Current and past research directions in LLMs, controllable generation, robustness, and causal NLP.
This research investigates controllable counterfactual text generation in large language models by modeling stylistic attributes as directional structures in latent representation space. Instead of relying solely on prompt-based control, the work introduces a geometric framework that performs structured representation-level interventions to modify attributes such as writing style while preserving semantic content.
The proposed approach integrates contrastive objectives, identity-preserving constraints, and semantic regularization to guide latent representations toward target stylistic anchors while minimizing semantic drift. Extensive evaluation using automatic metrics, LLM-based judges, and geometric analysis demonstrates that stylistic transformations can be interpreted as controlled movements along attribute directions in embedding space.
This research explores the use of the Minimum Description Length (MDL) principle to detect adversarial attacks in natural language processing systems. Adversarial attacks often introduce subtle perturbations that change model predictions while preserving human readability.
The proposed framework estimates the algorithmic information content of text using masked language modeling and analyzes how adversarial perturbations alter the description length of sentences. The approach provides a principled information-theoretic signal for detecting adversarial manipulation in NLP systems.
This work proposes CEBERT, a robust hate speech detection framework designed to mitigate spurious correlations in text classification models. Traditional models often rely on superficial lexical cues that lead to poor generalization and vulnerability to adversarial manipulation.
The proposed approach introduces a causal graph formulation and entropy-based regularization to encourage models to focus on causal linguistic signals rather than spurious patterns. Experiments demonstrate improved robustness against adversarial word and character perturbations while maintaining strong predictive performance.