Selected publications and manuscripts related to large language models, robust NLP, and machine learning.
This ongoing research investigates counterfactual text generation in large language models through structured representation-level interventions. The work explores how stylistic attributes can be modeled as directional structures in latent embedding space, enabling controlled modification of attributes such as writing style while preserving semantic meaning.
The proposed framework integrates contrastive learning, semantic regularization, and identity-preserving constraints to guide latent representations toward target attribute anchors while minimizing semantic drift. Extensive evaluation using automatic metrics, LLM-based judges, and geometric analysis is used to study controllable attribute manipulation in language models.
This work studies adversarial robustness in natural language processing from an information-theoretic perspective. Instead of directly modeling adversarial perturbations, the approach treats the attack process as a complex causal mechanism and quantifies its algorithmic information using the Minimum Description Length (MDL) framework.
Using masked language modeling, the method estimates the amount of information required to transform an original text into its adversarially modified version. This signal can then be used to identify altered tokens and detect adversarial manipulation even without access to the original text.
This work proposes a robust hate speech detection model that mitigates spurious correlations between lexical cues and prediction labels. Traditional classifiers often rely on superficial correlations, making them vulnerable to adversarial word or character perturbations.
The proposed method formulates hate speech detection using a causal graph and quantifies spurious correlations through causal strength. A regularized entropy loss function is then introduced to reduce reliance on these correlations, improving robustness against adversarial attacks.