Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?
Published:Dec 25, 2025 05:00
•1 min read
•ArXiv NLP
Analysis
This paper explores the feasibility of removing demographic bias from language models without sacrificing their ability to recognize demographic information. The research uses a multi-task evaluation setup and compares attribution-based and correlation-based methods for identifying bias features. The key finding is that targeted feature ablations, particularly using sparse autoencoders in Gemma-2-9B, can reduce bias without significantly degrading recognition performance. However, the study also highlights the importance of dimension-specific interventions, as some debiasing techniques can inadvertently increase bias in other areas. The research suggests that demographic bias stems from task-specific mechanisms rather than inherent demographic markers, paving the way for more precise and effective debiasing strategies.
Key Takeaways
Reference
“demographic bias arises from task-specific mechanisms rather than absolute demographic markers”