Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?

Research #llm 🔬 Research|Analyzed: Dec 25, 2025 10:16•

Published: Dec 25, 2025 05:00

•

1 min read

Analysis

This paper explores the feasibility of removing demographic bias from language models without sacrificing their ability to recognize demographic information. The research uses a multi-task evaluation setup and compares attribution-based and correlation-based methods for identifying bias features. The key finding is that targeted feature ablations, particularly using sparse autoencoders in Gemma-2-9B, can reduce bias without significantly degrading recognition performance. However, the study also highlights the importance of dimension-specific interventions, as some debiasing techniques can inadvertently increase bias in other areas. The research suggests that demographic bias stems from task-specific mechanisms rather than inherent demographic markers, paving the way for more precise and effective debiasing strategies.