Distilling Consistent Features in Sparse Autoencoders
Published:Dec 31, 2025 17:12
•1 min read
•ArXiv
Analysis
This paper addresses the problem of feature redundancy and inconsistency in sparse autoencoders (SAEs), which hinders interpretability and reusability. The authors propose a novel distillation method, Distilled Matryoshka Sparse Autoencoders (DMSAEs), to extract a compact and consistent core of useful features. This is achieved through an iterative distillation cycle that measures feature contribution using gradient x activation and retains only the most important features. The approach is validated on Gemma-2-2B, demonstrating improved performance and transferability of learned features.
Key Takeaways
- •Proposes DMSAEs, a novel distillation method for sparse autoencoders.
- •Uses gradient x activation to identify and retain the most important features.
- •Demonstrates improved performance and transferability of features on Gemma-2-2B.
- •Addresses the problem of feature redundancy and inconsistency in SAEs.
Reference
“DMSAEs run an iterative distillation cycle: train a Matryoshka SAE with a shared core, use gradient X activation to measure each feature's contribution to next-token loss in the most nested reconstruction, and keep only the smallest subset that explains a fixed fraction of the attribution.”