Distilling Consistent Features in Sparse Autoencoders
Analysis
Key Takeaways
- •Proposes DMSAEs, a novel distillation method for sparse autoencoders.
- •Uses gradient x activation to identify and retain the most important features.
- •Demonstrates improved performance and transferability of features on Gemma-2-2B.
- •Addresses the problem of feature redundancy and inconsistency in SAEs.
“DMSAEs run an iterative distillation cycle: train a Matryoshka SAE with a shared core, use gradient X activation to measure each feature's contribution to next-token loss in the most nested reconstruction, and keep only the smallest subset that explains a fixed fraction of the attribution.”