Interpretable Safety Alignment for LLMs
Analysis
This paper addresses the lack of interpretability in low-rank adaptation methods for fine-tuning large language models (LLMs). It proposes a novel approach using Sparse Autoencoders (SAEs) to identify task-relevant features in a disentangled feature space, leading to an interpretable low-rank subspace for safety alignment. The method achieves high safety rates while updating a small fraction of parameters and provides insights into the learned alignment subspace.
Key Takeaways
- •Proposes a novel method for interpretable safety alignment in LLMs.
- •Uses Sparse Autoencoders (SAEs) to identify task-relevant features.
- •Constructs an interpretable low-rank subspace for alignment.
- •Achieves high safety rates with parameter-efficient fine-tuning.
- •Provides insights into the learned alignment subspace.
Reference
“The method achieves up to 99.6% safety rate--exceeding full fine-tuning by 7.4 percentage points and approaching RLHF-based methods--while updating only 0.19-0.24% of parameters.”