Search:
Match:
7 results
Paper#LLM🔬 ResearchAnalyzed: Jan 3, 2026 06:17

Distilling Consistent Features in Sparse Autoencoders

Published:Dec 31, 2025 17:12
1 min read
ArXiv

Analysis

This paper addresses the problem of feature redundancy and inconsistency in sparse autoencoders (SAEs), which hinders interpretability and reusability. The authors propose a novel distillation method, Distilled Matryoshka Sparse Autoencoders (DMSAEs), to extract a compact and consistent core of useful features. This is achieved through an iterative distillation cycle that measures feature contribution using gradient x activation and retains only the most important features. The approach is validated on Gemma-2-2B, demonstrating improved performance and transferability of learned features.
Reference

DMSAEs run an iterative distillation cycle: train a Matryoshka SAE with a shared core, use gradient X activation to measure each feature's contribution to next-token loss in the most nested reconstruction, and keep only the smallest subset that explains a fixed fraction of the attribution.

Paper#LLM🔬 ResearchAnalyzed: Jan 3, 2026 06:27

Memory-Efficient Incremental Clustering for Long-Text Coreference Resolution

Published:Dec 31, 2025 08:26
1 min read
ArXiv

Analysis

This paper addresses the challenge of coreference resolution in long texts, a crucial area for LLMs. It proposes MEIC-DT, a novel approach that balances efficiency and performance by focusing on memory constraints. The dual-threshold mechanism and SAES/IRP strategies are key innovations. The paper's significance lies in its potential to improve coreference resolution in resource-constrained environments, making LLMs more practical for long documents.
Reference

MEIC-DT achieves highly competitive coreference performance under stringent memory constraints.

Paper#LLM🔬 ResearchAnalyzed: Jan 3, 2026 19:02

Interpretable Safety Alignment for LLMs

Published:Dec 29, 2025 07:39
1 min read
ArXiv

Analysis

This paper addresses the lack of interpretability in low-rank adaptation methods for fine-tuning large language models (LLMs). It proposes a novel approach using Sparse Autoencoders (SAEs) to identify task-relevant features in a disentangled feature space, leading to an interpretable low-rank subspace for safety alignment. The method achieves high safety rates while updating a small fraction of parameters and provides insights into the learned alignment subspace.
Reference

The method achieves up to 99.6% safety rate--exceeding full fine-tuning by 7.4 percentage points and approaching RLHF-based methods--while updating only 0.19-0.24% of parameters.

Research#llm🔬 ResearchAnalyzed: Dec 25, 2025 09:40

Uncovering Competency Gaps in Large Language Models and Their Benchmarks

Published:Dec 25, 2025 05:00
1 min read
ArXiv NLP

Analysis

This paper introduces a novel method using sparse autoencoders (SAEs) to identify competency gaps in large language models (LLMs) and imbalances in their benchmarks. The approach extracts SAE concept activations and computes saliency-weighted performance scores, grounding evaluation in the model's internal representations. The study reveals that LLMs often underperform on concepts contrasting sycophancy and related to safety, aligning with existing research. Furthermore, it highlights benchmark gaps, where obedience-related concepts are over-represented, while other relevant concepts are missing. This automated, unsupervised method offers a valuable tool for improving LLM evaluation and development by identifying areas needing improvement in both models and benchmarks, ultimately leading to more robust and reliable AI systems.
Reference

We found that these models consistently underperformed on concepts that stand in contrast to sycophantic behaviors (e.g., politely refusing a request or asserting boundaries) and concepts connected to safety discussions.

Research#llm📝 BlogAnalyzed: Jan 3, 2026 07:50

Gemma Scope 2 Release Announced

Published:Dec 22, 2025 21:56
2 min read
Alignment Forum

Analysis

Google DeepMind's mech interp team is releasing Gemma Scope 2, a suite of Sparse Autoencoders (SAEs) and transcoders trained on the Gemma 3 model family. This release offers advancements over the previous version, including support for more complex models, a more comprehensive release covering all layers and model sizes up to 27B, and a focus on chat models. The release includes SAEs trained on different sites (residual stream, MLP output, and attention output) and MLP transcoders. The team hopes this will be a useful tool for the community despite deprioritizing fundamental research on SAEs.

Key Takeaways

Reference

The release contains SAEs trained on 3 different sites (residual stream, MLP output and attention output) as well as MLP transcoders (both with and without affine skip connections), for every layer of each of the 10 models in the Gemma 3 family (i.e. sizes 270m, 1b, 4b, 12b and 27b, both the PT and IT versions of each).

Research#llm🔬 ResearchAnalyzed: Jan 4, 2026 07:14

Features Emerge as Discrete States: The First Application of SAEs to 3D Representations

Published:Dec 12, 2025 03:54
1 min read
ArXiv

Analysis

This article likely discusses the application of Sparse Autoencoders (SAEs) to 3D representations. The title suggests a novel approach where features are learned as discrete states, which could lead to more efficient and interpretable representations. The use of SAEs implies an attempt to learn sparse and meaningful features from 3D data.

Key Takeaways

    Reference

    Research#llm🔬 ResearchAnalyzed: Jan 4, 2026 08:00

    SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models

    Published:Nov 25, 2025 20:14
    1 min read
    ArXiv

    Analysis

    This article introduces SAGE, a framework designed to interpret features learned by Sparse Autoencoders (SAEs) within Language Models (LLMs). The use of an 'agentic' approach suggests an attempt to automate or enhance the interpretability process, potentially offering a more nuanced understanding of how LLMs function. The focus on SAEs indicates an interest in understanding the internal representations of LLMs, which is a key area of research for improving model transparency and control.

    Key Takeaways

      Reference