Warnings in Training Data Backfire for Language Models
Published:Dec 25, 2025 20:07
•1 min read
•ArXiv
Analysis
This paper highlights a critical vulnerability in current language models: they fail to learn from negative examples presented in a warning-framed context. The study demonstrates that models exposed to warnings about harmful content are just as likely to reproduce that content as models directly exposed to it. This has significant implications for the safety and reliability of AI systems, particularly those trained on data containing warnings or disclaimers. The paper's analysis, using sparse autoencoders, provides insights into the underlying mechanisms, pointing to a failure of orthogonalization and the dominance of statistical co-occurrence over pragmatic understanding. The findings suggest that current architectures prioritize the association of content with its context rather than the meaning or intent behind it.
Key Takeaways
- •Language models fail to learn from warning-framed negative examples.
- •Models reproduce warned-against content at similar rates to direct exposure.
- •The issue stems from a failure of orthogonalization and the dominance of statistical co-occurrence.
- •Training-time feature ablation is suggested as a potential solution.
Reference
“Models exposed to such warnings reproduced the flagged content at rates statistically indistinguishable from models given the content directly (76.7% vs. 83.3%).”