Warnings in Training Data Backfire for Language Models

Published:Dec 25, 2025 20:07
1 min read
ArXiv

Analysis

This paper highlights a critical vulnerability in current language models: they fail to learn from negative examples presented in a warning-framed context. The study demonstrates that models exposed to warnings about harmful content are just as likely to reproduce that content as models directly exposed to it. This has significant implications for the safety and reliability of AI systems, particularly those trained on data containing warnings or disclaimers. The paper's analysis, using sparse autoencoders, provides insights into the underlying mechanisms, pointing to a failure of orthogonalization and the dominance of statistical co-occurrence over pragmatic understanding. The findings suggest that current architectures prioritize the association of content with its context rather than the meaning or intent behind it.

Reference

Models exposed to such warnings reproduced the flagged content at rates statistically indistinguishable from models given the content directly (76.7% vs. 83.3%).