DarkPatterns-LLM: A Benchmark for Detecting Manipulative AI Behavior
Published:Dec 27, 2025 05:05
•1 min read
•ArXiv
Analysis
This paper introduces DarkPatterns-LLM, a novel benchmark designed to assess the manipulative and harmful behaviors of Large Language Models (LLMs). It addresses a critical gap in existing safety benchmarks by providing a fine-grained, multi-dimensional approach to detecting manipulation, moving beyond simple binary classifications. The framework's four-layer analytical pipeline and the inclusion of seven harm categories (Legal/Power, Psychological, Emotional, Physical, Autonomy, Economic, and Societal Harm) offer a comprehensive evaluation of LLM outputs. The evaluation of state-of-the-art models highlights performance disparities and weaknesses, particularly in detecting autonomy-undermining patterns, emphasizing the importance of this benchmark for improving AI trustworthiness.
Key Takeaways
- •Introduces DarkPatterns-LLM, a new benchmark for detecting manipulative behaviors in LLMs.
- •Employs a multi-layered analytical pipeline for fine-grained assessment.
- •Evaluates LLMs across seven harm categories.
- •Highlights performance disparities and weaknesses in existing models.
- •Aims to improve AI trustworthiness through actionable diagnostics.
Reference
“DarkPatterns-LLM establishes the first standardized, multi-dimensional benchmark for manipulation detection in LLMs, offering actionable diagnostics toward more trustworthy AI systems.”