HeartBench: Evaluating Anthropomorphic Intelligence in Chinese LLMs
Analysis
This paper introduces HeartBench, a novel framework for evaluating the anthropomorphic intelligence of Large Language Models (LLMs) specifically within the Chinese linguistic and cultural context. It addresses a critical gap in current LLM evaluation by focusing on social, emotional, and ethical dimensions, areas where LLMs often struggle. The use of authentic psychological counseling scenarios and collaboration with clinical experts strengthens the validity of the benchmark. The paper's findings, including the performance ceiling of leading models and the performance decay in complex scenarios, highlight the limitations of current LLMs and the need for further research in this area. The methodology, including the rubric-based evaluation and the 'reasoning-before-scoring' protocol, provides a valuable blueprint for future research.
Key Takeaways
- •HeartBench is a new framework for evaluating anthropomorphic intelligence in Chinese LLMs.
- •It focuses on emotional, cultural, and ethical dimensions.
- •The benchmark uses authentic psychological counseling scenarios.
- •Leading LLMs show a performance ceiling of around 60% on the benchmark.
- •The framework provides a blueprint for creating high-quality, human-aligned training data.
“Even leading models achieve only 60% of the expert-defined ideal score.”