RAIR: A New Benchmark for E-commerce Relevance Assessment
Published:Dec 31, 2025 16:09
•1 min read
•ArXiv
Analysis
This paper introduces RAIR, a new benchmark dataset for evaluating the relevance of search results in e-commerce. It addresses the limitations of existing benchmarks by providing a more complex and comprehensive evaluation framework, including a long-tail subset and a visual salience subset. The paper's significance lies in its potential to standardize relevance assessment and provide a more challenging testbed for LLMs and VLMs in the e-commerce domain. The creation of a standardized framework and the inclusion of visual elements are particularly noteworthy.
Key Takeaways
- •RAIR is a new Chinese dataset for e-commerce relevance assessment.
- •It includes a general subset, a long-tail subset, and a visual salience subset.
- •RAIR aims to standardize relevance evaluation and provide a more challenging benchmark.
- •Experiments show RAIR challenges even state-of-the-art models like GPT-5.
Reference
“RAIR presents sufficient challenges even for GPT-5, which achieved the best performance.”