Search: このベンチマークは、将来の - ai.jp.net

Paper #llm 🔬 ResearchAnalyzed: Jan 3, 2026 19:27

HiSciBench: A Hierarchical Benchmark for Scientific Intelligence

Published:Dec 28, 2025 12:08

•

1 min read

•

ArXiv

Analysis

This paper introduces HiSciBench, a novel benchmark designed to evaluate large language models (LLMs) and multimodal models on scientific reasoning. It addresses the limitations of existing benchmarks by providing a hierarchical and multi-disciplinary framework that mirrors the complete scientific workflow, from basic literacy to scientific discovery. The benchmark's comprehensive nature, including multimodal inputs and cross-lingual evaluation, allows for a detailed diagnosis of model capabilities across different stages of scientific reasoning. The evaluation of leading models reveals significant performance gaps, highlighting the challenges in achieving true scientific intelligence and providing actionable insights for future model development. The public release of the benchmark will facilitate further research in this area.

Key Takeaways

•HiSciBench is a new hierarchical benchmark for evaluating scientific intelligence in LLMs and multimodal models.
•It covers a complete scientific workflow from literacy to discovery.
•The benchmark supports multimodal inputs and cross-lingual evaluation.
•Evaluations reveal significant performance gaps in current models.
•The benchmark will be publicly released to facilitate future research.

Reference

“While models achieve up to 69% accuracy on basic literacy tasks, performance declines sharply to 25% on discovery-level challenges.”

Permalink ArXiv

Research #VLM 🔬 ResearchAnalyzed: Jan 10, 2026 07:38

VisRes Bench: Evaluating Visual Reasoning in VLMs

Published:Dec 24, 2025 14:18

•

1 min read

•

ArXiv

Analysis

This research introduces VisRes Bench, a benchmark for evaluating the visual reasoning capabilities of Vision-Language Models (VLMs). The study's focus on benchmarking is a crucial step in advancing VLM development and understanding their limitations.

Key Takeaways

•VisRes Bench provides a standardized way to assess VLMs' reasoning abilities.
•The research contributes to a better understanding of current VLM strengths and weaknesses.
•This benchmark can guide future VLM development and improvements.

Reference

“VisRes Bench is a benchmark for evaluating the visual reasoning capabilities of VLMs.”

Permalink ArXiv

HiSciBench: A Hierarchical Benchmark for Scientific Intelligence

Analysis

Key Takeaways

VisRes Bench: Evaluating Visual Reasoning in VLMs

Analysis

Key Takeaways

📬 Get AI News Delivered

Browse by Category

Trending Topics

📬 Get AI News Delivered

Browse by Category

Trending Topics