LLM Jigsaw: Benchmarking Spatial Reasoning in VLMs - frontier models hit a wall at 5x5 puzzles
Analysis
Key Takeaways
“”
“”
“The benchmark demands a lot of model's visual reasoning: they must mentally rotate pieces, count coordinates properly, keep track of each piece's starred square, and determine the relationship between different pieces on the board.”
“Enhancing the baseline agent with Chain-of-Thought (CoT) reasoning and self-reflection leads to an unexpected performance decrease, suggesting MLLMs exhibit poor context awareness in embodied navigation tasks.”
“The two-stage approach decomposes spatial reasoning into atomic building blocks and their composition.”
“Memory representation plays a central role in consolidating spatial experience, with structured memories particularly sequential and graph-based representations, substantially improving performance on structure-intensive tasks such as path planning.”
“ViReLoc plans routes between two given ground images.”
“LVLDrive achieves superior performance compared to vision-only counterparts across scene understanding, metric spatial perception, and reliable driving decision-making.”
“FIGR improves the base model by 13.12% on AIME 2025 and 11.00% on BeyondAIME, highlighting the effectiveness of figure-guided multimodal reasoning in enhancing the stability and reliability of complex reasoning.”
“DiffThinker significantly outperforms leading closed source models including GPT-5 (+314.2%) and Gemini-3-Flash (+111.6%), as well as the fine-tuned Qwen3-VL-32B baseline (+39.0%), highlighting generative multimodal reasoning as a promising approach for vision-centric reasoning.”
“The paper introduces SpatialMosaic, a comprehensive instruction-tuning dataset featuring 2M QA pairs, and SpatialMosaic-Bench, a challenging benchmark for evaluating multi-view spatial reasoning under realistic and challenging scenarios, consisting of 1M QA pairs across 6 tasks.”
“Leading LLMs showed a uniform 0.00% pass rate on all long-horizon tasks, exposing a fundamental failure in long-term planning.”
“The paper highlights that VPTracker 'significantly enhances tracking stability and target disambiguation under challenging scenarios, opening a new avenue for integrating MLLMs into visual tracking.'”
“”
“...a novel framework that leverages a hybrid view-transformation branch with 3D Gaussian and edge priors to enhance both geometric consistency and boundary awareness in 3D panoptic occupancy prediction.”
“The integration of embodied agents into human environments demands embodied social intelligence: reasoning over both social norms and physical constraints.”
“”
“”
“Cube Bench is a benchmark for spatial visual reasoning in MLLMs.”
“The paper focuses on dynamic spatial understanding, hinting at the consideration of time as a dimension.”
“The study reveals a spatial reasoning gap in MLLMs.”
“”
“”
“”
“”
“The research paper is sourced from ArXiv.”
“”
“R4 likely involves leveraging retrieval-augmented techniques to process and reason about visual information across both spatial and temporal dimensions.”
“The research utilizes graph-based RAG.”
“The framework utilizes a dual-stage approach.”
“The article is a 'complete guide' to the topic.”
“”
“”
“Language models process text (*already* compressed human knowledge) using the same mechanism we use to learn from raw data.”
“The paper focuses on auto-labeling and reasoning about spatial movement in videos.”
“The research focuses on benchmarking microscopic spatial intelligence on molecules via vision-language models.”
“”
“The research focuses on the impact of camera tilt and object interference on VLM spatial reasoning.”
“The research focuses on aerial vision-language navigation.”
“”
“The study focuses on benchmarking multi-step cartographic reasoning in Vision-Language Models.”
“SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery”
“The study focuses on evaluating Vision-Language Models for 3D geospatial reasoning from aerial imagery.”
“”
“”
“”
“The research focuses on unlocking spatial reasoning capabilities in Large Language Models for 3D Scene-Language Understanding.”
“The research focuses on sequential embodied MLLM reasoning and exploration.”
“The article's focus on 'reasoning path' and 'latent state' suggests an interest in the 'black box' nature of AI and a desire to understand the internal workings of these models.”
“The research focuses on boosting spatial reasoning capability of MLLMs for 3D Visual Grounding.”
“DrawingBench evaluates spatial reasoning and UI interaction capabilities through mouse-based drawing tasks.”
Daily digest of the most important AI developments
No spam. Unsubscribe anytime.
Support free AI news
Support Us