LLM Jigsaw: Benchmarking Spatial Reasoning in VLMs - frontier models hit a wall at 5x5 puzzles
Analysis
Key Takeaways
“”
“”
“I know basics of pruning for deep learning models. However, I don't know how to do it for larger models. Sharing your knowledge and resources will guide me, thanks”
“DarkEQA isolates the perception bottleneck by evaluating question answering from egocentric observations under controlled degradations, enabling attributable robustness analysis.”
“RAIR presents sufficient challenges even for GPT-5, which achieved the best performance.”
“LSRE attains semantic risk detection accuracy comparable to a large VLM baseline, while providing substantially earlier hazard anticipation and maintaining low computational latency.”
“SliceLens achieves state-of-the-art performance, improving Precision@10 by 0.42 (0.73 vs. 0.31) on FeSD, and identifies interpretable slices that facilitate actionable model improvements.”
“HUMOR employs a hierarchical, multi-path Chain-of-Thought (CoT) to enhance reasoning diversity and a pairwise reward model for capturing subjective humor.”
“The paper introduces "Semantic Lookout", a camera-only, candidate-constrained vision-language model (VLM) fallback maneuver selector that selects one cautious action (or station-keeping) from water-valid, world-anchored trajectories under continuous human authority.”
“LVLDrive achieves superior performance compared to vision-only counterparts across scene understanding, metric spatial perception, and reliable driving decision-making.”
“SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks. Specifically, on search-oriented benchmarks, SenseNova-MARS-8B scores 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models such as Gemini-3-Flash and GPT-5.”
“The paper introduces Embodied Reasoning Intelligence Quotient (ERIQ), a large-scale embodied reasoning benchmark in robotic manipulation, and FACT, a flow-matching-based action tokenizer.”
“Key findings demonstrate that sub-goal decomposition and irrelevant premise filtering critically influence final problem-solving accuracy, whereas Chain-of-Thought prompting unexpectedly degrades performance in some tasks.”
“The paper introduces evidence tokens for evidence grounding, which emphasize event-level visual semantic capture beyond the focus on timestamp representation.”
“MF-RSVLM achieves state-of-the-art or highly competitive performance across remote sensing classification, image captioning, and VQA tasks.”
“The Hilbert-VLM model achieves a Dice score of 82.35 percent on the BraTS2021 segmentation benchmark, with a diagnostic classification accuracy (ACC) of 78.85 percent.”
“Fine-tuning VLMs on TWIN yields notable gains in fine-grained recognition, even on unseen domains such as art, animals, plants, and landmarks.”
“LVLMs trained with datasets, including instructions on output format, tend to follow instructions more accurately than models that do not.”
“The evaluation protocol jointly measures average accuracy, average cost, and throughput, and builds a ranking score from the harmonic mean of normalized cost and accuracy to enable comparison across router configurations and cost budgets.”
“TV-RAG realizes a dual-level reasoning routine that can be grafted onto any LVLM without re-training or fine-tuning.”
“CoFi-Dec substantially reduces both entity-level and semantic-level hallucinations, outperforming existing decoding strategies.”
“The paper introduces SpatialMosaic, a comprehensive instruction-tuning dataset featuring 2M QA pairs, and SpatialMosaic-Bench, a challenging benchmark for evaluating multi-view spatial reasoning under realistic and challenging scenarios, consisting of 1M QA pairs across 6 tasks.”
“ViLaCD-R1 substantially improves true semantic change recognition and localization, robustly suppresses non-semantic variations, and achieves state-of-the-art accuracy in complex real-world scenarios.”
“The proposed framework significantly improves the accuracy of semantic understanding and computational efficiency in tasks including image captioning and cross-modal retrieval.”
“The paper gives finite-sample uniform convergence bounds for accuracy and calibration functionals of VLM-induced classifiers under Lipschitz stability with respect to prompt embeddings.”
“SID analyzes inputs using a structured analysis stage that separates content (wireframe / skeleton) from style (visual physics) in JSON form.”
“MoVLR iteratively explores the reward space through iterative interaction between control optimization and VLM feedback, aligning control policies with physically coordinated behaviors.”
“MFT consistently surpasses LoRA variants and even full fine-tuning, achieving high performance without altering the frozen backbone.”
“Structured outputs can be syntactically valid while semantically incorrect, schema validation is structural (not geometric correctness), person identifiers are frame-local in the current prompting contract, and interactive single-frame analysis returns free-form text rather than schema-enforced JSON.”
“The Active Cognition-based Reasoning (ACR) module performs human-like perception of the target via a cognitive task chain and actively reasons about contextually relevant objects, thereby extending VLM cognition through a dynamically updated OLT.”
“ColaVLA achieves state-of-the-art performance in both open-loop and closed-loop settings with favorable efficiency and robustness.”
“BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.”
“The ALEAHallu framework follows an 'Activate-Locate-Edit Adversarially' paradigm, fine-tuning hallucination-prone parameter clusters using adversarial tuned prefixes to maximize visual neglect.”
“The LVLM-Aided Visual Alignment (LVLM-VA) method provides a bidirectional interface that translates model behavior into natural language and maps human class-level specifications to image-level critiques, enabling effective interaction between domain experts and the model.”
“Even state-of-the-art closed-source LVLMs exhibit significant deficiencies in recognizing and respecting the copyrighted content, even when presented with the copyright notice.”
“Intermediate hidden states consistently outperform caption-based representations.”
“DIOR outperforms existing training-free baselines, including CLIP.”
“By concentrating adversarial perturbations on these positions, we achieve semantic degradation comparable to global methods while using substantially smaller budgets. More importantly, across multiple representative VLMs, such selective attacks convert 35-49% of benign outputs into harmful ones, exposing a more critical safety risk.”
“Scene-VLM yields significant improvements of +6 AP and +13.7 F1 over the previous leading method on MovieNet.”
“The architecture uses a consortium of heterogeneous LLM and VLM agents to generate candidate outputs, a dedicated reasoning agent for consolidation, and explicit cross-model comparison for explainability.”
“The paper reveals pronounced cross-model discrepancies, including low concept overlap and near-zero agreement in relational triples on many slides.”
“The paper focuses on fine-tuning vision-language models.”
“...training on VL4Gaze brings substantial and consistent improvements across all tasks, highlighting the importance of targeted multi-task supervision for developing gaze understanding capabilities”
“adaptive preprocessing reduces per-image inference time by over 50\%”
“The paper originates from ArXiv, suggesting it's a research publication.”
“LookPlanGraph leverages VLM graph augmentation.”
“VisRes Bench is a benchmark for evaluating the visual reasoning capabilities of VLMs.”
“The paper originates from ArXiv, indicating it is a pre-print or research publication.”
“"At its core is a novel coarse-to-fine autonomous data generation pipeline without manual intervention."”
“”
“”
Daily digest of the most important AI developments
No spam. Unsubscribe anytime.
Support free AI news
Support Us