LLM Jigsaw: Benchmarking Spatial Reasoning in VLMs - frontier models hit a wall at 5x5 puzzles
Analysis
Key Takeaways
“”
“”
“Our estimator can be trained without computing the autocovariance kernels and it can be parallelized to provide the estimates much faster than existing approaches.”
“The paper presents an online variational inference framework to compute its approximation at each time step.”
“The method achieves approximately $4\sim10 imes$ and $2 imes$ speedups while using $1000$ cores, respectively, under the same level of structural and thermodynamic accuracy and with a reduced memory usage.”
“The method effectively suppresses spectral pollution and resolves closely spaced spectral components.”
“The algorithm minimizes a discretized version of the energy using finite elements, generalizing existing TV-minimization methods.”
“Neural operators are a powerful novel tool for high-performance control when hidden low-dimensional structure can be exploited, yet they remain fundamentally constrained by the intrinsic dimensional complexity in more challenging settings.”
“The authors report an experimentally accessible protocol for directly measuring the mixed-state topological invariant.”
“This work successfully reveals the intrinsic topological characteristics encoded within the Floquet eigenstates themselves.”
“The proposed method applies SSL comprehensively for both the architecture search and model pretraining processes.”
“The multimodal design achieved an 83% boost in 31P B1 efficiency and a 21% boost in 1H B1 efficiency at the coil center compared to same-sized single-tuned references.”
“A novel contribution arises that is proportional to the phase derivative of the levels broadening.”
“The proposed framework achieves superior performance over conventional sensing methods with reduced sensing power.”
“Utilizing 2:4 sparsity combined with quantization on $4096 imes 4096$ matrices, our approach achieves a reduction of up to $4\times$ in weight storage and a $1.71\times$ speedup in matrix multiplication, yielding a $1.29\times$ end-to-end latency reduction compared to dense GPU baselines.”
“The algorithm's runtime scaling is O(N4^N), a significant improvement over the brute-force approach.”
“The paper proposes a multi-modal cross-domain mixed fusion model with dual disentanglement for fault diagnosis.”
“The method achieves improved performance over state-of-the-art reconstruction methods, without task-specific supervised training or fine-tuning.”
“The algorithm achieves minimax-optimal regret independent of the ambient dimension $d$, thereby overcoming the curse of dimensionality.”
“The proposed HOOA achieves significant improvements, which reduces average task completion delay by 2.5% and average energy consumption by 3.1% compared with the best-performing benchmark approach and state-of-the-art DRL algorithm, respectively.”
“HMP-DRL consistently outperforms other methods, including state-of-the-art approaches, in terms of key metrics of robot navigation: success rate, collision rate, and time to reach the goal.”
“The model improves multi-hop reasoning accuracy by 16.8 percent on HotpotQA, 14.3 percent on 2WikiMultihopQA, and 19.2 percent on MeetingBank, while improving consistency by 21.5 percent.”
“CPR achieves an F1 score of 0.632 under SAP attacks, surpassing Median Smoothing (0.541 F1) by 9.1%.”
“HUMOR employs a hierarchical, multi-path Chain-of-Thought (CoT) to enhance reasoning diversity and a pairwise reward model for capturing subjective humor.”
“The paper claims "significant superiority" and "faster convergence, enhanced training stability, and improved robustness to noise interference" compared to conventional optimization algorithms.”
“DISF reduces CoM misalignment while maintaining geometric compatibility, translating into higher grasp success in both simulation and real-world execution compared to baselines.”
“The model achieves 25.96 dB PSNR and 0.8375 SSIM on the test set, demonstrating its effectiveness in compressing low-resolution video while maintaining good perceptual quality.”
“The two-stage approach decomposes spatial reasoning into atomic building blocks and their composition.”
“The algorithm achieves linear time complexity when the number of colors is greater than 3.637 times the maximum degree plus 1.”
“The paper establishes TEDPara and YTSegPara as the first benchmarks for the paragraph segmentation task in the speech domain.”
“The approach achieves a 20.15% reduction in Mean Spectral Information Divergence (MSID), up to 1.09% PSNR improvement, and a 1.62% log transformed MS-SSIM gain over strong learned baselines.”
“The paper introduces a general, model-agnostic training and inference framework for joint generative forecasting and shows how it enables assessment of forecast robustness and reliability using three complementary uncertainty quantification metrics.”
“The employed pulses combine active interference cancellation (AIC) and adaptive symbol transition (AST) terms in a transparent way to the receiver.”
“A "pointy" limit set implies asymptotic dependence, offering practical geometric criteria for identifying extremal dependence classes.”
“Our method randomly masks a section of the document and uses a natural language inference (NLI)-based contrastive objective to align it with relevant parts while distancing it from unrelated ones.”
“The paper proposes a two-step MCMC-algorithm in a Bayesian framework to overcome the issue of partial observations.”
“The paper presents an encoder-only transformer built with minimum layers for intrusion detection.”
“SaM2B leverages lightweight cues such as environmental visual, flight posture, and geospatial data to adaptively allocate contributions across modalities at different time points through reliability-aware dynamic weight updates.”
“The paper highlights that the targeted Reasoning RL and Agentic RL stages yield significant gains in their respective capabilities.”
“The core of PartMotionEdit is a Part-aware Motion Modulation (PMM) module, which builds upon a predefined five-part body decomposition.”
“The framework incorporates a Salient Region Selection module and a Jacobian Vector Product Guidance mechanism to generate physically plausible adversarial objects.”
“Random multiplexing achieves statistical fading-channel ergodicity for transmitted signals by constructing an equivalent input-isotropic channel matrix in the random transform domain.”
“The approach substantially improves both the representational power and the RD performance of 2DGS while maintaining over 1000 FPS decoding. Compared with the baseline GSImage, we reduce BD-rate by 43.44% on Kodak and 29.91% on DIV2K.”
“The method reaches detection thresholds approximately three times lower than baseline approaches, providing a path towards automated, global-scale monitoring of surface changes.”
“Experimental results show that our unsupervised approach still attains 80% test accuracy, on par with several supervised deep learning-based strategies.”
“The multimodal Transformer achieves RMSE = 0.90 mm and R^2 = 0.97 on the test set on the eastern Ireland tile (E32N34).”
“The proposed method outperforms zero forcing (ZF) and maximum ratio transmission (MRT) techniques, particularly in high-interference scenarios, while remaining robust to CSI imperfections.”
“The method combines two known field theories into a new composite field theory whose target space is the product of the original target spaces.”
“The baseline model can compress a 20-second video into a context at about 5k length, where random frames can be retrieved with perceptually preserved appearances.”
“The paper proposes to leverage connections between proximal operators and Hamilton-Jacobi partial differential equations (HJ PDEs) to develop novel deep learning architectures for learning the prior.”
“The paper proposes to use a standard reduced basis method (RBM) to construct this low-order rational function. Algorithmically, this procedure is an iterative greedy approach, where the greedy objective is evaluated through an error estimator that exploits the linearity of the frequency domain representation.”
Daily digest of the most important AI developments
No spam. Unsubscribe anytime.
Support free AI news
Support Us