Paper#Computer Vision, Natural Language Processing, 3D Scene Understanding🔬 ResearchAnalyzed: Jan 3, 2026 08:39
2D-Trained Systems Adapt to 3D Scenes
Published:Dec 31, 2025 12:39
•1 min read
•ArXiv
Analysis
This paper addresses the challenge of applying 2D vision-language models to 3D scenes. The core contribution is a novel method for controlling an in-scene camera to bridge the dimensionality gap, enabling adaptation to object occlusions and feature differentiation without requiring pretraining or finetuning. The use of derivative-free optimization for regret minimization in mutual information estimation is a key innovation.
Key Takeaways
- •Addresses the problem of applying 2D vision-language models to 3D scenes.
- •Introduces a method for controlling an in-scene camera.
- •Employs derivative-free optimization for improved mutual information estimation.
- •Enables adaptation to object occlusions and feature differentiation.
- •Avoids the need for pretraining or finetuning.
Reference
“Our algorithm enables off-the-shelf cross-modal systems trained on 2D visual inputs to adapt online to object occlusions and differentiate features.”