Search:
Match:
1 results

Analysis

This paper addresses the challenge of applying 2D vision-language models to 3D scenes. The core contribution is a novel method for controlling an in-scene camera to bridge the dimensionality gap, enabling adaptation to object occlusions and feature differentiation without requiring pretraining or finetuning. The use of derivative-free optimization for regret minimization in mutual information estimation is a key innovation.
Reference

Our algorithm enables off-the-shelf cross-modal systems trained on 2D visual inputs to adapt online to object occlusions and differentiate features.