ARM: Enhancing CLIP for Open-Vocabulary Segmentation
Analysis
Key Takeaways
- •Proposes ARM, a lightweight, learnable module for improving CLIP-based open-vocabulary semantic segmentation.
- •ARM uses a 'train once, use anywhere' paradigm, acting as a plug-and-play post-processor.
- •Addresses the limitations of CLIP's coarse image-level representations by refining pixel-level details.
- •Demonstrates improved performance on multiple benchmarks with negligible inference overhead.
“ARM learns to adaptively fuse hierarchical features. It employs a semantically-guided cross-attention block, using robust deep features (K, V) to select and refine detail-rich shallow features (Q), followed by a self-attention block.”