ARM: Enhancing CLIP for Open-Vocabulary Segmentation
Published:Dec 30, 2025 13:38
•1 min read
•ArXiv
Analysis
This paper introduces the Attention Refinement Module (ARM), a lightweight, learnable module designed to improve the performance of CLIP-based open-vocabulary semantic segmentation. The key contribution is a 'train once, use anywhere' paradigm, making it a plug-and-play post-processor. This addresses the limitations of CLIP's coarse image-level representations by adaptively fusing hierarchical features and refining pixel-level details. The paper's significance lies in its efficiency and effectiveness, offering a computationally inexpensive solution to a challenging problem in computer vision.
Key Takeaways
- •Proposes ARM, a lightweight, learnable module for improving CLIP-based open-vocabulary semantic segmentation.
- •ARM uses a 'train once, use anywhere' paradigm, acting as a plug-and-play post-processor.
- •Addresses the limitations of CLIP's coarse image-level representations by refining pixel-level details.
- •Demonstrates improved performance on multiple benchmarks with negligible inference overhead.
Reference
“ARM learns to adaptively fuse hierarchical features. It employs a semantically-guided cross-attention block, using robust deep features (K, V) to select and refine detail-rich shallow features (Q), followed by a self-attention block.”