Rethinking Fine-Tuning for Vision-Language Models
Analysis
Key Takeaways
- •Proposes Mask Fine-Tuning (MFT) for Vision-Language Models (VLMs).
- •MFT reparameterizes the model using learnable gating scores instead of weight updates.
- •Demonstrates superior performance compared to LoRA and full fine-tuning.
- •Highlights the importance of re-establishing connections within existing model knowledge for effective adaptation.
- •Offers a more efficient and potentially less destructive fine-tuning approach.
“MFT consistently surpasses LoRA variants and even full fine-tuning, achieving high performance without altering the frozen backbone.”