Frozen LVLMs for Micro-Video Recommendation: A Systematic Study
Published:Dec 26, 2025 04:56
•1 min read
•ArXiv
Analysis
This paper addresses a critical gap in the application of Frozen Large Video Language Models (LVLMs) for micro-video recommendation. It provides a systematic empirical evaluation of different feature extraction and fusion strategies, which is crucial for practitioners. The study's findings offer actionable insights for integrating LVLMs into recommender systems, moving beyond treating them as black boxes. The proposed Dual Feature Fusion (DFF) Framework is a practical contribution, demonstrating state-of-the-art performance.
Key Takeaways
- •Intermediate hidden states from LVLMs are better feature extractors than caption-based representations for micro-video recommendation.
- •Fusion of LVLM features with ID embeddings is superior to replacing ID embeddings with LVLM features.
- •The effectiveness of different layers in LVLMs varies, highlighting the importance of multi-layer feature fusion.
- •The proposed Dual Feature Fusion (DFF) Framework provides a state-of-the-art approach for integrating LVLMs into micro-video recommender systems.
Reference
“Intermediate hidden states consistently outperform caption-based representations.”