Empowering VLMs for Humorous Meme Generation
Analysis
This paper introduces HUMOR, a framework designed to improve the ability of Vision-Language Models (VLMs) to generate humorous memes. It addresses the challenge of moving beyond simple image-to-caption generation by incorporating hierarchical reasoning (Chain-of-Thought) and aligning with human preferences through a reward model and reinforcement learning. The approach is novel in its multi-path CoT and group-wise preference learning, aiming for more diverse and higher-quality meme generation.
Key Takeaways
- •Proposes HUMOR, a framework for meme generation using VLMs.
- •Employs a hierarchical Chain-of-Thought for diverse reasoning.
- •Utilizes a pairwise reward model for capturing subjective humor and aligning with human preferences.
- •Demonstrates superior reasoning diversity, preference alignment, and meme quality in experiments.
- •Presents a general training paradigm for human-aligned multimodal generation.
“HUMOR employs a hierarchical, multi-path Chain-of-Thought (CoT) to enhance reasoning diversity and a pairwise reward model for capturing subjective humor.”