Research Paper#Vision Transformers, Compositionality, Wavelet Transforms🔬 ResearchAnalyzed: Jan 3, 2026 09:28
Compositionality in Vision Transformers Explored with Wavelets
Published:Dec 30, 2025 19:43
•1 min read
•ArXiv
Analysis
This paper investigates the compositionality of Vision Transformers (ViTs) by using Discrete Wavelet Transforms (DWTs) to create input-dependent primitives. It adapts a framework from language tasks to analyze how ViT encoders structure information. The use of DWTs provides a novel approach to understanding ViT representations, suggesting that ViTs may exhibit compositional behavior in their latent space.
Key Takeaways
- •Applies a compositionality analysis framework, previously used for language models, to Vision Transformers.
- •Utilizes Discrete Wavelet Transforms (DWTs) to generate image primitives.
- •Finds evidence of compositional behavior in ViT latent space using DWT-based primitives.
- •Offers a new perspective on how ViTs structure visual information.
Reference
“Primitives from a one-level DWT decomposition produce encoder representations that approximately compose in latent space.”