Compositionality in Vision Transformers Explored with Wavelets
Analysis
Key Takeaways
- •Applies a compositionality analysis framework, previously used for language models, to Vision Transformers.
- •Utilizes Discrete Wavelet Transforms (DWTs) to generate image primitives.
- •Finds evidence of compositional behavior in ViT latent space using DWT-based primitives.
- •Offers a new perspective on how ViTs structure visual information.
“Primitives from a one-level DWT decomposition produce encoder representations that approximately compose in latent space.”