Compositionality in Vision Transformers Explored with Wavelets

Research Paper #Vision Transformers, Compositionality, Wavelet Transforms 🔬 Research|Analyzed: Jan 3, 2026 09:28•

Published: Dec 30, 2025 19:43

•

1 min read

Analysis

This paper investigates the compositionality of Vision Transformers (ViTs) by using Discrete Wavelet Transforms (DWTs) to create input-dependent primitives. It adapts a framework from language tasks to analyze how ViT encoders structure information. The use of DWTs provides a novel approach to understanding ViT representations, suggesting that ViTs may exhibit compositional behavior in their latent space.

Key Takeaways

•Applies a compositionality analysis framework, previously used for language models, to Vision Transformers.
•Utilizes Discrete Wavelet Transforms (DWTs) to generate image primitives.
•Finds evidence of compositional behavior in ViT latent space using DWT-based primitives.
•Offers a new perspective on how ViTs structure visual information.

Reference / Citation

"Primitives from a one-level DWT decomposition produce encoder representations that approximately compose in latent space."

A

ArXivDec 30, 2025 19:43

* Cited for critical analysis under Article 32.

GPU-Accelerated LLM on an Orange Pi

Garak, LLM Vulnerability Scanner

Related Analysis

SpaceTimePilot: Generative Video Rendering with Space-Time Control

Jan 3, 2026 06:10

Randomness Generation in Quantum Chaotic Systems

Jan 3, 2026 06:10

GaMO: Geometry-aware Diffusion for Sparse-View 3D Reconstruction

Jan 3, 2026 06:32