Search: 言語モデルで以前使用されていた構成性分析フレームワークをVision - ai.jp.net

Research Paper #Vision Transformers, Compositionality, Wavelet Transforms 🔬 ResearchAnalyzed: Jan 3, 2026 09:28

Compositionality in Vision Transformers Explored with Wavelets

Published:Dec 30, 2025 19:43

•

1 min read

•

ArXiv

Analysis

This paper investigates the compositionality of Vision Transformers (ViTs) by using Discrete Wavelet Transforms (DWTs) to create input-dependent primitives. It adapts a framework from language tasks to analyze how ViT encoders structure information. The use of DWTs provides a novel approach to understanding ViT representations, suggesting that ViTs may exhibit compositional behavior in their latent space.

Key Takeaways

•Applies a compositionality analysis framework, previously used for language models, to Vision Transformers.
•Utilizes Discrete Wavelet Transforms (DWTs) to generate image primitives.
•Finds evidence of compositional behavior in ViT latent space using DWT-based primitives.
•Offers a new perspective on how ViTs structure visual information.

Reference

“Primitives from a one-level DWT decomposition produce encoder representations that approximately compose in latent space.”

Permalink ArXiv

Compositionality in Vision Transformers Explored with Wavelets

Analysis

Key Takeaways

📬 Get AI News Delivered

Browse by Category

Trending Topics

📬 Get AI News Delivered

Browse by Category

Trending Topics