Self-Attention Reveals Machine Attention Patterns
Analysis
This paper investigates the inner workings of self-attention in language models, specifically BERT-12, by analyzing the similarities between token vectors generated by the attention heads. It provides insights into how different attention heads specialize in identifying linguistic features like token repetitions and contextual relationships. The study's findings contribute to a better understanding of how these models process information and how attention mechanisms evolve through the layers.
Key Takeaways
- •The study analyzes self-attention mechanisms in BERT-12.
- •Attention heads specialize in different linguistic features.
- •Attention shifts from long-range to short-range similarities through layers.
- •Each head focuses on a unique token and builds similarity pairs around it.
“Different attention heads within an attention block focused on different linguistic characteristics, such as identifying token repetitions in a given text or recognizing a token of common appearance in the text and its surrounding context.”