Comprehensive List of Attention Mechanisms
1. Scaled Dot-Product Attention
- Formula:
- Purpose: Computes attention weights by scaling dot products of query and key.
- Example: Used in most Transformer-based models like GPT and BERT.
2. Multi-Head Attention
- Formula:
- Purpose: Allows the model to focus on different parts of the sequence in parallel.
- Example: Each head captures a unique relationship in GPT-4 and PaLM.
3. Self-Attention
- Formula:
- Purpose: Attention within the same input sequence.
- Example: Used in BERT for bidirectional context understanding.
4. Cross-Attention
- Formula:
- Purpose: Aligns information between two different sequences (e.g., encoder and decoder).
- Example: Found in T5, BART for tasks like translation and summarization.
5. Sparse Attention
- Purpose: Reduces computational cost by only computing attention for a subset of tokens.
- Example:
- Longformer: Focuses on specific tokens for long-document understanding.
- BigBird: Efficiently handles sequences of arbitrary length.
6. Local Attention
- Purpose: Focuses on a fixed window of nearby tokens instead of the entire sequence.
- Example: Used in Transformer-XL to handle long sequences efficiently.
7. Global Attention
- Purpose: Allows certain "global" tokens to attend to all tokens, improving long-term dependency modeling.
- Example: Found in Reformer for memory efficiency.
8. Causal Attention (Masked Attention)
- Purpose: Prevents tokens from attending to future tokens in autoregressive models.
- Example: GPT uses this to ensure the prediction is based only on past tokens.
9. Rotary Positional Embedding (RoPE) Attention
- Purpose: Enhances positional encodings to better capture context over long sequences.
- Example: Used in LLaMA to improve token dependencies.
10. Linear Attention
- Purpose: Replaces the softmax operation to achieve linear time complexity.
- Example: Applied in Performer for efficient handling of large sequences.
11. Adaptive Attention
- Purpose: Dynamically chooses between different attention mechanisms.
- Example: Found in Adaptive Transformers, switches between local and global attention.
12. Memory-Augmented Attention
- Purpose: Uses external memory to store and retrieve context.
- Example: RETRO retrieves relevant documents during inference.
13. Hierarchical Attention
- Purpose: Focuses on different levels of granularity in a hierarchical structure (e.g., sentence, paragraph).
- Example: Used in document-level models like HAN (Hierarchical Attention Networks).
14. MoE (Mixture of Experts) Attention
- Purpose: Activates only a subset of attention heads based on the input.
- Example: Found in Switch Transformer for efficient computation.
15. Dynamic Convolution Attention
- Purpose: Combines attention with dynamic convolutions for better context capture.
- Example: Used in DynamicConv models.
16. Attention with Linear Biases (ALiBi)
- Purpose: Adds linear bias to attention scores for better long-sequence modeling.
- Example: Improves transformer-based models for extended token sequences.
17. Hybrid Attention
- Purpose: Combines self-attention with recurrent or convolutional layers.
- Example: Used in Hybrid Transformers to capture local and global context.
18. Dual Attention
- Purpose: Uses two attention mechanisms, such as intra-sequence and cross-sequence attention.
- Example: Applied in multimodal models to align image and text representations.
19. Attention over Attention (AoA)
- Purpose: Computes attention on top of an already existing attention distribution.
- Example: Found in AoA Transformers for reading comprehension tasks.
20. Funnel Attention
- Purpose: Reduces sequence length hierarchically while retaining critical information.
- Example: Found in Funnel Transformers for efficient modeling of long texts.
Summary Table
| Attention Mechanism | Purpose | Example |
|---|---|---|
| Scaled Dot-Product | Core mechanism for token alignment | GPT, BERT |
| Multi-Head | Captures diverse relationships | GPT-4, PaLM |
| Self-Attention | Within-sequence token attention | BERT, GPT |
| Cross-Attention | Aligns encoder-decoder sequences | T5, BART |
| Sparse Attention | Efficient attention for subsets | Longformer, BigBird |
| Causal Attention | Prevents future token access | GPT |
| Rotary (RoPE) | Enhances long-range token dependencies | LLaMA |
| Memory-Augmented | Adds external memory retrieval | RETRO |
| Local Attention | Attends to nearby tokens | Transformer-XL |
| Global Attention | Allows global token interactions | Reformer |
| Hierarchical Attention | Focuses on different text levels | Hierarchical Attention Networks (HAN) |
| MoE Attention | Activates specific attention heads | Switch Transformer |
This layered approach ensures optimal performance across diverse NLP tasks.
Comments
Post a Comment