List of Attention Mechanisms of LLMS

List of Attention Mechanisms of LLMS

Comprehensive List of Attention Mechanisms

1. Scaled Dot-Product Attention

Formula: $Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V$
Purpose: Computes attention weights by scaling dot products of query and key.
Example: Used in most Transformer-based models like GPT and BERT.

2. Multi-Head Attention

Formula: $MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O}$
Purpose: Allows the model to focus on different parts of the sequence in parallel.
Example: Each head captures a unique relationship in GPT-4 and PaLM.

3. Self-Attention

Formula: $Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V$
Purpose: Attention within the same input sequence.
Example: Used in BERT for bidirectional context understanding.

4. Cross-Attention

Formula: $Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V$
Purpose: Aligns information between two different sequences (e.g., encoder and decoder).
Example: Found in T5, BART for tasks like translation and summarization.

5. Sparse Attention

Purpose: Reduces computational cost by only computing attention for a subset of tokens.
Example:
- Longformer: Focuses on specific tokens for long-document understanding.
- BigBird: Efficiently handles sequences of arbitrary length.

6. Local Attention

Purpose: Focuses on a fixed window of nearby tokens instead of the entire sequence.
Example: Used in Transformer-XL to handle long sequences efficiently.

7. Global Attention

Purpose: Allows certain "global" tokens to attend to all tokens, improving long-term dependency modeling.
Example: Found in Reformer for memory efficiency.

8. Causal Attention (Masked Attention)

Purpose: Prevents tokens from attending to future tokens in autoregressive models.
Example: GPT uses this to ensure the prediction is based only on past tokens.

9. Rotary Positional Embedding (RoPE) Attention

Purpose: Enhances positional encodings to better capture context over long sequences.
Example: Used in LLaMA to improve token dependencies.

10. Linear Attention

Purpose: Replaces the softmax operation to achieve linear time complexity.
Example: Applied in Performer for efficient handling of large sequences.

11. Adaptive Attention

Purpose: Dynamically chooses between different attention mechanisms.
Example: Found in Adaptive Transformers, switches between local and global attention.

12. Memory-Augmented Attention

Purpose: Uses external memory to store and retrieve context.
Example: RETRO retrieves relevant documents during inference.

13. Hierarchical Attention

Purpose: Focuses on different levels of granularity in a hierarchical structure (e.g., sentence, paragraph).
Example: Used in document-level models like HAN (Hierarchical Attention Networks).

14. MoE (Mixture of Experts) Attention

Purpose: Activates only a subset of attention heads based on the input.
Example: Found in Switch Transformer for efficient computation.

15. Dynamic Convolution Attention

Purpose: Combines attention with dynamic convolutions for better context capture.
Example: Used in DynamicConv models.

16. Attention with Linear Biases (ALiBi)

Purpose: Adds linear bias to attention scores for better long-sequence modeling.
Example: Improves transformer-based models for extended token sequences.

17. Hybrid Attention

Purpose: Combines self-attention with recurrent or convolutional layers.
Example: Used in Hybrid Transformers to capture local and global context.

18. Dual Attention

Purpose: Uses two attention mechanisms, such as intra-sequence and cross-sequence attention.
Example: Applied in multimodal models to align image and text representations.

19. Attention over Attention (AoA)

Purpose: Computes attention on top of an already existing attention distribution.
Example: Found in AoA Transformers for reading comprehension tasks.

20. Funnel Attention

Purpose: Reduces sequence length hierarchically while retaining critical information.
Example: Found in Funnel Transformers for efficient modeling of long texts.

Summary Table

Attention Mechanism	Purpose	Example
Scaled Dot-Product	Core mechanism for token alignment	GPT, BERT
Multi-Head	Captures diverse relationships	GPT-4, PaLM
Self-Attention	Within-sequence token attention	BERT, GPT
Cross-Attention	Aligns encoder-decoder sequences	T5, BART
Sparse Attention	Efficient attention for subsets	Longformer, BigBird
Causal Attention	Prevents future token access	GPT
Rotary (RoPE)	Enhances long-range token dependencies	LLaMA
Memory-Augmented	Adds external memory retrieval	RETRO
Local Attention	Attends to nearby tokens	Transformer-XL
Global Attention	Allows global token interactions	Reformer
Hierarchical Attention	Focuses on different text levels	Hierarchical Attention Networks (HAN)
MoE Attention	Activates specific attention heads	Switch Transformer

This layered approach ensures optimal performance across diverse NLP tasks.

Comments