Skip to main content

List of Attention Mechanisms of LLMS

 

Comprehensive List of Attention Mechanisms


1. Scaled Dot-Product Attention

  • Formula:Attention(Q,K,V)=softmax(QKTdk)V
  • Purpose: Computes attention weights by scaling dot products of query and key.
  • Example: Used in most Transformer-based models like GPT and BERT.

2. Multi-Head Attention

  • Formula:MultiHead(Q,K,V)=Concat(head1,,headh)WO
  • Purpose: Allows the model to focus on different parts of the sequence in parallel.
  • Example: Each head captures a unique relationship in GPT-4 and PaLM.

3. Self-Attention

  • Formula:Attention(Q,K,V)=softmax(QKTdk)V
  • Purpose: Attention within the same input sequence.
  • Example: Used in BERT for bidirectional context understanding.

4. Cross-Attention

  • Formula:Attention(Q,K,V)=softmax(QKTdk)V
  • Purpose: Aligns information between two different sequences (e.g., encoder and decoder).
  • Example: Found in T5, BART for tasks like translation and summarization.

5. Sparse Attention

  • Purpose: Reduces computational cost by only computing attention for a subset of tokens.
  • Example:
    • Longformer: Focuses on specific tokens for long-document understanding.
    • BigBird: Efficiently handles sequences of arbitrary length.

6. Local Attention

  • Purpose: Focuses on a fixed window of nearby tokens instead of the entire sequence.
  • Example: Used in Transformer-XL to handle long sequences efficiently.

7. Global Attention

  • Purpose: Allows certain "global" tokens to attend to all tokens, improving long-term dependency modeling.
  • Example: Found in Reformer for memory efficiency.

8. Causal Attention (Masked Attention)

  • Purpose: Prevents tokens from attending to future tokens in autoregressive models.
  • Example: GPT uses this to ensure the prediction is based only on past tokens.

9. Rotary Positional Embedding (RoPE) Attention

  • Purpose: Enhances positional encodings to better capture context over long sequences.
  • Example: Used in LLaMA to improve token dependencies.

10. Linear Attention

  • Purpose: Replaces the softmax operation to achieve linear time complexity.
  • Example: Applied in Performer for efficient handling of large sequences.

11. Adaptive Attention

  • Purpose: Dynamically chooses between different attention mechanisms.
  • Example: Found in Adaptive Transformers, switches between local and global attention.

12. Memory-Augmented Attention

  • Purpose: Uses external memory to store and retrieve context.
  • ExampleRETRO retrieves relevant documents during inference.

13. Hierarchical Attention

  • Purpose: Focuses on different levels of granularity in a hierarchical structure (e.g., sentence, paragraph).
  • Example: Used in document-level models like HAN (Hierarchical Attention Networks).

14. MoE (Mixture of Experts) Attention

  • Purpose: Activates only a subset of attention heads based on the input.
  • Example: Found in Switch Transformer for efficient computation.

15. Dynamic Convolution Attention

  • Purpose: Combines attention with dynamic convolutions for better context capture.
  • Example: Used in DynamicConv models.

16. Attention with Linear Biases (ALiBi)

  • Purpose: Adds linear bias to attention scores for better long-sequence modeling.
  • Example: Improves transformer-based models for extended token sequences.

17. Hybrid Attention

  • Purpose: Combines self-attention with recurrent or convolutional layers.
  • Example: Used in Hybrid Transformers to capture local and global context.

18. Dual Attention

  • Purpose: Uses two attention mechanisms, such as intra-sequence and cross-sequence attention.
  • Example: Applied in multimodal models to align image and text representations.

19. Attention over Attention (AoA)

  • Purpose: Computes attention on top of an already existing attention distribution.
  • Example: Found in AoA Transformers for reading comprehension tasks.

20. Funnel Attention

  • Purpose: Reduces sequence length hierarchically while retaining critical information.
  • Example: Found in Funnel Transformers for efficient modeling of long texts.

Summary Table

Attention MechanismPurposeExample
Scaled Dot-ProductCore mechanism for token alignmentGPT, BERT
Multi-HeadCaptures diverse relationshipsGPT-4, PaLM
Self-AttentionWithin-sequence token attentionBERT, GPT
Cross-AttentionAligns encoder-decoder sequencesT5, BART
Sparse AttentionEfficient attention for subsetsLongformer, BigBird
Causal AttentionPrevents future token accessGPT
Rotary (RoPE)Enhances long-range token dependenciesLLaMA
Memory-AugmentedAdds external memory retrievalRETRO
Local AttentionAttends to nearby tokensTransformer-XL
Global AttentionAllows global token interactionsReformer
Hierarchical AttentionFocuses on different text levelsHierarchical Attention Networks (HAN)
MoE AttentionActivates specific attention headsSwitch Transformer

This layered approach ensures optimal performance across diverse NLP tasks.

Comments

Popular posts from this blog

Machine Learning MATHS

Here are the remaining 200 points: _Differential Equations (continued)_ 1. Phase Plane Analysis 2. Limit Cycles 3. Bifurcation Diagrams 4. Chaos Theory 5. Fractals 6. Nonlinear Dynamics 7. Stochastic Differential Equations 8. Random Processes 9. Markov Chains 10. Monte Carlo Methods _Deep Learning Specific (20)_ 1. Backpropagation 2. Activation Functions 3. Loss Functions 4. Regularization Techniques 5. Batch Normalization 6. Dropout 7. Convolutional Neural Networks (CNNs) 8. Recurrent Neural Networks (RNNs) 9. Long Short-Term Memory (LSTM) 10. Gated Recurrent Units (GRU) 11. Transformers 12. Attention Mechanisms 13. Generative Adversarial Networks (GANs) 14. Variational Autoencoders (VAEs) 15. Word Embeddings 16. Language Models 17. Sequence-to-Sequence Models 18. Deep Reinforcement Learning 19. Deep Transfer Learning 20. Adversarial Training _Mathematical Functions (20)_ 1. Sigmoid 2. ReLU 3. Tanh 4. Softmax 5. Gaussian 6. Exponential 7. Logarithmic 8. Trigonometric 9. Hyperbolic 10....

AI languages

Computer languages also have a core structure, much like the skeleton of the human body. This core structure can be defined by key components that most languages share, even though their syntax or use cases may differ. Here’s a breakdown of the core structure that defines computer languages: 1. Syntax This is the set of rules that defines the combinations of symbols that are considered to be correctly structured programs in that language. It’s similar to grammar in human languages. Examples: Python uses indentation for blocks, C uses braces {} . 2. Variables and Data Types Variables store information, and data types specify what kind of information (integer, float, string, etc.). Core data types include: integers, floats, characters, booleans, and arrays/lists. 3. Control Flow This determines how the instructions are executed, i.e., in what order. Most languages have basic control structures like: If-Else Statements : Conditional logic to execute code based on conditions. Loops (For, ...

Notable generative AI companies

Here’s the detailed list of notable generative AI companies categorized by continent, including their focus/products and websites: North America OpenAI  - Language models and AI research. openai.com Google DeepMind  - AI research and applications in various domains. deepmind.com NVIDIA  - AI hardware and software for deep learning. nvidia.com IBM Watson  - AI for enterprise solutions. ibm.com/watson Microsoft  - AI services and tools for developers. microsoft.com Adobe  - Creative tools with generative AI features. adobe.com Stability AI  - Open-source models for image and text generation. stability.ai Runway  - AI tools for creative professionals. runwayml.com Hugging Face  - Community-driven NLP models and tools. huggingface.co Cohere  - AI for natural language processing. cohere.ai Copy.ai  - AI for content generation. copy.ai Jasper  - AI writing assistant. jasper.ai ChatGPT  - Conversational AI applications. openai.co...