Here’s a detailed breakdown of all the their functionalities, and examples of how they are used in Large Language Models (LLMs):
1. Embedding Layer
- Function: Converts input tokens (words, subwords) into dense vector representations.
- Example: In BERT, tokens like "apple" and "banana" are converted to vectors of size 768.
- Types:
- Token Embeddings: Maps vocabulary tokens to dense vectors.
- Positional Embeddings: Adds positional information since transformers lack inherent sequence ordering.
2. Self-Attention Layer
- Function: Computes relationships between tokens by determining which tokens should "attend" to others.
- Example:
- In GPT, "The cat sat on the mat" assigns high attention to "cat" and "sat" when predicting "on."
- Variants:
- Scaled Dot-Product Attention: Reduces variance by scaling the dot product of queries and keys.
- Causal (Masked) Attention: Ensures tokens only attend to previous tokens (used in GPT).
- Bidirectional Attention: Allows tokens to attend both past and future tokens (used in BERT).
3. Multi-Head Attention
- Function: Improves model's capacity to capture diverse patterns by splitting attention into multiple "heads."
- Example: GPT-4 uses 96 attention heads to capture complex dependencies across long contexts.
4. Feed-Forward Network (FFN)
- Function: Applies two dense layers with a non-linear activation in between.
- Example:
- In GPT-3, FFN layers are responsible for transforming intermediate token representations.
- Typical Structure:
5. Layer Normalization
- Function: Normalizes activations within each layer to stabilize learning and prevent vanishing/exploding gradients.
- Example: Found in every transformer layer in BERT and GPT.
- Equation:
6. Residual Connections
- Function: Adds the input of a layer back to its output, ensuring information is preserved and gradients flow efficiently.
- Example:
- In PaLM, residual connections allow deeper layers to retain critical information from earlier layers.
- Structure:
7. Positional Encoding
- Function: Injects positional information to token embeddings, allowing models to capture sequence order.
- Example:
- GPT models use sinusoidal positional encodings to differentiate tokens' positions in the sequence.
8. Dropout Layer
- Function: Randomly sets some weights to zero during training to prevent overfitting.
- Example: BERT uses dropout with a probability of 0.1 on attention and FFN layers.
9. Cross-Attention Layer (in Encoder-Decoder Models)
- Function: Aligns representations between two sequences, commonly between the encoder and decoder.
- Example:
- In translation tasks (e.g., T5), cross-attention aligns source and target sequences.
10. Sparse Attention
- Function: Focuses on a subset of tokens instead of all tokens, optimizing performance for long sequences.
- Example:
- Longformer uses sparse attention for documents with thousands of tokens.
11. Memory Layers
- Function: Incorporates memory modules to retain past context over long conversations or documents.
- Example:
- GPT models with memory components use external memory banks for context beyond input limits.
12. Mixture of Experts (MoE) Layers
- Function: Activates only a subset of the model (experts) for specific inputs, increasing efficiency.
- Example:
- GLaM and Switch Transformer use MoE layers to reduce computation while maintaining high performance.
13. Gated Linear Units (GLU)
- Function: Improves the expressiveness of FFN layers using gating mechanisms.
- Example:
- Utilized in models like GPT-3 to enhance nonlinear transformations.
14. Rotary Positional Embedding (RoPE)
- Function: Extends standard positional embeddings to handle longer contexts with improved structure.
- Example:
- Applied in LLaMA models for efficient context scaling.
15. Attention Over Memory (Memory-Augmented Layers)
- Function: Allows models to access a memory buffer for improved long-term context.
- Example:
- RETRO uses attention over retrieved documents for better factual consistency.
16. Key-Value Cache Layers
- Function: Stores key and value states for faster decoding.
- Example:
- GPT-3 uses caching during inference to efficiently generate long sequences.
Examples of Layer Usage in LLMs
| Model | Transformer Layers | Attention Mechanisms | Other Components |
|---|---|---|---|
| GPT-4 | Embedding, Multi-Head, FFN, LayerNorm | Causal Attention | Positional Encoding, Dropout, Residual |
| BERT | Embedding, Bidirectional Self-Attention, FFN | Masked Language Modeling | Next Sentence Prediction, LayerNorm |
| T5 | Embedding, Multi-Head, Cross-Attention | Bidirectional & Cross-Attention | Encoder-Decoder Structure |
| PaLM | Embedding, FFN, Multi-Head, LayerNorm | Causal Multi-Head Attention | Rotary Positional Embedding, Dropout |
| Longformer | Sparse Attention | Global + Local Attention | Long Sequence Support |
Each layer plays a crucial role in transforming and preserving information while enabling models to handle complex language tasks.
Example Usage Table
| Layer | Model | Role |
|---|---|---|
| Embedding | GPT, BERT | Converts tokens into dense vectors |
| Self-Attention | GPT, BERT | Captures relationships within tokens |
| Multi-Head | GPT-4, PaLM | Improves context understanding through parallel attention mechanisms |
| Cross-Attention | T5, BART | Aligns input and output for translation |
| Sparse Attention | Longformer | Optimizes token computation for long sequences |
| LayerNorm | GPT, BERT | Stabilizes training by normalizing inputs |
| Residual | GPT-3, GPT-4 | Preserves input for deeper networks |
| Dropout | BERT | Prevents overfitting during training |
| RoPE | LLaMA | Extends positional encoding for better long-term dependencies |
| Memory | RETRO | Retrieves relevant information for factual accuracy |
Each transformer component contributes to a robust system capable of handling complex NLP tasks efficiently.
Comments
Post a Comment