2.1.3: Attention Mechanism
Self-Attention: The Core of Transformers
Imagine you’re reading a sentence: "The cat sat on the mat. Each word is important, but some are more related to others.
- “cat” is related to “sat.”
- “mat” is related to “sat.”
- “on” is less important.
Self-Attention helps the model decide which words to focus on!
How Self-Attention Works (Step-by-Step)
Self-attention is done in 4 steps:
1. Convert Words into Vectors (Embeddings)
Computers don’t understand words, so we convert them into numbers (word embeddings).
Example:
| Word | Vector Representation (Simplified) |
|---|---|
| The | [0.1, 0.2, 0.3] |
| Cat | [0.5, 0.6, 0.7] |
| Sat | [0.8, 0.9, 1.0] |
2. Create Query, Key, and Value (Q, K, V)
Each word is transformed into three vectors:
- Query (Q) → “What am I looking for?”
- Key (K) → “What information do I have?”
- Value (V) → “What should be returned?”
Example:
| Word | Query (Q) | Key (K) | Value (V) |
|---|---|---|---|
| Cat | 0.5 | 0.4 | 0.6 |
| Sat | 0.7 | 0.8 | 0.9 |
The Query (Q) of “sat” will be compared with Keys (K) of all words to see how related they are.
Now, let’s compute the Attention Scores using Q × Kᵀ.
3. Compute Attention Scores (Importance of Words) (Q × Kᵀ)
Now, we compare Query (Q) and Key (K) using the dot product.
- If Q and K are similar, the word is important.
- If Q and K are different, the word is less important.
Example:
- “sat” is strongly related to “cat” → High attention score.
- “sat” is weakly related to “the” → Low attention score.
| Word Pair | Calculation |
|---|---|
| Cat → Cat | (0.5 × 0.4) = 0.2 |
| Cat → Sat | (0.5 × 0.8) = 0.4 |
| Sat → Cat | (0.7 × 0.4) = 0.28 |
| Sat → Sat | (0.7 × 0.8) = 0.56 |
Thus, our Attention Score matrix is:
\[ \begin{bmatrix} 0.2 & 0.4 \\ 0.28 & 0.56 \end{bmatrix} \]4. Now we normalize these scores using Softmax.
What is Softmax?
Softmax is a function that converts raw scores into probabilities. Example: Let’s say we have three scores:
Raw Scores: [0.56, 0.72, 0.11]Softmax converts them into values between 0 and 1:
Softmax Output: [0.30, 0.50, 0.20]Why do we use Softmax?
So the values sum to 1, making them easy to interpret as “importance levels.”
Coming back to our actual values now,
Softmax normalizes these values so that they sum to 1 per row.
\[ \text{Softmax}(0.2, 0.4) = \frac{e^{0.2}}{e^{0.2} + e^{0.4}}, \frac{e^{0.4}}{e^{0.2} + e^{0.4}} \]Approximating exponentials:
\[ e^{0.2} \approx 1.221, \quad e^{0.4} \approx 1.491 \]\[ \sum = 1.221 + 1.491 = 2.712 \]\[ \text{Softmax} = \left[ \frac{1.221}{2.712}, \frac{1.491}{2.712} \right] = [0.45, 0.55] \]For the second row:
\[ e^{0.28} \approx 1.323, \quad e^{0.56} \approx 1.751 \]\[ \sum = 1.323 + 1.751 = 3.074 \]\[ \text{Softmax} = \left[ \frac{1.323}{3.074}, \frac{1.751}{3.074} \right] = [0.43, 0.57] \]So, the normalized attention weights (Softmax scores) are:
\[ \begin{bmatrix} 0.45 & 0.55 \\ 0.43 & 0.57 \end{bmatrix} \]4. Multiply Attention Weights by Value (V) and Sum Up
Each word’s final value is computed as:
Now, we multiply the softmax scores by the Value (V) matrix.
Final Word Representation = Attention Score × Value (V)| Word | V |
|---|---|
| Cat | 0.6 |
| Sat | 0.9 |
For Cat (first row):
\[ (0.45 \times 0.6) + (0.55 \times 0.9) = 0.27 + 0.495 = 0.765 \]For Sat (second row):
\[ (0.43 \times 0.6) + (0.57 \times 0.9) = 0.258 + 0.513 = 0.771 \]This process refines each word’s meaning based on context.
5. Final Output: Context Vector
The final contextualized representations are:
\[ \begin{bmatrix} 0.765 \\ 0.771 \end{bmatrix} \]What This Means
- Each word’s new representation now depends on its relationship with others, weighted by attention!
- Would you like me to extend this to multi-dimensional Q, K, and V?
The final output (context vector) represents the new embeddings for each word after applying the self-attention mechanism. These new values are no longer just the original word embeddings; instead, they now encode contextual information from the surrounding words based on how much attention each word gives to others.
-
Contextualized Representation
- Originally, each word had its own static vector (e.g., “Cat” = [0.5, 0.6, 0.7]).
- After self-attention, each word’s new representation incorporates weighted contributions from other words based on attention scores.
-
Information Flow
- The new values (0.765, 0.771) indicate that “Cat” and “Sat” now carry some information from each other, influenced by the attention weights.
- Words that are more relevant to each other have stronger influences.
-
Why Is This Important?
- Before self-attention, “Cat” was just “Cat,” and “Sat” was just “Sat.”
- Now, “Cat” understands that “Sat” is nearby and incorporates some of its meaning.
- This is how transformers capture context and relationships in sentences!
6. Sample Code
from transformers import BertModel, BertTokenizer
# Load pre-trained BERT model with attention outputs enabled
model = BertModel.from_pretrained("bert-base-uncased", output_attentions=True)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", output_attentions=True)
# Tokenize and convert input into tensors
inputs = tokenizer("Hello, Generative AI!", return_tensors="pt")
# Forward pass to get outputs including attention weights
outputs = model(**inputs)
# Extract attention layers
attentions = outputs.attentions
# Print number of attention layers
print("Number of Attention Layers:", len(attentions))Explanation of the Code
BertModel.from_pretrained("bert-base-uncased", output_attentions=True): Loads the BERT model with the option to return attention weights.tokenizer("Hello, Generative AI!", return_tensors="pt"): Converts input text into a format the model understands (PyTorch tensors).model(**inputs): Passes the tokenized inputs through the BERT model.outputs.attentions: Extracts attention weights from different transformer layers.
7. Python Example: Simple Self-Attention Implementation from Scratch
Now, let’s implement self-attention from scratch in Python.
import torch
import torch.nn as nn
import torch.nn.functional as F
class SelfAttention(nn.Module):
def __init__(self, embed_size):
super(SelfAttention, self).__init__()
self.embed_size = embed_size
self.query = nn.Linear(embed_size, embed_size)
self.key = nn.Linear(embed_size, embed_size)
self.value = nn.Linear(embed_size, embed_size)
def forward(self, x):
Q = self.query(x) # Convert to Query
K = self.key(x) # Convert to Key
V = self.value(x) # Convert to Value
# Compute Attention Scores
scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.embed_size ** 0.5)
attention = F.softmax(scores, dim=-1) # Apply Softmax
# Multiply by values
out = torch.matmul(attention, V)
return out
# Example usage
embed_size = 8
seq_length = 5
x = torch.rand((1, seq_length, embed_size))
self_attn = SelfAttention(embed_size)
output = self_attn(x)
print("Output Shape:", output.shape) # Expected: (1, seq_length, embed_size)Multi-Head Attention (Why Do We Need Multiple Attention Layers?)
Instead of one attention mechanism, Transformers use multiple (“heads”) to capture different relationships.
- Why? A single attention mechanism may miss important details.
- Multi-head attention ensures the model sees information from different perspectives.