2.1: Key Concepts in GenAI
Key Concepts in Generative AI
| Concept |
Definition |
| Large Language Models (LLMs) |
LLMs are AI models trained on vast amounts of text data. They use the Transformer architecture, which relies on attention mechanisms to process input data. Examples: GPT (Generative Pre-trained Transformer), BERT, T5. |
| Tokenization |
Breaking down text into smaller units (tokens) for processing. Example: The sentence “Hello, world!” might be tokenized into ["Hello", ",", "world", "!"]. |
| Embeddings |
Representing tokens as numerical (vectors) in a high-dimensional space. Embeddings capture semantic meaning (e.g., “king” - “man” + “woman” ≈ “queen”). |
| Self-Attention/Attention Mechanism |
Mechanism that helps models focus on relevant words. |
| Transformers |
The deep learning architecture used in LLMs. Transformers are the backbone of most modern generative models. Key components: Encoder, Decoder, and Attention Mechanism. |
| Pre-training |
Training a model on a large dataset (e.g., all of Wikipedia) to learn general language patterns. |
| Fine-tuning |
Adapting the pre-trained model to a specific task (e.g., sentiment analysis, chatbot). |
| Prompt Engineering |
Designing effective inputs to guide model responses. |
Tokenization
Tokenization is the process of converting text into smaller units, typically words or subwords, that can be processed by machine learning models. In natural language processing (NLP), tokens are the basic building blocks for understanding and generating language. Tokenization helps the model “understand” the text by converting it into a format that can be fed into the neural network.
- Word-level tokenization splits text into words. Example:
"I love AI" → ["I", "love", "AI"].
- Subword-level tokenization (used in models like GPT) splits words into smaller parts or subwords. This is more efficient and handles unknown words better, as the model can understand smaller pieces of a word. Example, “delightful”
→[“delight”, “ful”]`.
- Character-Level Tokenization Splits text into individual characters. Example:
"AI" → ["A", "I"].
Hands-On: Tokenization
Tokenization Example
from transformers import AutoTokenizer
# Load the tokenizer for GPT-2
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Tokenize a sentence
text = "Hello, how are you?"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)
print("Tokens:", tokens)
print("Token IDs:", token_ids)
Output:
Tokens: ['Hello', ',', 'Ġhow', 'Ġare', 'Ġyou', '?']
Token IDs: [15496, 11, 703, 389, 345, 30]
Explanation:
tokenizer.tokenize(text): Splits the text into tokens.
tokenizer.encode(text): Converts tokens into their corresponding numerical IDs.
Tokenization: Additional Resources
Embeddings
Embeddings are numerical representations of tokens in a high-dimensional space. They capture the semantic meaning of tokens, allowing models to understand relationships between words.
Why are Embeddings Important?
- Words with similar meanings have similar embeddings.
- Embeddings enable models to generalize and understand context.
Example:
- The embeddings for
"king", "queen", "man", and "woman" might satisfy the relationship:
king - man + woman ≈ queen
Hands-On: Embeddings
from transformers import AutoTokenizer, AutoModel
import torch
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModel.from_pretrained("gpt2")
# Tokenize input text
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
# Generate embeddings
with torch.no_grad():
outputs = model(**inputs)
# Extract embeddings for the first token
embeddings = outputs.last_hidden_state
print("Embeddings shape:", embeddings.shape)
print("Embeddings for 'Hello':", embeddings[0, 0, :5]) # First 5 dimensions
Output:
Embeddings shape: torch.Size([1, 6, 768])
Embeddings for 'Hello': tensor([-0.0123, 0.0456, -0.0678, 0.0234, 0.0891])
Explanation:
model(**inputs): Passes the token IDs through the model to generate embeddings.
outputs.last_hidden_state: Contains the embeddings for each token.
- Each token is represented as a 768-dimensional vector (for GPT-2).
- Output:
torch.Size([1, 6, 768]) means
- Batch size: 1. Since we have one input sentence.
- Sequence length: 6. The input sentence has 6 tokens for the sentence
"Hello, how are you?"
- Embedding dimensions: 768. Each token is represented by a 768-dimensional vector for GPT-2.
Imagine you’re reading a sentence: "The cat sat on the mat. Each word is important, but some are more related to others.
- “cat” is related to “sat.”
- “mat” is related to “sat.”
- “on” is less important.
Self-Attention helps the model decide which words to focus on!
How Self-Attention Works (Step-by-Step)
Self-attention is done in 4 steps:
1. Convert Words into Vectors (Embeddings)
Computers don’t understand words, so we convert them into numbers (word embeddings).
Example:
| Word |
Vector Representation (Simplified) |
| The |
[0.1, 0.2, 0.3] |
| Cat |
[0.5, 0.6, 0.7] |
| Sat |
[0.8, 0.9, 1.0] |
2. Create Query, Key, and Value (Q, K, V)
Each word is transformed into three vectors:
- Query (Q) → “What am I looking for?”
- Key (K) → “What information do I have?”
- Value (V) → “What should be returned?”
Example:
| Word |
Query (Q) |
Key (K) |
Value (V) |
| Cat |
0.5 |
0.4 |
0.6 |
| Sat |
0.7 |
0.8 |
0.9 |
The Query (Q) of “sat” will be compared with Keys (K) of all words to see how related they are.
Now, let’s compute the Attention Scores using Q × Kᵀ.
3. Compute Attention Scores (Importance of Words) (Q × Kᵀ)
Now, we compare Query (Q) and Key (K) using the dot product.
- If Q and K are similar, the word is important.
- If Q and K are different, the word is less important.
\[
\text{Attention Score} = Q \times K^T
\]Example:
- “sat” is strongly related to “cat” → High attention score.
- “sat” is weakly related to “the” → Low attention score.
| Word Pair |
Calculation |
| Cat → Cat |
(0.5 × 0.4) = 0.2 |
| Cat → Sat |
(0.5 × 0.8) = 0.4 |
| Sat → Cat |
(0.7 × 0.4) = 0.28 |
| Sat → Sat |
(0.7 × 0.8) = 0.56 |
Thus, our Attention Score matrix is:
\[
\begin{bmatrix}
0.2 & 0.4 \\
0.28 & 0.56
\end{bmatrix}
\]4. Now we normalize these scores using Softmax.
What is Softmax?
Softmax is a function that converts raw scores into probabilities. Example: Let’s say we have three scores:
Raw Scores: [0.56, 0.72, 0.11]
Softmax converts them into values between 0 and 1:
Softmax Output: [0.30, 0.50, 0.20]
Why do we use Softmax?
So the values sum to 1, making them easy to interpret as “importance levels.”
Coming back to our actual values now,
Softmax normalizes these values so that they sum to 1 per row.
\[
\text{Softmax}(0.2, 0.4) = \frac{e^{0.2}}{e^{0.2} + e^{0.4}}, \frac{e^{0.4}}{e^{0.2} + e^{0.4}}
\]Approximating exponentials:
\[
e^{0.2} \approx 1.221, \quad e^{0.4} \approx 1.491
\]\[
\sum = 1.221 + 1.491 = 2.712
\]\[
\text{Softmax} = \left[ \frac{1.221}{2.712}, \frac{1.491}{2.712} \right] = [0.45, 0.55]
\]For the second row:
\[
e^{0.28} \approx 1.323, \quad e^{0.56} \approx 1.751
\]\[
\sum = 1.323 + 1.751 = 3.074
\]\[
\text{Softmax} = \left[ \frac{1.323}{3.074}, \frac{1.751}{3.074} \right] = [0.43, 0.57]
\]So, the normalized attention weights (Softmax scores) are:
\[
\begin{bmatrix}
0.45 & 0.55 \\
0.43 & 0.57
\end{bmatrix}
\]
4. Multiply Attention Weights by Value (V) and Sum Up
Each word’s final value is computed as:
Now, we multiply the softmax scores by the Value (V) matrix.
Final Word Representation = Attention Score × Value (V)
For Cat (first row):
\[
(0.45 \times 0.6) + (0.55 \times 0.9) = 0.27 + 0.495 = 0.765
\]For Sat (second row):
\[
(0.43 \times 0.6) + (0.57 \times 0.9) = 0.258 + 0.513 = 0.771
\]This process refines each word’s meaning based on context.
5. Final Output: Context Vector
The final contextualized representations are:
\[
\begin{bmatrix}
0.765 \\
0.771
\end{bmatrix}
\]What This Means
- Each word’s new representation now depends on its relationship with others, weighted by attention!
- Would you like me to extend this to multi-dimensional Q, K, and V?
The final output (context vector) represents the new embeddings for each word after applying the self-attention mechanism. These new values are no longer just the original word embeddings; instead, they now encode contextual information from the surrounding words based on how much attention each word gives to others.
-
Contextualized Representation
- Originally, each word had its own static vector (e.g., “Cat” = [0.5, 0.6, 0.7]).
- After self-attention, each word’s new representation incorporates weighted contributions from other words based on attention scores.
-
Information Flow
- The new values (0.765, 0.771) indicate that “Cat” and “Sat” now carry some information from each other, influenced by the attention weights.
- Words that are more relevant to each other have stronger influences.
-
Why Is This Important?
- Before self-attention, “Cat” was just “Cat,” and “Sat” was just “Sat.”
- Now, “Cat” understands that “Sat” is nearby and incorporates some of its meaning.
- This is how transformers capture context and relationships in sentences!
6. Sample Code
from transformers import BertModel, BertTokenizer
# Load pre-trained BERT model with attention outputs enabled
model = BertModel.from_pretrained("bert-base-uncased", output_attentions=True)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", output_attentions=True)
# Tokenize and convert input into tensors
inputs = tokenizer("Hello, Generative AI!", return_tensors="pt")
# Forward pass to get outputs including attention weights
outputs = model(**inputs)
# Extract attention layers
attentions = outputs.attentions
# Print number of attention layers
print("Number of Attention Layers:", len(attentions))
Explanation of the Code
BertModel.from_pretrained("bert-base-uncased", output_attentions=True): Loads the BERT model with the option to return attention weights.
tokenizer("Hello, Generative AI!", return_tensors="pt"): Converts input text into a format the model understands (PyTorch tensors).
model(**inputs): Passes the tokenized inputs through the BERT model.
outputs.attentions: Extracts attention weights from different transformer layers.
7. Python Example: Simple Self-Attention Implementation from Scratch
Now, let’s implement self-attention from scratch in Python.
import torch
import torch.nn as nn
import torch.nn.functional as F
class SelfAttention(nn.Module):
def __init__(self, embed_size):
super(SelfAttention, self).__init__()
self.embed_size = embed_size
self.query = nn.Linear(embed_size, embed_size)
self.key = nn.Linear(embed_size, embed_size)
self.value = nn.Linear(embed_size, embed_size)
def forward(self, x):
Q = self.query(x) # Convert to Query
K = self.key(x) # Convert to Key
V = self.value(x) # Convert to Value
# Compute Attention Scores
scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.embed_size ** 0.5)
attention = F.softmax(scores, dim=-1) # Apply Softmax
# Multiply by values
out = torch.matmul(attention, V)
return out
# Example usage
embed_size = 8
seq_length = 5
x = torch.rand((1, seq_length, embed_size))
self_attn = SelfAttention(embed_size)
output = self_attn(x)
print("Output Shape:", output.shape) # Expected: (1, seq_length, embed_size)
The Transformer Model is the core architecture behind most modern NLP and GenAI models like GPT, BERT, and LLaMA. Here’s how it works:
Key Components of the Transformer
- Self-Attention Mechanism: This allows the model to focus on different words in a sentence when processing each word. For example, when processing the word “bank” in the sentence “I went to the bank to withdraw money,” the model can focus on the context to determine if “bank” refers to a financial institution or the side of a river.
- Multi-Head Attention: This technique allows the model to focus on different aspects of the sentence simultaneously, using multiple attention heads to capture different relationships between words.
- Positional Encoding: Since transformers don’t inherently understand the order of words (like sequential models), positional encoding is added to provide information about the position of words in a sentence.
- Encoder-Decoder Architecture:
- Encoder: Processes the input data (e.g., a sentence).
- Decoder: Generates the output data (e.g., a translation of the sentence).
Both the Transformer Encoder and Decoder consist of:
- Feed-Forward Layers: Further processes information after attention.
- Layer Normalization: Normalizes the output of each sub-layer before passing it to the next layer, stabilizing training and preventing exploding gradients.
Feed-Forward Layers (FFNN)
What is a Feed-Forward Layer?
A Feed-Forward Layer, also known as a Feed-Forward Neural Network (FFNN), is a fully connected neural network layer that takes the output from the attention mechanism and further processes it.
How does a Feed-Forward Layer work?
- Input: The output from the attention mechanism is fed into the FFNN.
- Linear Transformation: The input is transformed using a linear layer (e.g., a dense layer) with a learnable weight matrix and bias term.
- Activation Function: The output from the linear transformation is passed through an activation function, such as ReLU (Rectified Linear Unit) or GeLU (Gaussian Error Linear Unit).
- Output: The output from the activation function is the final output of the FFNN.
Purpose of Feed-Forward Layers
The FFNN serves two purposes:
- Feature Transformation: It transforms the output from the attention mechanism into a higher-level representation that’s more suitable for the task at hand.
- Non-Linearity Introduction: The activation function introduces non-linearity into the model, allowing it to learn more complex relationships between the input and output. For example,
- ReLU (Rectified Linear Unit), Maps all negative values to 0 and all positive values to the same value. Another one,
- Sigmoid, which maps the input to a value between 0 and 1.
- GELU (Gaussian Error Linear Unit): combines the benefits of ReLU and sigmoid.
Layer Normalization
What is Layer Normalization?
Layer Normalization is a technique used to normalize the output of each sub-layer (e.g., the FFNN) before passing it to the next layer.
How does Layer Normalization work?
- Compute Mean and Variance: The mean and variance of the output from the sub-layer are computed.
- Normalize: The output is normalized by subtracting the mean and dividing by the standard deviation (square root of variance).
- Scale and Shift: The normalized output is then scaled and shifted using learnable parameters (gamma and beta).
Purpose of Layer Normalization
Layer Normalization serves several purposes:
- Stabilizes Training: Normalization helps stabilize the training process by reducing the effects of exploding gradients.
- Improves Generalization: Normalization can improve the model’s ability to generalize to new, unseen data.
- Reduces Dependence on Initialization: Normalization reduces the dependence on the initialization of the model’s weights.
By using Layer Normalization, the Transformer model can better handle the complex interactions between the attention mechanism and the FFNN, leading to improved performance and stability.
Examples
FNN Example
Suppose we have a simple FFN with one input layer, one hidden layer, and one output layer. The input layer has 2 neurons, the hidden layer has 3 neurons, and the output layer has 1 neuron.
Here’s the FFN architecture: Input Layer (2 neurons) → Hidden Layer (3 neurons) → Output Layer (1 neuron)
Let’s say we have an input vector x = [1, 2]. The FFN processes this input as follows:
- Input Layer: The input vector x is passed through the input layer.
- Hidden Layer: The output from the input layer is passed through a linear transformation (e.g., a dense layer) with weights W1 and biases b1, followed by an activation function (e.g., ReLU). Let’s call the output from the hidden layer h.
- Output Layer: The output from the hidden layer h is passed through another linear transformation with weights W2 and biases b2. The final output is y.
Here’s some sample PyTorch code to illustrate this:
import torch
import torch.nn as nn
# Define the FFN model
class FFN(nn.Module):
def __init__(self):
super(FFN, self).__init__()
self.hidden_layer = nn.Linear(2, 3) # input layer (2) -> hidden layer (3)
self.output_layer = nn.Linear(3, 1) # hidden layer (3) -> output layer (1)
def forward(self, x):
h = torch.relu(self.hidden_layer(x)) # activation function: ReLU
y = self.output_layer(h)
return y
# Initialize the FFN model and input vector
model = FFN()
x = torch.tensor([1.0, 2.0])
# Forward pass
y = model(x)
print(y)
FFNN Example: Image Classification
Suppose we want to build a simple image classification model that can distinguish between images of cats and dogs.
Here’s how an FFNN can be used:
- Input Layer: The input layer takes in the image data, which is typically represented as a matrix of pixel values.
- Hidden Layer: The hidden layer applies a series of transformations to the input data, using weights and biases learned during training.
- Output Layer: The output layer generates a probability distribution over the two classes (cat or dog).
For example, if we input an image of a cat, the FFNN might output: [0.8, 0.2]
This indicates that the model is 80% confident that the image is a cat and 20% confident that it’s a dog.
Layer Normalization Example
Now, let’s add Layer Normalization to the FFN model. We’ll apply Layer Normalization to the output of the hidden layer.
Here’s the modified PyTorch code:
import torch
import torch.nn as nn
import torch.nn.functional as F
# Define the FFN model with Layer Normalization
class FFN(nn.Module):
def __init__(self):
super(FFN, self).__init__()
self.hidden_layer = nn.Linear(2, 3) # input layer (2) -> hidden layer (3)
self.layer_norm = nn.LayerNorm(3) # Layer Normalization for hidden layer
self.output_layer = nn.Linear(3, 1) # hidden layer (3) -> output layer (1)
def forward(self, x):
h = torch.relu(self.hidden_layer(x)) # activation function: ReLU
h = self.layer_norm(h) # apply Layer Normalization
y = self.output_layer(h)
return y
# Initialize the FFN model and input vector
model = FFN()
x = torch.tensor([1.0, 2.0])
# Forward pass
y = model(x)
print(y)
In this example, we’ve added a LayerNorm module to the FFN model, which applies Layer Normalization to the output of the hidden layer.
The LayerNorm module normalizes the input data by subtracting the mean and dividing by the standard deviation, which helps to stabilize the training process and improve the model’s performance.
Layer Normalization Example: Language Translation
Suppose we want to build a machine translation system that translates English sentences to Hindi. We have a dataset of paired English and Hindi sentences.
Our sequence-to-sequence model consists of three main components:
- Encoder : The encoder takes in the English sentence and outputs a sequence of vectors and a hidden state. The encoder uses a GRU (Gated Recurrent Unit) layer to process the input sequence.
- Layer Normalization : The layer normalization component normalizes the output from the encoder. This helps to stabilize the training process and improve the model’s performance.
- Decoder : The decoder takes in the normalized output and the hidden state from the encoder and outputs a Hindi sentence. The decoder also uses a GRU layer to process the input sequence.
Layer Normalization helps to:
- Stabilize the training process by reducing the effects of exploding gradients.
- Improve the model’s ability to generalize to new, unseen data.
Example Walkthrough
Let’s walk through an example of how this model works:
- English Sentence: We start with an English sentence, “Hello, how are you?”
- Encoder: The encoder takes in the English sentence and outputs a sequence of vectors and a hidden state.
| English Word |
Encoder Output |
| Hello |
[0.1, 0.2, 0.3] |
| how |
[0.4, 0.5, 0.6] |
| are |
[0.7, 0.8, 0.9] |
| you |
[1.0, 1.1, 1.2] |
Layer Normalization: The layer normalization component normalizes the output from the encoder.
| English Word |
Normalized Output |
| Hello |
[-0.5, 0.0, 0.5] |
| how |
[-0.3, 0.2, 0.7] |
| are |
[-0.1, 0.4, 0.9] |
| you |
[0.1, 0.6, 1.1] |
Decoder: The decoder takes in the normalized output and the hidden state from the encoder and outputs a Hindi sentence.
| Hindi Word |
Decoder Output |
| नमस्ते |
[0.8, 0.9, 1.0] |
| कैसे |
[0.6, 0.7, 0.8] |
| हो |
[0.4, 0.5, 0.6] |
| ? |
[0.2, 0.3, 0.4] |
Final Output: The final output from the decoder is the translated Hindi sentence, “नमस्ते, कैसे हो?”.
# Import the necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim
# Define the encoder model
class Encoder(nn.Module):
def __init__(self):
# Initialize the encoder model
super(Encoder, self).__init__()
# Define the embedding layer with 10000 possible words and 128-dimensional vectors
self.embedding = nn.Embedding(num_embeddings=10000, embedding_dim=128)
# Define the GRU layer with 128 input dimensions, 256 hidden dimensions, and 1 layer
self.rnn = nn.GRU(input_size=128, hidden_size=256, num_layers=1, batch_first=True)
def forward(self, x):
# Embed the input sequence
embedded = self.embedding(x)
# Pass the embedded sequence through the GRU layer
output, hidden = self.rnn(embedded)
# Return the output and hidden state
return output, hidden
# Define the layer normalization model
class LayerNormalization(nn.Module):
def __init__(self):
# Initialize the layer normalization model
super(LayerNormalization, self).__init__()
# Define the layer normalization layer with 256 dimensions
self.layer_norm = nn.LayerNorm(normalized_shape=256)
def forward(self, x):
# Normalize the input sequence
return self.layer_norm(x)
# Define the decoder model
class Decoder(nn.Module):
def __init__(self):
# Initialize the decoder model
super(Decoder, self).__init__()
# Define the embedding layer with 10000 possible words and 128-dimensional vectors
self.embedding = nn.Embedding(num_embeddings=10000, embedding_dim=128)
# Define the GRU layer with 128 input dimensions, 256 hidden dimensions, and 1 layer
self.rnn = nn.GRU(input_size=128, hidden_size=256, num_layers=1, batch_first=True)
# Define the fully connected layer with 256 input dimensions and 10000 output dimensions
self.fc = nn.Linear(256, 10000)
def forward(self, x, hidden):
# Embed the input sequence
embedded = self.embedding(x)
# Pass the embedded sequence through the GRU layer
output, hidden = self.rnn(embedded, hidden)
# Pass the output through the fully connected layer
output = self.fc(output[:, -1, :])
# Return the output and hidden state
return output, hidden
# Define the sequence-to-sequence model
class Seq2Seq(nn.Module):
def __init__(self):
# Initialize the sequence-to-sequence model
super(Seq2Seq, self).__init__()
# Define the encoder model
self.encoder = Encoder()
# Define the layer normalization model
self.layer_norm = LayerNormalization()
# Define the decoder model
self.decoder = Decoder()
def forward(self, x):
# Pass the input sequence through the encoder
encoder_output, hidden = self.encoder(x)
# Pass the encoder output through the layer normalization
normalized_output = self.layer_norm(encoder_output)
# Pass the normalized output and hidden state through the decoder
decoder_output, _ = self.decoder(x[:, 0:1], hidden)
# Return the decoder output
return decoder_output
# Initialize the sequence-to-sequence model
model = Seq2Seq()
# Define the vocabulary
english_vocabulary = {'Hello': 0, 'are': 1, 'you': 2, '<EOS>': 3}
hindi_vocabulary = {'नमस्ते': 0, 'कैसे': 1, 'हो': 2, '<EOS>': 3}
# Define a sample English input sequence
english_input = torch.tensor([[english_vocabulary['Hello'], english_vocabulary['are'], english_vocabulary['you']]])
# Define a sample Hindi output sequence
hindi_output = torch.tensor([[hindi_vocabulary['नमस्ते'], hindi_vocabulary['कैसे'], hindi_vocabulary['हो']]])
# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# Train the model
for epoch in range(100):
# Zero the gradients
optimizer.zero_grad()
# Pass the input sequence through the model
output = model(english_input)
# Calculate the loss
loss = criterion(output, hindi_output[:, 0])
# Backpropagate the loss
loss.backward()
# Update the model parameters
optimizer.step()
# Print the loss
print(f'Epoch {epoch+1}, Loss: {loss.item()}')
# Use the trained model for translation
# Use the trained model for translation
translated_output = model(english_input)
translated_output_idx = torch.argmax(translated_output, dim=1)
predicted_word = list(hindi_vocabulary.keys())[translated_output_idx.item()]
print(predicted_word)
Multi-Head Attention (Why Do We Need Multiple Attention Layers?)
Instead of one attention mechanism, Transformers use multiple (“heads”) to capture different relationships.
- Why? A single attention mechanism may miss important details.
- Multi-head attention ensures the model sees information from different perspectives.
Summary: How Generative AI Processes Text
When you enter a prompt like:
➡️ "Explain black holes"
A Generative AI model follows these steps:
Step 1: Tokenization
Breaks text into smaller parts (tokens).
Example:
"Explain black holes" → ["explain", "black", "holes"]
Step 2: Embeddings
Each token is converted into a numerical vector for processing.
Step 3: Transformer Model Processing
- Self-attention determines which words matter the most.
- Multiple layers refine understanding.
Step 4: Text Generation
- Predicts the most likely next token at each step.
- Constructs output based on learned patterns.
Subsections of Key Concepts in GenAI
2.1.1: Tokenization
Tokenization
Tokenization is the process of converting text into smaller units, typically words or subwords, that can be processed by machine learning models. In natural language processing (NLP), tokens are the basic building blocks for understanding and generating language. Tokenization helps the model “understand” the text by converting it into a format that can be fed into the neural network.
- Word-level tokenization splits text into words. Example:
"I love AI" → ["I", "love", "AI"].
- Subword-level tokenization (used in models like GPT) splits words into smaller parts or subwords. This is more efficient and handles unknown words better, as the model can understand smaller pieces of a word. Example, “delightful”
→[“delight”, “ful”]`.
- Character-Level Tokenization Splits text into individual characters. Example:
"AI" → ["A", "I"].
Hands-On: Tokenization
Tokenization Example
from transformers import AutoTokenizer
# Load the tokenizer for GPT-2
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Tokenize a sentence
text = "Hello, how are you?"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)
print("Tokens:", tokens)
print("Token IDs:", token_ids)
Output:
Tokens: ['Hello', ',', 'Ġhow', 'Ġare', 'Ġyou', '?']
Token IDs: [15496, 11, 703, 389, 345, 30]
Explanation:
tokenizer.tokenize(text): Splits the text into tokens.
tokenizer.encode(text): Converts tokens into their corresponding numerical IDs.
Tokenization: Additional Resources
2.1.2: Embeddings
Embeddings
Embeddings are numerical representations of tokens in a high-dimensional space. They capture the semantic meaning of tokens, allowing models to understand relationships between words.
Why are Embeddings Important?
- Words with similar meanings have similar embeddings.
- Embeddings enable models to generalize and understand context.
Example:
- The embeddings for
"king", "queen", "man", and "woman" might satisfy the relationship:
king - man + woman ≈ queen
Hands-On: Embeddings
from transformers import AutoTokenizer, AutoModel
import torch
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModel.from_pretrained("gpt2")
# Tokenize input text
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
# Generate embeddings
with torch.no_grad():
outputs = model(**inputs)
# Extract embeddings for the first token
embeddings = outputs.last_hidden_state
print("Embeddings shape:", embeddings.shape)
print("Embeddings for 'Hello':", embeddings[0, 0, :5]) # First 5 dimensions
Output:
Embeddings shape: torch.Size([1, 6, 768])
Embeddings for 'Hello': tensor([-0.0123, 0.0456, -0.0678, 0.0234, 0.0891])
Explanation:
model(**inputs): Passes the token IDs through the model to generate embeddings.
outputs.last_hidden_state: Contains the embeddings for each token.
- Each token is represented as a 768-dimensional vector (for GPT-2).
- Output:
torch.Size([1, 6, 768]) means
- Batch size: 1. Since we have one input sentence.
- Sequence length: 6. The input sentence has 6 tokens for the sentence
"Hello, how are you?"
- Embedding dimensions: 768. Each token is represented by a 768-dimensional vector for GPT-2.
2.1.3: Attention Mechanism
Imagine you’re reading a sentence: "The cat sat on the mat. Each word is important, but some are more related to others.
- “cat” is related to “sat.”
- “mat” is related to “sat.”
- “on” is less important.
Self-Attention helps the model decide which words to focus on!
How Self-Attention Works (Step-by-Step)
Self-attention is done in 4 steps:
1. Convert Words into Vectors (Embeddings)
Computers don’t understand words, so we convert them into numbers (word embeddings).
Example:
| Word |
Vector Representation (Simplified) |
| The |
[0.1, 0.2, 0.3] |
| Cat |
[0.5, 0.6, 0.7] |
| Sat |
[0.8, 0.9, 1.0] |
2. Create Query, Key, and Value (Q, K, V)
Each word is transformed into three vectors:
- Query (Q) → “What am I looking for?”
- Key (K) → “What information do I have?”
- Value (V) → “What should be returned?”
Example:
| Word |
Query (Q) |
Key (K) |
Value (V) |
| Cat |
0.5 |
0.4 |
0.6 |
| Sat |
0.7 |
0.8 |
0.9 |
The Query (Q) of “sat” will be compared with Keys (K) of all words to see how related they are.
Now, let’s compute the Attention Scores using Q × Kᵀ.
3. Compute Attention Scores (Importance of Words) (Q × Kᵀ)
Now, we compare Query (Q) and Key (K) using the dot product.
- If Q and K are similar, the word is important.
- If Q and K are different, the word is less important.
\[
\text{Attention Score} = Q \times K^T
\]Example:
- “sat” is strongly related to “cat” → High attention score.
- “sat” is weakly related to “the” → Low attention score.
| Word Pair |
Calculation |
| Cat → Cat |
(0.5 × 0.4) = 0.2 |
| Cat → Sat |
(0.5 × 0.8) = 0.4 |
| Sat → Cat |
(0.7 × 0.4) = 0.28 |
| Sat → Sat |
(0.7 × 0.8) = 0.56 |
Thus, our Attention Score matrix is:
\[
\begin{bmatrix}
0.2 & 0.4 \\
0.28 & 0.56
\end{bmatrix}
\]4. Now we normalize these scores using Softmax.
What is Softmax?
Softmax is a function that converts raw scores into probabilities. Example: Let’s say we have three scores:
Raw Scores: [0.56, 0.72, 0.11]
Softmax converts them into values between 0 and 1:
Softmax Output: [0.30, 0.50, 0.20]
Why do we use Softmax?
So the values sum to 1, making them easy to interpret as “importance levels.”
Coming back to our actual values now,
Softmax normalizes these values so that they sum to 1 per row.
\[
\text{Softmax}(0.2, 0.4) = \frac{e^{0.2}}{e^{0.2} + e^{0.4}}, \frac{e^{0.4}}{e^{0.2} + e^{0.4}}
\]Approximating exponentials:
\[
e^{0.2} \approx 1.221, \quad e^{0.4} \approx 1.491
\]\[
\sum = 1.221 + 1.491 = 2.712
\]\[
\text{Softmax} = \left[ \frac{1.221}{2.712}, \frac{1.491}{2.712} \right] = [0.45, 0.55]
\]For the second row:
\[
e^{0.28} \approx 1.323, \quad e^{0.56} \approx 1.751
\]\[
\sum = 1.323 + 1.751 = 3.074
\]\[
\text{Softmax} = \left[ \frac{1.323}{3.074}, \frac{1.751}{3.074} \right] = [0.43, 0.57]
\]So, the normalized attention weights (Softmax scores) are:
\[
\begin{bmatrix}
0.45 & 0.55 \\
0.43 & 0.57
\end{bmatrix}
\]
4. Multiply Attention Weights by Value (V) and Sum Up
Each word’s final value is computed as:
Now, we multiply the softmax scores by the Value (V) matrix.
Final Word Representation = Attention Score × Value (V)
For Cat (first row):
\[
(0.45 \times 0.6) + (0.55 \times 0.9) = 0.27 + 0.495 = 0.765
\]For Sat (second row):
\[
(0.43 \times 0.6) + (0.57 \times 0.9) = 0.258 + 0.513 = 0.771
\]This process refines each word’s meaning based on context.
5. Final Output: Context Vector
The final contextualized representations are:
\[
\begin{bmatrix}
0.765 \\
0.771
\end{bmatrix}
\]What This Means
- Each word’s new representation now depends on its relationship with others, weighted by attention!
- Would you like me to extend this to multi-dimensional Q, K, and V?
The final output (context vector) represents the new embeddings for each word after applying the self-attention mechanism. These new values are no longer just the original word embeddings; instead, they now encode contextual information from the surrounding words based on how much attention each word gives to others.
-
Contextualized Representation
- Originally, each word had its own static vector (e.g., “Cat” = [0.5, 0.6, 0.7]).
- After self-attention, each word’s new representation incorporates weighted contributions from other words based on attention scores.
-
Information Flow
- The new values (0.765, 0.771) indicate that “Cat” and “Sat” now carry some information from each other, influenced by the attention weights.
- Words that are more relevant to each other have stronger influences.
-
Why Is This Important?
- Before self-attention, “Cat” was just “Cat,” and “Sat” was just “Sat.”
- Now, “Cat” understands that “Sat” is nearby and incorporates some of its meaning.
- This is how transformers capture context and relationships in sentences!
6. Sample Code
from transformers import BertModel, BertTokenizer
# Load pre-trained BERT model with attention outputs enabled
model = BertModel.from_pretrained("bert-base-uncased", output_attentions=True)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", output_attentions=True)
# Tokenize and convert input into tensors
inputs = tokenizer("Hello, Generative AI!", return_tensors="pt")
# Forward pass to get outputs including attention weights
outputs = model(**inputs)
# Extract attention layers
attentions = outputs.attentions
# Print number of attention layers
print("Number of Attention Layers:", len(attentions))
Explanation of the Code
BertModel.from_pretrained("bert-base-uncased", output_attentions=True): Loads the BERT model with the option to return attention weights.
tokenizer("Hello, Generative AI!", return_tensors="pt"): Converts input text into a format the model understands (PyTorch tensors).
model(**inputs): Passes the tokenized inputs through the BERT model.
outputs.attentions: Extracts attention weights from different transformer layers.
7. Python Example: Simple Self-Attention Implementation from Scratch
Now, let’s implement self-attention from scratch in Python.
import torch
import torch.nn as nn
import torch.nn.functional as F
class SelfAttention(nn.Module):
def __init__(self, embed_size):
super(SelfAttention, self).__init__()
self.embed_size = embed_size
self.query = nn.Linear(embed_size, embed_size)
self.key = nn.Linear(embed_size, embed_size)
self.value = nn.Linear(embed_size, embed_size)
def forward(self, x):
Q = self.query(x) # Convert to Query
K = self.key(x) # Convert to Key
V = self.value(x) # Convert to Value
# Compute Attention Scores
scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.embed_size ** 0.5)
attention = F.softmax(scores, dim=-1) # Apply Softmax
# Multiply by values
out = torch.matmul(attention, V)
return out
# Example usage
embed_size = 8
seq_length = 5
x = torch.rand((1, seq_length, embed_size))
self_attn = SelfAttention(embed_size)
output = self_attn(x)
print("Output Shape:", output.shape) # Expected: (1, seq_length, embed_size)
Multi-Head Attention (Why Do We Need Multiple Attention Layers?)
Instead of one attention mechanism, Transformers use multiple (“heads”) to capture different relationships.
- Why? A single attention mechanism may miss important details.
- Multi-head attention ensures the model sees information from different perspectives.
The Transformer Model is the core architecture behind most modern NLP and GenAI models like GPT, BERT, and LLaMA. Here’s how it works:
Key Components of the Transformer
- Self-Attention Mechanism: This allows the model to focus on different words in a sentence when processing each word. For example, when processing the word “bank” in the sentence “I went to the bank to withdraw money,” the model can focus on the context to determine if “bank” refers to a financial institution or the side of a river.
- Multi-Head Attention: This technique allows the model to focus on different aspects of the sentence simultaneously, using multiple attention heads to capture different relationships between words.
- Positional Encoding: Since transformers don’t inherently understand the order of words (like sequential models), positional encoding is added to provide information about the position of words in a sentence.
- Encoder-Decoder Architecture:
- Encoder: Processes the input data (e.g., a sentence).
- Decoder: Generates the output data (e.g., a translation of the sentence).
Both the Transformer Encoder and Decoder consist of:
- Feed-Forward Layers: Further processes information after attention.
- Layer Normalization: Normalizes the output of each sub-layer before passing it to the next layer, stabilizing training and preventing exploding gradients.
Feed-Forward Layers (FFNN)
What is a Feed-Forward Layer?
A Feed-Forward Layer, also known as a Feed-Forward Neural Network (FFNN), is a fully connected neural network layer that takes the output from the attention mechanism and further processes it.
How does a Feed-Forward Layer work?
- Input: The output from the attention mechanism is fed into the FFNN.
- Linear Transformation: The input is transformed using a linear layer (e.g., a dense layer) with a learnable weight matrix and bias term.
- Activation Function: The output from the linear transformation is passed through an activation function, such as ReLU (Rectified Linear Unit) or GeLU (Gaussian Error Linear Unit).
- Output: The output from the activation function is the final output of the FFNN.
Purpose of Feed-Forward Layers
The FFNN serves two purposes:
- Feature Transformation: It transforms the output from the attention mechanism into a higher-level representation that’s more suitable for the task at hand.
- Non-Linearity Introduction: The activation function introduces non-linearity into the model, allowing it to learn more complex relationships between the input and output. For example,
- ReLU (Rectified Linear Unit), Maps all negative values to 0 and all positive values to the same value. Another one,
- Sigmoid, which maps the input to a value between 0 and 1.
- GELU (Gaussian Error Linear Unit): combines the benefits of ReLU and sigmoid.
Layer Normalization
What is Layer Normalization?
Layer Normalization is a technique used to normalize the output of each sub-layer (e.g., the FFNN) before passing it to the next layer.
How does Layer Normalization work?
- Compute Mean and Variance: The mean and variance of the output from the sub-layer are computed.
- Normalize: The output is normalized by subtracting the mean and dividing by the standard deviation (square root of variance).
- Scale and Shift: The normalized output is then scaled and shifted using learnable parameters (gamma and beta).
Purpose of Layer Normalization
Layer Normalization serves several purposes:
- Stabilizes Training: Normalization helps stabilize the training process by reducing the effects of exploding gradients.
- Improves Generalization: Normalization can improve the model’s ability to generalize to new, unseen data.
- Reduces Dependence on Initialization: Normalization reduces the dependence on the initialization of the model’s weights.
By using Layer Normalization, the Transformer model can better handle the complex interactions between the attention mechanism and the FFNN, leading to improved performance and stability.
Examples
FNN Example
Suppose we have a simple FFN with one input layer, one hidden layer, and one output layer. The input layer has 2 neurons, the hidden layer has 3 neurons, and the output layer has 1 neuron.
Here’s the FFN architecture: Input Layer (2 neurons) → Hidden Layer (3 neurons) → Output Layer (1 neuron)
Let’s say we have an input vector x = [1, 2]. The FFN processes this input as follows:
- Input Layer: The input vector x is passed through the input layer.
- Hidden Layer: The output from the input layer is passed through a linear transformation (e.g., a dense layer) with weights W1 and biases b1, followed by an activation function (e.g., ReLU). Let’s call the output from the hidden layer h.
- Output Layer: The output from the hidden layer h is passed through another linear transformation with weights W2 and biases b2. The final output is y.
Here’s some sample PyTorch code to illustrate this:
import torch
import torch.nn as nn
# Define the FFN model
class FFN(nn.Module):
def __init__(self):
super(FFN, self).__init__()
self.hidden_layer = nn.Linear(2, 3) # input layer (2) -> hidden layer (3)
self.output_layer = nn.Linear(3, 1) # hidden layer (3) -> output layer (1)
def forward(self, x):
h = torch.relu(self.hidden_layer(x)) # activation function: ReLU
y = self.output_layer(h)
return y
# Initialize the FFN model and input vector
model = FFN()
x = torch.tensor([1.0, 2.0])
# Forward pass
y = model(x)
print(y)
FFNN Example: Image Classification
Suppose we want to build a simple image classification model that can distinguish between images of cats and dogs.
Here’s how an FFNN can be used:
- Input Layer: The input layer takes in the image data, which is typically represented as a matrix of pixel values.
- Hidden Layer: The hidden layer applies a series of transformations to the input data, using weights and biases learned during training.
- Output Layer: The output layer generates a probability distribution over the two classes (cat or dog).
For example, if we input an image of a cat, the FFNN might output: [0.8, 0.2]
This indicates that the model is 80% confident that the image is a cat and 20% confident that it’s a dog.
Layer Normalization Example
Now, let’s add Layer Normalization to the FFN model. We’ll apply Layer Normalization to the output of the hidden layer.
Here’s the modified PyTorch code:
import torch
import torch.nn as nn
import torch.nn.functional as F
# Define the FFN model with Layer Normalization
class FFN(nn.Module):
def __init__(self):
super(FFN, self).__init__()
self.hidden_layer = nn.Linear(2, 3) # input layer (2) -> hidden layer (3)
self.layer_norm = nn.LayerNorm(3) # Layer Normalization for hidden layer
self.output_layer = nn.Linear(3, 1) # hidden layer (3) -> output layer (1)
def forward(self, x):
h = torch.relu(self.hidden_layer(x)) # activation function: ReLU
h = self.layer_norm(h) # apply Layer Normalization
y = self.output_layer(h)
return y
# Initialize the FFN model and input vector
model = FFN()
x = torch.tensor([1.0, 2.0])
# Forward pass
y = model(x)
print(y)
In this example, we’ve added a LayerNorm module to the FFN model, which applies Layer Normalization to the output of the hidden layer.
The LayerNorm module normalizes the input data by subtracting the mean and dividing by the standard deviation, which helps to stabilize the training process and improve the model’s performance.
Layer Normalization Example: Language Translation
Suppose we want to build a machine translation system that translates English sentences to Hindi. We have a dataset of paired English and Hindi sentences.
Our sequence-to-sequence model consists of three main components:
- Encoder : The encoder takes in the English sentence and outputs a sequence of vectors and a hidden state. The encoder uses a GRU (Gated Recurrent Unit) layer to process the input sequence.
- Layer Normalization : The layer normalization component normalizes the output from the encoder. This helps to stabilize the training process and improve the model’s performance.
- Decoder : The decoder takes in the normalized output and the hidden state from the encoder and outputs a Hindi sentence. The decoder also uses a GRU layer to process the input sequence.
Layer Normalization helps to:
- Stabilize the training process by reducing the effects of exploding gradients.
- Improve the model’s ability to generalize to new, unseen data.
Example Walkthrough
Let’s walk through an example of how this model works:
- English Sentence: We start with an English sentence, “Hello, how are you?”
- Encoder: The encoder takes in the English sentence and outputs a sequence of vectors and a hidden state.
| English Word |
Encoder Output |
| Hello |
[0.1, 0.2, 0.3] |
| how |
[0.4, 0.5, 0.6] |
| are |
[0.7, 0.8, 0.9] |
| you |
[1.0, 1.1, 1.2] |
Layer Normalization: The layer normalization component normalizes the output from the encoder.
| English Word |
Normalized Output |
| Hello |
[-0.5, 0.0, 0.5] |
| how |
[-0.3, 0.2, 0.7] |
| are |
[-0.1, 0.4, 0.9] |
| you |
[0.1, 0.6, 1.1] |
Decoder: The decoder takes in the normalized output and the hidden state from the encoder and outputs a Hindi sentence.
| Hindi Word |
Decoder Output |
| नमस्ते |
[0.8, 0.9, 1.0] |
| कैसे |
[0.6, 0.7, 0.8] |
| हो |
[0.4, 0.5, 0.6] |
| ? |
[0.2, 0.3, 0.4] |
Final Output: The final output from the decoder is the translated Hindi sentence, “नमस्ते, कैसे हो?”.
# Import the necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim
# Define the encoder model
class Encoder(nn.Module):
def __init__(self):
# Initialize the encoder model
super(Encoder, self).__init__()
# Define the embedding layer with 10000 possible words and 128-dimensional vectors
self.embedding = nn.Embedding(num_embeddings=10000, embedding_dim=128)
# Define the GRU layer with 128 input dimensions, 256 hidden dimensions, and 1 layer
self.rnn = nn.GRU(input_size=128, hidden_size=256, num_layers=1, batch_first=True)
def forward(self, x):
# Embed the input sequence
embedded = self.embedding(x)
# Pass the embedded sequence through the GRU layer
output, hidden = self.rnn(embedded)
# Return the output and hidden state
return output, hidden
# Define the layer normalization model
class LayerNormalization(nn.Module):
def __init__(self):
# Initialize the layer normalization model
super(LayerNormalization, self).__init__()
# Define the layer normalization layer with 256 dimensions
self.layer_norm = nn.LayerNorm(normalized_shape=256)
def forward(self, x):
# Normalize the input sequence
return self.layer_norm(x)
# Define the decoder model
class Decoder(nn.Module):
def __init__(self):
# Initialize the decoder model
super(Decoder, self).__init__()
# Define the embedding layer with 10000 possible words and 128-dimensional vectors
self.embedding = nn.Embedding(num_embeddings=10000, embedding_dim=128)
# Define the GRU layer with 128 input dimensions, 256 hidden dimensions, and 1 layer
self.rnn = nn.GRU(input_size=128, hidden_size=256, num_layers=1, batch_first=True)
# Define the fully connected layer with 256 input dimensions and 10000 output dimensions
self.fc = nn.Linear(256, 10000)
def forward(self, x, hidden):
# Embed the input sequence
embedded = self.embedding(x)
# Pass the embedded sequence through the GRU layer
output, hidden = self.rnn(embedded, hidden)
# Pass the output through the fully connected layer
output = self.fc(output[:, -1, :])
# Return the output and hidden state
return output, hidden
# Define the sequence-to-sequence model
class Seq2Seq(nn.Module):
def __init__(self):
# Initialize the sequence-to-sequence model
super(Seq2Seq, self).__init__()
# Define the encoder model
self.encoder = Encoder()
# Define the layer normalization model
self.layer_norm = LayerNormalization()
# Define the decoder model
self.decoder = Decoder()
def forward(self, x):
# Pass the input sequence through the encoder
encoder_output, hidden = self.encoder(x)
# Pass the encoder output through the layer normalization
normalized_output = self.layer_norm(encoder_output)
# Pass the normalized output and hidden state through the decoder
decoder_output, _ = self.decoder(x[:, 0:1], hidden)
# Return the decoder output
return decoder_output
# Initialize the sequence-to-sequence model
model = Seq2Seq()
# Define the vocabulary
english_vocabulary = {'Hello': 0, 'are': 1, 'you': 2, '<EOS>': 3}
hindi_vocabulary = {'नमस्ते': 0, 'कैसे': 1, 'हो': 2, '<EOS>': 3}
# Define a sample English input sequence
english_input = torch.tensor([[english_vocabulary['Hello'], english_vocabulary['are'], english_vocabulary['you']]])
# Define a sample Hindi output sequence
hindi_output = torch.tensor([[hindi_vocabulary['नमस्ते'], hindi_vocabulary['कैसे'], hindi_vocabulary['हो']]])
# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# Train the model
for epoch in range(100):
# Zero the gradients
optimizer.zero_grad()
# Pass the input sequence through the model
output = model(english_input)
# Calculate the loss
loss = criterion(output, hindi_output[:, 0])
# Backpropagate the loss
loss.backward()
# Update the model parameters
optimizer.step()
# Print the loss
print(f'Epoch {epoch+1}, Loss: {loss.item()}')
# Use the trained model for translation
# Use the trained model for translation
translated_output = model(english_input)
translated_output_idx = torch.argmax(translated_output, dim=1)
predicted_word = list(hindi_vocabulary.keys())[translated_output_idx.item()]
print(predicted_word)
2.1.5: Recap
Summary: How Generative AI Processes Text
When you enter a prompt like:
➡️ "Explain black holes"
A Generative AI model follows these steps:
Step 1: Tokenization
Breaks text into smaller parts (tokens).
Example:
"Explain black holes" → ["explain", "black", "holes"]
Step 2: Embeddings
Each token is converted into a numerical vector for processing.
Step 3: Transformer Model Processing
- Self-attention determines which words matter the most.
- Multiple layers refine understanding.
Step 4: Text Generation
- Predicts the most likely next token at each step.
- Constructs output based on learned patterns.
2.2: Controlling GenAI Model Output
Temperature
- Purpose: Controls the randomness of the predictions. It’s a hyperparameter used to scale the logits (predicted probabilities) before sampling.
- How it works: The model computes probabilities for each token, and the temperature parameter adjusts these probabilities.
- Low temperature (<1.0): Makes the model more deterministic by amplifying the difference between high-probability tokens and low-probability tokens. This makes the model more likely to choose the most probable token.
- High temperature (>1.0): Makes the model more random by flattening the probabilities. This results in more diverse, creative, and sometimes less coherent text.
Example
- Temperature = 0.7: The model will likely choose the more predictable or likely tokens.
- Temperature = 1.5: The model will take more risks, leading to more unexpected, diverse outputs.
# Example of lower temperature (more deterministic)
outputs = model.generate(inputs['input_ids'], max_length=50, temperature=0.7)
# Example of higher temperature (more creative/random)
outputs = model.generate(inputs['input_ids'], max_length=50, temperature=1.5)
Top-k Sampling
- Purpose: Limits the number of tokens to sample from, making the generation process more efficient and sometimes more coherent.
- How it works: Instead of considering all possible tokens (the entire vocabulary), top-k sampling restricts the set of possible next tokens to the top-k most likely tokens based on their probability scores.
- k = 1: This would make the model behave deterministically, always picking the most probable token.
- k = 50: The model will sample from the top 50 tokens with the highest probabilities.
Example
- Top-k = 10: The model will only consider the 10 tokens with the highest probabilities when selecting the next word.
- Top-k = 100: The model will consider the top 100 tokens, giving it more variety.
# Example with top-k sampling (restricted to top 50 tokens)
outputs = model.generate(inputs['input_ids'], max_length=50, top_k=50)
- Effect of Top-k: By limiting the token options to the top-k, the model’s output tends to be more controlled and less random than pure sampling from all tokens.
Top-p (Nucleus Sampling)
- Purpose: Similar to top-k, but instead of limiting to a fixed number of tokens, top-p limits the tokens considered based on their cumulative probability.
- How it works: The model keeps sampling from the smallest set of tokens whose cumulative probability exceeds a threshold p (where p is between 0 and 1). This dynamic method is often referred to as nucleus sampling.
- p = 0.9: The model will consider the smallest set of tokens whose cumulative probability is at least 90%. This results in considering a variable number of tokens based on how steep the probability distribution is.
- p = 1.0: This would be equivalent to top-k sampling with k = all tokens, allowing the model to sample from all tokens.
Example
- Top-p = 0.9: The model considers the smallest set of tokens whose combined probability is at least 90%. This prevents very unlikely tokens from being considered while still allowing more diversity.
- Top-p = 0.95: The model will sample from a slightly larger set of tokens.
# Example with top-p (nucleus) sampling
outputs = model.generate(inputs['input_ids'], max_length=50, top_p=0.9)
- Effect of Top-p: Nucleus sampling tends to generate more coherent and diverse text than top-k sampling, as the model is free to choose tokens from a set that dynamically adjusts based on their probabilities.
Temperature, Top-k, and Top-p Combined
You can combine these parameters to fine-tune the model’s output. For example:
outputs = model.generate(
inputs['input_ids'],
max_length=50,
temperature=0.8,
top_k=50,
top_p=0.9
)
This will give you:
- A lower temperature (0.8), making the generation more predictable.
- Top-k sampling with the top 50 tokens.
- Top-p sampling that only includes tokens whose cumulative probability is at least 90%.
By tuning these parameters, you can experiment with how controlled or creative the generated text is.
Summary of Differences
- Temperature: Adjusts the randomness of the sampling. Higher temperature means more diverse output; lower means more predictable.
- Top-k Sampling: Limits the number of candidate tokens to the top-k most likely tokens.
- Top-p (Nucleus) Sampling: Limits the candidate tokens to those whose cumulative probability is at least p (a probability threshold), providing more flexible diversity control.
Details
Confused! Let us break down top-k and top-n with simpler examples.
Top-k Sampling (Simplified)
Imagine the model is choosing the next word from a list of 5 possible words, each with a probability:
| Word |
Probability |
| “apple” |
0.5 |
| “banana” |
0.3 |
| “cherry” |
0.1 |
| “date” |
0.05 |
| “elderberry” |
0.05 |
Top-k = 2:
With top-k=2, the model will only consider the top 2 most probable words. So it will only consider “apple” and “banana”. The model ignores the words “cherry”, “date”, and “elderberry” because they are less likely.
If the model needs to choose the next word, it will only sample from these 2 words: “apple” and “banana”. This makes the sampling process more controlled and focused.
Top-k = 3:
If top-k=3, it will consider “apple”, “banana”, and “cherry”. This is a little more diverse but still limited to the top 3.
Top-p (Nucleus Sampling) (Simplified)
Now, let’s look at top-p (nucleus sampling), which works a bit differently.
Let’s use the same words and probabilities:
| Word |
Probability |
| “apple” |
0.5 |
| “banana” |
0.3 |
| “cherry” |
0.1 |
| “date” |
0.05 |
| “elderberry” |
0.05 |
Top-p = 0.8:
With top-p=0.8, the model will add up the probabilities from the most likely words until the total probability is greater than or equal to 0.8.
- “apple” = 0.5
- “banana” = 0.3
- Total = 0.8
At this point, the model has already reached 0.8 probability. So it will stop and consider only “apple” and “banana”.
This is different from top-k because it doesn’t limit to a fixed number of tokens. It dynamically chooses the most likely words until the total probability reaches the given threshold (in this case, 0.8).
Top-p = 0.9:
If we set top-p=0.9, the model will keep adding tokens until the cumulative probability is 0.9.
- “apple” = 0.5
- “banana” = 0.3
- “cherry” = 0.1
- Total = 0.9
Now, the model will consider “apple”, “banana”, and “cherry”.
Key Difference between Top-k and Top-p
- Top-k restricts you to a fixed number of the most likely tokens.
- Example: top-k=2 would only allow the model to choose from the top 2 words.
- Top-p (Nucleus sampling) restricts you to the smallest set of tokens whose cumulative probability is greater than or equal to p.
- Example: top-p=0.8 means the model will sample from the tokens that, together, have at least 80% probability.
Summary
- Top-k: Always limits to a fixed number of tokens (e.g., top 3, top 5).
- Top-p: Dynamically limits to the smallest set of tokens whose cumulative probability is at least
p (e.g., 80% or 90%).
FAQ
1. Let’s work through the scenario where every word has a probability of 0.7 and you’re using top-p sampling with a threshold of 0.8.
Scenario: Let’s assume the following token probabilities:
| Word |
Probability |
| “apple” |
0.7 |
| “banana” |
0.7 |
| “cherry” |
0.7 |
| “date” |
0.7 |
| “elderberry” |
0.7 |
Top-p = 0.8: In top-p sampling, the model keeps adding tokens to the pool until their cumulative probability exceeds or meets the top-p threshold (0.8).
Step-by-step breakdown:
- “apple” = 0.7
- “banana” = 0.7 (cumulative probability = 0.7 + 0.7 = 1.4)
At this point, the cumulative probability is 1.4, which exceeds the 0.8 threshold. So the sampling pool will be limited to these two words: “apple” and “banana”.
Since the total probability already exceeds 0.8 after the first two words, the model will include both “apple” and “banana” in the selection pool.
Key Points:
- Top-p sampling doesn’t strictly limit the number of tokens — it selects tokens whose cumulative probability is at least the threshold (0.8 in this case).
- If all tokens have the same probability (0.7), then the model will keep adding tokens until the cumulative probability reaches the top-p threshold.
- In this case, the model will sample from the first two words (“apple” and “banana”), as their cumulative probability (1.4) exceeds the threshold of 0.8.
Final Conclusion:
If every word has the same probability of 0.7, and you’re using top-p = 0.8, the model will include all words up to the point where the cumulative probability exceeds 0.8. In this case, it will stop at the second word, and you’ll end up with a pool of two words to choose from.
2. Let’s now look at how top-k sampling works in this case where every word has a probability of 0.7.
Scenario: We have the same token probabilities:
| Word |
Probability |
| “apple” |
0.7 |
| “banana” |
0.7 |
| “cherry” |
0.7 |
| “date” |
0.7 |
| “elderberry” |
0.7 |
Top-k = 2: In top-k sampling, the model selects the top-k most probable tokens. The number k is fixed, meaning the model will consider exactly the top k tokens based on their probabilities.
How it works:
- Regardless of the probabilities, the model will pick the top 2 most probable tokens.
- In this case, since all the words have the same probability of 0.7, the model will choose the first 2 tokens (based on their order or position in the list).
What Happens Here:
- Since top-k=2, the model will always select the first 2 tokens, because every token has the same probability (0.7).
- The model doesn’t care about the cumulative probability here; it only cares about the number of tokens, which is fixed at 2 in this case.
Key Points:
- Top-k simply selects the top k most probable words — it doesn’t dynamically sum probabilities like top-p.
- In the case where all words have the same probability, top-k just picks the first
k words in the list.
- Top-k is not influenced by the cumulative probability — it just selects a fixed number of top tokens.