2: Introduction to GenAI

What is Generative AI?

Generative AI refers to a class of artificial intelligence models that generate new content, such as text, images, audio, or video. Unlike traditional AI models focused on classification or prediction, generative models create new data based on learned patterns, producing outputs similar to the input data but with variability. The ultimate goal is to produce realistic content that’s indistinguishable from human-created work.

Types of Generative AI Models

  1. Large Language Models (LLMs): Text Generation: Models that generate human-like text using deep learning. Examples: GPT-4, LLaMA, Claude, Mistral, Gemini
  2. Diffusion Models: Image & Video Generation: Generate images/video from noise, refining them over multiple steps. Examples: Stable Diffusion, DALL·E, Midjourney, Sora
  3. Audio & Music Generators: Generate realistic speech, music, or sound effects. Examples: MusicGen, Jukebox, VALL-E, Bark
  4. Multi-modal Models: Can process and generate text, images, video, and audio in a single model. Examples: Gemini, GPT-4 Turbo (Vision), LLaVA

Example Use Cases of GenAI

  • 📝 Text Generation: Article generation, essay writing etc. (ChatGPT, Gemini, LLaMA)
  • 🎨 Image Generation: Creating art, photos, or designs. (Stable Diffusion, DALL·E, Midjourney)
  • 🎶 Audio Generation: Composing music or generating speech. (Jukebox, MusicGen)
  • 🎥 Video Generation: Deepfake technology and AI-assisted filmmaking. (Sora, Pika Labs)
  • 🧑‍🎨 Chatbots: Conversational agents that can interact with users. (ChatGPT, Gemini, LLaMA)
  • GPT (Generative Pretrained Transformer):

    • GPT models are trained to predict the next word in a sentence given the previous words. They use transformer architecture, which allows them to understand and generate human-like text.
    • Use Case: Writing articles, answering questions, generating code, etc.
  • BERT (Bidirectional Encoder Representations from Transformers):

    • Unlike GPT, BERT is trained to understand the context of a word in both directions (left-to-right and right-to-left). It’s mainly used for tasks that require a deep understanding of the context of language.
    • Use Case: Sentiment analysis, text classification, question answering.
  • LLaMA (Large Language Model Meta AI):

    • Developed by Meta (formerly Facebook), LLaMA is an open-source language model similar to GPT. It focuses on providing access to large models while maintaining efficiency and usability.
    • Use Case: Text generation, summarization, and more.

GenAI Applications and Impact

GenAI has various applications across industries:

  • Text Generation: GenAI models like GPT are used in content generation, such as blog writing, coding, and chatbot responses. For example, OpenAI’s GPT-3 is employed for tasks ranging from generating marketing copy to drafting emails.
  • Conversational AI: Models like GPT-3, paired with specialized APIs, are used to build virtual assistants (like Siri and Alexa) or customer service chatbots, which can hold meaningful conversations with humans.
  • Image Generation: DALL·E is an example of an AI that generates images from textual descriptions. This is used in creative industries like marketing and design.
  • Code Generation: AI models like GitHub Copilot (based on GPT) assist developers by suggesting code and helping write functions.

Real-World GenAI Projects and Case Studies

  1. GPT-3 in Action:
    OpenAI’s GPT-3 is used across various sectors, from writing blog posts to generating legal contracts and automating customer service.

  2. DeepMind’s AlphaFold:
    AlphaFold is a deep learning model developed by DeepMind that predicts the 3D structure of proteins. This has significant implications for drug discovery and biology.

  3. Meta’s LLaMA:
    Meta’s LLaMA models are used for efficient natural language processing tasks, offering an open-source alternative to GPT models for research purposes.


Ethical Considerations

  • Bias in AI: AI models can inherit biases from their training data. This can affect the fairness of models in real-world applications.
  • Transparency and Accountability: Models like GPT may produce outputs that are hard to interpret, raising concerns about accountability in AI-generated content.
  • Deepfakes and Misinformation: GenAI models are capable of generating realistic but fake content, such as videos or voices, which can be used maliciously.

Subsections of Introduction to GenAI

2.1: Key Concepts in GenAI

Key Concepts in Generative AI

Concept Definition
Large Language Models (LLMs) LLMs are AI models trained on vast amounts of text data. They use the Transformer architecture, which relies on attention mechanisms to process input data. Examples: GPT (Generative Pre-trained Transformer), BERT, T5.
Tokenization Breaking down text into smaller units (tokens) for processing. Example: The sentence “Hello, world!” might be tokenized into ["Hello", ",", "world", "!"].
Embeddings Representing tokens as numerical (vectors) in a high-dimensional space. Embeddings capture semantic meaning (e.g., “king” - “man” + “woman” ≈ “queen”).
Self-Attention/Attention Mechanism Mechanism that helps models focus on relevant words.
Transformers The deep learning architecture used in LLMs. Transformers are the backbone of most modern generative models. Key components: Encoder, Decoder, and Attention Mechanism.
Pre-training Training a model on a large dataset (e.g., all of Wikipedia) to learn general language patterns.
Fine-tuning Adapting the pre-trained model to a specific task (e.g., sentiment analysis, chatbot).
Prompt Engineering Designing effective inputs to guide model responses.

Tokenization

Tokenization is the process of converting text into smaller units, typically words or subwords, that can be processed by machine learning models. In natural language processing (NLP), tokens are the basic building blocks for understanding and generating language. Tokenization helps the model “understand” the text by converting it into a format that can be fed into the neural network.

  • Word-level tokenization splits text into words. Example: "I love AI"["I", "love", "AI"].
  • Subword-level tokenization (used in models like GPT) splits words into smaller parts or subwords. This is more efficient and handles unknown words better, as the model can understand smaller pieces of a word. Example, “delightful”[“delight”, “ful”]`.
  • Character-Level Tokenization Splits text into individual characters. Example: "AI"["A", "I"].

Hands-On: Tokenization

Tokenization Example

from transformers import AutoTokenizer

# Load the tokenizer for GPT-2
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Tokenize a sentence
text = "Hello, how are you?"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)

print("Tokens:", tokens)
print("Token IDs:", token_ids)

Output:

Tokens: ['Hello', ',', 'Ġhow', 'Ġare', 'Ġyou', '?']
Token IDs: [15496, 11, 703, 389, 345, 30]

Explanation:

  • tokenizer.tokenize(text): Splits the text into tokens.
  • tokenizer.encode(text): Converts tokens into their corresponding numerical IDs.

Tokenization: Additional Resources


Embeddings

Embeddings are numerical representations of tokens in a high-dimensional space. They capture the semantic meaning of tokens, allowing models to understand relationships between words.

Why are Embeddings Important?

  • Words with similar meanings have similar embeddings.
  • Embeddings enable models to generalize and understand context.

Example:

  • The embeddings for "king", "queen", "man", and "woman" might satisfy the relationship:
    king - man + woman ≈ queen

Hands-On: Embeddings

from transformers import AutoTokenizer, AutoModel
import torch

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModel.from_pretrained("gpt2")

# Tokenize input text
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")

# Generate embeddings
with torch.no_grad():
    outputs = model(**inputs)

# Extract embeddings for the first token
embeddings = outputs.last_hidden_state
print("Embeddings shape:", embeddings.shape)
print("Embeddings for 'Hello':", embeddings[0, 0, :5])  # First 5 dimensions

Output:

Embeddings shape: torch.Size([1, 6, 768])
Embeddings for 'Hello': tensor([-0.0123,  0.0456, -0.0678,  0.0234,  0.0891])

Explanation:

  • model(**inputs): Passes the token IDs through the model to generate embeddings.
  • outputs.last_hidden_state: Contains the embeddings for each token.
  • Each token is represented as a 768-dimensional vector (for GPT-2).
  • Output: torch.Size([1, 6, 768]) means
    • Batch size: 1. Since we have one input sentence.
    • Sequence length: 6. The input sentence has 6 tokens for the sentence "Hello, how are you?"
    • Embedding dimensions: 768. Each token is represented by a 768-dimensional vector for GPT-2.

Self-Attention: The Core of Transformers

Imagine you’re reading a sentence: "The cat sat on the mat. Each word is important, but some are more related to others.

  • “cat” is related to “sat.”
  • “mat” is related to “sat.”
  • “on” is less important.

Self-Attention helps the model decide which words to focus on!

How Self-Attention Works (Step-by-Step)

Self-attention is done in 4 steps:

1. Convert Words into Vectors (Embeddings)

Computers don’t understand words, so we convert them into numbers (word embeddings).

Example:

Word Vector Representation (Simplified)
The [0.1, 0.2, 0.3]
Cat [0.5, 0.6, 0.7]
Sat [0.8, 0.9, 1.0]

2. Create Query, Key, and Value (Q, K, V)

Each word is transformed into three vectors:

  • Query (Q) → “What am I looking for?”
  • Key (K) → “What information do I have?”
  • Value (V) → “What should be returned?”

Example:

Word Query (Q) Key (K) Value (V)
Cat 0.5 0.4 0.6
Sat 0.7 0.8 0.9

The Query (Q) of “sat” will be compared with Keys (K) of all words to see how related they are.
Now, let’s compute the Attention Scores using Q × Kᵀ.

3. Compute Attention Scores (Importance of Words) (Q × Kᵀ)

Now, we compare Query (Q) and Key (K) using the dot product.

  • If Q and K are similar, the word is important.
  • If Q and K are different, the word is less important.
\[ \text{Attention Score} = Q \times K^T \]

Example:

  • “sat” is strongly related to “cat” → High attention score.
  • “sat” is weakly related to “the” → Low attention score.
Word Pair Calculation
Cat → Cat (0.5 × 0.4) = 0.2
Cat → Sat (0.5 × 0.8) = 0.4
Sat → Cat (0.7 × 0.4) = 0.28
Sat → Sat (0.7 × 0.8) = 0.56

Thus, our Attention Score matrix is:

\[ \begin{bmatrix} 0.2 & 0.4 \\ 0.28 & 0.56 \end{bmatrix} \]

4. Now we normalize these scores using Softmax.

What is Softmax?

Softmax is a function that converts raw scores into probabilities. Example: Let’s say we have three scores:

Raw Scores: [0.56, 0.72, 0.11]

Softmax converts them into values between 0 and 1:

Softmax Output: [0.30, 0.50, 0.20]

Why do we use Softmax?

So the values sum to 1, making them easy to interpret as “importance levels.”

Coming back to our actual values now,

Softmax normalizes these values so that they sum to 1 per row.

\[ \text{Softmax}(0.2, 0.4) = \frac{e^{0.2}}{e^{0.2} + e^{0.4}}, \frac{e^{0.4}}{e^{0.2} + e^{0.4}} \]

Approximating exponentials:

\[ e^{0.2} \approx 1.221, \quad e^{0.4} \approx 1.491 \]\[ \sum = 1.221 + 1.491 = 2.712 \]\[ \text{Softmax} = \left[ \frac{1.221}{2.712}, \frac{1.491}{2.712} \right] = [0.45, 0.55] \]

For the second row:

\[ e^{0.28} \approx 1.323, \quad e^{0.56} \approx 1.751 \]\[ \sum = 1.323 + 1.751 = 3.074 \]\[ \text{Softmax} = \left[ \frac{1.323}{3.074}, \frac{1.751}{3.074} \right] = [0.43, 0.57] \]

So, the normalized attention weights (Softmax scores) are:

\[ \begin{bmatrix} 0.45 & 0.55 \\ 0.43 & 0.57 \end{bmatrix} \]

4. Multiply Attention Weights by Value (V) and Sum Up

Each word’s final value is computed as:

Now, we multiply the softmax scores by the Value (V) matrix.

Final Word Representation = Attention Score × Value (V)
Word V
Cat 0.6
Sat 0.9

For Cat (first row):

\[ (0.45 \times 0.6) + (0.55 \times 0.9) = 0.27 + 0.495 = 0.765 \]

For Sat (second row):

\[ (0.43 \times 0.6) + (0.57 \times 0.9) = 0.258 + 0.513 = 0.771 \]

This process refines each word’s meaning based on context.

5. Final Output: Context Vector

The final contextualized representations are:

\[ \begin{bmatrix} 0.765 \\ 0.771 \end{bmatrix} \]

What This Means

  • Each word’s new representation now depends on its relationship with others, weighted by attention!
  • Would you like me to extend this to multi-dimensional Q, K, and V?

The final output (context vector) represents the new embeddings for each word after applying the self-attention mechanism. These new values are no longer just the original word embeddings; instead, they now encode contextual information from the surrounding words based on how much attention each word gives to others.

  1. Contextualized Representation

    • Originally, each word had its own static vector (e.g., “Cat” = [0.5, 0.6, 0.7]).
    • After self-attention, each word’s new representation incorporates weighted contributions from other words based on attention scores.
  2. Information Flow

    • The new values (0.765, 0.771) indicate that “Cat” and “Sat” now carry some information from each other, influenced by the attention weights.
    • Words that are more relevant to each other have stronger influences.
  3. Why Is This Important?

    • Before self-attention, “Cat” was just “Cat,” and “Sat” was just “Sat.”
    • Now, “Cat” understands that “Sat” is nearby and incorporates some of its meaning.
    • This is how transformers capture context and relationships in sentences!

6. Sample Code

from transformers import BertModel, BertTokenizer

# Load pre-trained BERT model with attention outputs enabled
model = BertModel.from_pretrained("bert-base-uncased", output_attentions=True)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", output_attentions=True)

# Tokenize and convert input into tensors
inputs = tokenizer("Hello, Generative AI!", return_tensors="pt")

# Forward pass to get outputs including attention weights
outputs = model(**inputs)

# Extract attention layers
attentions = outputs.attentions

# Print number of attention layers
print("Number of Attention Layers:", len(attentions))

Explanation of the Code

  • BertModel.from_pretrained("bert-base-uncased", output_attentions=True): Loads the BERT model with the option to return attention weights.
  • tokenizer("Hello, Generative AI!", return_tensors="pt"): Converts input text into a format the model understands (PyTorch tensors).
  • model(**inputs): Passes the tokenized inputs through the BERT model.
  • outputs.attentions: Extracts attention weights from different transformer layers.

7. Python Example: Simple Self-Attention Implementation from Scratch

Now, let’s implement self-attention from scratch in Python.

import torch
import torch.nn as nn
import torch.nn.functional as F

class SelfAttention(nn.Module):
    def __init__(self, embed_size):
        super(SelfAttention, self).__init__()
        self.embed_size = embed_size
        self.query = nn.Linear(embed_size, embed_size)
        self.key = nn.Linear(embed_size, embed_size)
        self.value = nn.Linear(embed_size, embed_size)
    
    def forward(self, x):
        Q = self.query(x)   # Convert to Query
        K = self.key(x)     # Convert to Key
        V = self.value(x)   # Convert to Value

        # Compute Attention Scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.embed_size ** 0.5)
        attention = F.softmax(scores, dim=-1)  # Apply Softmax

        # Multiply by values
        out = torch.matmul(attention, V)
        return out

# Example usage
embed_size = 8
seq_length = 5
x = torch.rand((1, seq_length, embed_size))

self_attn = SelfAttention(embed_size)
output = self_attn(x)

print("Output Shape:", output.shape)  # Expected: (1, seq_length, embed_size)

Transformer Architecture

The Transformer Model is the core architecture behind most modern NLP and GenAI models like GPT, BERT, and LLaMA. Here’s how it works:

Key Components of the Transformer

  • Self-Attention Mechanism: This allows the model to focus on different words in a sentence when processing each word. For example, when processing the word “bank” in the sentence “I went to the bank to withdraw money,” the model can focus on the context to determine if “bank” refers to a financial institution or the side of a river.
  • Multi-Head Attention: This technique allows the model to focus on different aspects of the sentence simultaneously, using multiple attention heads to capture different relationships between words.
  • Positional Encoding: Since transformers don’t inherently understand the order of words (like sequential models), positional encoding is added to provide information about the position of words in a sentence.
  • Encoder-Decoder Architecture:
    • Encoder: Processes the input data (e.g., a sentence).
    • Decoder: Generates the output data (e.g., a translation of the sentence).

Both the Transformer Encoder and Decoder consist of:

  • Feed-Forward Layers: Further processes information after attention.
  • Layer Normalization: Normalizes the output of each sub-layer before passing it to the next layer, stabilizing training and preventing exploding gradients.

Feed-Forward Layers (FFNN)

What is a Feed-Forward Layer?

A Feed-Forward Layer, also known as a Feed-Forward Neural Network (FFNN), is a fully connected neural network layer that takes the output from the attention mechanism and further processes it.

How does a Feed-Forward Layer work?

  • Input: The output from the attention mechanism is fed into the FFNN.
  • Linear Transformation: The input is transformed using a linear layer (e.g., a dense layer) with a learnable weight matrix and bias term.
  • Activation Function: The output from the linear transformation is passed through an activation function, such as ReLU (Rectified Linear Unit) or GeLU (Gaussian Error Linear Unit).
  • Output: The output from the activation function is the final output of the FFNN.

Purpose of Feed-Forward Layers

The FFNN serves two purposes:

  • Feature Transformation: It transforms the output from the attention mechanism into a higher-level representation that’s more suitable for the task at hand.
  • Non-Linearity Introduction: The activation function introduces non-linearity into the model, allowing it to learn more complex relationships between the input and output. For example,
    • ReLU (Rectified Linear Unit), Maps all negative values to 0 and all positive values to the same value. Another one,
    • Sigmoid, which maps the input to a value between 0 and 1.
    • GELU (Gaussian Error Linear Unit): combines the benefits of ReLU and sigmoid.

Layer Normalization

What is Layer Normalization?

Layer Normalization is a technique used to normalize the output of each sub-layer (e.g., the FFNN) before passing it to the next layer.

How does Layer Normalization work?

  • Compute Mean and Variance: The mean and variance of the output from the sub-layer are computed.
  • Normalize: The output is normalized by subtracting the mean and dividing by the standard deviation (square root of variance).
  • Scale and Shift: The normalized output is then scaled and shifted using learnable parameters (gamma and beta).

Purpose of Layer Normalization

Layer Normalization serves several purposes:

  • Stabilizes Training: Normalization helps stabilize the training process by reducing the effects of exploding gradients.
  • Improves Generalization: Normalization can improve the model’s ability to generalize to new, unseen data.
  • Reduces Dependence on Initialization: Normalization reduces the dependence on the initialization of the model’s weights.

By using Layer Normalization, the Transformer model can better handle the complex interactions between the attention mechanism and the FFNN, leading to improved performance and stability.


Examples

FNN Example

Suppose we have a simple FFN with one input layer, one hidden layer, and one output layer. The input layer has 2 neurons, the hidden layer has 3 neurons, and the output layer has 1 neuron.

Here’s the FFN architecture: Input Layer (2 neurons) → Hidden Layer (3 neurons) → Output Layer (1 neuron)

Let’s say we have an input vector x = [1, 2]. The FFN processes this input as follows:

  • Input Layer: The input vector x is passed through the input layer.
  • Hidden Layer: The output from the input layer is passed through a linear transformation (e.g., a dense layer) with weights W1 and biases b1, followed by an activation function (e.g., ReLU). Let’s call the output from the hidden layer h.
  • Output Layer: The output from the hidden layer h is passed through another linear transformation with weights W2 and biases b2. The final output is y.

Here’s some sample PyTorch code to illustrate this:

import torch
import torch.nn as nn

# Define the FFN model
class FFN(nn.Module):
    def __init__(self):
        super(FFN, self).__init__()
        self.hidden_layer = nn.Linear(2, 3)  # input layer (2) -> hidden layer (3)
        self.output_layer = nn.Linear(3, 1)  # hidden layer (3) -> output layer (1)

    def forward(self, x):
        h = torch.relu(self.hidden_layer(x))  # activation function: ReLU
        y = self.output_layer(h)
        return y

# Initialize the FFN model and input vector
model = FFN()
x = torch.tensor([1.0, 2.0])

# Forward pass
y = model(x)
print(y)
FFNN Example: Image Classification

Suppose we want to build a simple image classification model that can distinguish between images of cats and dogs.

Here’s how an FFNN can be used:

  • Input Layer: The input layer takes in the image data, which is typically represented as a matrix of pixel values.
  • Hidden Layer: The hidden layer applies a series of transformations to the input data, using weights and biases learned during training.
  • Output Layer: The output layer generates a probability distribution over the two classes (cat or dog).

For example, if we input an image of a cat, the FFNN might output: [0.8, 0.2]

This indicates that the model is 80% confident that the image is a cat and 20% confident that it’s a dog.

Layer Normalization Example

Now, let’s add Layer Normalization to the FFN model. We’ll apply Layer Normalization to the output of the hidden layer.

Here’s the modified PyTorch code:

import torch
import torch.nn as nn
import torch.nn.functional as F

# Define the FFN model with Layer Normalization
class FFN(nn.Module):
    def __init__(self):
        super(FFN, self).__init__()
        self.hidden_layer = nn.Linear(2, 3)  # input layer (2) -> hidden layer (3)
        self.layer_norm = nn.LayerNorm(3)  # Layer Normalization for hidden layer
        self.output_layer = nn.Linear(3, 1)  # hidden layer (3) -> output layer (1)

    def forward(self, x):
        h = torch.relu(self.hidden_layer(x))  # activation function: ReLU
        h = self.layer_norm(h)  # apply Layer Normalization
        y = self.output_layer(h)
        return y

# Initialize the FFN model and input vector
model = FFN()
x = torch.tensor([1.0, 2.0])

# Forward pass
y = model(x)
print(y)

In this example, we’ve added a LayerNorm module to the FFN model, which applies Layer Normalization to the output of the hidden layer. The LayerNorm module normalizes the input data by subtracting the mean and dividing by the standard deviation, which helps to stabilize the training process and improve the model’s performance.

Layer Normalization Example: Language Translation

Suppose we want to build a machine translation system that translates English sentences to Hindi. We have a dataset of paired English and Hindi sentences.

Our sequence-to-sequence model consists of three main components:

  • Encoder : The encoder takes in the English sentence and outputs a sequence of vectors and a hidden state. The encoder uses a GRU (Gated Recurrent Unit) layer to process the input sequence.
  • Layer Normalization : The layer normalization component normalizes the output from the encoder. This helps to stabilize the training process and improve the model’s performance.
  • Decoder : The decoder takes in the normalized output and the hidden state from the encoder and outputs a Hindi sentence. The decoder also uses a GRU layer to process the input sequence.

Layer Normalization helps to:

  • Stabilize the training process by reducing the effects of exploding gradients.
  • Improve the model’s ability to generalize to new, unseen data.

Example Walkthrough

Let’s walk through an example of how this model works:

  • English Sentence: We start with an English sentence, “Hello, how are you?”
  • Encoder: The encoder takes in the English sentence and outputs a sequence of vectors and a hidden state.
English Word Encoder Output
Hello [0.1, 0.2, 0.3]
how [0.4, 0.5, 0.6]
are [0.7, 0.8, 0.9]
you [1.0, 1.1, 1.2]

Layer Normalization: The layer normalization component normalizes the output from the encoder.

English Word Normalized Output
Hello [-0.5, 0.0, 0.5]
how [-0.3, 0.2, 0.7]
are [-0.1, 0.4, 0.9]
you [0.1, 0.6, 1.1]

Decoder: The decoder takes in the normalized output and the hidden state from the encoder and outputs a Hindi sentence.

Hindi Word Decoder Output
नमस्ते [0.8, 0.9, 1.0]
कैसे [0.6, 0.7, 0.8]
हो [0.4, 0.5, 0.6]
? [0.2, 0.3, 0.4]

Final Output: The final output from the decoder is the translated Hindi sentence, “नमस्ते, कैसे हो?”.

# Import the necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim

# Define the encoder model
class Encoder(nn.Module):
    def __init__(self):
        # Initialize the encoder model
        super(Encoder, self).__init__()
        # Define the embedding layer with 10000 possible words and 128-dimensional vectors
        self.embedding = nn.Embedding(num_embeddings=10000, embedding_dim=128)
        # Define the GRU layer with 128 input dimensions, 256 hidden dimensions, and 1 layer
        self.rnn = nn.GRU(input_size=128, hidden_size=256, num_layers=1, batch_first=True)

    def forward(self, x):
        # Embed the input sequence
        embedded = self.embedding(x)
        # Pass the embedded sequence through the GRU layer
        output, hidden = self.rnn(embedded)
        # Return the output and hidden state
        return output, hidden

# Define the layer normalization model
class LayerNormalization(nn.Module):
    def __init__(self):
        # Initialize the layer normalization model
        super(LayerNormalization, self).__init__()
        # Define the layer normalization layer with 256 dimensions
        self.layer_norm = nn.LayerNorm(normalized_shape=256)

    def forward(self, x):
        # Normalize the input sequence
        return self.layer_norm(x)

# Define the decoder model
class Decoder(nn.Module):
    def __init__(self):
        # Initialize the decoder model
        super(Decoder, self).__init__()
        # Define the embedding layer with 10000 possible words and 128-dimensional vectors
        self.embedding = nn.Embedding(num_embeddings=10000, embedding_dim=128)
        # Define the GRU layer with 128 input dimensions, 256 hidden dimensions, and 1 layer
        self.rnn = nn.GRU(input_size=128, hidden_size=256, num_layers=1, batch_first=True)
        # Define the fully connected layer with 256 input dimensions and 10000 output dimensions
        self.fc = nn.Linear(256, 10000)

    def forward(self, x, hidden):
        # Embed the input sequence
        embedded = self.embedding(x)
        # Pass the embedded sequence through the GRU layer
        output, hidden = self.rnn(embedded, hidden)
        # Pass the output through the fully connected layer
        output = self.fc(output[:, -1, :])
        # Return the output and hidden state
        return output, hidden

# Define the sequence-to-sequence model
class Seq2Seq(nn.Module):
    def __init__(self):
        # Initialize the sequence-to-sequence model
        super(Seq2Seq, self).__init__()
        # Define the encoder model
        self.encoder = Encoder()
        # Define the layer normalization model
        self.layer_norm = LayerNormalization()
        # Define the decoder model
        self.decoder = Decoder()

    def forward(self, x):
        # Pass the input sequence through the encoder
        encoder_output, hidden = self.encoder(x)
        # Pass the encoder output through the layer normalization
        normalized_output = self.layer_norm(encoder_output)
        # Pass the normalized output and hidden state through the decoder
        decoder_output, _ = self.decoder(x[:, 0:1], hidden)
        # Return the decoder output
        return decoder_output

# Initialize the sequence-to-sequence model
model = Seq2Seq()

# Define the vocabulary
english_vocabulary = {'Hello': 0, 'are': 1, 'you': 2, '<EOS>': 3}
hindi_vocabulary = {'नमस्ते': 0, 'कैसे': 1, 'हो': 2, '<EOS>': 3}

# Define a sample English input sequence
english_input = torch.tensor([[english_vocabulary['Hello'], english_vocabulary['are'], english_vocabulary['you']]])

# Define a sample Hindi output sequence
hindi_output = torch.tensor([[hindi_vocabulary['नमस्ते'], hindi_vocabulary['कैसे'], hindi_vocabulary['हो']]])

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Train the model
for epoch in range(100):
    # Zero the gradients
    optimizer.zero_grad()
    # Pass the input sequence through the model
    output = model(english_input)
    # Calculate the loss
    loss = criterion(output, hindi_output[:, 0])
    # Backpropagate the loss
    loss.backward()
    # Update the model parameters
    optimizer.step()
    # Print the loss
    print(f'Epoch {epoch+1}, Loss: {loss.item()}')

# Use the trained model for translation
# Use the trained model for translation
translated_output = model(english_input)
translated_output_idx = torch.argmax(translated_output, dim=1)
predicted_word = list(hindi_vocabulary.keys())[translated_output_idx.item()]
print(predicted_word)

Multi-Head Attention (Why Do We Need Multiple Attention Layers?)

Instead of one attention mechanism, Transformers use multiple (“heads”) to capture different relationships.

  • Why? A single attention mechanism may miss important details.
  • Multi-head attention ensures the model sees information from different perspectives.

Summary: How Generative AI Processes Text

When you enter a prompt like:
➡️ "Explain black holes"

A Generative AI model follows these steps:

Step 1: Tokenization

Breaks text into smaller parts (tokens).
Example:
"Explain black holes"["explain", "black", "holes"]

Step 2: Embeddings
Each token is converted into a numerical vector for processing.

Step 3: Transformer Model Processing

  • Self-attention determines which words matter the most.
  • Multiple layers refine understanding.

Step 4: Text Generation

  • Predicts the most likely next token at each step.
  • Constructs output based on learned patterns.

Subsections of Key Concepts in GenAI

2.1.1: Tokenization

Tokenization

Tokenization is the process of converting text into smaller units, typically words or subwords, that can be processed by machine learning models. In natural language processing (NLP), tokens are the basic building blocks for understanding and generating language. Tokenization helps the model “understand” the text by converting it into a format that can be fed into the neural network.

  • Word-level tokenization splits text into words. Example: "I love AI"["I", "love", "AI"].
  • Subword-level tokenization (used in models like GPT) splits words into smaller parts or subwords. This is more efficient and handles unknown words better, as the model can understand smaller pieces of a word. Example, “delightful”[“delight”, “ful”]`.
  • Character-Level Tokenization Splits text into individual characters. Example: "AI"["A", "I"].

Hands-On: Tokenization

Tokenization Example

from transformers import AutoTokenizer

# Load the tokenizer for GPT-2
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Tokenize a sentence
text = "Hello, how are you?"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)

print("Tokens:", tokens)
print("Token IDs:", token_ids)

Output:

Tokens: ['Hello', ',', 'Ġhow', 'Ġare', 'Ġyou', '?']
Token IDs: [15496, 11, 703, 389, 345, 30]

Explanation:

  • tokenizer.tokenize(text): Splits the text into tokens.
  • tokenizer.encode(text): Converts tokens into their corresponding numerical IDs.

Tokenization: Additional Resources


2.1.2: Embeddings

Embeddings

Embeddings are numerical representations of tokens in a high-dimensional space. They capture the semantic meaning of tokens, allowing models to understand relationships between words.

Why are Embeddings Important?

  • Words with similar meanings have similar embeddings.
  • Embeddings enable models to generalize and understand context.

Example:

  • The embeddings for "king", "queen", "man", and "woman" might satisfy the relationship:
    king - man + woman ≈ queen

Hands-On: Embeddings

from transformers import AutoTokenizer, AutoModel
import torch

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModel.from_pretrained("gpt2")

# Tokenize input text
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")

# Generate embeddings
with torch.no_grad():
    outputs = model(**inputs)

# Extract embeddings for the first token
embeddings = outputs.last_hidden_state
print("Embeddings shape:", embeddings.shape)
print("Embeddings for 'Hello':", embeddings[0, 0, :5])  # First 5 dimensions

Output:

Embeddings shape: torch.Size([1, 6, 768])
Embeddings for 'Hello': tensor([-0.0123,  0.0456, -0.0678,  0.0234,  0.0891])

Explanation:

  • model(**inputs): Passes the token IDs through the model to generate embeddings.
  • outputs.last_hidden_state: Contains the embeddings for each token.
  • Each token is represented as a 768-dimensional vector (for GPT-2).
  • Output: torch.Size([1, 6, 768]) means
    • Batch size: 1. Since we have one input sentence.
    • Sequence length: 6. The input sentence has 6 tokens for the sentence "Hello, how are you?"
    • Embedding dimensions: 768. Each token is represented by a 768-dimensional vector for GPT-2.

2.1.3: Attention Mechanism

Self-Attention: The Core of Transformers

Imagine you’re reading a sentence: "The cat sat on the mat. Each word is important, but some are more related to others.

  • “cat” is related to “sat.”
  • “mat” is related to “sat.”
  • “on” is less important.

Self-Attention helps the model decide which words to focus on!

How Self-Attention Works (Step-by-Step)

Self-attention is done in 4 steps:

1. Convert Words into Vectors (Embeddings)

Computers don’t understand words, so we convert them into numbers (word embeddings).

Example:

Word Vector Representation (Simplified)
The [0.1, 0.2, 0.3]
Cat [0.5, 0.6, 0.7]
Sat [0.8, 0.9, 1.0]

2. Create Query, Key, and Value (Q, K, V)

Each word is transformed into three vectors:

  • Query (Q) → “What am I looking for?”
  • Key (K) → “What information do I have?”
  • Value (V) → “What should be returned?”

Example:

Word Query (Q) Key (K) Value (V)
Cat 0.5 0.4 0.6
Sat 0.7 0.8 0.9

The Query (Q) of “sat” will be compared with Keys (K) of all words to see how related they are.
Now, let’s compute the Attention Scores using Q × Kᵀ.

3. Compute Attention Scores (Importance of Words) (Q × Kᵀ)

Now, we compare Query (Q) and Key (K) using the dot product.

  • If Q and K are similar, the word is important.
  • If Q and K are different, the word is less important.
\[ \text{Attention Score} = Q \times K^T \]

Example:

  • “sat” is strongly related to “cat” → High attention score.
  • “sat” is weakly related to “the” → Low attention score.
Word Pair Calculation
Cat → Cat (0.5 × 0.4) = 0.2
Cat → Sat (0.5 × 0.8) = 0.4
Sat → Cat (0.7 × 0.4) = 0.28
Sat → Sat (0.7 × 0.8) = 0.56

Thus, our Attention Score matrix is:

\[ \begin{bmatrix} 0.2 & 0.4 \\ 0.28 & 0.56 \end{bmatrix} \]

4. Now we normalize these scores using Softmax.

What is Softmax?

Softmax is a function that converts raw scores into probabilities. Example: Let’s say we have three scores:

Raw Scores: [0.56, 0.72, 0.11]

Softmax converts them into values between 0 and 1:

Softmax Output: [0.30, 0.50, 0.20]

Why do we use Softmax?

So the values sum to 1, making them easy to interpret as “importance levels.”

Coming back to our actual values now,

Softmax normalizes these values so that they sum to 1 per row.

\[ \text{Softmax}(0.2, 0.4) = \frac{e^{0.2}}{e^{0.2} + e^{0.4}}, \frac{e^{0.4}}{e^{0.2} + e^{0.4}} \]

Approximating exponentials:

\[ e^{0.2} \approx 1.221, \quad e^{0.4} \approx 1.491 \]\[ \sum = 1.221 + 1.491 = 2.712 \]\[ \text{Softmax} = \left[ \frac{1.221}{2.712}, \frac{1.491}{2.712} \right] = [0.45, 0.55] \]

For the second row:

\[ e^{0.28} \approx 1.323, \quad e^{0.56} \approx 1.751 \]\[ \sum = 1.323 + 1.751 = 3.074 \]\[ \text{Softmax} = \left[ \frac{1.323}{3.074}, \frac{1.751}{3.074} \right] = [0.43, 0.57] \]

So, the normalized attention weights (Softmax scores) are:

\[ \begin{bmatrix} 0.45 & 0.55 \\ 0.43 & 0.57 \end{bmatrix} \]

4. Multiply Attention Weights by Value (V) and Sum Up

Each word’s final value is computed as:

Now, we multiply the softmax scores by the Value (V) matrix.

Final Word Representation = Attention Score × Value (V)
Word V
Cat 0.6
Sat 0.9

For Cat (first row):

\[ (0.45 \times 0.6) + (0.55 \times 0.9) = 0.27 + 0.495 = 0.765 \]

For Sat (second row):

\[ (0.43 \times 0.6) + (0.57 \times 0.9) = 0.258 + 0.513 = 0.771 \]

This process refines each word’s meaning based on context.

5. Final Output: Context Vector

The final contextualized representations are:

\[ \begin{bmatrix} 0.765 \\ 0.771 \end{bmatrix} \]

What This Means

  • Each word’s new representation now depends on its relationship with others, weighted by attention!
  • Would you like me to extend this to multi-dimensional Q, K, and V?

The final output (context vector) represents the new embeddings for each word after applying the self-attention mechanism. These new values are no longer just the original word embeddings; instead, they now encode contextual information from the surrounding words based on how much attention each word gives to others.

  1. Contextualized Representation

    • Originally, each word had its own static vector (e.g., “Cat” = [0.5, 0.6, 0.7]).
    • After self-attention, each word’s new representation incorporates weighted contributions from other words based on attention scores.
  2. Information Flow

    • The new values (0.765, 0.771) indicate that “Cat” and “Sat” now carry some information from each other, influenced by the attention weights.
    • Words that are more relevant to each other have stronger influences.
  3. Why Is This Important?

    • Before self-attention, “Cat” was just “Cat,” and “Sat” was just “Sat.”
    • Now, “Cat” understands that “Sat” is nearby and incorporates some of its meaning.
    • This is how transformers capture context and relationships in sentences!

6. Sample Code

from transformers import BertModel, BertTokenizer

# Load pre-trained BERT model with attention outputs enabled
model = BertModel.from_pretrained("bert-base-uncased", output_attentions=True)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", output_attentions=True)

# Tokenize and convert input into tensors
inputs = tokenizer("Hello, Generative AI!", return_tensors="pt")

# Forward pass to get outputs including attention weights
outputs = model(**inputs)

# Extract attention layers
attentions = outputs.attentions

# Print number of attention layers
print("Number of Attention Layers:", len(attentions))

Explanation of the Code

  • BertModel.from_pretrained("bert-base-uncased", output_attentions=True): Loads the BERT model with the option to return attention weights.
  • tokenizer("Hello, Generative AI!", return_tensors="pt"): Converts input text into a format the model understands (PyTorch tensors).
  • model(**inputs): Passes the tokenized inputs through the BERT model.
  • outputs.attentions: Extracts attention weights from different transformer layers.

7. Python Example: Simple Self-Attention Implementation from Scratch

Now, let’s implement self-attention from scratch in Python.

import torch
import torch.nn as nn
import torch.nn.functional as F

class SelfAttention(nn.Module):
    def __init__(self, embed_size):
        super(SelfAttention, self).__init__()
        self.embed_size = embed_size
        self.query = nn.Linear(embed_size, embed_size)
        self.key = nn.Linear(embed_size, embed_size)
        self.value = nn.Linear(embed_size, embed_size)
    
    def forward(self, x):
        Q = self.query(x)   # Convert to Query
        K = self.key(x)     # Convert to Key
        V = self.value(x)   # Convert to Value

        # Compute Attention Scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.embed_size ** 0.5)
        attention = F.softmax(scores, dim=-1)  # Apply Softmax

        # Multiply by values
        out = torch.matmul(attention, V)
        return out

# Example usage
embed_size = 8
seq_length = 5
x = torch.rand((1, seq_length, embed_size))

self_attn = SelfAttention(embed_size)
output = self_attn(x)

print("Output Shape:", output.shape)  # Expected: (1, seq_length, embed_size)

Multi-Head Attention (Why Do We Need Multiple Attention Layers?)

Instead of one attention mechanism, Transformers use multiple (“heads”) to capture different relationships.

  • Why? A single attention mechanism may miss important details.
  • Multi-head attention ensures the model sees information from different perspectives.

2.1.4: Transformers

Transformer Architecture

The Transformer Model is the core architecture behind most modern NLP and GenAI models like GPT, BERT, and LLaMA. Here’s how it works:

Key Components of the Transformer

  • Self-Attention Mechanism: This allows the model to focus on different words in a sentence when processing each word. For example, when processing the word “bank” in the sentence “I went to the bank to withdraw money,” the model can focus on the context to determine if “bank” refers to a financial institution or the side of a river.
  • Multi-Head Attention: This technique allows the model to focus on different aspects of the sentence simultaneously, using multiple attention heads to capture different relationships between words.
  • Positional Encoding: Since transformers don’t inherently understand the order of words (like sequential models), positional encoding is added to provide information about the position of words in a sentence.
  • Encoder-Decoder Architecture:
    • Encoder: Processes the input data (e.g., a sentence).
    • Decoder: Generates the output data (e.g., a translation of the sentence).

Both the Transformer Encoder and Decoder consist of:

  • Feed-Forward Layers: Further processes information after attention.
  • Layer Normalization: Normalizes the output of each sub-layer before passing it to the next layer, stabilizing training and preventing exploding gradients.

Feed-Forward Layers (FFNN)

What is a Feed-Forward Layer?

A Feed-Forward Layer, also known as a Feed-Forward Neural Network (FFNN), is a fully connected neural network layer that takes the output from the attention mechanism and further processes it.

How does a Feed-Forward Layer work?

  • Input: The output from the attention mechanism is fed into the FFNN.
  • Linear Transformation: The input is transformed using a linear layer (e.g., a dense layer) with a learnable weight matrix and bias term.
  • Activation Function: The output from the linear transformation is passed through an activation function, such as ReLU (Rectified Linear Unit) or GeLU (Gaussian Error Linear Unit).
  • Output: The output from the activation function is the final output of the FFNN.

Purpose of Feed-Forward Layers

The FFNN serves two purposes:

  • Feature Transformation: It transforms the output from the attention mechanism into a higher-level representation that’s more suitable for the task at hand.
  • Non-Linearity Introduction: The activation function introduces non-linearity into the model, allowing it to learn more complex relationships between the input and output. For example,
    • ReLU (Rectified Linear Unit), Maps all negative values to 0 and all positive values to the same value. Another one,
    • Sigmoid, which maps the input to a value between 0 and 1.
    • GELU (Gaussian Error Linear Unit): combines the benefits of ReLU and sigmoid.

Layer Normalization

What is Layer Normalization?

Layer Normalization is a technique used to normalize the output of each sub-layer (e.g., the FFNN) before passing it to the next layer.

How does Layer Normalization work?

  • Compute Mean and Variance: The mean and variance of the output from the sub-layer are computed.
  • Normalize: The output is normalized by subtracting the mean and dividing by the standard deviation (square root of variance).
  • Scale and Shift: The normalized output is then scaled and shifted using learnable parameters (gamma and beta).

Purpose of Layer Normalization

Layer Normalization serves several purposes:

  • Stabilizes Training: Normalization helps stabilize the training process by reducing the effects of exploding gradients.
  • Improves Generalization: Normalization can improve the model’s ability to generalize to new, unseen data.
  • Reduces Dependence on Initialization: Normalization reduces the dependence on the initialization of the model’s weights.

By using Layer Normalization, the Transformer model can better handle the complex interactions between the attention mechanism and the FFNN, leading to improved performance and stability.


Examples

FNN Example

Suppose we have a simple FFN with one input layer, one hidden layer, and one output layer. The input layer has 2 neurons, the hidden layer has 3 neurons, and the output layer has 1 neuron.

Here’s the FFN architecture: Input Layer (2 neurons) → Hidden Layer (3 neurons) → Output Layer (1 neuron)

Let’s say we have an input vector x = [1, 2]. The FFN processes this input as follows:

  • Input Layer: The input vector x is passed through the input layer.
  • Hidden Layer: The output from the input layer is passed through a linear transformation (e.g., a dense layer) with weights W1 and biases b1, followed by an activation function (e.g., ReLU). Let’s call the output from the hidden layer h.
  • Output Layer: The output from the hidden layer h is passed through another linear transformation with weights W2 and biases b2. The final output is y.

Here’s some sample PyTorch code to illustrate this:

import torch
import torch.nn as nn

# Define the FFN model
class FFN(nn.Module):
    def __init__(self):
        super(FFN, self).__init__()
        self.hidden_layer = nn.Linear(2, 3)  # input layer (2) -> hidden layer (3)
        self.output_layer = nn.Linear(3, 1)  # hidden layer (3) -> output layer (1)

    def forward(self, x):
        h = torch.relu(self.hidden_layer(x))  # activation function: ReLU
        y = self.output_layer(h)
        return y

# Initialize the FFN model and input vector
model = FFN()
x = torch.tensor([1.0, 2.0])

# Forward pass
y = model(x)
print(y)
FFNN Example: Image Classification

Suppose we want to build a simple image classification model that can distinguish between images of cats and dogs.

Here’s how an FFNN can be used:

  • Input Layer: The input layer takes in the image data, which is typically represented as a matrix of pixel values.
  • Hidden Layer: The hidden layer applies a series of transformations to the input data, using weights and biases learned during training.
  • Output Layer: The output layer generates a probability distribution over the two classes (cat or dog).

For example, if we input an image of a cat, the FFNN might output: [0.8, 0.2]

This indicates that the model is 80% confident that the image is a cat and 20% confident that it’s a dog.

Layer Normalization Example

Now, let’s add Layer Normalization to the FFN model. We’ll apply Layer Normalization to the output of the hidden layer.

Here’s the modified PyTorch code:

import torch
import torch.nn as nn
import torch.nn.functional as F

# Define the FFN model with Layer Normalization
class FFN(nn.Module):
    def __init__(self):
        super(FFN, self).__init__()
        self.hidden_layer = nn.Linear(2, 3)  # input layer (2) -> hidden layer (3)
        self.layer_norm = nn.LayerNorm(3)  # Layer Normalization for hidden layer
        self.output_layer = nn.Linear(3, 1)  # hidden layer (3) -> output layer (1)

    def forward(self, x):
        h = torch.relu(self.hidden_layer(x))  # activation function: ReLU
        h = self.layer_norm(h)  # apply Layer Normalization
        y = self.output_layer(h)
        return y

# Initialize the FFN model and input vector
model = FFN()
x = torch.tensor([1.0, 2.0])

# Forward pass
y = model(x)
print(y)

In this example, we’ve added a LayerNorm module to the FFN model, which applies Layer Normalization to the output of the hidden layer. The LayerNorm module normalizes the input data by subtracting the mean and dividing by the standard deviation, which helps to stabilize the training process and improve the model’s performance.

Layer Normalization Example: Language Translation

Suppose we want to build a machine translation system that translates English sentences to Hindi. We have a dataset of paired English and Hindi sentences.

Our sequence-to-sequence model consists of three main components:

  • Encoder : The encoder takes in the English sentence and outputs a sequence of vectors and a hidden state. The encoder uses a GRU (Gated Recurrent Unit) layer to process the input sequence.
  • Layer Normalization : The layer normalization component normalizes the output from the encoder. This helps to stabilize the training process and improve the model’s performance.
  • Decoder : The decoder takes in the normalized output and the hidden state from the encoder and outputs a Hindi sentence. The decoder also uses a GRU layer to process the input sequence.

Layer Normalization helps to:

  • Stabilize the training process by reducing the effects of exploding gradients.
  • Improve the model’s ability to generalize to new, unseen data.

Example Walkthrough

Let’s walk through an example of how this model works:

  • English Sentence: We start with an English sentence, “Hello, how are you?”
  • Encoder: The encoder takes in the English sentence and outputs a sequence of vectors and a hidden state.
English Word Encoder Output
Hello [0.1, 0.2, 0.3]
how [0.4, 0.5, 0.6]
are [0.7, 0.8, 0.9]
you [1.0, 1.1, 1.2]

Layer Normalization: The layer normalization component normalizes the output from the encoder.

English Word Normalized Output
Hello [-0.5, 0.0, 0.5]
how [-0.3, 0.2, 0.7]
are [-0.1, 0.4, 0.9]
you [0.1, 0.6, 1.1]

Decoder: The decoder takes in the normalized output and the hidden state from the encoder and outputs a Hindi sentence.

Hindi Word Decoder Output
नमस्ते [0.8, 0.9, 1.0]
कैसे [0.6, 0.7, 0.8]
हो [0.4, 0.5, 0.6]
? [0.2, 0.3, 0.4]

Final Output: The final output from the decoder is the translated Hindi sentence, “नमस्ते, कैसे हो?”.

# Import the necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim

# Define the encoder model
class Encoder(nn.Module):
    def __init__(self):
        # Initialize the encoder model
        super(Encoder, self).__init__()
        # Define the embedding layer with 10000 possible words and 128-dimensional vectors
        self.embedding = nn.Embedding(num_embeddings=10000, embedding_dim=128)
        # Define the GRU layer with 128 input dimensions, 256 hidden dimensions, and 1 layer
        self.rnn = nn.GRU(input_size=128, hidden_size=256, num_layers=1, batch_first=True)

    def forward(self, x):
        # Embed the input sequence
        embedded = self.embedding(x)
        # Pass the embedded sequence through the GRU layer
        output, hidden = self.rnn(embedded)
        # Return the output and hidden state
        return output, hidden

# Define the layer normalization model
class LayerNormalization(nn.Module):
    def __init__(self):
        # Initialize the layer normalization model
        super(LayerNormalization, self).__init__()
        # Define the layer normalization layer with 256 dimensions
        self.layer_norm = nn.LayerNorm(normalized_shape=256)

    def forward(self, x):
        # Normalize the input sequence
        return self.layer_norm(x)

# Define the decoder model
class Decoder(nn.Module):
    def __init__(self):
        # Initialize the decoder model
        super(Decoder, self).__init__()
        # Define the embedding layer with 10000 possible words and 128-dimensional vectors
        self.embedding = nn.Embedding(num_embeddings=10000, embedding_dim=128)
        # Define the GRU layer with 128 input dimensions, 256 hidden dimensions, and 1 layer
        self.rnn = nn.GRU(input_size=128, hidden_size=256, num_layers=1, batch_first=True)
        # Define the fully connected layer with 256 input dimensions and 10000 output dimensions
        self.fc = nn.Linear(256, 10000)

    def forward(self, x, hidden):
        # Embed the input sequence
        embedded = self.embedding(x)
        # Pass the embedded sequence through the GRU layer
        output, hidden = self.rnn(embedded, hidden)
        # Pass the output through the fully connected layer
        output = self.fc(output[:, -1, :])
        # Return the output and hidden state
        return output, hidden

# Define the sequence-to-sequence model
class Seq2Seq(nn.Module):
    def __init__(self):
        # Initialize the sequence-to-sequence model
        super(Seq2Seq, self).__init__()
        # Define the encoder model
        self.encoder = Encoder()
        # Define the layer normalization model
        self.layer_norm = LayerNormalization()
        # Define the decoder model
        self.decoder = Decoder()

    def forward(self, x):
        # Pass the input sequence through the encoder
        encoder_output, hidden = self.encoder(x)
        # Pass the encoder output through the layer normalization
        normalized_output = self.layer_norm(encoder_output)
        # Pass the normalized output and hidden state through the decoder
        decoder_output, _ = self.decoder(x[:, 0:1], hidden)
        # Return the decoder output
        return decoder_output

# Initialize the sequence-to-sequence model
model = Seq2Seq()

# Define the vocabulary
english_vocabulary = {'Hello': 0, 'are': 1, 'you': 2, '<EOS>': 3}
hindi_vocabulary = {'नमस्ते': 0, 'कैसे': 1, 'हो': 2, '<EOS>': 3}

# Define a sample English input sequence
english_input = torch.tensor([[english_vocabulary['Hello'], english_vocabulary['are'], english_vocabulary['you']]])

# Define a sample Hindi output sequence
hindi_output = torch.tensor([[hindi_vocabulary['नमस्ते'], hindi_vocabulary['कैसे'], hindi_vocabulary['हो']]])

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Train the model
for epoch in range(100):
    # Zero the gradients
    optimizer.zero_grad()
    # Pass the input sequence through the model
    output = model(english_input)
    # Calculate the loss
    loss = criterion(output, hindi_output[:, 0])
    # Backpropagate the loss
    loss.backward()
    # Update the model parameters
    optimizer.step()
    # Print the loss
    print(f'Epoch {epoch+1}, Loss: {loss.item()}')

# Use the trained model for translation
# Use the trained model for translation
translated_output = model(english_input)
translated_output_idx = torch.argmax(translated_output, dim=1)
predicted_word = list(hindi_vocabulary.keys())[translated_output_idx.item()]
print(predicted_word)

2.1.5: Recap

Summary: How Generative AI Processes Text

When you enter a prompt like:
➡️ "Explain black holes"

A Generative AI model follows these steps:

Step 1: Tokenization

Breaks text into smaller parts (tokens).
Example:
"Explain black holes"["explain", "black", "holes"]

Step 2: Embeddings
Each token is converted into a numerical vector for processing.

Step 3: Transformer Model Processing

  • Self-attention determines which words matter the most.
  • Multiple layers refine understanding.

Step 4: Text Generation

  • Predicts the most likely next token at each step.
  • Constructs output based on learned patterns.

2.2: Controlling GenAI Model Output

Temperature

  • Purpose: Controls the randomness of the predictions. It’s a hyperparameter used to scale the logits (predicted probabilities) before sampling.
  • How it works: The model computes probabilities for each token, and the temperature parameter adjusts these probabilities.
    • Low temperature (<1.0): Makes the model more deterministic by amplifying the difference between high-probability tokens and low-probability tokens. This makes the model more likely to choose the most probable token.
    • High temperature (>1.0): Makes the model more random by flattening the probabilities. This results in more diverse, creative, and sometimes less coherent text.

Example

  • Temperature = 0.7: The model will likely choose the more predictable or likely tokens.
  • Temperature = 1.5: The model will take more risks, leading to more unexpected, diverse outputs.
# Example of lower temperature (more deterministic)
outputs = model.generate(inputs['input_ids'], max_length=50, temperature=0.7)

# Example of higher temperature (more creative/random)
outputs = model.generate(inputs['input_ids'], max_length=50, temperature=1.5)

Top-k Sampling

  • Purpose: Limits the number of tokens to sample from, making the generation process more efficient and sometimes more coherent.
  • How it works: Instead of considering all possible tokens (the entire vocabulary), top-k sampling restricts the set of possible next tokens to the top-k most likely tokens based on their probability scores.
    • k = 1: This would make the model behave deterministically, always picking the most probable token.
    • k = 50: The model will sample from the top 50 tokens with the highest probabilities.

Example

  • Top-k = 10: The model will only consider the 10 tokens with the highest probabilities when selecting the next word.
  • Top-k = 100: The model will consider the top 100 tokens, giving it more variety.
# Example with top-k sampling (restricted to top 50 tokens)
outputs = model.generate(inputs['input_ids'], max_length=50, top_k=50)
  • Effect of Top-k: By limiting the token options to the top-k, the model’s output tends to be more controlled and less random than pure sampling from all tokens.

Top-p (Nucleus Sampling)

  • Purpose: Similar to top-k, but instead of limiting to a fixed number of tokens, top-p limits the tokens considered based on their cumulative probability.
  • How it works: The model keeps sampling from the smallest set of tokens whose cumulative probability exceeds a threshold p (where p is between 0 and 1). This dynamic method is often referred to as nucleus sampling.
    • p = 0.9: The model will consider the smallest set of tokens whose cumulative probability is at least 90%. This results in considering a variable number of tokens based on how steep the probability distribution is.
    • p = 1.0: This would be equivalent to top-k sampling with k = all tokens, allowing the model to sample from all tokens.

Example

  • Top-p = 0.9: The model considers the smallest set of tokens whose combined probability is at least 90%. This prevents very unlikely tokens from being considered while still allowing more diversity.
  • Top-p = 0.95: The model will sample from a slightly larger set of tokens.
# Example with top-p (nucleus) sampling
outputs = model.generate(inputs['input_ids'], max_length=50, top_p=0.9)
  • Effect of Top-p: Nucleus sampling tends to generate more coherent and diverse text than top-k sampling, as the model is free to choose tokens from a set that dynamically adjusts based on their probabilities.

Temperature, Top-k, and Top-p Combined

You can combine these parameters to fine-tune the model’s output. For example:

outputs = model.generate(
    inputs['input_ids'], 
    max_length=50, 
    temperature=0.8, 
    top_k=50, 
    top_p=0.9
)

This will give you:

  • A lower temperature (0.8), making the generation more predictable.
  • Top-k sampling with the top 50 tokens.
  • Top-p sampling that only includes tokens whose cumulative probability is at least 90%.

By tuning these parameters, you can experiment with how controlled or creative the generated text is.


Summary of Differences

  • Temperature: Adjusts the randomness of the sampling. Higher temperature means more diverse output; lower means more predictable.
  • Top-k Sampling: Limits the number of candidate tokens to the top-k most likely tokens.
  • Top-p (Nucleus) Sampling: Limits the candidate tokens to those whose cumulative probability is at least p (a probability threshold), providing more flexible diversity control.

Details

img.png img.png Confused! Let us break down top-k and top-n with simpler examples.


Top-k Sampling (Simplified)

Imagine the model is choosing the next word from a list of 5 possible words, each with a probability:

Word Probability
“apple” 0.5
“banana” 0.3
“cherry” 0.1
“date” 0.05
“elderberry” 0.05

Top-k = 2:

With top-k=2, the model will only consider the top 2 most probable words. So it will only consider “apple” and “banana”. The model ignores the words “cherry”, “date”, and “elderberry” because they are less likely.

If the model needs to choose the next word, it will only sample from these 2 words: “apple” and “banana”. This makes the sampling process more controlled and focused.

Top-k = 3:

If top-k=3, it will consider “apple”, “banana”, and “cherry”. This is a little more diverse but still limited to the top 3.


Top-p (Nucleus Sampling) (Simplified)

Now, let’s look at top-p (nucleus sampling), which works a bit differently.

Let’s use the same words and probabilities:

Word Probability
“apple” 0.5
“banana” 0.3
“cherry” 0.1
“date” 0.05
“elderberry” 0.05

Top-p = 0.8:

With top-p=0.8, the model will add up the probabilities from the most likely words until the total probability is greater than or equal to 0.8.

  • “apple” = 0.5
  • “banana” = 0.3
  • Total = 0.8

At this point, the model has already reached 0.8 probability. So it will stop and consider only “apple” and “banana”.

This is different from top-k because it doesn’t limit to a fixed number of tokens. It dynamically chooses the most likely words until the total probability reaches the given threshold (in this case, 0.8).

Top-p = 0.9:

If we set top-p=0.9, the model will keep adding tokens until the cumulative probability is 0.9.

  • “apple” = 0.5
  • “banana” = 0.3
  • “cherry” = 0.1
  • Total = 0.9

Now, the model will consider “apple”, “banana”, and “cherry”.


Key Difference between Top-k and Top-p

  • Top-k restricts you to a fixed number of the most likely tokens.
    • Example: top-k=2 would only allow the model to choose from the top 2 words.
  • Top-p (Nucleus sampling) restricts you to the smallest set of tokens whose cumulative probability is greater than or equal to p.
    • Example: top-p=0.8 means the model will sample from the tokens that, together, have at least 80% probability.

Summary

  • Top-k: Always limits to a fixed number of tokens (e.g., top 3, top 5).
  • Top-p: Dynamically limits to the smallest set of tokens whose cumulative probability is at least p (e.g., 80% or 90%).

FAQ

1. Let’s work through the scenario where every word has a probability of 0.7 and you’re using top-p sampling with a threshold of 0.8.

Scenario: Let’s assume the following token probabilities:

Word Probability
“apple” 0.7
“banana” 0.7
“cherry” 0.7
“date” 0.7
“elderberry” 0.7

Top-p = 0.8: In top-p sampling, the model keeps adding tokens to the pool until their cumulative probability exceeds or meets the top-p threshold (0.8).

Step-by-step breakdown:

  • “apple” = 0.7
  • “banana” = 0.7 (cumulative probability = 0.7 + 0.7 = 1.4)

At this point, the cumulative probability is 1.4, which exceeds the 0.8 threshold. So the sampling pool will be limited to these two words: “apple” and “banana”.

Since the total probability already exceeds 0.8 after the first two words, the model will include both “apple” and “banana” in the selection pool.

Key Points:

  • Top-p sampling doesn’t strictly limit the number of tokens — it selects tokens whose cumulative probability is at least the threshold (0.8 in this case).
  • If all tokens have the same probability (0.7), then the model will keep adding tokens until the cumulative probability reaches the top-p threshold.
  • In this case, the model will sample from the first two words (“apple” and “banana”), as their cumulative probability (1.4) exceeds the threshold of 0.8.

Final Conclusion: If every word has the same probability of 0.7, and you’re using top-p = 0.8, the model will include all words up to the point where the cumulative probability exceeds 0.8. In this case, it will stop at the second word, and you’ll end up with a pool of two words to choose from.


2. Let’s now look at how top-k sampling works in this case where every word has a probability of 0.7.

Scenario: We have the same token probabilities:

Word Probability
“apple” 0.7
“banana” 0.7
“cherry” 0.7
“date” 0.7
“elderberry” 0.7

Top-k = 2: In top-k sampling, the model selects the top-k most probable tokens. The number k is fixed, meaning the model will consider exactly the top k tokens based on their probabilities.

How it works:

  • Regardless of the probabilities, the model will pick the top 2 most probable tokens.
  • In this case, since all the words have the same probability of 0.7, the model will choose the first 2 tokens (based on their order or position in the list).

What Happens Here:

  • Since top-k=2, the model will always select the first 2 tokens, because every token has the same probability (0.7).
  • The model doesn’t care about the cumulative probability here; it only cares about the number of tokens, which is fixed at 2 in this case.

Key Points:

  • Top-k simply selects the top k most probable words — it doesn’t dynamically sum probabilities like top-p.
  • In the case where all words have the same probability, top-k just picks the first k words in the list.
  • Top-k is not influenced by the cumulative probability — it just selects a fixed number of top tokens.

2.3: Seeing In Action

Simple Hands On: Text Generation with GPT

Let’s write some code to generate text using a pre-trained GPT model. We’ll use the transformers library by Hugging Face, which provides easy access to many pre-trained models.

Step 1: Install the Required Libraries

You’ll need Python installed on your machine along with the following packages:

  • transformers (from Hugging Face)
  • torch (PyTorch backend)
pip install transformers torch

Step 2: Write the Code

from transformers import pipeline

# Load a pre-trained GPT-2 model for text generation
generator = pipeline('text-generation', model='gpt2')

# Generate text
prompt = "The future of AI is"
output = generator(prompt, max_length=50, num_return_sequences=1)

# Print the generated text
print(output[0]['generated_text'])

Explanation:

  • pipeline('text-generation', model='gpt2'): Loads the GPT-2 model for text generation.
  • prompt: The starting text for generation.
  • max_length: The maximum length of the generated text.
  • num_return_sequences: The number of sequences to generate.

Output Example:

The future of AI is bright, with advancements in natural language processing, computer vision, and robotics. As AI continues to evolve, it will transform industries, improve healthcare, and enhance our daily lives.

Experiment with Different Prompts

Try changing the prompt variable to see how the model responds. For example:

  • “In a world where robots rule,”
  • “Once upon a time, there was a”
  • “The secret to happiness is”

Practice Yourself

  1. Run the Code: Execute the text generation code and experiment with different prompts.
  2. Explore Other Models: Replace gpt2 with other models like EleutherAI/gpt-neo-1.3B or gpt-j (if available).

For list of available models, check Hugging Face Model Hub.

generator = pipeline('text-generation', model='EleutherAI/gpt-neo-1.3B')
  1. Read More: Familiarize yourself with the Hugging Face documentation and explore other tasks like translation, summarization, and question answering.

Additional Resources

Subsections of Seeing In Action

2.3.1: DeepDive Hands On

Deep Dive : Text Generation with GPT and Tokenizer

We’ll start with loading a pretrained model (like GPT-2 or BERT) and running a simple text generation task. We’ll use Hugging Face’s transformers library for this.

Step 1: Install the required libraries

You’ll need Python installed on your machine along with the following packages:

  • transformers (from Hugging Face)
  • torch (PyTorch backend)

To install these, run:

pip install transformers torch

Step 2: Load a Pretrained GPT-2 Model and Tokenizer
Here’s a simple example to load the GPT-2 model and tokenizer, then generate text based on a prompt.

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained model and tokenizer
model_name = 'gpt2'  # You can change to 'gpt2-medium', 'gpt2-large', etc., for a larger model
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Set model to evaluation mode
model.eval()

# Encode the prompt
prompt = "In the near future, artificial intelligence will"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate text
outputs = model.generate(inputs['input_ids'], max_length=50, num_return_sequences=1, no_repeat_ngram_size=2)

# Decode the generated text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Step 3: Run the Code and Observe the Output
When you run the script, the GPT-2 model will generate a continuation of the prompt:

In the near future, artificial intelligence will

Let’s break down the prompt:

  1. Tokenizer Encodes: The tokenizer will first convert this text into token IDs.
  2. Model Generates: The model uses these token IDs to generate the continuation of the sentence.
  3. Tokenizer Decodes: The output token IDs are converted back into a string of text.

For example, the output might look like:

"In the near future, artificial intelligence will be able to predict our every move, revolutionize industries, and improve the quality of life. With advancements in machine learning algorithms and deep learning techniques, AI will be a central part of our daily lives."

Step 4: Experiment with Different Prompts
You can modify the prompt to see how the model responds to different inputs. For example, try:

  • “Once upon a time, in a land far away,”
  • “The economy of the future will be driven by”
  • “The secret to a successful business is”

This will give you a sense of how the GPT-2 model can generate creative and contextually relevant text. Feel free to experiment with different parameters like max_length, num_return_sequences, or other settings to customize the output.


Let’s break down tokenization and the components involved, as well as explain the different parameters used in the code.

Tokenizer in the Code

The tokenizer is responsible for converting human-readable text into tokens that the model can understand. In Hugging Face’s transformers library, the tokenizer is used to:

  • Convert text into token IDs that the model can process.
  • Convert token IDs back into human-readable text (decoding).
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained model and tokenizer
model_name = 'gpt2'  # You can change to 'gpt2-medium', 'gpt2-large', etc., for a larger model
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

GPT2Tokenizer.from_pretrained(model_name): This loads the tokenizer associated with the GPT-2 model. This tokenizer is trained specifically for the model and knows how to convert text into tokens and vice versa.


Key Parameters Used in the Code

a. Encoding the Input Text

inputs = tokenizer(prompt, return_tensors="pt")
  • prompt: This is the initial text input that you want the model to complete or generate further text from.
  • tokenizer(prompt): This will convert the prompt text into token IDs that GPT-2 can understand.
  • return_tensors="pt": This specifies that the output should be in the form of PyTorch tensors. This is required because PyTorch is used for processing the data inside the model. (If you’re using TensorFlow, you’d use return_tensors="tf").

The result inputs will look like:

{
    'input_ids': tensor([[50256, 318, 257, 4768, 282, 2740]]),
    'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])
}
  • input_ids: These are the actual token IDs that represent the words in the prompt.
  • attention_mask: This tells the model which tokens to focus on (1 means to focus, 0 means ignore).

b. Generating Text

outputs = model.generate(inputs['input_ids'], max_length=50, num_return_sequences=1, no_repeat_ngram_size=2)
  • inputs['input_ids']: The token IDs for the prompt are passed as input to the model.
  • max_length=50: This limits the total number of tokens (words/subwords) in the generated text, including both the input and the output. In this case, the total length is 50 tokens.
  • num_return_sequences=1: This defines how many different sequences of text you want the model to generate. In this case, the model will generate 1 sequence.
  • no_repeat_ngram_size=2: This parameter prevents the model from repeating a sequence of 2 consecutive tokens (an n-gram) in the generated text. For example, if the model starts generating “the the”, this parameter will force it to avoid that repetition.

c. Decoding the Output

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
  • outputs[0]: This contains the token IDs for the generated text. The model outputs the predicted token IDs for each step in the sequence.
  • tokenizer.decode(...): This converts the token IDs back into human-readable text.
  • skip_special_tokens=True: This removes special tokens like the end-of-sequence token (typically used in transformers models to indicate the end of the generated text).

For example, the outputs[0] could be a list of token IDs like [50256, 318, 257, 4768, 282], which the decoder will turn into human-readable text.

Conclusion

  • Tokenization is the process of converting text into tokens (IDs) that a model can understand and process.
  • The Tokenizer is a crucial component in transforming text for a model and back into text after processing.
  • Parameters like max_length, num_return_sequences, and no_repeat_ngram_size control the length, number of sequences, and quality of the generated output.

More hands-on examples

Let’s dive into more hands-on examples to reinforce the concepts of tokenization and model generation.

1. Experimenting with Different Prompt Types

Let’s start by experimenting with different types of prompts to see how GPT-2 responds.

Example 1: Story Prompt

prompt = "Once upon a time, in a land far away,"

Example 2: Business Scenario

prompt = "The future of artificial intelligence in business is"

Example 3: Philosophical Question

prompt = "What is the meaning of life?"

For each of these, run the same code and observe the text generated by GPT-2.

2. Play with Generation Parameters

You can adjust parameters to experiment with how the model generates text:

  1. Change max_length:

    • If you increase max_length to 100 or 200, the model will generate a longer continuation of the prompt.

    • Note that we have added min_length to ensure the generated text is at least 100 tokens long.

      outputs = model.generate(inputs['input_ids'], min_length=100, max_length=200, num_return_sequences=1, no_repeat_ngram_size=2)
  2. Experiment with num_return_sequences:

    • If you set num_return_sequences to 3, the model will generate 3 different continuations of the same prompt.

      outputs = model.generate(inputs['input_ids'], min_length=100, max_length=200, num_return_sequences=3, no_repeat_ngram_size=2)
    If ValueError for num_return_sequences

    ValueError: Greedy methods without beam search do not support num_return_sequences different than 1 (got 3). Replace num_return_sequences with num_beams

  3. Experiment with temperature and top_k:

    • temperature controls randomness. A lower temperature (e.g., 0.7) generates more focused, deterministic text, while a higher temperature (e.g., 1.5) generates more creative, diverse output.

    • top_k restricts the sampling to the top-k most likely next tokens, which controls diversity.

      outputs = model.generate(inputs['input_ids'], min_length=100, max_length=200, num_return_sequences=1, temperature=0.9, top_k=50)
      • Lower temperature: More predictable output.
      • Higher temperature: More random, creative output.
  4. Experiment with top_p (nucleus sampling):

    • top_p restricts sampling to the smallest set of tokens whose cumulative probability is greater than p (e.g., top_p=0.9 means it will sample from the smallest set of tokens that cumulatively have 90% of the probability mass).
    outputs = model.generate(inputs['input_ids'], min_length=100, max_length=200, num_return_sequences=1, top_p=0.9, temperature=0.8)

3. Try Decoding the Output

You’ll see different outputs for each of the above changes. Use the tokenizer’s decode method to see how the generated tokens look.

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

This allows you to experiment and see how changing each parameter affects the model’s output.


Summary of Parameters

Here’s a quick recap of the parameters we explored:

  • max_length: Controls the length of the generated text (in tokens).
  • num_return_sequences: Controls how many different outputs you want to generate.
  • no_repeat_ngram_size: Prevents repetitive sequences (n-grams) from appearing in the generated text.
  • temperature: Controls the randomness of the text generation. Higher = more randomness.
  • top_k: Limits sampling to the top-k tokens by probability. Controls diversity.
  • top_p: Nucleus sampling; limits the set of tokens to a cumulative probability p.