Attention Is All You Need

The paper “Attention is All You Need” is a famous research paper in the field of artificial intelligence, specifically in natural language processing (NLP). It introduced a new type of model called the Transformer, which has become the foundation for many modern AI systems like ChatGPT, BERT, and others.

Attention is All You Need

The paper “Attention is All You Need” introduces a new model called Transformer for processing sequences of data, like language. Before this, models like RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory networks) were used for tasks like translation or text generation. These models processed data one step at a time, in a sequence, which made them slow and inefficient. They also struggled with long sentences because they had trouble remembering information from far back in the sequence.

The Big Idea: Attention

The key idea of the Transformer model is the Attention Mechanism. Instead of processing words one by one in order, the model could look at all the words in a sentence at once and figure out which words were most important to each other. For example, in the sentence “The cat sat on the mat,” the word “sat” is closely related to “cat” and “mat.” Attention helps the model focus on these relationships.

How Transformers Work

The Transformer is built entirely on this idea of attention. Here’s how it works in simple terms:

Attention Mechanism: This allows the model to look at each word in a sentence and pay attention to other words that might be important for understanding its meaning. For example, in the sentence “The cat sat on the mat,” the model might focus on the word “cat” when interpreting the word “sat.”
Parallelization: Since the Transformer doesn’t process words one by one, it can look at all words in parallel, speeding up training and making it more efficient.
Encoder-Decoder Structure: The Transformer is split into two parts:
- Encoder: Reads and processes the input (like a sentence in English).
- Decoder: Produces the output (like a translation in French).
Multi-Head Attention: The model doesn’t just have one “attention” mechanism but multiple, allowing it to understand different aspects of the input at once, which improves accuracy.

Why Transformers Are Better

Speed: Because Transformers process all words at once, they’re much faster than older models.
Accuracy: They’re better at understanding long sentences and complex relationships between words.
Scalability: Transformers can be trained on huge amounts of data, which makes them very powerful.

Impact of the Paper

The Transformer architecture revolutionized NLP and AI. It led to the development of models like:

GPT (Generative Pre-trained Transformer): Used for text generation.
BERT (Bidirectional Encoder Representations from Transformers): Used for understanding language.
Many others that power tools like Google Translate, chatbots, and more.

Key Takeaway

The main idea of the paper is that attention is all you need to build powerful language models. By focusing on how words relate to each other, Transformers can understand and generate language much better than older models. Overall, the Transformer model revolutionized natural language processing by being faster, more scalable, and better at handling long sentences or complex relationships between words. It’s the foundation for many advanced models like GPT and BERT.

1. Introduction to AI

What is AI?

Artificial Intelligence (AI) refers to the field of computer science that aims to create machines or systems capable of performing tasks that typically require human intelligence. These tasks include reasoning, problem-solving, understanding language, and recognizing patterns.

Artificial Narrow Intelligence (ANI): This is AI designed for a specific task. For example, a machine learning model used to recommend videos on YouTube is an ANI.
Artificial General Intelligence (AGI): AGI refers to a theoretical form of AI that can perform any intellectual task a human can. This is still in the research phase and has not yet been achieved.
Artificial Superintelligence (ASI): This is the next stage beyond AGI, where AI surpasses human intelligence in all aspects.

Key Terminology

Machine Learning (ML): A subset of AI where algorithms learn patterns from data to make predictions or decisions. There are three primary types:
- Supervised Learning: The model is trained on labeled data.
- Unsupervised Learning: The model finds patterns in unlabeled data.
- Reinforcement Learning: The model learns by interacting with an environment and receiving feedback.
Deep Learning: A subset of machine learning that uses neural networks with many layers (hence “deep”) to learn from large amounts of data. It is particularly effective for tasks like image recognition, speech recognition, and language processing.
Natural Language Processing (NLP): This is the field that focuses on the interaction between computers and human language, enabling computers to process, analyze, and understand text or speech in a way that is meaningful.

Discriminative Models vs. Generative Models

Discriminative models and generative models are two fundamental types of machine learning models that serve different purposes and are used in various applications.

Discriminative Models

Discriminative models are designed to predict the probability of a target variable given a set of input features. They learn to distinguish between different classes or labels and are typically used for classification tasks.

Key Characteristics:

Focus on prediction: Discriminative models aim to predict the target variable accurately.
Conditional probability: They model the conditional probability of the target variable given the input features, P(Y|X).
Classification-oriented: Discriminative models are widely used for classification tasks, such as spam detection, sentiment analysis, and image classification.

Examples of Discriminative Models:

Logistic Regression
Support Vector Machines (SVMs)
Neural Networks (e.g., Multilayer Perceptron)

Generative Models

Generative models, on the other hand, are designed to model the underlying distribution of the data. They learn to generate new data samples that are similar to the existing data.

Key Characteristics:

Focus on data generation: Generative models aim to generate new data samples that resemble the existing data.
Joint probability: They model the joint probability of the input features and the target variable, P(X, Y).
Data generation-oriented: Generative models are used for tasks such as data augmentation, anomaly detection, and image/video generation.

Examples of Generative Models:

Generative Adversarial Networks (GANs)
Variational Autoencoders (VAEs)
Gaussian Mixture Models (GMMs)

Key Differences

Purpose: Discriminative models focus on prediction, while generative models focus on data generation.
Probability modeling: Discriminative models model conditional probabilities, whereas generative models model joint probabilities.
Applications: Discriminative models are widely used for classification tasks, while generative models are used for data generation, anomaly detection, and more.

In summary, discriminative models are designed for prediction tasks, while generative models are designed for data generation and modeling the underlying data distribution. Both types of models have their strengths and are used in various applications in machine learning and artificial intelligence.

2: Introduction to GenAI

What is Generative AI?

Generative AI refers to a class of artificial intelligence models that generate new content, such as text, images, audio, or video. Unlike traditional AI models focused on classification or prediction, generative models create new data based on learned patterns, producing outputs similar to the input data but with variability. The ultimate goal is to produce realistic content that’s indistinguishable from human-created work.

Types of Generative AI Models

Large Language Models (LLMs): Text Generation: Models that generate human-like text using deep learning. Examples: GPT-4, LLaMA, Claude, Mistral, Gemini
Diffusion Models: Image & Video Generation: Generate images/video from noise, refining them over multiple steps. Examples: Stable Diffusion, DALL·E, Midjourney, Sora
Audio & Music Generators: Generate realistic speech, music, or sound effects. Examples: MusicGen, Jukebox, VALL-E, Bark
Multi-modal Models: Can process and generate text, images, video, and audio in a single model. Examples: Gemini, GPT-4 Turbo (Vision), LLaVA

Example Use Cases of GenAI

📝 Text Generation: Article generation, essay writing etc. (ChatGPT, Gemini, LLaMA)
🎨 Image Generation: Creating art, photos, or designs. (Stable Diffusion, DALL·E, Midjourney)
🎶 Audio Generation: Composing music or generating speech. (Jukebox, MusicGen)
🎥 Video Generation: Deepfake technology and AI-assisted filmmaking. (Sora, Pika Labs)
🧑‍🎨 Chatbots: Conversational agents that can interact with users. (ChatGPT, Gemini, LLaMA)

Popular Generative Models

GPT (Generative Pretrained Transformer):
- GPT models are trained to predict the next word in a sentence given the previous words. They use transformer architecture, which allows them to understand and generate human-like text.
- Use Case: Writing articles, answering questions, generating code, etc.
BERT (Bidirectional Encoder Representations from Transformers):
- Unlike GPT, BERT is trained to understand the context of a word in both directions (left-to-right and right-to-left). It’s mainly used for tasks that require a deep understanding of the context of language.
- Use Case: Sentiment analysis, text classification, question answering.
LLaMA (Large Language Model Meta AI):
- Developed by Meta (formerly Facebook), LLaMA is an open-source language model similar to GPT. It focuses on providing access to large models while maintaining efficiency and usability.
- Use Case: Text generation, summarization, and more.

GenAI Applications and Impact

GenAI has various applications across industries:

Text Generation: GenAI models like GPT are used in content generation, such as blog writing, coding, and chatbot responses. For example, OpenAI’s GPT-3 is employed for tasks ranging from generating marketing copy to drafting emails.
Conversational AI: Models like GPT-3, paired with specialized APIs, are used to build virtual assistants (like Siri and Alexa) or customer service chatbots, which can hold meaningful conversations with humans.
Image Generation: DALL·E is an example of an AI that generates images from textual descriptions. This is used in creative industries like marketing and design.
Code Generation: AI models like GitHub Copilot (based on GPT) assist developers by suggesting code and helping write functions.

Real-World GenAI Projects and Case Studies

GPT-3 in Action:
OpenAI’s GPT-3 is used across various sectors, from writing blog posts to generating legal contracts and automating customer service.
DeepMind’s AlphaFold:
AlphaFold is a deep learning model developed by DeepMind that predicts the 3D structure of proteins. This has significant implications for drug discovery and biology.
Meta’s LLaMA:
Meta’s LLaMA models are used for efficient natural language processing tasks, offering an open-source alternative to GPT models for research purposes.

Ethical Considerations

Bias in AI: AI models can inherit biases from their training data. This can affect the fairness of models in real-world applications.
Transparency and Accountability: Models like GPT may produce outputs that are hard to interpret, raising concerns about accountability in AI-generated content.
Deepfakes and Misinformation: GenAI models are capable of generating realistic but fake content, such as videos or voices, which can be used maliciously.

2.1: Key Concepts in GenAI

Key Concepts in Generative AI

Concept	Definition
Large Language Models (LLMs)	LLMs are AI models trained on vast amounts of text data. They use the Transformer architecture, which relies on attention mechanisms to process input data. Examples: GPT (Generative Pre-trained Transformer), BERT, T5.
Tokenization	Breaking down text into smaller units (tokens) for processing. Example: The sentence “Hello, world!” might be tokenized into `["Hello", ",", "world", "!"]`.
Embeddings	Representing tokens as numerical (vectors) in a high-dimensional space. Embeddings capture semantic meaning (e.g., “king” - “man” + “woman” ≈ “queen”).
Self-Attention/Attention Mechanism	Mechanism that helps models focus on relevant words.
Transformers	The deep learning architecture used in LLMs. Transformers are the backbone of most modern generative models. Key components: Encoder, Decoder, and Attention Mechanism.
Pre-training	Training a model on a large dataset (e.g., all of Wikipedia) to learn general language patterns.
Fine-tuning	Adapting the pre-trained model to a specific task (e.g., sentiment analysis, chatbot).
Prompt Engineering	Designing effective inputs to guide model responses.

Tokenization

Tokenization is the process of converting text into smaller units, typically words or subwords, that can be processed by machine learning models. In natural language processing (NLP), tokens are the basic building blocks for understanding and generating language. Tokenization helps the model “understand” the text by converting it into a format that can be fed into the neural network.

Word-level tokenization splits text into words. Example: "I love AI" → ["I", "love", "AI"].
Subword-level tokenization (used in models like GPT) splits words into smaller parts or subwords. This is more efficient and handles unknown words better, as the model can understand smaller pieces of a word. Example, “delightful”→[“delight”, “ful”]`.
Character-Level Tokenization Splits text into individual characters. Example: "AI" → ["A", "I"].

Hands-On: Tokenization

Tokenization Example

from transformers import AutoTokenizer

# Load the tokenizer for GPT-2
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Tokenize a sentence
text = "Hello, how are you?"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)

print("Tokens:", tokens)
print("Token IDs:", token_ids)

Output:

Tokens: ['Hello', ',', 'Ġhow', 'Ġare', 'Ġyou', '?']
Token IDs: [15496, 11, 703, 389, 345, 30]

Explanation:

tokenizer.tokenize(text): Splits the text into tokens.
tokenizer.encode(text): Converts tokens into their corresponding numerical IDs.

Tokenization: Additional Resources

Blog: Tokenization in NLP
Video: Word Embeddings Explained

Embeddings

Embeddings are numerical representations of tokens in a high-dimensional space. They capture the semantic meaning of tokens, allowing models to understand relationships between words.

Why are Embeddings Important?

Words with similar meanings have similar embeddings.
Embeddings enable models to generalize and understand context.

Example:

The embeddings for "king", "queen", "man", and "woman" might satisfy the relationship:
```
king - man + woman ≈ queen
```

Hands-On: Embeddings

from transformers import AutoTokenizer, AutoModel
import torch

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModel.from_pretrained("gpt2")

# Tokenize input text
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")

# Generate embeddings
with torch.no_grad():
    outputs = model(**inputs)

# Extract embeddings for the first token
embeddings = outputs.last_hidden_state
print("Embeddings shape:", embeddings.shape)
print("Embeddings for 'Hello':", embeddings[0, 0, :5])  # First 5 dimensions

Output:

Embeddings shape: torch.Size([1, 6, 768])
Embeddings for 'Hello': tensor([-0.0123,  0.0456, -0.0678,  0.0234,  0.0891])

Explanation:

model(**inputs): Passes the token IDs through the model to generate embeddings.
outputs.last_hidden_state: Contains the embeddings for each token.
Each token is represented as a 768-dimensional vector (for GPT-2).
Output: torch.Size([1, 6, 768]) means
- Batch size: 1. Since we have one input sentence.
- Sequence length: 6. The input sentence has 6 tokens for the sentence "Hello, how are you?"
- Embedding dimensions: 768. Each token is represented by a 768-dimensional vector for GPT-2.

Self-Attention: The Core of Transformers

Imagine you’re reading a sentence: "The cat sat on the mat. Each word is important, but some are more related to others.

“cat” is related to “sat.”
“mat” is related to “sat.”
“on” is less important.

Self-Attention helps the model decide which words to focus on!

How Self-Attention Works (Step-by-Step)

Self-attention is done in 4 steps:

1. Convert Words into Vectors (Embeddings)

Computers don’t understand words, so we convert them into numbers (word embeddings).

Example:

Word	Vector Representation (Simplified)
The	[0.1, 0.2, 0.3]
Cat	[0.5, 0.6, 0.7]
Sat	[0.8, 0.9, 1.0]

2. Create Query, Key, and Value (Q, K, V)

Each word is transformed into three vectors:

Query (Q) → “What am I looking for?”
Key (K) → “What information do I have?”
Value (V) → “What should be returned?”

Example:

Word	Query (Q)	Key (K)	Value (V)
Cat	0.5	0.4	0.6
Sat	0.7	0.8	0.9

The Query (Q) of “sat” will be compared with Keys (K) of all words to see how related they are.
Now, let’s compute the Attention Scores using Q × Kᵀ.

3. Compute Attention Scores (Importance of Words) (Q × Kᵀ)

Now, we compare Query (Q) and Key (K) using the dot product.

If Q and K are similar, the word is important.
If Q and K are different, the word is less important.

\[ \text{Attention Score} = Q \times K^T \]

Example:

“sat” is strongly related to “cat” → High attention score.
“sat” is weakly related to “the” → Low attention score.

Word Pair	Calculation
Cat → Cat	(0.5 × 0.4) = 0.2
Cat → Sat	(0.5 × 0.8) = 0.4
Sat → Cat	(0.7 × 0.4) = 0.28
Sat → Sat	(0.7 × 0.8) = 0.56

Thus, our Attention Score matrix is:

\[ \begin{bmatrix} 0.2 & 0.4 \\ 0.28 & 0.56 \end{bmatrix} \]

4. Now we normalize these scores using Softmax.

What is Softmax?

Softmax is a function that converts raw scores into probabilities. Example: Let’s say we have three scores:

Raw Scores: [0.56, 0.72, 0.11]

Softmax converts them into values between 0 and 1:

Softmax Output: [0.30, 0.50, 0.20]

Why do we use Softmax?

So the values sum to 1, making them easy to interpret as “importance levels.”

Coming back to our actual values now,

Softmax normalizes these values so that they sum to 1 per row.

\[ \text{Softmax}(0.2, 0.4) = \frac{e^{0.2}}{e^{0.2} + e^{0.4}}, \frac{e^{0.4}}{e^{0.2} + e^{0.4}} \]

Approximating exponentials:

\[ e^{0.2} \approx 1.221, \quad e^{0.4} \approx 1.491 \]\[ \sum = 1.221 + 1.491 = 2.712 \]\[ \text{Softmax} = \left[ \frac{1.221}{2.712}, \frac{1.491}{2.712} \right] = [0.45, 0.55] \]

For the second row:

\[ e^{0.28} \approx 1.323, \quad e^{0.56} \approx 1.751 \]\[ \sum = 1.323 + 1.751 = 3.074 \]\[ \text{Softmax} = \left[ \frac{1.323}{3.074}, \frac{1.751}{3.074} \right] = [0.43, 0.57] \]

So, the normalized attention weights (Softmax scores) are:

\[ \begin{bmatrix} 0.45 & 0.55 \\ 0.43 & 0.57 \end{bmatrix} \]

4. Multiply Attention Weights by Value (V) and Sum Up

Each word’s final value is computed as:

Now, we multiply the softmax scores by the Value (V) matrix.

Final Word Representation = Attention Score × Value (V)

Word	V
Cat	0.6
Sat	0.9

For Cat (first row):

\[ (0.45 \times 0.6) + (0.55 \times 0.9) = 0.27 + 0.495 = 0.765 \]

For Sat (second row):

\[ (0.43 \times 0.6) + (0.57 \times 0.9) = 0.258 + 0.513 = 0.771 \]

This process refines each word’s meaning based on context.

5. Final Output: Context Vector

The final contextualized representations are:

\[ \begin{bmatrix} 0.765 \\ 0.771 \end{bmatrix} \]

What This Means

Each word’s new representation now depends on its relationship with others, weighted by attention!
Would you like me to extend this to multi-dimensional Q, K, and V?

The final output (context vector) represents the new embeddings for each word after applying the self-attention mechanism. These new values are no longer just the original word embeddings; instead, they now encode contextual information from the surrounding words based on how much attention each word gives to others.

Contextualized Representation
- Originally, each word had its own static vector (e.g., “Cat” = [0.5, 0.6, 0.7]).
- After self-attention, each word’s new representation incorporates weighted contributions from other words based on attention scores.
Information Flow
- The new values (0.765, 0.771) indicate that “Cat” and “Sat” now carry some information from each other, influenced by the attention weights.
- Words that are more relevant to each other have stronger influences.
Why Is This Important?
- Before self-attention, “Cat” was just “Cat,” and “Sat” was just “Sat.”
- Now, “Cat” understands that “Sat” is nearby and incorporates some of its meaning.
- This is how transformers capture context and relationships in sentences!

6. Sample Code

from transformers import BertModel, BertTokenizer

# Load pre-trained BERT model with attention outputs enabled
model = BertModel.from_pretrained("bert-base-uncased", output_attentions=True)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", output_attentions=True)

# Tokenize and convert input into tensors
inputs = tokenizer("Hello, Generative AI!", return_tensors="pt")

# Forward pass to get outputs including attention weights
outputs = model(**inputs)

# Extract attention layers
attentions = outputs.attentions

# Print number of attention layers
print("Number of Attention Layers:", len(attentions))

Explanation of the Code

BertModel.from_pretrained("bert-base-uncased", output_attentions=True): Loads the BERT model with the option to return attention weights.
tokenizer("Hello, Generative AI!", return_tensors="pt"): Converts input text into a format the model understands (PyTorch tensors).
model(**inputs): Passes the tokenized inputs through the BERT model.
outputs.attentions: Extracts attention weights from different transformer layers.

7. Python Example: Simple Self-Attention Implementation from Scratch

Now, let’s implement self-attention from scratch in Python.

import torch
import torch.nn as nn
import torch.nn.functional as F

class SelfAttention(nn.Module):
    def __init__(self, embed_size):
        super(SelfAttention, self).__init__()
        self.embed_size = embed_size
        self.query = nn.Linear(embed_size, embed_size)
        self.key = nn.Linear(embed_size, embed_size)
        self.value = nn.Linear(embed_size, embed_size)
    
    def forward(self, x):
        Q = self.query(x)   # Convert to Query
        K = self.key(x)     # Convert to Key
        V = self.value(x)   # Convert to Value

        # Compute Attention Scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.embed_size ** 0.5)
        attention = F.softmax(scores, dim=-1)  # Apply Softmax

        # Multiply by values
        out = torch.matmul(attention, V)
        return out

# Example usage
embed_size = 8
seq_length = 5
x = torch.rand((1, seq_length, embed_size))

self_attn = SelfAttention(embed_size)
output = self_attn(x)

print("Output Shape:", output.shape)  # Expected: (1, seq_length, embed_size)

Transformer Architecture

The Transformer Model is the core architecture behind most modern NLP and GenAI models like GPT, BERT, and LLaMA. Here’s how it works:

Key Components of the Transformer

Self-Attention Mechanism: This allows the model to focus on different words in a sentence when processing each word. For example, when processing the word “bank” in the sentence “I went to the bank to withdraw money,” the model can focus on the context to determine if “bank” refers to a financial institution or the side of a river.
Multi-Head Attention: This technique allows the model to focus on different aspects of the sentence simultaneously, using multiple attention heads to capture different relationships between words.
Positional Encoding: Since transformers don’t inherently understand the order of words (like sequential models), positional encoding is added to provide information about the position of words in a sentence.
Encoder-Decoder Architecture:
- Encoder: Processes the input data (e.g., a sentence).
- Decoder: Generates the output data (e.g., a translation of the sentence).

Both the Transformer Encoder and Decoder consist of:

Feed-Forward Layers: Further processes information after attention.
Layer Normalization: Normalizes the output of each sub-layer before passing it to the next layer, stabilizing training and preventing exploding gradients.

Feed-Forward Layers (FFNN)

What is a Feed-Forward Layer?

A Feed-Forward Layer, also known as a Feed-Forward Neural Network (FFNN), is a fully connected neural network layer that takes the output from the attention mechanism and further processes it.

How does a Feed-Forward Layer work?

Input: The output from the attention mechanism is fed into the FFNN.
Linear Transformation: The input is transformed using a linear layer (e.g., a dense layer) with a learnable weight matrix and bias term.
Activation Function: The output from the linear transformation is passed through an activation function, such as ReLU (Rectified Linear Unit) or GeLU (Gaussian Error Linear Unit).
Output: The output from the activation function is the final output of the FFNN.

Purpose of Feed-Forward Layers

The FFNN serves two purposes:

Feature Transformation: It transforms the output from the attention mechanism into a higher-level representation that’s more suitable for the task at hand.
Non-Linearity Introduction: The activation function introduces non-linearity into the model, allowing it to learn more complex relationships between the input and output. For example,
- ReLU (Rectified Linear Unit), Maps all negative values to 0 and all positive values to the same value. Another one,
- Sigmoid, which maps the input to a value between 0 and 1.
- GELU (Gaussian Error Linear Unit): combines the benefits of ReLU and sigmoid.

Layer Normalization

What is Layer Normalization?

Layer Normalization is a technique used to normalize the output of each sub-layer (e.g., the FFNN) before passing it to the next layer.

How does Layer Normalization work?

Compute Mean and Variance: The mean and variance of the output from the sub-layer are computed.
Normalize: The output is normalized by subtracting the mean and dividing by the standard deviation (square root of variance).
Scale and Shift: The normalized output is then scaled and shifted using learnable parameters (gamma and beta).

Purpose of Layer Normalization

Layer Normalization serves several purposes:

Stabilizes Training: Normalization helps stabilize the training process by reducing the effects of exploding gradients.
Improves Generalization: Normalization can improve the model’s ability to generalize to new, unseen data.
Reduces Dependence on Initialization: Normalization reduces the dependence on the initialization of the model’s weights.

By using Layer Normalization, the Transformer model can better handle the complex interactions between the attention mechanism and the FFNN, leading to improved performance and stability.

Examples

FNN Example

Suppose we have a simple FFN with one input layer, one hidden layer, and one output layer. The input layer has 2 neurons, the hidden layer has 3 neurons, and the output layer has 1 neuron.

Here’s the FFN architecture: Input Layer (2 neurons) → Hidden Layer (3 neurons) → Output Layer (1 neuron)

Let’s say we have an input vector x = [1, 2]. The FFN processes this input as follows:

Input Layer: The input vector x is passed through the input layer.
Hidden Layer: The output from the input layer is passed through a linear transformation (e.g., a dense layer) with weights W1 and biases b1, followed by an activation function (e.g., ReLU). Let’s call the output from the hidden layer h.
Output Layer: The output from the hidden layer h is passed through another linear transformation with weights W2 and biases b2. The final output is y.

Here’s some sample PyTorch code to illustrate this:

import torch
import torch.nn as nn

# Define the FFN model
class FFN(nn.Module):
    def __init__(self):
        super(FFN, self).__init__()
        self.hidden_layer = nn.Linear(2, 3)  # input layer (2) -> hidden layer (3)
        self.output_layer = nn.Linear(3, 1)  # hidden layer (3) -> output layer (1)

    def forward(self, x):
        h = torch.relu(self.hidden_layer(x))  # activation function: ReLU
        y = self.output_layer(h)
        return y

# Initialize the FFN model and input vector
model = FFN()
x = torch.tensor([1.0, 2.0])

# Forward pass
y = model(x)
print(y)

FFNN Example: Image Classification

Suppose we want to build a simple image classification model that can distinguish between images of cats and dogs.

Here’s how an FFNN can be used:

Input Layer: The input layer takes in the image data, which is typically represented as a matrix of pixel values.
Hidden Layer: The hidden layer applies a series of transformations to the input data, using weights and biases learned during training.
Output Layer: The output layer generates a probability distribution over the two classes (cat or dog).

For example, if we input an image of a cat, the FFNN might output: [0.8, 0.2]

This indicates that the model is 80% confident that the image is a cat and 20% confident that it’s a dog.

Layer Normalization Example

Now, let’s add Layer Normalization to the FFN model. We’ll apply Layer Normalization to the output of the hidden layer.

Here’s the modified PyTorch code:

import torch
import torch.nn as nn
import torch.nn.functional as F

# Define the FFN model with Layer Normalization
class FFN(nn.Module):
    def __init__(self):
        super(FFN, self).__init__()
        self.hidden_layer = nn.Linear(2, 3)  # input layer (2) -> hidden layer (3)
        self.layer_norm = nn.LayerNorm(3)  # Layer Normalization for hidden layer
        self.output_layer = nn.Linear(3, 1)  # hidden layer (3) -> output layer (1)

    def forward(self, x):
        h = torch.relu(self.hidden_layer(x))  # activation function: ReLU
        h = self.layer_norm(h)  # apply Layer Normalization
        y = self.output_layer(h)
        return y

# Initialize the FFN model and input vector
model = FFN()
x = torch.tensor([1.0, 2.0])

# Forward pass
y = model(x)
print(y)

In this example, we’ve added a LayerNorm module to the FFN model, which applies Layer Normalization to the output of the hidden layer. The LayerNorm module normalizes the input data by subtracting the mean and dividing by the standard deviation, which helps to stabilize the training process and improve the model’s performance.

Layer Normalization Example: Language Translation

Suppose we want to build a machine translation system that translates English sentences to Hindi. We have a dataset of paired English and Hindi sentences.

Our sequence-to-sequence model consists of three main components:

Encoder : The encoder takes in the English sentence and outputs a sequence of vectors and a hidden state. The encoder uses a GRU (Gated Recurrent Unit) layer to process the input sequence.
Layer Normalization : The layer normalization component normalizes the output from the encoder. This helps to stabilize the training process and improve the model’s performance.
Decoder : The decoder takes in the normalized output and the hidden state from the encoder and outputs a Hindi sentence. The decoder also uses a GRU layer to process the input sequence.

Layer Normalization helps to:

Stabilize the training process by reducing the effects of exploding gradients.
Improve the model’s ability to generalize to new, unseen data.

Example Walkthrough

Let’s walk through an example of how this model works:

English Sentence: We start with an English sentence, “Hello, how are you?”
Encoder: The encoder takes in the English sentence and outputs a sequence of vectors and a hidden state.

English Word	Encoder Output
Hello	[0.1, 0.2, 0.3]
how	[0.4, 0.5, 0.6]
are	[0.7, 0.8, 0.9]
you	[1.0, 1.1, 1.2]

Layer Normalization: The layer normalization component normalizes the output from the encoder.

English Word	Normalized Output
Hello	[-0.5, 0.0, 0.5]
how	[-0.3, 0.2, 0.7]
are	[-0.1, 0.4, 0.9]
you	[0.1, 0.6, 1.1]

Decoder: The decoder takes in the normalized output and the hidden state from the encoder and outputs a Hindi sentence.

Hindi Word	Decoder Output
नमस्ते	[0.8, 0.9, 1.0]
कैसे	[0.6, 0.7, 0.8]
हो	[0.4, 0.5, 0.6]
?	[0.2, 0.3, 0.4]

Final Output: The final output from the decoder is the translated Hindi sentence, “नमस्ते, कैसे हो?”.

# Import the necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim

# Define the encoder model
class Encoder(nn.Module):
    def __init__(self):
        # Initialize the encoder model
        super(Encoder, self).__init__()
        # Define the embedding layer with 10000 possible words and 128-dimensional vectors
        self.embedding = nn.Embedding(num_embeddings=10000, embedding_dim=128)
        # Define the GRU layer with 128 input dimensions, 256 hidden dimensions, and 1 layer
        self.rnn = nn.GRU(input_size=128, hidden_size=256, num_layers=1, batch_first=True)

    def forward(self, x):
        # Embed the input sequence
        embedded = self.embedding(x)
        # Pass the embedded sequence through the GRU layer
        output, hidden = self.rnn(embedded)
        # Return the output and hidden state
        return output, hidden

# Define the layer normalization model
class LayerNormalization(nn.Module):
    def __init__(self):
        # Initialize the layer normalization model
        super(LayerNormalization, self).__init__()
        # Define the layer normalization layer with 256 dimensions
        self.layer_norm = nn.LayerNorm(normalized_shape=256)

    def forward(self, x):
        # Normalize the input sequence
        return self.layer_norm(x)

# Define the decoder model
class Decoder(nn.Module):
    def __init__(self):
        # Initialize the decoder model
        super(Decoder, self).__init__()
        # Define the embedding layer with 10000 possible words and 128-dimensional vectors
        self.embedding = nn.Embedding(num_embeddings=10000, embedding_dim=128)
        # Define the GRU layer with 128 input dimensions, 256 hidden dimensions, and 1 layer
        self.rnn = nn.GRU(input_size=128, hidden_size=256, num_layers=1, batch_first=True)
        # Define the fully connected layer with 256 input dimensions and 10000 output dimensions
        self.fc = nn.Linear(256, 10000)

    def forward(self, x, hidden):
        # Embed the input sequence
        embedded = self.embedding(x)
        # Pass the embedded sequence through the GRU layer
        output, hidden = self.rnn(embedded, hidden)
        # Pass the output through the fully connected layer
        output = self.fc(output[:, -1, :])
        # Return the output and hidden state
        return output, hidden

# Define the sequence-to-sequence model
class Seq2Seq(nn.Module):
    def __init__(self):
        # Initialize the sequence-to-sequence model
        super(Seq2Seq, self).__init__()
        # Define the encoder model
        self.encoder = Encoder()
        # Define the layer normalization model
        self.layer_norm = LayerNormalization()
        # Define the decoder model
        self.decoder = Decoder()

    def forward(self, x):
        # Pass the input sequence through the encoder
        encoder_output, hidden = self.encoder(x)
        # Pass the encoder output through the layer normalization
        normalized_output = self.layer_norm(encoder_output)
        # Pass the normalized output and hidden state through the decoder
        decoder_output, _ = self.decoder(x[:, 0:1], hidden)
        # Return the decoder output
        return decoder_output

# Initialize the sequence-to-sequence model
model = Seq2Seq()

# Define the vocabulary
english_vocabulary = {'Hello': 0, 'are': 1, 'you': 2, '<EOS>': 3}
hindi_vocabulary = {'नमस्ते': 0, 'कैसे': 1, 'हो': 2, '<EOS>': 3}

# Define a sample English input sequence
english_input = torch.tensor([[english_vocabulary['Hello'], english_vocabulary['are'], english_vocabulary['you']]])

# Define a sample Hindi output sequence
hindi_output = torch.tensor([[hindi_vocabulary['नमस्ते'], hindi_vocabulary['कैसे'], hindi_vocabulary['हो']]])

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Train the model
for epoch in range(100):
    # Zero the gradients
    optimizer.zero_grad()
    # Pass the input sequence through the model
    output = model(english_input)
    # Calculate the loss
    loss = criterion(output, hindi_output[:, 0])
    # Backpropagate the loss
    loss.backward()
    # Update the model parameters
    optimizer.step()
    # Print the loss
    print(f'Epoch {epoch+1}, Loss: {loss.item()}')

# Use the trained model for translation
# Use the trained model for translation
translated_output = model(english_input)
translated_output_idx = torch.argmax(translated_output, dim=1)
predicted_word = list(hindi_vocabulary.keys())[translated_output_idx.item()]
print(predicted_word)

Multi-Head Attention (Why Do We Need Multiple Attention Layers?)

Instead of one attention mechanism, Transformers use multiple (“heads”) to capture different relationships.

Why? A single attention mechanism may miss important details.
Multi-head attention ensures the model sees information from different perspectives.

Summary: How Generative AI Processes Text

When you enter a prompt like:
➡️ "Explain black holes"

A Generative AI model follows these steps:

Step 1: Tokenization

Breaks text into smaller parts (tokens).
Example:
"Explain black holes" → ["explain", "black", "holes"]

Step 2: Embeddings
Each token is converted into a numerical vector for processing.

Step 3: Transformer Model Processing

Self-attention determines which words matter the most.
Multiple layers refine understanding.

Step 4: Text Generation

Predicts the most likely next token at each step.
Constructs output based on learned patterns.

2.1.1: Tokenization

Tokenization

Word-level tokenization splits text into words. Example: "I love AI" → ["I", "love", "AI"].
Subword-level tokenization (used in models like GPT) splits words into smaller parts or subwords. This is more efficient and handles unknown words better, as the model can understand smaller pieces of a word. Example, “delightful”→[“delight”, “ful”]`.
Character-Level Tokenization Splits text into individual characters. Example: "AI" → ["A", "I"].

Hands-On: Tokenization

Tokenization Example

from transformers import AutoTokenizer

# Load the tokenizer for GPT-2
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Tokenize a sentence
text = "Hello, how are you?"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)

print("Tokens:", tokens)
print("Token IDs:", token_ids)

Output:

Tokens: ['Hello', ',', 'Ġhow', 'Ġare', 'Ġyou', '?']
Token IDs: [15496, 11, 703, 389, 345, 30]

Explanation:

tokenizer.tokenize(text): Splits the text into tokens.
tokenizer.encode(text): Converts tokens into their corresponding numerical IDs.

Tokenization: Additional Resources

Blog: Tokenization in NLP
Video: Word Embeddings Explained

2.1.2: Embeddings

Embeddings

Embeddings are numerical representations of tokens in a high-dimensional space. They capture the semantic meaning of tokens, allowing models to understand relationships between words.

Why are Embeddings Important?

Words with similar meanings have similar embeddings.
Embeddings enable models to generalize and understand context.

Example:

The embeddings for "king", "queen", "man", and "woman" might satisfy the relationship:
```
king - man + woman ≈ queen
```

Hands-On: Embeddings

from transformers import AutoTokenizer, AutoModel
import torch

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModel.from_pretrained("gpt2")

# Tokenize input text
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")

# Generate embeddings
with torch.no_grad():
    outputs = model(**inputs)

# Extract embeddings for the first token
embeddings = outputs.last_hidden_state
print("Embeddings shape:", embeddings.shape)
print("Embeddings for 'Hello':", embeddings[0, 0, :5])  # First 5 dimensions

Output:

Embeddings shape: torch.Size([1, 6, 768])
Embeddings for 'Hello': tensor([-0.0123,  0.0456, -0.0678,  0.0234,  0.0891])

Explanation:

model(**inputs): Passes the token IDs through the model to generate embeddings.
outputs.last_hidden_state: Contains the embeddings for each token.
Each token is represented as a 768-dimensional vector (for GPT-2).
Output: torch.Size([1, 6, 768]) means
- Batch size: 1. Since we have one input sentence.
- Sequence length: 6. The input sentence has 6 tokens for the sentence "Hello, how are you?"
- Embedding dimensions: 768. Each token is represented by a 768-dimensional vector for GPT-2.

2.1.3: Attention Mechanism

Self-Attention: The Core of Transformers

Imagine you’re reading a sentence: "The cat sat on the mat. Each word is important, but some are more related to others.

“cat” is related to “sat.”
“mat” is related to “sat.”
“on” is less important.

Self-Attention helps the model decide which words to focus on!

How Self-Attention Works (Step-by-Step)

Self-attention is done in 4 steps:

1. Convert Words into Vectors (Embeddings)

Computers don’t understand words, so we convert them into numbers (word embeddings).

Example:

Word	Vector Representation (Simplified)
The	[0.1, 0.2, 0.3]
Cat	[0.5, 0.6, 0.7]
Sat	[0.8, 0.9, 1.0]

2. Create Query, Key, and Value (Q, K, V)

Each word is transformed into three vectors:

Query (Q) → “What am I looking for?”
Key (K) → “What information do I have?”
Value (V) → “What should be returned?”

Example:

Word	Query (Q)	Key (K)	Value (V)
Cat	0.5	0.4	0.6
Sat	0.7	0.8	0.9

The Query (Q) of “sat” will be compared with Keys (K) of all words to see how related they are.
Now, let’s compute the Attention Scores using Q × Kᵀ.

3. Compute Attention Scores (Importance of Words) (Q × Kᵀ)

Now, we compare Query (Q) and Key (K) using the dot product.

If Q and K are similar, the word is important.
If Q and K are different, the word is less important.

\[ \text{Attention Score} = Q \times K^T \]

Example:

“sat” is strongly related to “cat” → High attention score.
“sat” is weakly related to “the” → Low attention score.

Word Pair	Calculation
Cat → Cat	(0.5 × 0.4) = 0.2
Cat → Sat	(0.5 × 0.8) = 0.4
Sat → Cat	(0.7 × 0.4) = 0.28
Sat → Sat	(0.7 × 0.8) = 0.56

Thus, our Attention Score matrix is:

\[ \begin{bmatrix} 0.2 & 0.4 \\ 0.28 & 0.56 \end{bmatrix} \]

4. Now we normalize these scores using Softmax.

What is Softmax?

Softmax is a function that converts raw scores into probabilities. Example: Let’s say we have three scores:

Raw Scores: [0.56, 0.72, 0.11]

Softmax converts them into values between 0 and 1:

Softmax Output: [0.30, 0.50, 0.20]

Why do we use Softmax?

So the values sum to 1, making them easy to interpret as “importance levels.”

Coming back to our actual values now,

Softmax normalizes these values so that they sum to 1 per row.

\[ \text{Softmax}(0.2, 0.4) = \frac{e^{0.2}}{e^{0.2} + e^{0.4}}, \frac{e^{0.4}}{e^{0.2} + e^{0.4}} \]

Approximating exponentials:

\[ e^{0.2} \approx 1.221, \quad e^{0.4} \approx 1.491 \]\[ \sum = 1.221 + 1.491 = 2.712 \]\[ \text{Softmax} = \left[ \frac{1.221}{2.712}, \frac{1.491}{2.712} \right] = [0.45, 0.55] \]

For the second row:

\[ e^{0.28} \approx 1.323, \quad e^{0.56} \approx 1.751 \]\[ \sum = 1.323 + 1.751 = 3.074 \]\[ \text{Softmax} = \left[ \frac{1.323}{3.074}, \frac{1.751}{3.074} \right] = [0.43, 0.57] \]

So, the normalized attention weights (Softmax scores) are:

\[ \begin{bmatrix} 0.45 & 0.55 \\ 0.43 & 0.57 \end{bmatrix} \]

4. Multiply Attention Weights by Value (V) and Sum Up

Each word’s final value is computed as:

Now, we multiply the softmax scores by the Value (V) matrix.

Final Word Representation = Attention Score × Value (V)

Word	V
Cat	0.6
Sat	0.9

For Cat (first row):

\[ (0.45 \times 0.6) + (0.55 \times 0.9) = 0.27 + 0.495 = 0.765 \]

For Sat (second row):

\[ (0.43 \times 0.6) + (0.57 \times 0.9) = 0.258 + 0.513 = 0.771 \]

This process refines each word’s meaning based on context.

5. Final Output: Context Vector

The final contextualized representations are:

\[ \begin{bmatrix} 0.765 \\ 0.771 \end{bmatrix} \]

What This Means

Each word’s new representation now depends on its relationship with others, weighted by attention!
Would you like me to extend this to multi-dimensional Q, K, and V?

Contextualized Representation
- Originally, each word had its own static vector (e.g., “Cat” = [0.5, 0.6, 0.7]).
- After self-attention, each word’s new representation incorporates weighted contributions from other words based on attention scores.
Information Flow
- The new values (0.765, 0.771) indicate that “Cat” and “Sat” now carry some information from each other, influenced by the attention weights.
- Words that are more relevant to each other have stronger influences.
Why Is This Important?
- Before self-attention, “Cat” was just “Cat,” and “Sat” was just “Sat.”
- Now, “Cat” understands that “Sat” is nearby and incorporates some of its meaning.
- This is how transformers capture context and relationships in sentences!

6. Sample Code

from transformers import BertModel, BertTokenizer

# Load pre-trained BERT model with attention outputs enabled
model = BertModel.from_pretrained("bert-base-uncased", output_attentions=True)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", output_attentions=True)

# Tokenize and convert input into tensors
inputs = tokenizer("Hello, Generative AI!", return_tensors="pt")

# Forward pass to get outputs including attention weights
outputs = model(**inputs)

# Extract attention layers
attentions = outputs.attentions

# Print number of attention layers
print("Number of Attention Layers:", len(attentions))

Explanation of the Code

BertModel.from_pretrained("bert-base-uncased", output_attentions=True): Loads the BERT model with the option to return attention weights.
tokenizer("Hello, Generative AI!", return_tensors="pt"): Converts input text into a format the model understands (PyTorch tensors).
model(**inputs): Passes the tokenized inputs through the BERT model.
outputs.attentions: Extracts attention weights from different transformer layers.

7. Python Example: Simple Self-Attention Implementation from Scratch

Now, let’s implement self-attention from scratch in Python.

import torch
import torch.nn as nn
import torch.nn.functional as F

class SelfAttention(nn.Module):
    def __init__(self, embed_size):
        super(SelfAttention, self).__init__()
        self.embed_size = embed_size
        self.query = nn.Linear(embed_size, embed_size)
        self.key = nn.Linear(embed_size, embed_size)
        self.value = nn.Linear(embed_size, embed_size)
    
    def forward(self, x):
        Q = self.query(x)   # Convert to Query
        K = self.key(x)     # Convert to Key
        V = self.value(x)   # Convert to Value

        # Compute Attention Scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.embed_size ** 0.5)
        attention = F.softmax(scores, dim=-1)  # Apply Softmax

        # Multiply by values
        out = torch.matmul(attention, V)
        return out

# Example usage
embed_size = 8
seq_length = 5
x = torch.rand((1, seq_length, embed_size))

self_attn = SelfAttention(embed_size)
output = self_attn(x)

print("Output Shape:", output.shape)  # Expected: (1, seq_length, embed_size)

Multi-Head Attention (Why Do We Need Multiple Attention Layers?)

Instead of one attention mechanism, Transformers use multiple (“heads”) to capture different relationships.

Why? A single attention mechanism may miss important details.
Multi-head attention ensures the model sees information from different perspectives.

2.1.4: Transformers

Transformer Architecture

The Transformer Model is the core architecture behind most modern NLP and GenAI models like GPT, BERT, and LLaMA. Here’s how it works:

Key Components of the Transformer

Self-Attention Mechanism: This allows the model to focus on different words in a sentence when processing each word. For example, when processing the word “bank” in the sentence “I went to the bank to withdraw money,” the model can focus on the context to determine if “bank” refers to a financial institution or the side of a river.
Multi-Head Attention: This technique allows the model to focus on different aspects of the sentence simultaneously, using multiple attention heads to capture different relationships between words.
Positional Encoding: Since transformers don’t inherently understand the order of words (like sequential models), positional encoding is added to provide information about the position of words in a sentence.
Encoder-Decoder Architecture:
- Encoder: Processes the input data (e.g., a sentence).
- Decoder: Generates the output data (e.g., a translation of the sentence).

Both the Transformer Encoder and Decoder consist of:

Feed-Forward Layers: Further processes information after attention.
Layer Normalization: Normalizes the output of each sub-layer before passing it to the next layer, stabilizing training and preventing exploding gradients.

Feed-Forward Layers (FFNN)

What is a Feed-Forward Layer?

A Feed-Forward Layer, also known as a Feed-Forward Neural Network (FFNN), is a fully connected neural network layer that takes the output from the attention mechanism and further processes it.

How does a Feed-Forward Layer work?

Input: The output from the attention mechanism is fed into the FFNN.
Linear Transformation: The input is transformed using a linear layer (e.g., a dense layer) with a learnable weight matrix and bias term.
Activation Function: The output from the linear transformation is passed through an activation function, such as ReLU (Rectified Linear Unit) or GeLU (Gaussian Error Linear Unit).
Output: The output from the activation function is the final output of the FFNN.

Purpose of Feed-Forward Layers

The FFNN serves two purposes:

Feature Transformation: It transforms the output from the attention mechanism into a higher-level representation that’s more suitable for the task at hand.
Non-Linearity Introduction: The activation function introduces non-linearity into the model, allowing it to learn more complex relationships between the input and output. For example,
- ReLU (Rectified Linear Unit), Maps all negative values to 0 and all positive values to the same value. Another one,
- Sigmoid, which maps the input to a value between 0 and 1.
- GELU (Gaussian Error Linear Unit): combines the benefits of ReLU and sigmoid.

Layer Normalization

What is Layer Normalization?

Layer Normalization is a technique used to normalize the output of each sub-layer (e.g., the FFNN) before passing it to the next layer.

How does Layer Normalization work?

Compute Mean and Variance: The mean and variance of the output from the sub-layer are computed.
Normalize: The output is normalized by subtracting the mean and dividing by the standard deviation (square root of variance).
Scale and Shift: The normalized output is then scaled and shifted using learnable parameters (gamma and beta).

Purpose of Layer Normalization

Layer Normalization serves several purposes:

Stabilizes Training: Normalization helps stabilize the training process by reducing the effects of exploding gradients.
Improves Generalization: Normalization can improve the model’s ability to generalize to new, unseen data.
Reduces Dependence on Initialization: Normalization reduces the dependence on the initialization of the model’s weights.

By using Layer Normalization, the Transformer model can better handle the complex interactions between the attention mechanism and the FFNN, leading to improved performance and stability.

Examples

FNN Example

Suppose we have a simple FFN with one input layer, one hidden layer, and one output layer. The input layer has 2 neurons, the hidden layer has 3 neurons, and the output layer has 1 neuron.

Here’s the FFN architecture: Input Layer (2 neurons) → Hidden Layer (3 neurons) → Output Layer (1 neuron)

Let’s say we have an input vector x = [1, 2]. The FFN processes this input as follows:

Input Layer: The input vector x is passed through the input layer.
Hidden Layer: The output from the input layer is passed through a linear transformation (e.g., a dense layer) with weights W1 and biases b1, followed by an activation function (e.g., ReLU). Let’s call the output from the hidden layer h.
Output Layer: The output from the hidden layer h is passed through another linear transformation with weights W2 and biases b2. The final output is y.

Here’s some sample PyTorch code to illustrate this:

import torch
import torch.nn as nn

# Define the FFN model
class FFN(nn.Module):
    def __init__(self):
        super(FFN, self).__init__()
        self.hidden_layer = nn.Linear(2, 3)  # input layer (2) -> hidden layer (3)
        self.output_layer = nn.Linear(3, 1)  # hidden layer (3) -> output layer (1)

    def forward(self, x):
        h = torch.relu(self.hidden_layer(x))  # activation function: ReLU
        y = self.output_layer(h)
        return y

# Initialize the FFN model and input vector
model = FFN()
x = torch.tensor([1.0, 2.0])

# Forward pass
y = model(x)
print(y)

FFNN Example: Image Classification

Suppose we want to build a simple image classification model that can distinguish between images of cats and dogs.

Here’s how an FFNN can be used:

Input Layer: The input layer takes in the image data, which is typically represented as a matrix of pixel values.
Hidden Layer: The hidden layer applies a series of transformations to the input data, using weights and biases learned during training.
Output Layer: The output layer generates a probability distribution over the two classes (cat or dog).

For example, if we input an image of a cat, the FFNN might output: [0.8, 0.2]

This indicates that the model is 80% confident that the image is a cat and 20% confident that it’s a dog.

Layer Normalization Example

Now, let’s add Layer Normalization to the FFN model. We’ll apply Layer Normalization to the output of the hidden layer.

Here’s the modified PyTorch code:

import torch
import torch.nn as nn
import torch.nn.functional as F

# Define the FFN model with Layer Normalization
class FFN(nn.Module):
    def __init__(self):
        super(FFN, self).__init__()
        self.hidden_layer = nn.Linear(2, 3)  # input layer (2) -> hidden layer (3)
        self.layer_norm = nn.LayerNorm(3)  # Layer Normalization for hidden layer
        self.output_layer = nn.Linear(3, 1)  # hidden layer (3) -> output layer (1)

    def forward(self, x):
        h = torch.relu(self.hidden_layer(x))  # activation function: ReLU
        h = self.layer_norm(h)  # apply Layer Normalization
        y = self.output_layer(h)
        return y

# Initialize the FFN model and input vector
model = FFN()
x = torch.tensor([1.0, 2.0])

# Forward pass
y = model(x)
print(y)

Layer Normalization Example: Language Translation

Suppose we want to build a machine translation system that translates English sentences to Hindi. We have a dataset of paired English and Hindi sentences.

Our sequence-to-sequence model consists of three main components:

Encoder : The encoder takes in the English sentence and outputs a sequence of vectors and a hidden state. The encoder uses a GRU (Gated Recurrent Unit) layer to process the input sequence.
Layer Normalization : The layer normalization component normalizes the output from the encoder. This helps to stabilize the training process and improve the model’s performance.
Decoder : The decoder takes in the normalized output and the hidden state from the encoder and outputs a Hindi sentence. The decoder also uses a GRU layer to process the input sequence.

Layer Normalization helps to:

Stabilize the training process by reducing the effects of exploding gradients.
Improve the model’s ability to generalize to new, unseen data.

Example Walkthrough

Let’s walk through an example of how this model works:

English Sentence: We start with an English sentence, “Hello, how are you?”
Encoder: The encoder takes in the English sentence and outputs a sequence of vectors and a hidden state.

English Word	Encoder Output
Hello	[0.1, 0.2, 0.3]
how	[0.4, 0.5, 0.6]
are	[0.7, 0.8, 0.9]
you	[1.0, 1.1, 1.2]

Layer Normalization: The layer normalization component normalizes the output from the encoder.

English Word	Normalized Output
Hello	[-0.5, 0.0, 0.5]
how	[-0.3, 0.2, 0.7]
are	[-0.1, 0.4, 0.9]
you	[0.1, 0.6, 1.1]

Decoder: The decoder takes in the normalized output and the hidden state from the encoder and outputs a Hindi sentence.

Hindi Word	Decoder Output
नमस्ते	[0.8, 0.9, 1.0]
कैसे	[0.6, 0.7, 0.8]
हो	[0.4, 0.5, 0.6]
?	[0.2, 0.3, 0.4]

Final Output: The final output from the decoder is the translated Hindi sentence, “नमस्ते, कैसे हो?”.

# Import the necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim

# Define the encoder model
class Encoder(nn.Module):
    def __init__(self):
        # Initialize the encoder model
        super(Encoder, self).__init__()
        # Define the embedding layer with 10000 possible words and 128-dimensional vectors
        self.embedding = nn.Embedding(num_embeddings=10000, embedding_dim=128)
        # Define the GRU layer with 128 input dimensions, 256 hidden dimensions, and 1 layer
        self.rnn = nn.GRU(input_size=128, hidden_size=256, num_layers=1, batch_first=True)

    def forward(self, x):
        # Embed the input sequence
        embedded = self.embedding(x)
        # Pass the embedded sequence through the GRU layer
        output, hidden = self.rnn(embedded)
        # Return the output and hidden state
        return output, hidden

# Define the layer normalization model
class LayerNormalization(nn.Module):
    def __init__(self):
        # Initialize the layer normalization model
        super(LayerNormalization, self).__init__()
        # Define the layer normalization layer with 256 dimensions
        self.layer_norm = nn.LayerNorm(normalized_shape=256)

    def forward(self, x):
        # Normalize the input sequence
        return self.layer_norm(x)

# Define the decoder model
class Decoder(nn.Module):
    def __init__(self):
        # Initialize the decoder model
        super(Decoder, self).__init__()
        # Define the embedding layer with 10000 possible words and 128-dimensional vectors
        self.embedding = nn.Embedding(num_embeddings=10000, embedding_dim=128)
        # Define the GRU layer with 128 input dimensions, 256 hidden dimensions, and 1 layer
        self.rnn = nn.GRU(input_size=128, hidden_size=256, num_layers=1, batch_first=True)
        # Define the fully connected layer with 256 input dimensions and 10000 output dimensions
        self.fc = nn.Linear(256, 10000)

    def forward(self, x, hidden):
        # Embed the input sequence
        embedded = self.embedding(x)
        # Pass the embedded sequence through the GRU layer
        output, hidden = self.rnn(embedded, hidden)
        # Pass the output through the fully connected layer
        output = self.fc(output[:, -1, :])
        # Return the output and hidden state
        return output, hidden

# Define the sequence-to-sequence model
class Seq2Seq(nn.Module):
    def __init__(self):
        # Initialize the sequence-to-sequence model
        super(Seq2Seq, self).__init__()
        # Define the encoder model
        self.encoder = Encoder()
        # Define the layer normalization model
        self.layer_norm = LayerNormalization()
        # Define the decoder model
        self.decoder = Decoder()

    def forward(self, x):
        # Pass the input sequence through the encoder
        encoder_output, hidden = self.encoder(x)
        # Pass the encoder output through the layer normalization
        normalized_output = self.layer_norm(encoder_output)
        # Pass the normalized output and hidden state through the decoder
        decoder_output, _ = self.decoder(x[:, 0:1], hidden)
        # Return the decoder output
        return decoder_output

# Initialize the sequence-to-sequence model
model = Seq2Seq()

# Define the vocabulary
english_vocabulary = {'Hello': 0, 'are': 1, 'you': 2, '<EOS>': 3}
hindi_vocabulary = {'नमस्ते': 0, 'कैसे': 1, 'हो': 2, '<EOS>': 3}

# Define a sample English input sequence
english_input = torch.tensor([[english_vocabulary['Hello'], english_vocabulary['are'], english_vocabulary['you']]])

# Define a sample Hindi output sequence
hindi_output = torch.tensor([[hindi_vocabulary['नमस्ते'], hindi_vocabulary['कैसे'], hindi_vocabulary['हो']]])

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Train the model
for epoch in range(100):
    # Zero the gradients
    optimizer.zero_grad()
    # Pass the input sequence through the model
    output = model(english_input)
    # Calculate the loss
    loss = criterion(output, hindi_output[:, 0])
    # Backpropagate the loss
    loss.backward()
    # Update the model parameters
    optimizer.step()
    # Print the loss
    print(f'Epoch {epoch+1}, Loss: {loss.item()}')

# Use the trained model for translation
# Use the trained model for translation
translated_output = model(english_input)
translated_output_idx = torch.argmax(translated_output, dim=1)
predicted_word = list(hindi_vocabulary.keys())[translated_output_idx.item()]
print(predicted_word)

2.1.5: Recap

Summary: How Generative AI Processes Text

When you enter a prompt like:
➡️ "Explain black holes"

A Generative AI model follows these steps:

Step 1: Tokenization

Breaks text into smaller parts (tokens).
Example:
"Explain black holes" → ["explain", "black", "holes"]

Step 2: Embeddings
Each token is converted into a numerical vector for processing.

Step 3: Transformer Model Processing

Self-attention determines which words matter the most.
Multiple layers refine understanding.

Step 4: Text Generation

Predicts the most likely next token at each step.
Constructs output based on learned patterns.

2.2: Controlling GenAI Model Output

Temperature

Purpose: Controls the randomness of the predictions. It’s a hyperparameter used to scale the logits (predicted probabilities) before sampling.
How it works: The model computes probabilities for each token, and the temperature parameter adjusts these probabilities.
- Low temperature (<1.0): Makes the model more deterministic by amplifying the difference between high-probability tokens and low-probability tokens. This makes the model more likely to choose the most probable token.
- High temperature (>1.0): Makes the model more random by flattening the probabilities. This results in more diverse, creative, and sometimes less coherent text.

Example

Temperature = 0.7: The model will likely choose the more predictable or likely tokens.
Temperature = 1.5: The model will take more risks, leading to more unexpected, diverse outputs.

# Example of lower temperature (more deterministic)
outputs = model.generate(inputs['input_ids'], max_length=50, temperature=0.7)

# Example of higher temperature (more creative/random)
outputs = model.generate(inputs['input_ids'], max_length=50, temperature=1.5)

Top-k Sampling

Purpose: Limits the number of tokens to sample from, making the generation process more efficient and sometimes more coherent.
How it works: Instead of considering all possible tokens (the entire vocabulary), top-k sampling restricts the set of possible next tokens to the top-k most likely tokens based on their probability scores.
- k = 1: This would make the model behave deterministically, always picking the most probable token.
- k = 50: The model will sample from the top 50 tokens with the highest probabilities.

Example

Top-k = 10: The model will only consider the 10 tokens with the highest probabilities when selecting the next word.
Top-k = 100: The model will consider the top 100 tokens, giving it more variety.

# Example with top-k sampling (restricted to top 50 tokens)
outputs = model.generate(inputs['input_ids'], max_length=50, top_k=50)

Effect of Top-k: By limiting the token options to the top-k, the model’s output tends to be more controlled and less random than pure sampling from all tokens.

Top-p (Nucleus Sampling)

Purpose: Similar to top-k, but instead of limiting to a fixed number of tokens, top-p limits the tokens considered based on their cumulative probability.
How it works: The model keeps sampling from the smallest set of tokens whose cumulative probability exceeds a threshold p (where p is between 0 and 1). This dynamic method is often referred to as nucleus sampling.
- p = 0.9: The model will consider the smallest set of tokens whose cumulative probability is at least 90%. This results in considering a variable number of tokens based on how steep the probability distribution is.
- p = 1.0: This would be equivalent to top-k sampling with k = all tokens, allowing the model to sample from all tokens.

Example

Top-p = 0.9: The model considers the smallest set of tokens whose combined probability is at least 90%. This prevents very unlikely tokens from being considered while still allowing more diversity.
Top-p = 0.95: The model will sample from a slightly larger set of tokens.

# Example with top-p (nucleus) sampling
outputs = model.generate(inputs['input_ids'], max_length=50, top_p=0.9)

Effect of Top-p: Nucleus sampling tends to generate more coherent and diverse text than top-k sampling, as the model is free to choose tokens from a set that dynamically adjusts based on their probabilities.

Temperature, Top-k, and Top-p Combined

You can combine these parameters to fine-tune the model’s output. For example:

outputs = model.generate(
    inputs['input_ids'], 
    max_length=50, 
    temperature=0.8, 
    top_k=50, 
    top_p=0.9
)

This will give you:

A lower temperature (0.8), making the generation more predictable.
Top-k sampling with the top 50 tokens.
Top-p sampling that only includes tokens whose cumulative probability is at least 90%.

By tuning these parameters, you can experiment with how controlled or creative the generated text is.

Summary of Differences

Temperature: Adjusts the randomness of the sampling. Higher temperature means more diverse output; lower means more predictable.
Top-k Sampling: Limits the number of candidate tokens to the top-k most likely tokens.
Top-p (Nucleus) Sampling: Limits the candidate tokens to those whose cumulative probability is at least p (a probability threshold), providing more flexible diversity control.

Confused! Let us break down top-k and top-n with simpler examples.

Top-k Sampling (Simplified)

Imagine the model is choosing the next word from a list of 5 possible words, each with a probability:

Word	Probability
“apple”	0.5
“banana”	0.3
“cherry”	0.1
“date”	0.05
“elderberry”	0.05

Top-k = 2:

With top-k=2, the model will only consider the top 2 most probable words. So it will only consider “apple” and “banana”. The model ignores the words “cherry”, “date”, and “elderberry” because they are less likely.

If the model needs to choose the next word, it will only sample from these 2 words: “apple” and “banana”. This makes the sampling process more controlled and focused.

Top-k = 3:

If top-k=3, it will consider “apple”, “banana”, and “cherry”. This is a little more diverse but still limited to the top 3.

Top-p (Nucleus Sampling) (Simplified)

Now, let’s look at top-p (nucleus sampling), which works a bit differently.

Let’s use the same words and probabilities:

Word	Probability
“apple”	0.5
“banana”	0.3
“cherry”	0.1
“date”	0.05
“elderberry”	0.05

Top-p = 0.8:

With top-p=0.8, the model will add up the probabilities from the most likely words until the total probability is greater than or equal to 0.8.

“apple” = 0.5
“banana” = 0.3
Total = 0.8

At this point, the model has already reached 0.8 probability. So it will stop and consider only “apple” and “banana”.

This is different from top-k because it doesn’t limit to a fixed number of tokens. It dynamically chooses the most likely words until the total probability reaches the given threshold (in this case, 0.8).

Top-p = 0.9:

If we set top-p=0.9, the model will keep adding tokens until the cumulative probability is 0.9.

“apple” = 0.5
“banana” = 0.3
“cherry” = 0.1
Total = 0.9

Now, the model will consider “apple”, “banana”, and “cherry”.

Key Difference between Top-k and Top-p

Top-k restricts you to a fixed number of the most likely tokens.
- Example: top-k=2 would only allow the model to choose from the top 2 words.
Top-p (Nucleus sampling) restricts you to the smallest set of tokens whose cumulative probability is greater than or equal to p.
- Example: top-p=0.8 means the model will sample from the tokens that, together, have at least 80% probability.

Summary

Top-k: Always limits to a fixed number of tokens (e.g., top 3, top 5).
Top-p: Dynamically limits to the smallest set of tokens whose cumulative probability is at least p (e.g., 80% or 90%).

FAQ

1. Let’s work through the scenario where every word has a probability of 0.7 and you’re using top-p sampling with a threshold of 0.8.

Scenario: Let’s assume the following token probabilities:

Word	Probability
“apple”	0.7
“banana”	0.7
“cherry”	0.7
“date”	0.7
“elderberry”	0.7

Top-p = 0.8: In top-p sampling, the model keeps adding tokens to the pool until their cumulative probability exceeds or meets the top-p threshold (0.8).

Step-by-step breakdown:

“apple” = 0.7
“banana” = 0.7 (cumulative probability = 0.7 + 0.7 = 1.4)

At this point, the cumulative probability is 1.4, which exceeds the 0.8 threshold. So the sampling pool will be limited to these two words: “apple” and “banana”.

Since the total probability already exceeds 0.8 after the first two words, the model will include both “apple” and “banana” in the selection pool.

Key Points:

Top-p sampling doesn’t strictly limit the number of tokens — it selects tokens whose cumulative probability is at least the threshold (0.8 in this case).
If all tokens have the same probability (0.7), then the model will keep adding tokens until the cumulative probability reaches the top-p threshold.
In this case, the model will sample from the first two words (“apple” and “banana”), as their cumulative probability (1.4) exceeds the threshold of 0.8.

Final Conclusion: If every word has the same probability of 0.7, and you’re using top-p = 0.8, the model will include all words up to the point where the cumulative probability exceeds 0.8. In this case, it will stop at the second word, and you’ll end up with a pool of two words to choose from.

2. Let’s now look at how top-k sampling works in this case where every word has a probability of 0.7.

Scenario: We have the same token probabilities:

Word	Probability
“apple”	0.7
“banana”	0.7
“cherry”	0.7
“date”	0.7
“elderberry”	0.7

Top-k = 2: In top-k sampling, the model selects the top-k most probable tokens. The number k is fixed, meaning the model will consider exactly the top k tokens based on their probabilities.

How it works:

Regardless of the probabilities, the model will pick the top 2 most probable tokens.
In this case, since all the words have the same probability of 0.7, the model will choose the first 2 tokens (based on their order or position in the list).

What Happens Here:

Since top-k=2, the model will always select the first 2 tokens, because every token has the same probability (0.7).
The model doesn’t care about the cumulative probability here; it only cares about the number of tokens, which is fixed at 2 in this case.

Key Points:

Top-k simply selects the top k most probable words — it doesn’t dynamically sum probabilities like top-p.
In the case where all words have the same probability, top-k just picks the first k words in the list.
Top-k is not influenced by the cumulative probability — it just selects a fixed number of top tokens.

2.3: Seeing In Action

Simple Hands On: Text Generation with GPT

Let’s write some code to generate text using a pre-trained GPT model. We’ll use the transformers library by Hugging Face, which provides easy access to many pre-trained models.

Step 1: Install the Required Libraries

You’ll need Python installed on your machine along with the following packages:

transformers (from Hugging Face)
torch (PyTorch backend)

pip install transformers torch

Step 2: Write the Code

from transformers import pipeline

# Load a pre-trained GPT-2 model for text generation
generator = pipeline('text-generation', model='gpt2')

# Generate text
prompt = "The future of AI is"
output = generator(prompt, max_length=50, num_return_sequences=1)

# Print the generated text
print(output[0]['generated_text'])

Explanation:

pipeline('text-generation', model='gpt2'): Loads the GPT-2 model for text generation.
prompt: The starting text for generation.
max_length: The maximum length of the generated text.
num_return_sequences: The number of sequences to generate.

Output Example:

The future of AI is bright, with advancements in natural language processing, computer vision, and robotics. As AI continues to evolve, it will transform industries, improve healthcare, and enhance our daily lives.

Experiment with Different Prompts

Try changing the prompt variable to see how the model responds. For example:

“In a world where robots rule,”
“Once upon a time, there was a”
“The secret to happiness is”

Practice Yourself

Run the Code: Execute the text generation code and experiment with different prompts.
Explore Other Models: Replace gpt2 with other models like EleutherAI/gpt-neo-1.3B or gpt-j (if available).

For list of available models, check Hugging Face Model Hub.

generator = pipeline('text-generation', model='EleutherAI/gpt-neo-1.3B')

Read More: Familiarize yourself with the Hugging Face documentation and explore other tasks like translation, summarization, and question answering.

Additional Resources

Paper: “Attention is All You Need” (Transformers).
Blog: Hugging Face Blog for tutorials and updates.
Video: Introduction to Transformers.

2.3.1: DeepDive Hands On

Deep Dive : Text Generation with GPT and Tokenizer

We’ll start with loading a pretrained model (like GPT-2 or BERT) and running a simple text generation task. We’ll use Hugging Face’s transformers library for this.

Step 1: Install the required libraries

You’ll need Python installed on your machine along with the following packages:

transformers (from Hugging Face)
torch (PyTorch backend)

To install these, run:

pip install transformers torch

Step 2: Load a Pretrained GPT-2 Model and Tokenizer
Here’s a simple example to load the GPT-2 model and tokenizer, then generate text based on a prompt.

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained model and tokenizer
model_name = 'gpt2'  # You can change to 'gpt2-medium', 'gpt2-large', etc., for a larger model
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Set model to evaluation mode
model.eval()

# Encode the prompt
prompt = "In the near future, artificial intelligence will"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate text
outputs = model.generate(inputs['input_ids'], max_length=50, num_return_sequences=1, no_repeat_ngram_size=2)

# Decode the generated text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Step 3: Run the Code and Observe the Output
When you run the script, the GPT-2 model will generate a continuation of the prompt:

In the near future, artificial intelligence will

Let’s break down the prompt:

Tokenizer Encodes: The tokenizer will first convert this text into token IDs.
Model Generates: The model uses these token IDs to generate the continuation of the sentence.
Tokenizer Decodes: The output token IDs are converted back into a string of text.

For example, the output might look like:

"In the near future, artificial intelligence will be able to predict our every move, revolutionize industries, and improve the quality of life. With advancements in machine learning algorithms and deep learning techniques, AI will be a central part of our daily lives."

Step 4: Experiment with Different Prompts
You can modify the prompt to see how the model responds to different inputs. For example, try:

“Once upon a time, in a land far away,”
“The economy of the future will be driven by”
“The secret to a successful business is”

This will give you a sense of how the GPT-2 model can generate creative and contextually relevant text. Feel free to experiment with different parameters like max_length, num_return_sequences, or other settings to customize the output.

Let’s break down tokenization and the components involved, as well as explain the different parameters used in the code.

Tokenizer in the Code

The tokenizer is responsible for converting human-readable text into tokens that the model can understand. In Hugging Face’s transformers library, the tokenizer is used to:

Convert text into token IDs that the model can process.
Convert token IDs back into human-readable text (decoding).

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained model and tokenizer
model_name = 'gpt2'  # You can change to 'gpt2-medium', 'gpt2-large', etc., for a larger model
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

GPT2Tokenizer.from_pretrained(model_name): This loads the tokenizer associated with the GPT-2 model. This tokenizer is trained specifically for the model and knows how to convert text into tokens and vice versa.

Key Parameters Used in the Code

a. Encoding the Input Text

inputs = tokenizer(prompt, return_tensors="pt")

prompt: This is the initial text input that you want the model to complete or generate further text from.
tokenizer(prompt): This will convert the prompt text into token IDs that GPT-2 can understand.
return_tensors="pt": This specifies that the output should be in the form of PyTorch tensors. This is required because PyTorch is used for processing the data inside the model. (If you’re using TensorFlow, you’d use return_tensors="tf").

The result inputs will look like:

{
    'input_ids': tensor([[50256, 318, 257, 4768, 282, 2740]]),
    'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])
}

input_ids: These are the actual token IDs that represent the words in the prompt.
attention_mask: This tells the model which tokens to focus on (1 means to focus, 0 means ignore).

b. Generating Text

outputs = model.generate(inputs['input_ids'], max_length=50, num_return_sequences=1, no_repeat_ngram_size=2)

inputs['input_ids']: The token IDs for the prompt are passed as input to the model.
max_length=50: This limits the total number of tokens (words/subwords) in the generated text, including both the input and the output. In this case, the total length is 50 tokens.
num_return_sequences=1: This defines how many different sequences of text you want the model to generate. In this case, the model will generate 1 sequence.
no_repeat_ngram_size=2: This parameter prevents the model from repeating a sequence of 2 consecutive tokens (an n-gram) in the generated text. For example, if the model starts generating “the the”, this parameter will force it to avoid that repetition.

c. Decoding the Output

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

outputs[0]: This contains the token IDs for the generated text. The model outputs the predicted token IDs for each step in the sequence.
tokenizer.decode(...): This converts the token IDs back into human-readable text.
skip_special_tokens=True: This removes special tokens like the end-of-sequence token (typically used in transformers models to indicate the end of the generated text).

For example, the outputs[0] could be a list of token IDs like [50256, 318, 257, 4768, 282], which the decoder will turn into human-readable text.

Conclusion

Tokenization is the process of converting text into tokens (IDs) that a model can understand and process.
The Tokenizer is a crucial component in transforming text for a model and back into text after processing.
Parameters like max_length, num_return_sequences, and no_repeat_ngram_size control the length, number of sequences, and quality of the generated output.

More hands-on examples

Let’s dive into more hands-on examples to reinforce the concepts of tokenization and model generation.

1. Experimenting with Different Prompt Types

Let’s start by experimenting with different types of prompts to see how GPT-2 responds.

Example 1: Story Prompt

prompt = "Once upon a time, in a land far away,"

Example 2: Business Scenario

prompt = "The future of artificial intelligence in business is"

Example 3: Philosophical Question

prompt = "What is the meaning of life?"

For each of these, run the same code and observe the text generated by GPT-2.

2. Play with Generation Parameters

You can adjust parameters to experiment with how the model generates text:

Change max_length:
- If you increase max_length to 100 or 200, the model will generate a longer continuation of the prompt.
- Note that we have added min_length to ensure the generated text is at least 100 tokens long.
```
outputs = model.generate(inputs['input_ids'], min_length=100, max_length=200, num_return_sequences=1, no_repeat_ngram_size=2)
```
Experiment with num_return_sequences:
- If you set num_return_sequences to 3, the model will generate 3 different continuations of the same prompt.
```
outputs = model.generate(inputs['input_ids'], min_length=100, max_length=200, num_return_sequences=3, no_repeat_ngram_size=2)
```
If ValueError for num_return_sequences

ValueError: Greedy methods without beam search do not support num_return_sequences different than 1 (got 3). Replace num_return_sequences with num_beams
Experiment with temperature and top_k:
- temperature controls randomness. A lower temperature (e.g., 0.7) generates more focused, deterministic text, while a higher temperature (e.g., 1.5) generates more creative, diverse output.
- top_k restricts the sampling to the top-k most likely next tokens, which controls diversity.
```
outputs = model.generate(inputs['input_ids'], min_length=100, max_length=200, num_return_sequences=1, temperature=0.9, top_k=50)
```
  - Lower temperature: More predictable output.
  - Higher temperature: More random, creative output.
Experiment with top_p (nucleus sampling):
- top_p restricts sampling to the smallest set of tokens whose cumulative probability is greater than p (e.g., top_p=0.9 means it will sample from the smallest set of tokens that cumulatively have 90% of the probability mass).
```
outputs = model.generate(inputs['input_ids'], min_length=100, max_length=200, num_return_sequences=1, top_p=0.9, temperature=0.8)
```

3. Try Decoding the Output

You’ll see different outputs for each of the above changes. Use the tokenizer’s decode method to see how the generated tokens look.

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

This allows you to experiment and see how changing each parameter affects the model’s output.

Summary of Parameters

Here’s a quick recap of the parameters we explored:

max_length: Controls the length of the generated text (in tokens).
num_return_sequences: Controls how many different outputs you want to generate.
no_repeat_ngram_size: Prevents repetitive sequences (n-grams) from appearing in the generated text.
temperature: Controls the randomness of the text generation. Higher = more randomness.
top_k: Limits sampling to the top-k tokens by probability. Controls diversity.
top_p: Nucleus sampling; limits the set of tokens to a cumulative probability p.

1. Understanding LLMs & Text Generation

How LLMs Generate Text

LLMs don’t “think” like humans. They predict the most probable next word (token) based on previous words.

Step 1: Convert Text to Tokens

Example (Word-based tokenization):

Sentence:  "The cat sat on the mat."
Tokens:    ["The", "cat", "sat", "on", "the", "mat", "."]

Example (Sub-word tokenization, used in LLaMA models):

Sentence:  "Artificial intelligence"
Tokens:    ["Art", "ificial", "intelli", "gence"]

Why sub-word tokenization?

Handles new words by breaking them into smaller known parts.
Reduces vocabulary size, improving efficiency.

Step 2: Assign Probability to Next Token

Example: Predicting the next token for the phrase: "The capital of France is"

Token Probabilities:
"Paris" → 85%
"London" → 5%
"Berlin" → 3%
"Rome" → 2%

📌 The model chooses “Paris” because it has the highest probability.

Step 3: Decoding Strategies (Choosing the Next Word)

Once we have probabilities, we need a decoding strategy to pick the best next word.

Decoding Strategies

With Simple Examples & Ollama Implementation

1. Greedy Search (Always Pick the Highest Probability Token)

How it works: Greedy search selects the token with the highest probability at each step i.e. Always chooses the word with the highest probability.

Example:

Input: "The cat sat on the"
Greedy Output: "mat mat mat mat..."

Pros:

Computationally efficient
Simple and fast to implement

Cons:

May not produce optimal results
Can get stuck in local maxima

Use cases: Greedy search is suitable for applications where computational efficiency is crucial, such as real-time language translation or text summarization.

import ollama

response = ollama.chat(model="llama3:8b", messages=[{"role": "user", "content": "Complete the sentence: The cat sat on the"}])
print(response['message']['content'])  # Greedy search (default in Ollama)

2. Beam Search (Looks at Multiple Possibilities)

How it works: Beam search maintains a set of top-scoring candidate sequences (beams). At each step, it expands each beam by adding possible next tokens and selects the top-scoring beams to continue.

Example:

Input: "The cat sat on the"
Beam Search Output: "mat because it was comfortable."

Pros:

More accurate than greedy search
Can handle sequences with varying lengths i.e. Reduces repetition

Cons:

Computationally expensive
Requires careful tuning of beam size and other hyperparameters

Use cases: Beam search is suitable for applications where high accuracy is crucial, such as machine translation, speech recognition, or text generation.

response = ollama.chat(model="llama3:8b", messages=[{"role": "user", "content": "Complete the sentence: The cat sat on the"}], options={"num_beams": 3})
print(response['message']['content'])  # Beam Search

3. Top-k Sampling (Adds Randomness)

How it works: Top-K sampling selects the top-K tokens with the highest probabilities and samples from this subset to generate the next token. Instead of always picking the highest probability word, Top-k sampling picks randomly from the top K most probable words.

Example (Top-3 Sampling):

Input: "The cat sat on the"
Possible Outputs: "sofa / couch / floor" (randomly picked from top 3)

Pros:

Encourages diversity in generated text i.e. Introduces randomness, making text more creative
Can handle rare or unseen tokens

Cons:

May not produce optimal results
Requires careful tuning of K as can sometimes generate weird outputs

Use cases: Top-K sampling is suitable for applications where diversity and creativity are important, such as language generation, chatbots, or creative writing.

response = ollama.chat(model="llama3:8b", messages=[{"role": "user", "content": "Complete the sentence: The cat sat on the"}], options={"top_k": 50})
print(response['message']['content'])  # Top-k Sampling

4. Top-p (Nucleus) Sampling (More Dynamic)

How it works: Top-N sampling selects the top tokens that make up a certain probability mass (p) and samples from this subset to generate the next token. Instead of a fixed K, Top-p sampling dynamically picks words until the cumulative probability reaches P.

Example (Top-p = 0.9):

Input: "The cat sat on the"
Top-p Output: "mat because it was sunny outside."

Pros:

More efficient than Top-K sampling
Encourages diversity and creativity i.e. more natural text

Cons:

May not produce optimal results
Requires careful tuning of p

Use cases: Top-N sampling is suitable for applications where efficiency and diversity are important, such as language generation, chatbots, or creative writing.

response = ollama.chat(model="llama3:8b", messages=[{"role": "user", "content": "Complete the sentence: The cat sat on the"}], options={"top_p": 0.9})
print(response['message']['content'])  # Top-p Sampling

Real-World Example: AI Chat Assistant

Using Ollama LLaMA 8B

Step 1: Install Ollama

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3:8b

Step 2: Implement Chatbot

import ollama

def chatbot(prompt):
    response = ollama.chat(model="llama3:8b", messages=[{"role": "user", "content": prompt}])
    return response['message']['content']

# Example conversation
user_input = "Tell me a fun fact about space!"
print(chatbot(user_input))

Understanding Softmax: How Probabilities are Computed

We have discussed this in previous topics in details but to summarize, Softmax is a mathematical function that converts raw scores into probabilities.

Simple Example:

Let’s say the model predicts these scores:

Raw Scores: [3.2, 2.1, 5.8, 1.4]

Softmax converts them into probabilities:

Probabilities: [10%, 4%, 80%, 6%]

Softmax Formula

\[ P(y_i) = \frac{e^{x_i}}{\sum e^{x_j}} \]

Where:

\( x_i \) is the raw score for token \( i \)
\( e^{x_i} \) makes the numbers positive and scalable

Python Code

import numpy as np

def softmax(logits):
    exp_values = np.exp(logits - np.max(logits))  # Stability trick
    return exp_values / np.sum(exp_values)

logits = [3.2, 2.1, 5.8, 1.4]
probabilities = softmax(logits)
print(probabilities)  # Output: [0.104, 0.04, 0.8, 0.056]

Best Decoding Strategy for Different Use Cases

Use Case	Best Decoding Strategy
Factual Answers	Greedy or Beam Search
Creative Writing	Top-k or Top-p Sampling
Chatbots	Top-p Sampling (for variety)
Summarization	Beam Search

1. Embeddings & Vector Representation

What Are Word Embeddings?

Word embeddings are numerical representations of words in a continuous vector space. These embeddings capture the meaning, relationships, and context of words based on how they appear in text data.

Why Do LLMs Need Word Embeddings?

LLMs like GPT, BERT, and LLaMA work with numbers, not raw text. Embeddings convert words into numerical format so they can be processed by neural networks.

Without embeddings: The model treats words like independent tokens (e.g., “king” and “queen” would be unrelated).
With embeddings: The model understands relationships (e.g., “king” and “queen” are semantically close).

Key Idea: Words with similar meanings will have similar vector representations in the embedding space.

Different Types of Word Embeddings

Method	How It Works	Pros	Cons
Word2Vec	Continuous Bag of Words (CBOW): Predicts a target word based on its context words. Skip-Gram: Predicts context words based on a target word.	Captures semantic meaning well.	Cannot handle new words (fixed vocabulary).
GloVe	Uses word co-occurrence statistics. It represents words as vectors by factorizing a word-context co-occurrence matrix.	Captures global meaning better.	Static embeddings (no context awareness).
BERT	Contextual embeddings (different meaning in different sentences). It uses a multi-layer bidirectional transformer encoder to generate context-dependent word embeddings.	Solves polysemy (e.g., “bank” as a river vs. financial).	Heavier computation.
OpenAI Embeddings	Transformer-based, optimized for retrieval & search.	Best for LLM-based applications.	Requires API calls (not open-source).
ELMo	Contextualized word embeddings using a bidirectional LSTM (Long Short-Term Memory). It generates context-dependent word embeddings.	Captures nuances of language.	Computationally expensive.
RoBERTa	Optimized BERT approach with dynamic masking and larger batch sizes. It generates context-dependent word embeddings.	Improved performance over BERT.	Requires significant computational resources.
FastText	Extension of Word2Vec, takes into account subword information. It represents words as a bag of character n-grams.	Handles out-of-vocabulary words.	Computationally expensive. May not capture semantic meaning as well.
Sentence-BERT	Siamese ( a type of neural network architecture that involves two identical sub-networks ) network-based approach for sentence embeddings. It generates sentence embeddings that can be used for tasks like semantic search and clustering.	Effective for sentence-level tasks.	May not capture nuances of individual words.

Hands-on: Generating & Visualizing Word Embeddings

We’ll generate embeddings using:

OpenAI’s API (for latest transformer-based embeddings).
SentenceTransformers (open-source alternative).
t-SNE visualization (to plot embeddings in 2D).

Example 1: Generating Embeddings Using OpenAI API

Best for real-world applications like search, retrieval, and semantic similarity

Step 1: Install OpenAI SDK

pip install openai

Step 2: Generate Embeddings

import openai

# OpenAI API Key (Replace with your actual key)
OPENAI_API_KEY = "your_api_key"


def get_embedding(text, model="text-embedding-ada-002"):
    response = openai.embeddings.create(
        input=text,
        model=model,
        api_key=OPENAI_API_KEY
    )
    return response.data[0].embedding


# Example texts
texts = ["king", "queen", "apple", "banana", "dog", "cat"]
# Generate embeddings
embeddings = [get_embedding(text) for text in texts]
print("Generated Embeddings:", embeddings[:2])  # Print first 2 embeddings

📝 Notes:

OpenAI’s embedding model (text-embedding-ada-002) is state-of-the-art and works well for LLM applications.

The embeddings can be used for semantic search, clustering, and recommendation systems.

Example 2: Using SentenceTransformers (Open-Source Alternative)

Best for local and offline applications

Step 1: Install SentenceTransformers

pip install sentence-transformers  tf-keras

Step 2: Generate Embeddings

from sentence_transformers import SentenceTransformer

# Load a pre-trained model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Example texts
texts = ["king", "queen", "apple", "banana", "dog", "cat"]

# Generate embeddings
embeddings = model.encode(texts, convert_to_tensor=False)
print("Generated Embeddings:", embeddings[:2])  # Print first 2 embeddings

📝 Notes:

all-MiniLM-L6-v2 is a lightweight model optimized for speed and accuracy.

The embeddings are useful for classification, search, and NLP tasks.

Example 3: Visualizing Embeddings with t-SNE

Helps understand how words are related in the embedding space.

Step 1: Install Matplotlib & SciKit-Learn

pip install matplotlib scikit-learn

Step 2: Visualize Embeddings

import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
from sklearn.manifold import TSNE

# Load a pre-trained model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Example texts
texts = ["king", "queen", "apple", "banana", "dog", "cat"]

# Generate embeddings
embeddings = model.encode(texts, convert_to_tensor=False)
print("Generated Embeddings:", embeddings[:2])  # Print first 2 embeddings

# Reduce dimensions using t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=5)
embeddings_2d = tsne.fit_transform(embeddings)  # embeddings, texts are taken from the previous example

# Plot embeddings
plt.figure(figsize=(6, 6))
for i, text in enumerate(texts):
    x, y = embeddings_2d[i]
    plt.scatter(x, y)
    plt.text(x + 0.01, y + 0.01, text, fontsize=12)

plt.title("t-SNE Visualization of Word Embeddings")
plt.title("t-SNE Visualization of Word Embeddings")
plt.show()

Output:

📝 What You’ll See:

Similar words (e.g., “king” and “queen”) should appear close to each other.

Dissimilar words (e.g., “king” and “banana”) should be far apart.

In example above, we have use three parameters to fine tune i.e. perplexity, random_state, and n_components. Here’s a more detailed explanation of perplexity, random_state, and n_components with examples:

Perplexity

Perplexity measures the model’s uncertainty when predicting the data. A lower perplexity value indicates better predictions.

High Perplexity: The model is uncertain and struggles to predict the data.
Low Perplexity: The model is confident and accurately predicts the data.

Example:

from sklearn.manifold import TSNE

# High perplexity (model is uncertain)
tsne_high_perplexity = TSNE(n_components=2, perplexity=50)
# Low perplexity (model is confident)
tsne_low_perplexity = TSNE(n_components=2, perplexity=5)

Random State

Random state ensures reproducibility of results by setting the seed for random number generation.

Fixed Random State: Produces the same results every time the code is run.
Random Random State: Produces different results every time the code is run.

Example:

from sklearn.manifold import TSNE

# Fixed random state (reproducible results)
tsne_fixed_random_state = TSNE(n_components=2, random_state=42)
# Random random state (non-reproducible results)
tsne_random_random_state = TSNE(n_components=2, random_state=None)

n_components

n_components specifies the number of dimensions to reduce the data to.

Low n_components: Reduces the data to a few dimensions, losing some information.
High n_components: Retains more dimensions, but may not reduce the data effectively.

Example:

from sklearn.manifold import TSNE

# Low n_components (reduce to 2D)
tsne_low_n_components = TSNE(n_components=2)
# High n_components (reduce to 10D)
tsne_high_n_components = TSNE(n_components=10)

Summary so far!

Word embeddings convert text into numbers so LLMs can process them.
Different methods (Word2Vec, GloVe, BERT, OpenAI) have pros & cons.
SentenceTransformers is a free, offline alternative.
t-SNE helps visualize relationships between words.

Real-World Use Cases

✔ Semantic Search: Find similar documents using embeddings.
✔ Chatbots & Q&A Systems: Improve response relevance.
✔ Recommendation Systems: Recommend similar items.
✔ Text Clustering: Group similar content automatically.

4.1. Custom Embedding's

Custom Embeddings: Why, When, and How?

Why Would You Need Custom Embeddings?

Pre-trained embeddings (like OpenAI’s text-embedding-ada-002 or SentenceTransformers) work well in most cases. However, custom embeddings are necessary when:

Domain-Specific Knowledge
- If you’re working with medical, legal, finance, or technical text, general-purpose embeddings may not capture key relationships.
- Example: “BP” in general NLP models means “British Petroleum,” but in medicine, it means “Blood Pressure.”
Multilingual Support
- Many embedding models are optimized for English, so custom training is needed for non-English or code-mixed languages (e.g., Hinglish).
Fine-Tuned Retrieval & Search
- If you’re building a semantic search system, fine-tuning embeddings on your own dataset gives better results than generic embeddings.
Industry-Specific Search & Clustering
- A legal search engine should rank case laws before blogs.
- A medical chatbot should understand symptoms better than casual conversations.

Pre-trained vs. Custom Embeddings: Pros & Cons

Feature	Pre-trained Embeddings (e.g., OpenAI, BERT)	Custom Trained Embeddings
Training Data	General internet text, books, Wikipedia	Your own dataset (domain-specific)
Performance	Good for broad use cases	Excellent for domain-specific tasks
New Vocabulary	Cannot handle completely unseen words	Learns domain-specific terms
Computational Cost	Free or API-based	Requires GPUs & storage
Ease of Use	Ready to use	Requires training & maintenance

How to Create Custom Embeddings?

We’ll explore two main approaches:

Fine-tuning an existing model (Easier, uses SentenceTransformers).
Training from scratch (Harder, needs large datasets).

Approach 1: Fine-Tuning SentenceTransformers on Custom Data

**Best for cases where you already have good embeddings but need slight adjustments.

Step 1: Install SentenceTransformers

pip install sentence-transformers tf-keras datasets transformers[torch] sentencepiece

Step 2: Prepare Your Dataset

You need pairs of sentences where one is the query, and the other is a relevant response.

Example Dataset (Medical Search Engine):

[
  {"query": "What are the symptoms of diabetes?", "response": "Common symptoms include frequent urination and fatigue."},
  {"query": "How to lower blood pressure?", "response": "A low-sodium diet and regular exercise help lower blood pressure."}
]

Step 3: Fine-Tune SentenceTransformers

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load a pre-trained model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Prepare training data
train_data = [
    InputExample(texts=["What are the symptoms of diabetes?", "Common symptoms include frequent urination and fatigue."]),
    InputExample(texts=["How to lower blood pressure?", "A low-sodium diet and regular exercise help lower blood pressure."])
]

# Convert to DataLoader with batch size 2
train_dataloader = DataLoader(train_data, shuffle=True, batch_size=2)

# Use contrastive loss for fine-tuning
train_loss = losses.MultipleNegativesRankingLoss(model)

# Train the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=3, warmup_steps=100)

# Save the model
model.save("custom_medical_embeddings")

Step 4: Use Custom Embeddings

custom_model = SentenceTransformer("custom_medical_embeddings")
custom_model_embedding = custom_model.encode("What are the symptoms of diabetes?")
print(custom_model_embedding)

Result: Your model now generates custom embeddings tailored to medical queries.

Approach 2: Training Word Embeddings from Scratch (Word2Vec / FastText)

Best when no suitable pre-trained embeddings exist (e.g., for a new language or industry-specific jargon.

Step 1: Install Gensim

pip install gensim

Step 2: Train Word2Vec on Your Dataset

from gensim.models import Word2Vec


# Sample medical text data
corpus = [
    ["diabetes", "causes", "high", "blood", "sugar"],
    ["high", "blood", "pressure", "treatment", "exercise"],
    ["hypertension", "is", "related", "to", "high", "blood", "pressure"]
]


# Train Word2Vec model
model = Word2Vec(sentences=corpus, vector_size=50, window=5, min_count=1, workers=4)


# Save and load the model
model.save("word2vec_medical.model")
model = Word2Vec.load("word2vec_medical.model")


# Get embedding for a word
print(model.wv["diabetes"])  # Prints embedding for "diabetes"

Result: You now have word embeddings trained on custom medical terminology.

Real-World Applications of Custom Embeddings

✔ Medical Chatbots - Fine-tune embeddings to understand medical queries.
✔ Legal Document Search - Optimize embeddings for law-related searches.
✔ E-commerce Search - Improve product recommendations with domain-specific embeddings.
✔ Multilingual NLP - Train embeddings for underrepresented languages.

When to Use Which Approach?

Situation	Best Approach
You need fast, ready-to-use embeddings	OpenAI API or SentenceTransformers
You need embeddings fine-tuned for a specific task	Fine-tune SentenceTransformers
You’re working with a new domain/language	Train Word2Vec/FastText from scratch
You want contextual embeddings (different meaning in different sentences)	Fine-tune a Transformer model (BERT, GPT, etc.)

Summary

Custom embeddings outperform pre-trained ones in domain-specific tasks.
Fine-tuning pre-trained embeddings is easier & more efficient than training from scratch.
Training from scratch is best for new languages or highly specialized fields.

4.2. Custom Embeddings - Examples

Attention Is All You Need