2.1: Key Concepts in GenAI

Key Concepts in Generative AI

Concept	Definition
Large Language Models (LLMs)	LLMs are AI models trained on vast amounts of text data. They use the Transformer architecture, which relies on attention mechanisms to process input data. Examples: GPT (Generative Pre-trained Transformer), BERT, T5.
Tokenization	Breaking down text into smaller units (tokens) for processing. Example: The sentence “Hello, world!” might be tokenized into `["Hello", ",", "world", "!"]`.
Embeddings	Representing tokens as numerical (vectors) in a high-dimensional space. Embeddings capture semantic meaning (e.g., “king” - “man” + “woman” ≈ “queen”).
Self-Attention/Attention Mechanism	Mechanism that helps models focus on relevant words.
Transformers	The deep learning architecture used in LLMs. Transformers are the backbone of most modern generative models. Key components: Encoder, Decoder, and Attention Mechanism.
Pre-training	Training a model on a large dataset (e.g., all of Wikipedia) to learn general language patterns.
Fine-tuning	Adapting the pre-trained model to a specific task (e.g., sentiment analysis, chatbot).
Prompt Engineering	Designing effective inputs to guide model responses.

Tokenization

Tokenization is the process of converting text into smaller units, typically words or subwords, that can be processed by machine learning models. In natural language processing (NLP), tokens are the basic building blocks for understanding and generating language. Tokenization helps the model “understand” the text by converting it into a format that can be fed into the neural network.

Word-level tokenization splits text into words. Example: "I love AI" → ["I", "love", "AI"].
Subword-level tokenization (used in models like GPT) splits words into smaller parts or subwords. This is more efficient and handles unknown words better, as the model can understand smaller pieces of a word. Example, “delightful”→[“delight”, “ful”]`.
Character-Level Tokenization Splits text into individual characters. Example: "AI" → ["A", "I"].

Hands-On: Tokenization

Tokenization Example

from transformers import AutoTokenizer

# Load the tokenizer for GPT-2
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Tokenize a sentence
text = "Hello, how are you?"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)

print("Tokens:", tokens)
print("Token IDs:", token_ids)

Output:

Tokens: ['Hello', ',', 'Ġhow', 'Ġare', 'Ġyou', '?']
Token IDs: [15496, 11, 703, 389, 345, 30]

Explanation:

tokenizer.tokenize(text): Splits the text into tokens.
tokenizer.encode(text): Converts tokens into their corresponding numerical IDs.

Tokenization: Additional Resources

Blog: Tokenization in NLP
Video: Word Embeddings Explained

Embeddings

Embeddings are numerical representations of tokens in a high-dimensional space. They capture the semantic meaning of tokens, allowing models to understand relationships between words.

Why are Embeddings Important?

Words with similar meanings have similar embeddings.
Embeddings enable models to generalize and understand context.

Example:

The embeddings for "king", "queen", "man", and "woman" might satisfy the relationship:
```
king - man + woman ≈ queen
```

Hands-On: Embeddings

from transformers import AutoTokenizer, AutoModel
import torch

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModel.from_pretrained("gpt2")

# Tokenize input text
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")

# Generate embeddings
with torch.no_grad():
    outputs = model(**inputs)

# Extract embeddings for the first token
embeddings = outputs.last_hidden_state
print("Embeddings shape:", embeddings.shape)
print("Embeddings for 'Hello':", embeddings[0, 0, :5])  # First 5 dimensions

Output:

Embeddings shape: torch.Size([1, 6, 768])
Embeddings for 'Hello': tensor([-0.0123,  0.0456, -0.0678,  0.0234,  0.0891])

Explanation:

model(**inputs): Passes the token IDs through the model to generate embeddings.
outputs.last_hidden_state: Contains the embeddings for each token.
Each token is represented as a 768-dimensional vector (for GPT-2).
Output: torch.Size([1, 6, 768]) means
- Batch size: 1. Since we have one input sentence.
- Sequence length: 6. The input sentence has 6 tokens for the sentence "Hello, how are you?"
- Embedding dimensions: 768. Each token is represented by a 768-dimensional vector for GPT-2.

Self-Attention: The Core of Transformers

Imagine you’re reading a sentence: "The cat sat on the mat. Each word is important, but some are more related to others.

“cat” is related to “sat.”
“mat” is related to “sat.”
“on” is less important.

Self-Attention helps the model decide which words to focus on!

How Self-Attention Works (Step-by-Step)

Self-attention is done in 4 steps:

1. Convert Words into Vectors (Embeddings)

Computers don’t understand words, so we convert them into numbers (word embeddings).

Example:

Word	Vector Representation (Simplified)
The	[0.1, 0.2, 0.3]
Cat	[0.5, 0.6, 0.7]
Sat	[0.8, 0.9, 1.0]

2. Create Query, Key, and Value (Q, K, V)

Each word is transformed into three vectors:

Query (Q) → “What am I looking for?”
Key (K) → “What information do I have?”
Value (V) → “What should be returned?”

Example:

Word	Query (Q)	Key (K)	Value (V)
Cat	0.5	0.4	0.6
Sat	0.7	0.8	0.9

The Query (Q) of “sat” will be compared with Keys (K) of all words to see how related they are.
Now, let’s compute the Attention Scores using Q × Kᵀ.

3. Compute Attention Scores (Importance of Words) (Q × Kᵀ)

Now, we compare Query (Q) and Key (K) using the dot product.

If Q and K are similar, the word is important.
If Q and K are different, the word is less important.

\[ \text{Attention Score} = Q \times K^T \]

Example:

“sat” is strongly related to “cat” → High attention score.
“sat” is weakly related to “the” → Low attention score.

Word Pair	Calculation
Cat → Cat	(0.5 × 0.4) = 0.2
Cat → Sat	(0.5 × 0.8) = 0.4
Sat → Cat	(0.7 × 0.4) = 0.28
Sat → Sat	(0.7 × 0.8) = 0.56

Thus, our Attention Score matrix is:

\[ \begin{bmatrix} 0.2 & 0.4 \\ 0.28 & 0.56 \end{bmatrix} \]

4. Now we normalize these scores using Softmax.

What is Softmax?

Softmax is a function that converts raw scores into probabilities. Example: Let’s say we have three scores:

Raw Scores: [0.56, 0.72, 0.11]

Softmax converts them into values between 0 and 1:

Softmax Output: [0.30, 0.50, 0.20]

Why do we use Softmax?

So the values sum to 1, making them easy to interpret as “importance levels.”

Coming back to our actual values now,

Softmax normalizes these values so that they sum to 1 per row.

\[ \text{Softmax}(0.2, 0.4) = \frac{e^{0.2}}{e^{0.2} + e^{0.4}}, \frac{e^{0.4}}{e^{0.2} + e^{0.4}} \]

Approximating exponentials:

\[ e^{0.2} \approx 1.221, \quad e^{0.4} \approx 1.491 \]\[ \sum = 1.221 + 1.491 = 2.712 \]\[ \text{Softmax} = \left[ \frac{1.221}{2.712}, \frac{1.491}{2.712} \right] = [0.45, 0.55] \]

For the second row:

\[ e^{0.28} \approx 1.323, \quad e^{0.56} \approx 1.751 \]\[ \sum = 1.323 + 1.751 = 3.074 \]\[ \text{Softmax} = \left[ \frac{1.323}{3.074}, \frac{1.751}{3.074} \right] = [0.43, 0.57] \]

So, the normalized attention weights (Softmax scores) are:

\[ \begin{bmatrix} 0.45 & 0.55 \\ 0.43 & 0.57 \end{bmatrix} \]

4. Multiply Attention Weights by Value (V) and Sum Up

Each word’s final value is computed as:

Now, we multiply the softmax scores by the Value (V) matrix.

Final Word Representation = Attention Score × Value (V)

Word	V
Cat	0.6
Sat	0.9

For Cat (first row):

\[ (0.45 \times 0.6) + (0.55 \times 0.9) = 0.27 + 0.495 = 0.765 \]

For Sat (second row):

\[ (0.43 \times 0.6) + (0.57 \times 0.9) = 0.258 + 0.513 = 0.771 \]

This process refines each word’s meaning based on context.

5. Final Output: Context Vector

The final contextualized representations are:

\[ \begin{bmatrix} 0.765 \\ 0.771 \end{bmatrix} \]

What This Means

Each word’s new representation now depends on its relationship with others, weighted by attention!
Would you like me to extend this to multi-dimensional Q, K, and V?

The final output (context vector) represents the new embeddings for each word after applying the self-attention mechanism. These new values are no longer just the original word embeddings; instead, they now encode contextual information from the surrounding words based on how much attention each word gives to others.

Contextualized Representation
- Originally, each word had its own static vector (e.g., “Cat” = [0.5, 0.6, 0.7]).
- After self-attention, each word’s new representation incorporates weighted contributions from other words based on attention scores.
Information Flow
- The new values (0.765, 0.771) indicate that “Cat” and “Sat” now carry some information from each other, influenced by the attention weights.
- Words that are more relevant to each other have stronger influences.
Why Is This Important?
- Before self-attention, “Cat” was just “Cat,” and “Sat” was just “Sat.”
- Now, “Cat” understands that “Sat” is nearby and incorporates some of its meaning.
- This is how transformers capture context and relationships in sentences!

6. Sample Code

from transformers import BertModel, BertTokenizer

# Load pre-trained BERT model with attention outputs enabled
model = BertModel.from_pretrained("bert-base-uncased", output_attentions=True)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", output_attentions=True)

# Tokenize and convert input into tensors
inputs = tokenizer("Hello, Generative AI!", return_tensors="pt")

# Forward pass to get outputs including attention weights
outputs = model(**inputs)

# Extract attention layers
attentions = outputs.attentions

# Print number of attention layers
print("Number of Attention Layers:", len(attentions))

Explanation of the Code

BertModel.from_pretrained("bert-base-uncased", output_attentions=True): Loads the BERT model with the option to return attention weights.
tokenizer("Hello, Generative AI!", return_tensors="pt"): Converts input text into a format the model understands (PyTorch tensors).
model(**inputs): Passes the tokenized inputs through the BERT model.
outputs.attentions: Extracts attention weights from different transformer layers.

7. Python Example: Simple Self-Attention Implementation from Scratch

Now, let’s implement self-attention from scratch in Python.

import torch
import torch.nn as nn
import torch.nn.functional as F

class SelfAttention(nn.Module):
    def __init__(self, embed_size):
        super(SelfAttention, self).__init__()
        self.embed_size = embed_size
        self.query = nn.Linear(embed_size, embed_size)
        self.key = nn.Linear(embed_size, embed_size)
        self.value = nn.Linear(embed_size, embed_size)
    
    def forward(self, x):
        Q = self.query(x)   # Convert to Query
        K = self.key(x)     # Convert to Key
        V = self.value(x)   # Convert to Value

        # Compute Attention Scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.embed_size ** 0.5)
        attention = F.softmax(scores, dim=-1)  # Apply Softmax

        # Multiply by values
        out = torch.matmul(attention, V)
        return out

# Example usage
embed_size = 8
seq_length = 5
x = torch.rand((1, seq_length, embed_size))

self_attn = SelfAttention(embed_size)
output = self_attn(x)

print("Output Shape:", output.shape)  # Expected: (1, seq_length, embed_size)

Transformer Architecture

The Transformer Model is the core architecture behind most modern NLP and GenAI models like GPT, BERT, and LLaMA. Here’s how it works:

Key Components of the Transformer

Self-Attention Mechanism: This allows the model to focus on different words in a sentence when processing each word. For example, when processing the word “bank” in the sentence “I went to the bank to withdraw money,” the model can focus on the context to determine if “bank” refers to a financial institution or the side of a river.
Multi-Head Attention: This technique allows the model to focus on different aspects of the sentence simultaneously, using multiple attention heads to capture different relationships between words.
Positional Encoding: Since transformers don’t inherently understand the order of words (like sequential models), positional encoding is added to provide information about the position of words in a sentence.
Encoder-Decoder Architecture:
- Encoder: Processes the input data (e.g., a sentence).
- Decoder: Generates the output data (e.g., a translation of the sentence).

Both the Transformer Encoder and Decoder consist of:

Feed-Forward Layers: Further processes information after attention.
Layer Normalization: Normalizes the output of each sub-layer before passing it to the next layer, stabilizing training and preventing exploding gradients.

Feed-Forward Layers (FFNN)

What is a Feed-Forward Layer?

A Feed-Forward Layer, also known as a Feed-Forward Neural Network (FFNN), is a fully connected neural network layer that takes the output from the attention mechanism and further processes it.

How does a Feed-Forward Layer work?

Input: The output from the attention mechanism is fed into the FFNN.
Linear Transformation: The input is transformed using a linear layer (e.g., a dense layer) with a learnable weight matrix and bias term.
Activation Function: The output from the linear transformation is passed through an activation function, such as ReLU (Rectified Linear Unit) or GeLU (Gaussian Error Linear Unit).
Output: The output from the activation function is the final output of the FFNN.

Purpose of Feed-Forward Layers

The FFNN serves two purposes:

Feature Transformation: It transforms the output from the attention mechanism into a higher-level representation that’s more suitable for the task at hand.
Non-Linearity Introduction: The activation function introduces non-linearity into the model, allowing it to learn more complex relationships between the input and output. For example,
- ReLU (Rectified Linear Unit), Maps all negative values to 0 and all positive values to the same value. Another one,
- Sigmoid, which maps the input to a value between 0 and 1.
- GELU (Gaussian Error Linear Unit): combines the benefits of ReLU and sigmoid.

Layer Normalization

What is Layer Normalization?

Layer Normalization is a technique used to normalize the output of each sub-layer (e.g., the FFNN) before passing it to the next layer.

How does Layer Normalization work?

Compute Mean and Variance: The mean and variance of the output from the sub-layer are computed.
Normalize: The output is normalized by subtracting the mean and dividing by the standard deviation (square root of variance).
Scale and Shift: The normalized output is then scaled and shifted using learnable parameters (gamma and beta).

Purpose of Layer Normalization

Layer Normalization serves several purposes:

Stabilizes Training: Normalization helps stabilize the training process by reducing the effects of exploding gradients.
Improves Generalization: Normalization can improve the model’s ability to generalize to new, unseen data.
Reduces Dependence on Initialization: Normalization reduces the dependence on the initialization of the model’s weights.

By using Layer Normalization, the Transformer model can better handle the complex interactions between the attention mechanism and the FFNN, leading to improved performance and stability.

Examples

FNN Example

Suppose we have a simple FFN with one input layer, one hidden layer, and one output layer. The input layer has 2 neurons, the hidden layer has 3 neurons, and the output layer has 1 neuron.

Here’s the FFN architecture: Input Layer (2 neurons) → Hidden Layer (3 neurons) → Output Layer (1 neuron)

Let’s say we have an input vector x = [1, 2]. The FFN processes this input as follows:

Input Layer: The input vector x is passed through the input layer.
Hidden Layer: The output from the input layer is passed through a linear transformation (e.g., a dense layer) with weights W1 and biases b1, followed by an activation function (e.g., ReLU). Let’s call the output from the hidden layer h.
Output Layer: The output from the hidden layer h is passed through another linear transformation with weights W2 and biases b2. The final output is y.

Here’s some sample PyTorch code to illustrate this:

import torch
import torch.nn as nn

# Define the FFN model
class FFN(nn.Module):
    def __init__(self):
        super(FFN, self).__init__()
        self.hidden_layer = nn.Linear(2, 3)  # input layer (2) -> hidden layer (3)
        self.output_layer = nn.Linear(3, 1)  # hidden layer (3) -> output layer (1)

    def forward(self, x):
        h = torch.relu(self.hidden_layer(x))  # activation function: ReLU
        y = self.output_layer(h)
        return y

# Initialize the FFN model and input vector
model = FFN()
x = torch.tensor([1.0, 2.0])

# Forward pass
y = model(x)
print(y)

FFNN Example: Image Classification

Suppose we want to build a simple image classification model that can distinguish between images of cats and dogs.

Here’s how an FFNN can be used:

Input Layer: The input layer takes in the image data, which is typically represented as a matrix of pixel values.
Hidden Layer: The hidden layer applies a series of transformations to the input data, using weights and biases learned during training.
Output Layer: The output layer generates a probability distribution over the two classes (cat or dog).

For example, if we input an image of a cat, the FFNN might output: [0.8, 0.2]

This indicates that the model is 80% confident that the image is a cat and 20% confident that it’s a dog.

Layer Normalization Example

Now, let’s add Layer Normalization to the FFN model. We’ll apply Layer Normalization to the output of the hidden layer.

Here’s the modified PyTorch code:

import torch
import torch.nn as nn
import torch.nn.functional as F

# Define the FFN model with Layer Normalization
class FFN(nn.Module):
    def __init__(self):
        super(FFN, self).__init__()
        self.hidden_layer = nn.Linear(2, 3)  # input layer (2) -> hidden layer (3)
        self.layer_norm = nn.LayerNorm(3)  # Layer Normalization for hidden layer
        self.output_layer = nn.Linear(3, 1)  # hidden layer (3) -> output layer (1)

    def forward(self, x):
        h = torch.relu(self.hidden_layer(x))  # activation function: ReLU
        h = self.layer_norm(h)  # apply Layer Normalization
        y = self.output_layer(h)
        return y

# Initialize the FFN model and input vector
model = FFN()
x = torch.tensor([1.0, 2.0])

# Forward pass
y = model(x)
print(y)

In this example, we’ve added a LayerNorm module to the FFN model, which applies Layer Normalization to the output of the hidden layer. The LayerNorm module normalizes the input data by subtracting the mean and dividing by the standard deviation, which helps to stabilize the training process and improve the model’s performance.

Layer Normalization Example: Language Translation

Suppose we want to build a machine translation system that translates English sentences to Hindi. We have a dataset of paired English and Hindi sentences.

Our sequence-to-sequence model consists of three main components:

Encoder : The encoder takes in the English sentence and outputs a sequence of vectors and a hidden state. The encoder uses a GRU (Gated Recurrent Unit) layer to process the input sequence.
Layer Normalization : The layer normalization component normalizes the output from the encoder. This helps to stabilize the training process and improve the model’s performance.
Decoder : The decoder takes in the normalized output and the hidden state from the encoder and outputs a Hindi sentence. The decoder also uses a GRU layer to process the input sequence.

Layer Normalization helps to:

Stabilize the training process by reducing the effects of exploding gradients.
Improve the model’s ability to generalize to new, unseen data.

Example Walkthrough

Let’s walk through an example of how this model works:

English Sentence: We start with an English sentence, “Hello, how are you?”
Encoder: The encoder takes in the English sentence and outputs a sequence of vectors and a hidden state.

English Word	Encoder Output
Hello	[0.1, 0.2, 0.3]
how	[0.4, 0.5, 0.6]
are	[0.7, 0.8, 0.9]
you	[1.0, 1.1, 1.2]

Layer Normalization: The layer normalization component normalizes the output from the encoder.

English Word	Normalized Output
Hello	[-0.5, 0.0, 0.5]
how	[-0.3, 0.2, 0.7]
are	[-0.1, 0.4, 0.9]
you	[0.1, 0.6, 1.1]

Decoder: The decoder takes in the normalized output and the hidden state from the encoder and outputs a Hindi sentence.

Hindi Word	Decoder Output
नमस्ते	[0.8, 0.9, 1.0]
कैसे	[0.6, 0.7, 0.8]
हो	[0.4, 0.5, 0.6]
?	[0.2, 0.3, 0.4]

Final Output: The final output from the decoder is the translated Hindi sentence, “नमस्ते, कैसे हो?”.

# Import the necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim

# Define the encoder model
class Encoder(nn.Module):
    def __init__(self):
        # Initialize the encoder model
        super(Encoder, self).__init__()
        # Define the embedding layer with 10000 possible words and 128-dimensional vectors
        self.embedding = nn.Embedding(num_embeddings=10000, embedding_dim=128)
        # Define the GRU layer with 128 input dimensions, 256 hidden dimensions, and 1 layer
        self.rnn = nn.GRU(input_size=128, hidden_size=256, num_layers=1, batch_first=True)

    def forward(self, x):
        # Embed the input sequence
        embedded = self.embedding(x)
        # Pass the embedded sequence through the GRU layer
        output, hidden = self.rnn(embedded)
        # Return the output and hidden state
        return output, hidden

# Define the layer normalization model
class LayerNormalization(nn.Module):
    def __init__(self):
        # Initialize the layer normalization model
        super(LayerNormalization, self).__init__()
        # Define the layer normalization layer with 256 dimensions
        self.layer_norm = nn.LayerNorm(normalized_shape=256)

    def forward(self, x):
        # Normalize the input sequence
        return self.layer_norm(x)

# Define the decoder model
class Decoder(nn.Module):
    def __init__(self):
        # Initialize the decoder model
        super(Decoder, self).__init__()
        # Define the embedding layer with 10000 possible words and 128-dimensional vectors
        self.embedding = nn.Embedding(num_embeddings=10000, embedding_dim=128)
        # Define the GRU layer with 128 input dimensions, 256 hidden dimensions, and 1 layer
        self.rnn = nn.GRU(input_size=128, hidden_size=256, num_layers=1, batch_first=True)
        # Define the fully connected layer with 256 input dimensions and 10000 output dimensions
        self.fc = nn.Linear(256, 10000)

    def forward(self, x, hidden):
        # Embed the input sequence
        embedded = self.embedding(x)
        # Pass the embedded sequence through the GRU layer
        output, hidden = self.rnn(embedded, hidden)
        # Pass the output through the fully connected layer
        output = self.fc(output[:, -1, :])
        # Return the output and hidden state
        return output, hidden

# Define the sequence-to-sequence model
class Seq2Seq(nn.Module):
    def __init__(self):
        # Initialize the sequence-to-sequence model
        super(Seq2Seq, self).__init__()
        # Define the encoder model
        self.encoder = Encoder()
        # Define the layer normalization model
        self.layer_norm = LayerNormalization()
        # Define the decoder model
        self.decoder = Decoder()

    def forward(self, x):
        # Pass the input sequence through the encoder
        encoder_output, hidden = self.encoder(x)
        # Pass the encoder output through the layer normalization
        normalized_output = self.layer_norm(encoder_output)
        # Pass the normalized output and hidden state through the decoder
        decoder_output, _ = self.decoder(x[:, 0:1], hidden)
        # Return the decoder output
        return decoder_output

# Initialize the sequence-to-sequence model
model = Seq2Seq()

# Define the vocabulary
english_vocabulary = {'Hello': 0, 'are': 1, 'you': 2, '<EOS>': 3}
hindi_vocabulary = {'नमस्ते': 0, 'कैसे': 1, 'हो': 2, '<EOS>': 3}

# Define a sample English input sequence
english_input = torch.tensor([[english_vocabulary['Hello'], english_vocabulary['are'], english_vocabulary['you']]])

# Define a sample Hindi output sequence
hindi_output = torch.tensor([[hindi_vocabulary['नमस्ते'], hindi_vocabulary['कैसे'], hindi_vocabulary['हो']]])

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Train the model
for epoch in range(100):
    # Zero the gradients
    optimizer.zero_grad()
    # Pass the input sequence through the model
    output = model(english_input)
    # Calculate the loss
    loss = criterion(output, hindi_output[:, 0])
    # Backpropagate the loss
    loss.backward()
    # Update the model parameters
    optimizer.step()
    # Print the loss
    print(f'Epoch {epoch+1}, Loss: {loss.item()}')

# Use the trained model for translation
# Use the trained model for translation
translated_output = model(english_input)
translated_output_idx = torch.argmax(translated_output, dim=1)
predicted_word = list(hindi_vocabulary.keys())[translated_output_idx.item()]
print(predicted_word)

Multi-Head Attention (Why Do We Need Multiple Attention Layers?)

Instead of one attention mechanism, Transformers use multiple (“heads”) to capture different relationships.

Why? A single attention mechanism may miss important details.
Multi-head attention ensures the model sees information from different perspectives.

Summary: How Generative AI Processes Text

When you enter a prompt like:
➡️ "Explain black holes"

A Generative AI model follows these steps:

Step 1: Tokenization

Breaks text into smaller parts (tokens).
Example:
"Explain black holes" → ["explain", "black", "holes"]

Step 2: Embeddings
Each token is converted into a numerical vector for processing.

Step 3: Transformer Model Processing

Self-attention determines which words matter the most.
Multiple layers refine understanding.

Step 4: Text Generation

Predicts the most likely next token at each step.
Constructs output based on learned patterns.