2.1.4: Transformers

Transformer Architecture

The Transformer Model is the core architecture behind most modern NLP and GenAI models like GPT, BERT, and LLaMA. Here’s how it works:

Key Components of the Transformer

  • Self-Attention Mechanism: This allows the model to focus on different words in a sentence when processing each word. For example, when processing the word “bank” in the sentence “I went to the bank to withdraw money,” the model can focus on the context to determine if “bank” refers to a financial institution or the side of a river.
  • Multi-Head Attention: This technique allows the model to focus on different aspects of the sentence simultaneously, using multiple attention heads to capture different relationships between words.
  • Positional Encoding: Since transformers don’t inherently understand the order of words (like sequential models), positional encoding is added to provide information about the position of words in a sentence.
  • Encoder-Decoder Architecture:
    • Encoder: Processes the input data (e.g., a sentence).
    • Decoder: Generates the output data (e.g., a translation of the sentence).

Both the Transformer Encoder and Decoder consist of:

  • Feed-Forward Layers: Further processes information after attention.
  • Layer Normalization: Normalizes the output of each sub-layer before passing it to the next layer, stabilizing training and preventing exploding gradients.

Feed-Forward Layers (FFNN)

What is a Feed-Forward Layer?

A Feed-Forward Layer, also known as a Feed-Forward Neural Network (FFNN), is a fully connected neural network layer that takes the output from the attention mechanism and further processes it.

How does a Feed-Forward Layer work?

  • Input: The output from the attention mechanism is fed into the FFNN.
  • Linear Transformation: The input is transformed using a linear layer (e.g., a dense layer) with a learnable weight matrix and bias term.
  • Activation Function: The output from the linear transformation is passed through an activation function, such as ReLU (Rectified Linear Unit) or GeLU (Gaussian Error Linear Unit).
  • Output: The output from the activation function is the final output of the FFNN.

Purpose of Feed-Forward Layers

The FFNN serves two purposes:

  • Feature Transformation: It transforms the output from the attention mechanism into a higher-level representation that’s more suitable for the task at hand.
  • Non-Linearity Introduction: The activation function introduces non-linearity into the model, allowing it to learn more complex relationships between the input and output. For example,
    • ReLU (Rectified Linear Unit), Maps all negative values to 0 and all positive values to the same value. Another one,
    • Sigmoid, which maps the input to a value between 0 and 1.
    • GELU (Gaussian Error Linear Unit): combines the benefits of ReLU and sigmoid.

Layer Normalization

What is Layer Normalization?

Layer Normalization is a technique used to normalize the output of each sub-layer (e.g., the FFNN) before passing it to the next layer.

How does Layer Normalization work?

  • Compute Mean and Variance: The mean and variance of the output from the sub-layer are computed.
  • Normalize: The output is normalized by subtracting the mean and dividing by the standard deviation (square root of variance).
  • Scale and Shift: The normalized output is then scaled and shifted using learnable parameters (gamma and beta).

Purpose of Layer Normalization

Layer Normalization serves several purposes:

  • Stabilizes Training: Normalization helps stabilize the training process by reducing the effects of exploding gradients.
  • Improves Generalization: Normalization can improve the model’s ability to generalize to new, unseen data.
  • Reduces Dependence on Initialization: Normalization reduces the dependence on the initialization of the model’s weights.

By using Layer Normalization, the Transformer model can better handle the complex interactions between the attention mechanism and the FFNN, leading to improved performance and stability.


Examples

FNN Example

Suppose we have a simple FFN with one input layer, one hidden layer, and one output layer. The input layer has 2 neurons, the hidden layer has 3 neurons, and the output layer has 1 neuron.

Here’s the FFN architecture: Input Layer (2 neurons) → Hidden Layer (3 neurons) → Output Layer (1 neuron)

Let’s say we have an input vector x = [1, 2]. The FFN processes this input as follows:

  • Input Layer: The input vector x is passed through the input layer.
  • Hidden Layer: The output from the input layer is passed through a linear transformation (e.g., a dense layer) with weights W1 and biases b1, followed by an activation function (e.g., ReLU). Let’s call the output from the hidden layer h.
  • Output Layer: The output from the hidden layer h is passed through another linear transformation with weights W2 and biases b2. The final output is y.

Here’s some sample PyTorch code to illustrate this:

import torch
import torch.nn as nn

# Define the FFN model
class FFN(nn.Module):
    def __init__(self):
        super(FFN, self).__init__()
        self.hidden_layer = nn.Linear(2, 3)  # input layer (2) -> hidden layer (3)
        self.output_layer = nn.Linear(3, 1)  # hidden layer (3) -> output layer (1)

    def forward(self, x):
        h = torch.relu(self.hidden_layer(x))  # activation function: ReLU
        y = self.output_layer(h)
        return y

# Initialize the FFN model and input vector
model = FFN()
x = torch.tensor([1.0, 2.0])

# Forward pass
y = model(x)
print(y)
FFNN Example: Image Classification

Suppose we want to build a simple image classification model that can distinguish between images of cats and dogs.

Here’s how an FFNN can be used:

  • Input Layer: The input layer takes in the image data, which is typically represented as a matrix of pixel values.
  • Hidden Layer: The hidden layer applies a series of transformations to the input data, using weights and biases learned during training.
  • Output Layer: The output layer generates a probability distribution over the two classes (cat or dog).

For example, if we input an image of a cat, the FFNN might output: [0.8, 0.2]

This indicates that the model is 80% confident that the image is a cat and 20% confident that it’s a dog.

Layer Normalization Example

Now, let’s add Layer Normalization to the FFN model. We’ll apply Layer Normalization to the output of the hidden layer.

Here’s the modified PyTorch code:

import torch
import torch.nn as nn
import torch.nn.functional as F

# Define the FFN model with Layer Normalization
class FFN(nn.Module):
    def __init__(self):
        super(FFN, self).__init__()
        self.hidden_layer = nn.Linear(2, 3)  # input layer (2) -> hidden layer (3)
        self.layer_norm = nn.LayerNorm(3)  # Layer Normalization for hidden layer
        self.output_layer = nn.Linear(3, 1)  # hidden layer (3) -> output layer (1)

    def forward(self, x):
        h = torch.relu(self.hidden_layer(x))  # activation function: ReLU
        h = self.layer_norm(h)  # apply Layer Normalization
        y = self.output_layer(h)
        return y

# Initialize the FFN model and input vector
model = FFN()
x = torch.tensor([1.0, 2.0])

# Forward pass
y = model(x)
print(y)

In this example, we’ve added a LayerNorm module to the FFN model, which applies Layer Normalization to the output of the hidden layer. The LayerNorm module normalizes the input data by subtracting the mean and dividing by the standard deviation, which helps to stabilize the training process and improve the model’s performance.

Layer Normalization Example: Language Translation

Suppose we want to build a machine translation system that translates English sentences to Hindi. We have a dataset of paired English and Hindi sentences.

Our sequence-to-sequence model consists of three main components:

  • Encoder : The encoder takes in the English sentence and outputs a sequence of vectors and a hidden state. The encoder uses a GRU (Gated Recurrent Unit) layer to process the input sequence.
  • Layer Normalization : The layer normalization component normalizes the output from the encoder. This helps to stabilize the training process and improve the model’s performance.
  • Decoder : The decoder takes in the normalized output and the hidden state from the encoder and outputs a Hindi sentence. The decoder also uses a GRU layer to process the input sequence.

Layer Normalization helps to:

  • Stabilize the training process by reducing the effects of exploding gradients.
  • Improve the model’s ability to generalize to new, unseen data.

Example Walkthrough

Let’s walk through an example of how this model works:

  • English Sentence: We start with an English sentence, “Hello, how are you?”
  • Encoder: The encoder takes in the English sentence and outputs a sequence of vectors and a hidden state.
English Word Encoder Output
Hello [0.1, 0.2, 0.3]
how [0.4, 0.5, 0.6]
are [0.7, 0.8, 0.9]
you [1.0, 1.1, 1.2]

Layer Normalization: The layer normalization component normalizes the output from the encoder.

English Word Normalized Output
Hello [-0.5, 0.0, 0.5]
how [-0.3, 0.2, 0.7]
are [-0.1, 0.4, 0.9]
you [0.1, 0.6, 1.1]

Decoder: The decoder takes in the normalized output and the hidden state from the encoder and outputs a Hindi sentence.

Hindi Word Decoder Output
नमस्ते [0.8, 0.9, 1.0]
कैसे [0.6, 0.7, 0.8]
हो [0.4, 0.5, 0.6]
? [0.2, 0.3, 0.4]

Final Output: The final output from the decoder is the translated Hindi sentence, “नमस्ते, कैसे हो?”.

# Import the necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim

# Define the encoder model
class Encoder(nn.Module):
    def __init__(self):
        # Initialize the encoder model
        super(Encoder, self).__init__()
        # Define the embedding layer with 10000 possible words and 128-dimensional vectors
        self.embedding = nn.Embedding(num_embeddings=10000, embedding_dim=128)
        # Define the GRU layer with 128 input dimensions, 256 hidden dimensions, and 1 layer
        self.rnn = nn.GRU(input_size=128, hidden_size=256, num_layers=1, batch_first=True)

    def forward(self, x):
        # Embed the input sequence
        embedded = self.embedding(x)
        # Pass the embedded sequence through the GRU layer
        output, hidden = self.rnn(embedded)
        # Return the output and hidden state
        return output, hidden

# Define the layer normalization model
class LayerNormalization(nn.Module):
    def __init__(self):
        # Initialize the layer normalization model
        super(LayerNormalization, self).__init__()
        # Define the layer normalization layer with 256 dimensions
        self.layer_norm = nn.LayerNorm(normalized_shape=256)

    def forward(self, x):
        # Normalize the input sequence
        return self.layer_norm(x)

# Define the decoder model
class Decoder(nn.Module):
    def __init__(self):
        # Initialize the decoder model
        super(Decoder, self).__init__()
        # Define the embedding layer with 10000 possible words and 128-dimensional vectors
        self.embedding = nn.Embedding(num_embeddings=10000, embedding_dim=128)
        # Define the GRU layer with 128 input dimensions, 256 hidden dimensions, and 1 layer
        self.rnn = nn.GRU(input_size=128, hidden_size=256, num_layers=1, batch_first=True)
        # Define the fully connected layer with 256 input dimensions and 10000 output dimensions
        self.fc = nn.Linear(256, 10000)

    def forward(self, x, hidden):
        # Embed the input sequence
        embedded = self.embedding(x)
        # Pass the embedded sequence through the GRU layer
        output, hidden = self.rnn(embedded, hidden)
        # Pass the output through the fully connected layer
        output = self.fc(output[:, -1, :])
        # Return the output and hidden state
        return output, hidden

# Define the sequence-to-sequence model
class Seq2Seq(nn.Module):
    def __init__(self):
        # Initialize the sequence-to-sequence model
        super(Seq2Seq, self).__init__()
        # Define the encoder model
        self.encoder = Encoder()
        # Define the layer normalization model
        self.layer_norm = LayerNormalization()
        # Define the decoder model
        self.decoder = Decoder()

    def forward(self, x):
        # Pass the input sequence through the encoder
        encoder_output, hidden = self.encoder(x)
        # Pass the encoder output through the layer normalization
        normalized_output = self.layer_norm(encoder_output)
        # Pass the normalized output and hidden state through the decoder
        decoder_output, _ = self.decoder(x[:, 0:1], hidden)
        # Return the decoder output
        return decoder_output

# Initialize the sequence-to-sequence model
model = Seq2Seq()

# Define the vocabulary
english_vocabulary = {'Hello': 0, 'are': 1, 'you': 2, '<EOS>': 3}
hindi_vocabulary = {'नमस्ते': 0, 'कैसे': 1, 'हो': 2, '<EOS>': 3}

# Define a sample English input sequence
english_input = torch.tensor([[english_vocabulary['Hello'], english_vocabulary['are'], english_vocabulary['you']]])

# Define a sample Hindi output sequence
hindi_output = torch.tensor([[hindi_vocabulary['नमस्ते'], hindi_vocabulary['कैसे'], hindi_vocabulary['हो']]])

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Train the model
for epoch in range(100):
    # Zero the gradients
    optimizer.zero_grad()
    # Pass the input sequence through the model
    output = model(english_input)
    # Calculate the loss
    loss = criterion(output, hindi_output[:, 0])
    # Backpropagate the loss
    loss.backward()
    # Update the model parameters
    optimizer.step()
    # Print the loss
    print(f'Epoch {epoch+1}, Loss: {loss.item()}')

# Use the trained model for translation
# Use the trained model for translation
translated_output = model(english_input)
translated_output_idx = torch.argmax(translated_output, dim=1)
predicted_word = list(hindi_vocabulary.keys())[translated_output_idx.item()]
print(predicted_word)