2.1.2: Embeddings

Embeddings

Embeddings are numerical representations of tokens in a high-dimensional space. They capture the semantic meaning of tokens, allowing models to understand relationships between words.

Why are Embeddings Important?

  • Words with similar meanings have similar embeddings.
  • Embeddings enable models to generalize and understand context.

Example:

  • The embeddings for "king", "queen", "man", and "woman" might satisfy the relationship:
    king - man + woman ≈ queen

Hands-On: Embeddings

from transformers import AutoTokenizer, AutoModel
import torch

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModel.from_pretrained("gpt2")

# Tokenize input text
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")

# Generate embeddings
with torch.no_grad():
    outputs = model(**inputs)

# Extract embeddings for the first token
embeddings = outputs.last_hidden_state
print("Embeddings shape:", embeddings.shape)
print("Embeddings for 'Hello':", embeddings[0, 0, :5])  # First 5 dimensions

Output:

Embeddings shape: torch.Size([1, 6, 768])
Embeddings for 'Hello': tensor([-0.0123,  0.0456, -0.0678,  0.0234,  0.0891])

Explanation:

  • model(**inputs): Passes the token IDs through the model to generate embeddings.
  • outputs.last_hidden_state: Contains the embeddings for each token.
  • Each token is represented as a 768-dimensional vector (for GPT-2).
  • Output: torch.Size([1, 6, 768]) means
    • Batch size: 1. Since we have one input sentence.
    • Sequence length: 6. The input sentence has 6 tokens for the sentence "Hello, how are you?"
    • Embedding dimensions: 768. Each token is represented by a 768-dimensional vector for GPT-2.