2.1.2: Embeddings

Embeddings

Embeddings are numerical representations of tokens in a high-dimensional space. They capture the semantic meaning of tokens, allowing models to understand relationships between words.

Why are Embeddings Important?

Words with similar meanings have similar embeddings.
Embeddings enable models to generalize and understand context.

Example:

The embeddings for "king", "queen", "man", and "woman" might satisfy the relationship:
```
king - man + woman ≈ queen
```

Hands-On: Embeddings

from transformers import AutoTokenizer, AutoModel
import torch

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModel.from_pretrained("gpt2")

# Tokenize input text
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")

# Generate embeddings
with torch.no_grad():
    outputs = model(**inputs)

# Extract embeddings for the first token
embeddings = outputs.last_hidden_state
print("Embeddings shape:", embeddings.shape)
print("Embeddings for 'Hello':", embeddings[0, 0, :5])  # First 5 dimensions

Output:

Embeddings shape: torch.Size([1, 6, 768])
Embeddings for 'Hello': tensor([-0.0123,  0.0456, -0.0678,  0.0234,  0.0891])

Explanation:

model(**inputs): Passes the token IDs through the model to generate embeddings.
outputs.last_hidden_state: Contains the embeddings for each token.
Each token is represented as a 768-dimensional vector (for GPT-2).
Output: torch.Size([1, 6, 768]) means
- Batch size: 1. Since we have one input sentence.
- Sequence length: 6. The input sentence has 6 tokens for the sentence "Hello, how are you?"
- Embedding dimensions: 768. Each token is represented by a 768-dimensional vector for GPT-2.