2.1.2: Embeddings
Embeddings
Embeddings are numerical representations of tokens in a high-dimensional space. They capture the semantic meaning of tokens, allowing models to understand relationships between words.
Why are Embeddings Important?
- Words with similar meanings have similar embeddings.
- Embeddings enable models to generalize and understand context.
Example:
- The embeddings for
"king","queen","man", and"woman"might satisfy the relationship:king - man + woman ≈ queen
Hands-On: Embeddings
from transformers import AutoTokenizer, AutoModel
import torch
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModel.from_pretrained("gpt2")
# Tokenize input text
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
# Generate embeddings
with torch.no_grad():
outputs = model(**inputs)
# Extract embeddings for the first token
embeddings = outputs.last_hidden_state
print("Embeddings shape:", embeddings.shape)
print("Embeddings for 'Hello':", embeddings[0, 0, :5]) # First 5 dimensionsOutput:
Embeddings shape: torch.Size([1, 6, 768])
Embeddings for 'Hello': tensor([-0.0123, 0.0456, -0.0678, 0.0234, 0.0891])Explanation:
model(**inputs): Passes the token IDs through the model to generate embeddings.outputs.last_hidden_state: Contains the embeddings for each token.- Each token is represented as a 768-dimensional vector (for GPT-2).
- Output:
torch.Size([1, 6, 768])means- Batch size: 1. Since we have one input sentence.
- Sequence length: 6. The input sentence has 6 tokens for the sentence
"Hello, how are you?" - Embedding dimensions: 768. Each token is represented by a 768-dimensional vector for GPT-2.