1. Embeddings & Vector Representation
What Are Word Embeddings?
Word embeddings are numerical representations of words in a continuous vector space. These embeddings capture the meaning, relationships, and context of words based on how they appear in text data.
Why Do LLMs Need Word Embeddings?
LLMs like GPT, BERT, and LLaMA work with numbers, not raw text. Embeddings convert words into numerical format so they can be processed by neural networks.
- Without embeddings: The model treats words like independent tokens (e.g., “king” and “queen” would be unrelated).
- With embeddings: The model understands relationships (e.g., “king” and “queen” are semantically close).
Key Idea: Words with similar meanings will have similar vector representations in the embedding space.
Different Types of Word Embeddings
| Method | How It Works | Pros | Cons |
|---|---|---|---|
| Word2Vec | Continuous Bag of Words (CBOW): Predicts a target word based on its context words. Skip-Gram: Predicts context words based on a target word. | Captures semantic meaning well. | Cannot handle new words (fixed vocabulary). |
| GloVe | Uses word co-occurrence statistics. It represents words as vectors by factorizing a word-context co-occurrence matrix. | Captures global meaning better. | Static embeddings (no context awareness). |
| BERT | Contextual embeddings (different meaning in different sentences). It uses a multi-layer bidirectional transformer encoder to generate context-dependent word embeddings. | Solves polysemy (e.g., “bank” as a river vs. financial). | Heavier computation. |
| OpenAI Embeddings | Transformer-based, optimized for retrieval & search. | Best for LLM-based applications. | Requires API calls (not open-source). |
| ELMo | Contextualized word embeddings using a bidirectional LSTM (Long Short-Term Memory). It generates context-dependent word embeddings. | Captures nuances of language. | Computationally expensive. |
| RoBERTa | Optimized BERT approach with dynamic masking and larger batch sizes. It generates context-dependent word embeddings. | Improved performance over BERT. | Requires significant computational resources. |
| FastText | Extension of Word2Vec, takes into account subword information. It represents words as a bag of character n-grams. | Handles out-of-vocabulary words. | Computationally expensive. May not capture semantic meaning as well. |
| Sentence-BERT | Siamese ( a type of neural network architecture that involves two identical sub-networks ) network-based approach for sentence embeddings. It generates sentence embeddings that can be used for tasks like semantic search and clustering. | Effective for sentence-level tasks. | May not capture nuances of individual words. |
Hands-on: Generating & Visualizing Word Embeddings
We’ll generate embeddings using:
- OpenAI’s API (for latest transformer-based embeddings).
- SentenceTransformers (open-source alternative).
- t-SNE visualization (to plot embeddings in 2D).
Example 1: Generating Embeddings Using OpenAI API
Best for real-world applications like search, retrieval, and semantic similarity
Step 1: Install OpenAI SDK
pip install openaiStep 2: Generate Embeddings
import openai
# OpenAI API Key (Replace with your actual key)
OPENAI_API_KEY = "your_api_key"
def get_embedding(text, model="text-embedding-ada-002"):
response = openai.embeddings.create(
input=text,
model=model,
api_key=OPENAI_API_KEY
)
return response.data[0].embedding
# Example texts
texts = ["king", "queen", "apple", "banana", "dog", "cat"]
# Generate embeddings
embeddings = [get_embedding(text) for text in texts]
print("Generated Embeddings:", embeddings[:2]) # Print first 2 embeddings📝 Notes:
- OpenAI’s embedding model (
text-embedding-ada-002) is state-of-the-art and works well for LLM applications.- The embeddings can be used for semantic search, clustering, and recommendation systems.
Example 2: Using SentenceTransformers (Open-Source Alternative)
Best for local and offline applications
Step 1: Install SentenceTransformers
pip install sentence-transformers tf-kerasStep 2: Generate Embeddings
from sentence_transformers import SentenceTransformer
# Load a pre-trained model
model = SentenceTransformer("all-MiniLM-L6-v2")
# Example texts
texts = ["king", "queen", "apple", "banana", "dog", "cat"]
# Generate embeddings
embeddings = model.encode(texts, convert_to_tensor=False)
print("Generated Embeddings:", embeddings[:2]) # Print first 2 embeddings📝 Notes:
all-MiniLM-L6-v2is a lightweight model optimized for speed and accuracy.- The embeddings are useful for classification, search, and NLP tasks.
Example 3: Visualizing Embeddings with t-SNE
Helps understand how words are related in the embedding space.
Step 1: Install Matplotlib & SciKit-Learn
pip install matplotlib scikit-learnStep 2: Visualize Embeddings
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
from sklearn.manifold import TSNE
# Load a pre-trained model
model = SentenceTransformer("all-MiniLM-L6-v2")
# Example texts
texts = ["king", "queen", "apple", "banana", "dog", "cat"]
# Generate embeddings
embeddings = model.encode(texts, convert_to_tensor=False)
print("Generated Embeddings:", embeddings[:2]) # Print first 2 embeddings
# Reduce dimensions using t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=5)
embeddings_2d = tsne.fit_transform(embeddings) # embeddings, texts are taken from the previous example
# Plot embeddings
plt.figure(figsize=(6, 6))
for i, text in enumerate(texts):
x, y = embeddings_2d[i]
plt.scatter(x, y)
plt.text(x + 0.01, y + 0.01, text, fontsize=12)
plt.title("t-SNE Visualization of Word Embeddings")
plt.title("t-SNE Visualization of Word Embeddings")
plt.show()Output:
📝 What You’ll See:
- Similar words (e.g., “king” and “queen”) should appear close to each other.
- Dissimilar words (e.g., “king” and “banana”) should be far apart.
In example above, we have use three parameters to fine tune i.e. perplexity, random_state, and n_components. Here’s a more detailed explanation of perplexity, random_state, and n_components with examples:
Perplexity
Perplexity measures the model’s uncertainty when predicting the data. A lower perplexity value indicates better predictions.
- High Perplexity: The model is uncertain and struggles to predict the data.
- Low Perplexity: The model is confident and accurately predicts the data.
Example:
from sklearn.manifold import TSNE
# High perplexity (model is uncertain)
tsne_high_perplexity = TSNE(n_components=2, perplexity=50)
# Low perplexity (model is confident)
tsne_low_perplexity = TSNE(n_components=2, perplexity=5)Random State
Random state ensures reproducibility of results by setting the seed for random number generation.
- Fixed Random State: Produces the same results every time the code is run.
- Random Random State: Produces different results every time the code is run.
Example:
from sklearn.manifold import TSNE
# Fixed random state (reproducible results)
tsne_fixed_random_state = TSNE(n_components=2, random_state=42)
# Random random state (non-reproducible results)
tsne_random_random_state = TSNE(n_components=2, random_state=None)n_components
n_components specifies the number of dimensions to reduce the data to.
- Low n_components: Reduces the data to a few dimensions, losing some information.
- High n_components: Retains more dimensions, but may not reduce the data effectively.
Example:
from sklearn.manifold import TSNE
# Low n_components (reduce to 2D)
tsne_low_n_components = TSNE(n_components=2)
# High n_components (reduce to 10D)
tsne_high_n_components = TSNE(n_components=10)Summary so far!
- Word embeddings convert text into numbers so LLMs can process them.
- Different methods (Word2Vec, GloVe, BERT, OpenAI) have pros & cons.
- SentenceTransformers is a free, offline alternative.
- t-SNE helps visualize relationships between words.
Real-World Use Cases
✔ Semantic Search: Find similar documents using embeddings.
✔ Chatbots & Q&A Systems: Improve response relevance.
✔ Recommendation Systems: Recommend similar items.
✔ Text Clustering: Group similar content automatically.
