1. Embeddings & Vector Representation

What Are Word Embeddings?

Word embeddings are numerical representations of words in a continuous vector space. These embeddings capture the meaning, relationships, and context of words based on how they appear in text data.

Why Do LLMs Need Word Embeddings?

LLMs like GPT, BERT, and LLaMA work with numbers, not raw text. Embeddings convert words into numerical format so they can be processed by neural networks.

  • Without embeddings: The model treats words like independent tokens (e.g., “king” and “queen” would be unrelated).
  • With embeddings: The model understands relationships (e.g., “king” and “queen” are semantically close).

Key Idea: Words with similar meanings will have similar vector representations in the embedding space.

Different Types of Word Embeddings

Method How It Works Pros Cons
Word2Vec Continuous Bag of Words (CBOW): Predicts a target word based on its context words. Skip-Gram: Predicts context words based on a target word. Captures semantic meaning well. Cannot handle new words (fixed vocabulary).
GloVe Uses word co-occurrence statistics. It represents words as vectors by factorizing a word-context co-occurrence matrix. Captures global meaning better. Static embeddings (no context awareness).
BERT Contextual embeddings (different meaning in different sentences). It uses a multi-layer bidirectional transformer encoder to generate context-dependent word embeddings. Solves polysemy (e.g., “bank” as a river vs. financial). Heavier computation.
OpenAI Embeddings Transformer-based, optimized for retrieval & search. Best for LLM-based applications. Requires API calls (not open-source).
ELMo Contextualized word embeddings using a bidirectional LSTM (Long Short-Term Memory). It generates context-dependent word embeddings. Captures nuances of language. Computationally expensive.
RoBERTa Optimized BERT approach with dynamic masking and larger batch sizes. It generates context-dependent word embeddings. Improved performance over BERT. Requires significant computational resources.
FastText Extension of Word2Vec, takes into account subword information. It represents words as a bag of character n-grams. Handles out-of-vocabulary words. Computationally expensive. May not capture semantic meaning as well.
Sentence-BERT Siamese ( a type of neural network architecture that involves two identical sub-networks ) network-based approach for sentence embeddings. It generates sentence embeddings that can be used for tasks like semantic search and clustering. Effective for sentence-level tasks. May not capture nuances of individual words.

Hands-on: Generating & Visualizing Word Embeddings

We’ll generate embeddings using:

  1. OpenAI’s API (for latest transformer-based embeddings).
  2. SentenceTransformers (open-source alternative).
  3. t-SNE visualization (to plot embeddings in 2D).

Example 1: Generating Embeddings Using OpenAI API

Best for real-world applications like search, retrieval, and semantic similarity

Step 1: Install OpenAI SDK

pip install openai

Step 2: Generate Embeddings

import openai

# OpenAI API Key (Replace with your actual key)
OPENAI_API_KEY = "your_api_key"


def get_embedding(text, model="text-embedding-ada-002"):
    response = openai.embeddings.create(
        input=text,
        model=model,
        api_key=OPENAI_API_KEY
    )
    return response.data[0].embedding


# Example texts
texts = ["king", "queen", "apple", "banana", "dog", "cat"]
# Generate embeddings
embeddings = [get_embedding(text) for text in texts]
print("Generated Embeddings:", embeddings[:2])  # Print first 2 embeddings

📝 Notes:

  • OpenAI’s embedding model (text-embedding-ada-002) is state-of-the-art and works well for LLM applications.
  • The embeddings can be used for semantic search, clustering, and recommendation systems.

Example 2: Using SentenceTransformers (Open-Source Alternative)

Best for local and offline applications

Step 1: Install SentenceTransformers

pip install sentence-transformers  tf-keras

Step 2: Generate Embeddings

from sentence_transformers import SentenceTransformer

# Load a pre-trained model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Example texts
texts = ["king", "queen", "apple", "banana", "dog", "cat"]

# Generate embeddings
embeddings = model.encode(texts, convert_to_tensor=False)
print("Generated Embeddings:", embeddings[:2])  # Print first 2 embeddings

📝 Notes:

  • all-MiniLM-L6-v2 is a lightweight model optimized for speed and accuracy.
  • The embeddings are useful for classification, search, and NLP tasks.

Example 3: Visualizing Embeddings with t-SNE

Helps understand how words are related in the embedding space.

Step 1: Install Matplotlib & SciKit-Learn

pip install matplotlib scikit-learn

Step 2: Visualize Embeddings

import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
from sklearn.manifold import TSNE

# Load a pre-trained model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Example texts
texts = ["king", "queen", "apple", "banana", "dog", "cat"]

# Generate embeddings
embeddings = model.encode(texts, convert_to_tensor=False)
print("Generated Embeddings:", embeddings[:2])  # Print first 2 embeddings

# Reduce dimensions using t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=5)
embeddings_2d = tsne.fit_transform(embeddings)  # embeddings, texts are taken from the previous example

# Plot embeddings
plt.figure(figsize=(6, 6))
for i, text in enumerate(texts):
    x, y = embeddings_2d[i]
    plt.scatter(x, y)
    plt.text(x + 0.01, y + 0.01, text, fontsize=12)

plt.title("t-SNE Visualization of Word Embeddings")
plt.title("t-SNE Visualization of Word Embeddings")
plt.show()

Output:

tSNE-Visual-Embedding.png tSNE-Visual-Embedding.png

📝 What You’ll See:

  • Similar words (e.g., “king” and “queen”) should appear close to each other.
  • Dissimilar words (e.g., “king” and “banana”) should be far apart.

In example above, we have use three parameters to fine tune i.e. perplexity, random_state, and n_components. Here’s a more detailed explanation of perplexity, random_state, and n_components with examples:

Perplexity

Perplexity measures the model’s uncertainty when predicting the data. A lower perplexity value indicates better predictions.

  • High Perplexity: The model is uncertain and struggles to predict the data.
  • Low Perplexity: The model is confident and accurately predicts the data.

Example:

from sklearn.manifold import TSNE

# High perplexity (model is uncertain)
tsne_high_perplexity = TSNE(n_components=2, perplexity=50)
# Low perplexity (model is confident)
tsne_low_perplexity = TSNE(n_components=2, perplexity=5)

Random State

Random state ensures reproducibility of results by setting the seed for random number generation.

  • Fixed Random State: Produces the same results every time the code is run.
  • Random Random State: Produces different results every time the code is run.

Example:

from sklearn.manifold import TSNE

# Fixed random state (reproducible results)
tsne_fixed_random_state = TSNE(n_components=2, random_state=42)
# Random random state (non-reproducible results)
tsne_random_random_state = TSNE(n_components=2, random_state=None)

n_components

n_components specifies the number of dimensions to reduce the data to.

  • Low n_components: Reduces the data to a few dimensions, losing some information.
  • High n_components: Retains more dimensions, but may not reduce the data effectively.

Example:

from sklearn.manifold import TSNE

# Low n_components (reduce to 2D)
tsne_low_n_components = TSNE(n_components=2)
# High n_components (reduce to 10D)
tsne_high_n_components = TSNE(n_components=10)

Summary so far!

  • Word embeddings convert text into numbers so LLMs can process them.
  • Different methods (Word2Vec, GloVe, BERT, OpenAI) have pros & cons.
  • SentenceTransformers is a free, offline alternative.
  • t-SNE helps visualize relationships between words.

Real-World Use Cases

Semantic Search: Find similar documents using embeddings.
Chatbots & Q&A Systems: Improve response relevance.
Recommendation Systems: Recommend similar items.
Text Clustering: Group similar content automatically.