1. Embeddings & Vector Representation

What Are Word Embeddings?

Word embeddings are numerical representations of words in a continuous vector space. These embeddings capture the meaning, relationships, and context of words based on how they appear in text data.

Why Do LLMs Need Word Embeddings?

LLMs like GPT, BERT, and LLaMA work with numbers, not raw text. Embeddings convert words into numerical format so they can be processed by neural networks.

  • Without embeddings: The model treats words like independent tokens (e.g., “king” and “queen” would be unrelated).
  • With embeddings: The model understands relationships (e.g., “king” and “queen” are semantically close).

Key Idea: Words with similar meanings will have similar vector representations in the embedding space.

Different Types of Word Embeddings

Method How It Works Pros Cons
Word2Vec Continuous Bag of Words (CBOW): Predicts a target word based on its context words. Skip-Gram: Predicts context words based on a target word. Captures semantic meaning well. Cannot handle new words (fixed vocabulary).
GloVe Uses word co-occurrence statistics. It represents words as vectors by factorizing a word-context co-occurrence matrix. Captures global meaning better. Static embeddings (no context awareness).
BERT Contextual embeddings (different meaning in different sentences). It uses a multi-layer bidirectional transformer encoder to generate context-dependent word embeddings. Solves polysemy (e.g., “bank” as a river vs. financial). Heavier computation.
OpenAI Embeddings Transformer-based, optimized for retrieval & search. Best for LLM-based applications. Requires API calls (not open-source).
ELMo Contextualized word embeddings using a bidirectional LSTM (Long Short-Term Memory). It generates context-dependent word embeddings. Captures nuances of language. Computationally expensive.
RoBERTa Optimized BERT approach with dynamic masking and larger batch sizes. It generates context-dependent word embeddings. Improved performance over BERT. Requires significant computational resources.
FastText Extension of Word2Vec, takes into account subword information. It represents words as a bag of character n-grams. Handles out-of-vocabulary words. Computationally expensive. May not capture semantic meaning as well.
Sentence-BERT Siamese ( a type of neural network architecture that involves two identical sub-networks ) network-based approach for sentence embeddings. It generates sentence embeddings that can be used for tasks like semantic search and clustering. Effective for sentence-level tasks. May not capture nuances of individual words.

Hands-on: Generating & Visualizing Word Embeddings

We’ll generate embeddings using:

  1. OpenAI’s API (for latest transformer-based embeddings).
  2. SentenceTransformers (open-source alternative).
  3. t-SNE visualization (to plot embeddings in 2D).

Example 1: Generating Embeddings Using OpenAI API

Best for real-world applications like search, retrieval, and semantic similarity

Step 1: Install OpenAI SDK

pip install openai

Step 2: Generate Embeddings

import openai

# OpenAI API Key (Replace with your actual key)
OPENAI_API_KEY = "your_api_key"


def get_embedding(text, model="text-embedding-ada-002"):
    response = openai.embeddings.create(
        input=text,
        model=model,
        api_key=OPENAI_API_KEY
    )
    return response.data[0].embedding


# Example texts
texts = ["king", "queen", "apple", "banana", "dog", "cat"]
# Generate embeddings
embeddings = [get_embedding(text) for text in texts]
print("Generated Embeddings:", embeddings[:2])  # Print first 2 embeddings

📝 Notes:

  • OpenAI’s embedding model (text-embedding-ada-002) is state-of-the-art and works well for LLM applications.
  • The embeddings can be used for semantic search, clustering, and recommendation systems.

Example 2: Using SentenceTransformers (Open-Source Alternative)

Best for local and offline applications

Step 1: Install SentenceTransformers

pip install sentence-transformers  tf-keras

Step 2: Generate Embeddings

from sentence_transformers import SentenceTransformer

# Load a pre-trained model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Example texts
texts = ["king", "queen", "apple", "banana", "dog", "cat"]

# Generate embeddings
embeddings = model.encode(texts, convert_to_tensor=False)
print("Generated Embeddings:", embeddings[:2])  # Print first 2 embeddings

📝 Notes:

  • all-MiniLM-L6-v2 is a lightweight model optimized for speed and accuracy.
  • The embeddings are useful for classification, search, and NLP tasks.

Example 3: Visualizing Embeddings with t-SNE

Helps understand how words are related in the embedding space.

Step 1: Install Matplotlib & SciKit-Learn

pip install matplotlib scikit-learn

Step 2: Visualize Embeddings

import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
from sklearn.manifold import TSNE

# Load a pre-trained model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Example texts
texts = ["king", "queen", "apple", "banana", "dog", "cat"]

# Generate embeddings
embeddings = model.encode(texts, convert_to_tensor=False)
print("Generated Embeddings:", embeddings[:2])  # Print first 2 embeddings

# Reduce dimensions using t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=5)
embeddings_2d = tsne.fit_transform(embeddings)  # embeddings, texts are taken from the previous example

# Plot embeddings
plt.figure(figsize=(6, 6))
for i, text in enumerate(texts):
    x, y = embeddings_2d[i]
    plt.scatter(x, y)
    plt.text(x + 0.01, y + 0.01, text, fontsize=12)

plt.title("t-SNE Visualization of Word Embeddings")
plt.title("t-SNE Visualization of Word Embeddings")
plt.show()

Output:

tSNE-Visual-Embedding.png tSNE-Visual-Embedding.png

📝 What You’ll See:

  • Similar words (e.g., “king” and “queen”) should appear close to each other.
  • Dissimilar words (e.g., “king” and “banana”) should be far apart.

In example above, we have use three parameters to fine tune i.e. perplexity, random_state, and n_components. Here’s a more detailed explanation of perplexity, random_state, and n_components with examples:

Perplexity

Perplexity measures the model’s uncertainty when predicting the data. A lower perplexity value indicates better predictions.

  • High Perplexity: The model is uncertain and struggles to predict the data.
  • Low Perplexity: The model is confident and accurately predicts the data.

Example:

from sklearn.manifold import TSNE

# High perplexity (model is uncertain)
tsne_high_perplexity = TSNE(n_components=2, perplexity=50)
# Low perplexity (model is confident)
tsne_low_perplexity = TSNE(n_components=2, perplexity=5)

Random State

Random state ensures reproducibility of results by setting the seed for random number generation.

  • Fixed Random State: Produces the same results every time the code is run.
  • Random Random State: Produces different results every time the code is run.

Example:

from sklearn.manifold import TSNE

# Fixed random state (reproducible results)
tsne_fixed_random_state = TSNE(n_components=2, random_state=42)
# Random random state (non-reproducible results)
tsne_random_random_state = TSNE(n_components=2, random_state=None)

n_components

n_components specifies the number of dimensions to reduce the data to.

  • Low n_components: Reduces the data to a few dimensions, losing some information.
  • High n_components: Retains more dimensions, but may not reduce the data effectively.

Example:

from sklearn.manifold import TSNE

# Low n_components (reduce to 2D)
tsne_low_n_components = TSNE(n_components=2)
# High n_components (reduce to 10D)
tsne_high_n_components = TSNE(n_components=10)

Summary so far!

  • Word embeddings convert text into numbers so LLMs can process them.
  • Different methods (Word2Vec, GloVe, BERT, OpenAI) have pros & cons.
  • SentenceTransformers is a free, offline alternative.
  • t-SNE helps visualize relationships between words.

Real-World Use Cases

Semantic Search: Find similar documents using embeddings.
Chatbots & Q&A Systems: Improve response relevance.
Recommendation Systems: Recommend similar items.
Text Clustering: Group similar content automatically.

Subsections of Embeddings & Vector

4.1. Custom Embedding's

Custom Embeddings: Why, When, and How?

Why Would You Need Custom Embeddings?

Pre-trained embeddings (like OpenAI’s text-embedding-ada-002 or SentenceTransformers) work well in most cases. However, custom embeddings are necessary when:

  1. Domain-Specific Knowledge

    • If you’re working with medical, legal, finance, or technical text, general-purpose embeddings may not capture key relationships.
    • Example: “BP” in general NLP models means “British Petroleum,” but in medicine, it means “Blood Pressure.”
  2. Multilingual Support

    • Many embedding models are optimized for English, so custom training is needed for non-English or code-mixed languages (e.g., Hinglish).
  3. Fine-Tuned Retrieval & Search

    • If you’re building a semantic search system, fine-tuning embeddings on your own dataset gives better results than generic embeddings.
  4. Industry-Specific Search & Clustering

    • A legal search engine should rank case laws before blogs.
    • A medical chatbot should understand symptoms better than casual conversations.

Pre-trained vs. Custom Embeddings: Pros & Cons

Feature Pre-trained Embeddings (e.g., OpenAI, BERT) Custom Trained Embeddings
Training Data General internet text, books, Wikipedia Your own dataset (domain-specific)
Performance Good for broad use cases Excellent for domain-specific tasks
New Vocabulary Cannot handle completely unseen words Learns domain-specific terms
Computational Cost Free or API-based Requires GPUs & storage
Ease of Use Ready to use Requires training & maintenance

How to Create Custom Embeddings?

We’ll explore two main approaches:

  1. Fine-tuning an existing model (Easier, uses SentenceTransformers).
  2. Training from scratch (Harder, needs large datasets).

Approach 1: Fine-Tuning SentenceTransformers on Custom Data

**Best for cases where you already have good embeddings but need slight adjustments.

Step 1: Install SentenceTransformers

pip install sentence-transformers tf-keras datasets transformers[torch] sentencepiece

Step 2: Prepare Your Dataset

You need pairs of sentences where one is the query, and the other is a relevant response.

Example Dataset (Medical Search Engine):

[
  {"query": "What are the symptoms of diabetes?", "response": "Common symptoms include frequent urination and fatigue."},
  {"query": "How to lower blood pressure?", "response": "A low-sodium diet and regular exercise help lower blood pressure."}
]

Step 3: Fine-Tune SentenceTransformers

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load a pre-trained model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Prepare training data
train_data = [
    InputExample(texts=["What are the symptoms of diabetes?", "Common symptoms include frequent urination and fatigue."]),
    InputExample(texts=["How to lower blood pressure?", "A low-sodium diet and regular exercise help lower blood pressure."])
]

# Convert to DataLoader with batch size 2
train_dataloader = DataLoader(train_data, shuffle=True, batch_size=2)

# Use contrastive loss for fine-tuning
train_loss = losses.MultipleNegativesRankingLoss(model)

# Train the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=3, warmup_steps=100)

# Save the model
model.save("custom_medical_embeddings")

Step 4: Use Custom Embeddings

custom_model = SentenceTransformer("custom_medical_embeddings")
custom_model_embedding = custom_model.encode("What are the symptoms of diabetes?")
print(custom_model_embedding)

Result: Your model now generates custom embeddings tailored to medical queries.


Approach 2: Training Word Embeddings from Scratch (Word2Vec / FastText)

Best when no suitable pre-trained embeddings exist (e.g., for a new language or industry-specific jargon.

Step 1: Install Gensim

pip install gensim

Step 2: Train Word2Vec on Your Dataset

from gensim.models import Word2Vec


# Sample medical text data
corpus = [
    ["diabetes", "causes", "high", "blood", "sugar"],
    ["high", "blood", "pressure", "treatment", "exercise"],
    ["hypertension", "is", "related", "to", "high", "blood", "pressure"]
]


# Train Word2Vec model
model = Word2Vec(sentences=corpus, vector_size=50, window=5, min_count=1, workers=4)


# Save and load the model
model.save("word2vec_medical.model")
model = Word2Vec.load("word2vec_medical.model")


# Get embedding for a word
print(model.wv["diabetes"])  # Prints embedding for "diabetes"

Result: You now have word embeddings trained on custom medical terminology.


Real-World Applications of Custom Embeddings

Medical Chatbots - Fine-tune embeddings to understand medical queries.
Legal Document Search - Optimize embeddings for law-related searches.
E-commerce Search - Improve product recommendations with domain-specific embeddings.
Multilingual NLP - Train embeddings for underrepresented languages.

When to Use Which Approach?

Situation Best Approach
You need fast, ready-to-use embeddings OpenAI API or SentenceTransformers
You need embeddings fine-tuned for a specific task Fine-tune SentenceTransformers
You’re working with a new domain/language Train Word2Vec/FastText from scratch
You want contextual embeddings (different meaning in different sentences) Fine-tune a Transformer model (BERT, GPT, etc.)

Summary

  • Custom embeddings outperform pre-trained ones in domain-specific tasks.
  • Fine-tuning pre-trained embeddings is easier & more efficient than training from scratch.
  • Training from scratch is best for new languages or highly specialized fields.

4.2. Custom Embeddings - Examples