4.1. Custom Embedding's

Custom Embeddings: Why, When, and How?

Why Would You Need Custom Embeddings?

Pre-trained embeddings (like OpenAI’s text-embedding-ada-002 or SentenceTransformers) work well in most cases. However, custom embeddings are necessary when:

  1. Domain-Specific Knowledge

    • If you’re working with medical, legal, finance, or technical text, general-purpose embeddings may not capture key relationships.
    • Example: “BP” in general NLP models means “British Petroleum,” but in medicine, it means “Blood Pressure.”
  2. Multilingual Support

    • Many embedding models are optimized for English, so custom training is needed for non-English or code-mixed languages (e.g., Hinglish).
  3. Fine-Tuned Retrieval & Search

    • If you’re building a semantic search system, fine-tuning embeddings on your own dataset gives better results than generic embeddings.
  4. Industry-Specific Search & Clustering

    • A legal search engine should rank case laws before blogs.
    • A medical chatbot should understand symptoms better than casual conversations.

Pre-trained vs. Custom Embeddings: Pros & Cons

Feature Pre-trained Embeddings (e.g., OpenAI, BERT) Custom Trained Embeddings
Training Data General internet text, books, Wikipedia Your own dataset (domain-specific)
Performance Good for broad use cases Excellent for domain-specific tasks
New Vocabulary Cannot handle completely unseen words Learns domain-specific terms
Computational Cost Free or API-based Requires GPUs & storage
Ease of Use Ready to use Requires training & maintenance

How to Create Custom Embeddings?

We’ll explore two main approaches:

  1. Fine-tuning an existing model (Easier, uses SentenceTransformers).
  2. Training from scratch (Harder, needs large datasets).

Approach 1: Fine-Tuning SentenceTransformers on Custom Data

**Best for cases where you already have good embeddings but need slight adjustments.

Step 1: Install SentenceTransformers

pip install sentence-transformers tf-keras datasets transformers[torch] sentencepiece

Step 2: Prepare Your Dataset

You need pairs of sentences where one is the query, and the other is a relevant response.

Example Dataset (Medical Search Engine):

[
  {"query": "What are the symptoms of diabetes?", "response": "Common symptoms include frequent urination and fatigue."},
  {"query": "How to lower blood pressure?", "response": "A low-sodium diet and regular exercise help lower blood pressure."}
]

Step 3: Fine-Tune SentenceTransformers

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load a pre-trained model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Prepare training data
train_data = [
    InputExample(texts=["What are the symptoms of diabetes?", "Common symptoms include frequent urination and fatigue."]),
    InputExample(texts=["How to lower blood pressure?", "A low-sodium diet and regular exercise help lower blood pressure."])
]

# Convert to DataLoader with batch size 2
train_dataloader = DataLoader(train_data, shuffle=True, batch_size=2)

# Use contrastive loss for fine-tuning
train_loss = losses.MultipleNegativesRankingLoss(model)

# Train the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=3, warmup_steps=100)

# Save the model
model.save("custom_medical_embeddings")

Step 4: Use Custom Embeddings

custom_model = SentenceTransformer("custom_medical_embeddings")
custom_model_embedding = custom_model.encode("What are the symptoms of diabetes?")
print(custom_model_embedding)

Result: Your model now generates custom embeddings tailored to medical queries.


Approach 2: Training Word Embeddings from Scratch (Word2Vec / FastText)

Best when no suitable pre-trained embeddings exist (e.g., for a new language or industry-specific jargon.

Step 1: Install Gensim

pip install gensim

Step 2: Train Word2Vec on Your Dataset

from gensim.models import Word2Vec


# Sample medical text data
corpus = [
    ["diabetes", "causes", "high", "blood", "sugar"],
    ["high", "blood", "pressure", "treatment", "exercise"],
    ["hypertension", "is", "related", "to", "high", "blood", "pressure"]
]


# Train Word2Vec model
model = Word2Vec(sentences=corpus, vector_size=50, window=5, min_count=1, workers=4)


# Save and load the model
model.save("word2vec_medical.model")
model = Word2Vec.load("word2vec_medical.model")


# Get embedding for a word
print(model.wv["diabetes"])  # Prints embedding for "diabetes"

Result: You now have word embeddings trained on custom medical terminology.


Real-World Applications of Custom Embeddings

Medical Chatbots - Fine-tune embeddings to understand medical queries.
Legal Document Search - Optimize embeddings for law-related searches.
E-commerce Search - Improve product recommendations with domain-specific embeddings.
Multilingual NLP - Train embeddings for underrepresented languages.

When to Use Which Approach?

Situation Best Approach
You need fast, ready-to-use embeddings OpenAI API or SentenceTransformers
You need embeddings fine-tuned for a specific task Fine-tune SentenceTransformers
You’re working with a new domain/language Train Word2Vec/FastText from scratch
You want contextual embeddings (different meaning in different sentences) Fine-tune a Transformer model (BERT, GPT, etc.)

Summary

  • Custom embeddings outperform pre-trained ones in domain-specific tasks.
  • Fine-tuning pre-trained embeddings is easier & more efficient than training from scratch.
  • Training from scratch is best for new languages or highly specialized fields.