4.1. Custom Embedding's
Custom Embeddings: Why, When, and How?
Why Would You Need Custom Embeddings?
Pre-trained embeddings (like OpenAI’s text-embedding-ada-002 or SentenceTransformers) work well in most cases. However, custom embeddings are necessary when:
-
Domain-Specific Knowledge
- If you’re working with medical, legal, finance, or technical text, general-purpose embeddings may not capture key relationships.
- Example: “BP” in general NLP models means “British Petroleum,” but in medicine, it means “Blood Pressure.”
-
Multilingual Support
- Many embedding models are optimized for English, so custom training is needed for non-English or code-mixed languages (e.g., Hinglish).
-
Fine-Tuned Retrieval & Search
- If you’re building a semantic search system, fine-tuning embeddings on your own dataset gives better results than generic embeddings.
-
Industry-Specific Search & Clustering
- A legal search engine should rank case laws before blogs.
- A medical chatbot should understand symptoms better than casual conversations.
Pre-trained vs. Custom Embeddings: Pros & Cons
| Feature | Pre-trained Embeddings (e.g., OpenAI, BERT) | Custom Trained Embeddings |
|---|---|---|
| Training Data | General internet text, books, Wikipedia | Your own dataset (domain-specific) |
| Performance | Good for broad use cases | Excellent for domain-specific tasks |
| New Vocabulary | Cannot handle completely unseen words | Learns domain-specific terms |
| Computational Cost | Free or API-based | Requires GPUs & storage |
| Ease of Use | Ready to use | Requires training & maintenance |
How to Create Custom Embeddings?
We’ll explore two main approaches:
- Fine-tuning an existing model (Easier, uses SentenceTransformers).
- Training from scratch (Harder, needs large datasets).
Approach 1: Fine-Tuning SentenceTransformers on Custom Data
**Best for cases where you already have good embeddings but need slight adjustments.
Step 1: Install SentenceTransformers
pip install sentence-transformers tf-keras datasets transformers[torch] sentencepieceStep 2: Prepare Your Dataset
You need pairs of sentences where one is the query, and the other is a relevant response.
Example Dataset (Medical Search Engine):
[
{"query": "What are the symptoms of diabetes?", "response": "Common symptoms include frequent urination and fatigue."},
{"query": "How to lower blood pressure?", "response": "A low-sodium diet and regular exercise help lower blood pressure."}
]Step 3: Fine-Tune SentenceTransformers
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration
# Load a pre-trained model
model = SentenceTransformer("all-MiniLM-L6-v2")
# Prepare training data
train_data = [
InputExample(texts=["What are the symptoms of diabetes?", "Common symptoms include frequent urination and fatigue."]),
InputExample(texts=["How to lower blood pressure?", "A low-sodium diet and regular exercise help lower blood pressure."])
]
# Convert to DataLoader with batch size 2
train_dataloader = DataLoader(train_data, shuffle=True, batch_size=2)
# Use contrastive loss for fine-tuning
train_loss = losses.MultipleNegativesRankingLoss(model)
# Train the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=3, warmup_steps=100)
# Save the model
model.save("custom_medical_embeddings")Step 4: Use Custom Embeddings
custom_model = SentenceTransformer("custom_medical_embeddings")
custom_model_embedding = custom_model.encode("What are the symptoms of diabetes?")
print(custom_model_embedding)Result: Your model now generates custom embeddings tailored to medical queries.
Approach 2: Training Word Embeddings from Scratch (Word2Vec / FastText)
Best when no suitable pre-trained embeddings exist (e.g., for a new language or industry-specific jargon.
Step 1: Install Gensim
pip install gensimStep 2: Train Word2Vec on Your Dataset
from gensim.models import Word2Vec
# Sample medical text data
corpus = [
["diabetes", "causes", "high", "blood", "sugar"],
["high", "blood", "pressure", "treatment", "exercise"],
["hypertension", "is", "related", "to", "high", "blood", "pressure"]
]
# Train Word2Vec model
model = Word2Vec(sentences=corpus, vector_size=50, window=5, min_count=1, workers=4)
# Save and load the model
model.save("word2vec_medical.model")
model = Word2Vec.load("word2vec_medical.model")
# Get embedding for a word
print(model.wv["diabetes"]) # Prints embedding for "diabetes"Result: You now have word embeddings trained on custom medical terminology.
Real-World Applications of Custom Embeddings
✔ Medical Chatbots - Fine-tune embeddings to understand medical queries.
✔ Legal Document Search - Optimize embeddings for law-related searches.
✔ E-commerce Search - Improve product recommendations with domain-specific embeddings.
✔ Multilingual NLP - Train embeddings for underrepresented languages.
When to Use Which Approach?
| Situation | Best Approach |
|---|---|
| You need fast, ready-to-use embeddings | OpenAI API or SentenceTransformers |
| You need embeddings fine-tuned for a specific task | Fine-tune SentenceTransformers |
| You’re working with a new domain/language | Train Word2Vec/FastText from scratch |
| You want contextual embeddings (different meaning in different sentences) | Fine-tune a Transformer model (BERT, GPT, etc.) |
Summary
- Custom embeddings outperform pre-trained ones in domain-specific tasks.
- Fine-tuning pre-trained embeddings is easier & more efficient than training from scratch.
- Training from scratch is best for new languages or highly specialized fields.