Embeddings & Similarity Search: From Text to Vectors to Meaning
Embeddings & Similarity Search#
Embeddings convert text, images, and other data into dense numerical vectors where similar items are close together. This single idea powers semantic search, recommendations, clustering, and anomaly detection.
What Are Embeddings?#
An embedding is a fixed-size vector (array of floats) that captures the meaning of input data:
"How do I reset my password?" → [0.021, -0.134, 0.087, ..., 0.045] (1536 dims)
"I forgot my login credentials" → [0.019, -0.128, 0.091, ..., 0.041] (1536 dims)
"The weather is nice today" → [-0.203, 0.067, -0.112, ..., 0.189] (1536 dims)
The first two vectors are close (similar meaning). The third is far away (different topic). Distance in embedding space equals semantic difference.
Embedding Models#
OpenAI Embeddings#
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-large",
input="How do I reset my password?"
)
vector = response.data[0].embedding # 3072 dimensions
Models available:
text-embedding-3-small— 1536 dims, cheapest, good for most use casestext-embedding-3-large— 3072 dims, highest quality, supports dimension truncation
OpenAI's v3 models support Matryoshka embeddings — you can truncate to fewer dimensions (e.g., 256) with minimal quality loss, saving storage and compute.
Cohere Embeddings#
import cohere
co = cohere.Client("your-api-key")
response = co.embed(
texts=["How do I reset my password?"],
model="embed-english-v3.0",
input_type="search_document"
)
vector = response.embeddings[0] # 1024 dimensions
Cohere distinguishes between search_document and search_query input types, which improves retrieval quality. They also offer multilingual models covering 100+ languages.
Sentence-Transformers (Open Source)#
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(["How do I reset my password?"])
# 384 dimensions, runs locally, no API costs
Popular models:
all-MiniLM-L6-v2— 384 dims, fast, good baselineall-mpnet-base-v2— 768 dims, higher qualityBAAI/bge-large-en-v1.5— 1024 dims, top benchmark scoresnomic-ai/nomic-embed-text-v1.5— 768 dims, Matryoshka support
Running locally means zero API costs and full data privacy.
Image Embeddings#
Embeddings work for images too. CLIP-style models embed images and text into the same vector space:
from sentence_transformers import SentenceTransformer
from PIL import Image
model = SentenceTransformer("clip-ViT-B-32")
# Embed text and images into the same space
text_embedding = model.encode("a photo of a cat")
image_embedding = model.encode(Image.open("cat.jpg"))
# Now you can compare text queries against image embeddings
This enables text-to-image search, image-to-image similarity, and multimodal retrieval.
Dimensionality#
Embedding dimensions affect quality, speed, and storage:
| Dimensions | Storage per vector | Quality | Use case |
|---|---|---|---|
| 256 | 1 KB | Good | High-volume, cost-sensitive |
| 384 | 1.5 KB | Good+ | General purpose (local) |
| 768 | 3 KB | Very good | Production search |
| 1536 | 6 KB | Excellent | High-precision tasks |
| 3072 | 12 KB | Best | Maximum quality |
At scale, dimensions matter:
- 10M vectors at 1536 dims = ~60 GB just for vectors
- 10M vectors at 256 dims = ~10 GB
- Lower dimensions also mean faster similarity computation
Matryoshka Embeddings#
Modern models support dimension truncation — you can use just the first N dimensions:
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-large",
input="query text",
dimensions=256 # Truncate from 3072 to 256
)
Quality degrades gracefully. You might lose 2-5% retrieval accuracy but save 90% storage.
Similarity Metrics#
How you measure distance between vectors matters:
Cosine Similarity (Most Common)#
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Returns -1 to 1, where 1 = identical direction
Cosine similarity ignores magnitude and compares direction only. Best for text embeddings where magnitude varies.
Euclidean Distance (L2)#
def euclidean_distance(a, b):
return np.linalg.norm(a - b)
# Returns 0 to infinity, where 0 = identical
Dot Product#
def dot_product(a, b):
return np.dot(a, b)
Fastest to compute. Equivalent to cosine similarity when vectors are normalized (which most embedding models do).
Fine-Tuning Embeddings#
Off-the-shelf models work well, but fine-tuning on your domain data can boost retrieval by 5-15%:
Training Data Format#
[
{"query": "password reset", "positive": "How to change your password", "negative": "Pricing plans"},
{"query": "billing issue", "positive": "Update payment method", "negative": "API documentation"}
]
Each example has a query, a positive (relevant) document, and a negative (irrelevant) document.
Fine-Tuning with Sentence-Transformers#
from sentence_transformers import SentenceTransformer, losses, InputExample
from torch.utils.data import DataLoader
model = SentenceTransformer("all-MiniLM-L6-v2")
train_examples = [
InputExample(texts=["password reset", "How to change your password"], label=1.0),
InputExample(texts=["password reset", "Pricing plans"], label=0.0),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100
)
When to Fine-Tune#
- Domain-specific vocabulary (medical, legal, internal jargon)
- Specific retrieval patterns your base model struggles with
- Significant quality gap between base model and your needs
- Sufficient training data — at least 1,000 query-document pairs
Semantic Similarity Search#
The core use case: find documents similar to a query.
# Index phase (once)
documents = load_documents()
embeddings = model.encode(documents)
index.add(embeddings) # Add to vector store (Pinecone, Weaviate, pgvector, etc.)
# Query phase (per request)
query_embedding = model.encode("How do I reset my password?")
results = index.search(query_embedding, top_k=5)
# Returns the 5 most similar documents
Vector Databases#
| Database | Type | Highlights |
|---|---|---|
| Pinecone | Managed | Fully managed, simple API |
| Weaviate | Self-hosted | Hybrid search, GraphQL API |
| Qdrant | Self-hosted | Rust-based, filtering support |
| pgvector | Extension | PostgreSQL extension, familiar SQL |
| ChromaDB | Embedded | Great for prototyping |
Clustering#
Embeddings make clustering trivial — similar items naturally group together:
from sklearn.cluster import KMeans
embeddings = model.encode(documents)
kmeans = KMeans(n_clusters=10, random_state=42)
labels = kmeans.fit_predict(embeddings)
# Each document now has a cluster label
for doc, label in zip(documents, labels):
print(f"Cluster {label}: {doc[:50]}...")
Use cases: topic discovery, content categorization, support ticket routing.
Anomaly Detection#
Items far from any cluster center are anomalies:
from sklearn.metrics.pairwise import cosine_distances
centroid = np.mean(embeddings, axis=0)
distances = cosine_distances([centroid], embeddings)[0]
# Flag items beyond 2 standard deviations
threshold = np.mean(distances) + 2 * np.std(distances)
anomalies = [doc for doc, dist in zip(documents, distances) if dist > threshold]
Use cases: detecting spam, identifying novel support tickets, finding mislabeled data.
Production Architecture#
┌──────────┐ ┌────────────┐ ┌───────────────┐
│ Query │───▶│ Embedding │───▶│ Vector Store │
│ │ │ Model │ │ (Pinecone/ │
└──────────┘ └────────────┘ │ pgvector) │
└───────┬───────┘
│ top-k
┌───────▼───────┐
│ Re-ranker │
│ (optional) │
└───────┬───────┘
│
┌───────▼───────┐
│ Results │
└───────────────┘
Re-ranking with a cross-encoder model after initial retrieval typically improves precision by 10-20% at minimal latency cost.
Key Takeaways#
- Embeddings encode meaning — similar concepts have similar vectors
- Choose your model based on quality needs, cost, and privacy requirements
- Dimensionality is a tradeoff — Matryoshka embeddings let you pick your point
- Fine-tuning boosts domain-specific retrieval by 5-15%
- Cosine similarity is the default metric for text embeddings
- Beyond search — embeddings power clustering, anomaly detection, and recommendations
332 articles on software engineering at codelit.io/blog.
Try it on Codelit
GitHub Integration
Paste any repo URL to generate an interactive architecture diagram from real code
Related articles
AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG
8 min read
fine tuningFine-Tuning LLMs: When to Fine-Tune, LoRA, QLoRA, and Production Workflows
8 min read
retrieval augmented generationRetrieval-Augmented Generation (RAG) — Architecture, Chunking, and Evaluation
7 min read
Try these templates
OpenAI API Request Pipeline
7-stage pipeline from API call to token generation, handling millions of requests per minute.
8 componentsAirbnb-like Booking Platform
Property rental marketplace with search, booking, payments, and reviews.
10 componentsKubernetes Container Orchestration
K8s cluster with pod scheduling, service mesh, auto-scaling, and CI/CD deployment pipeline.
9 componentsBuild this architecture
Generate an interactive architecture for Embeddings & Similarity Search in seconds.
Try it in Codelit →
Comments