Vector Database Architecture — How Embeddings Power Modern AI Search
Vector Database Architecture#
Traditional databases index rows by primary key. Vector databases index high-dimensional embeddings and answer a fundamentally different question: "What is most similar to this?"
If you are building RAG pipelines, semantic search, recommendation engines, or anomaly detection, you need to understand how vector databases work under the hood.
What Is an Embedding?#
An embedding is a fixed-length array of floats that represents the meaning of a piece of data — a sentence, an image, a product, a code snippet.
- Text embeddings — OpenAI
text-embedding-3-largeoutputs 3072 dimensions - Image embeddings — CLIP maps images and text into a shared 512-d space
- Code embeddings — models like
code-search-adaembed functions for retrieval
The key insight: items that are semantically similar end up close together in vector space.
Similarity Metrics#
When you query a vector database, you are asking "find the K vectors closest to this query vector." Closeness depends on the metric.
Cosine Similarity#
Measures the angle between two vectors, ignoring magnitude. Values range from -1 (opposite) to 1 (identical direction). Most common for text.
Euclidean Distance (L2)#
Measures straight-line distance between two points. Lower is more similar. Better when magnitude matters — e.g., comparing raw pixel features.
Dot Product (Inner Product)#
Measures both direction and magnitude. If vectors are normalized, dot product equals cosine similarity. Used when you want magnitude to influence ranking.
Rule of thumb: Use cosine for text search, euclidean for spatial data, dot product when vectors are already normalized.
Approximate Nearest Neighbor (ANN) Search#
Exact nearest-neighbor search is O(n) — you compare the query against every vector. At millions of vectors, that is too slow.
ANN algorithms trade a small amount of accuracy for massive speed gains. The two dominant approaches:
HNSW (Hierarchical Navigable Small World)#
HNSW builds a multi-layer graph where each node connects to its nearest neighbors. Higher layers contain fewer, more spread-out nodes for fast coarse navigation. Lower layers contain all nodes for precise search.
How search works:
- Enter the graph at the top layer
- Greedily move to the neighbor closest to the query
- Drop to the next layer and repeat
- At the bottom layer, explore the local neighborhood and return top-K
Tradeoffs:
- Very fast queries (sub-millisecond at 1M vectors)
- High memory usage — the full graph lives in RAM
- Slow index build time compared to IVF
IVF (Inverted File Index)#
IVF partitions vectors into clusters using k-means. At query time, you only search the nearest clusters instead of the full dataset.
How search works:
- Run k-means on all vectors to create
nlistcentroids - Assign each vector to its nearest centroid
- At query time, find the
nprobenearest centroids - Search only vectors in those clusters
Tradeoffs:
- Lower memory than HNSW — you only need centroids in RAM
- Tunable accuracy via
nprobe(more probes = more accurate, slower) - Faster index build, especially with GPU-accelerated k-means
IVF + PQ (Product Quantization)#
Combines IVF with compression. Each vector is split into sub-vectors and quantized to a codebook entry. Dramatically reduces memory at some accuracy cost. Used by FAISS at billion-scale.
Vector Database Comparison#
Pinecone#
Fully managed. No infrastructure to operate. Supports metadata filtering, namespaces, and sparse-dense hybrid search. Pricing is per-pod or serverless.
Best for: Teams that want zero ops overhead and fast prototyping.
Weaviate#
Open-source, supports HNSW indexing, built-in vectorization modules (OpenAI, Cohere, HuggingFace), and GraphQL API. Can run on Kubernetes.
Best for: Teams that want built-in embedding generation and a rich query language.
Milvus#
Open-source, built for billion-scale. Supports IVF, HNSW, DiskANN, and GPU indexes. Separates storage and compute. Runs on Kubernetes via the Milvus Operator.
Best for: High-scale production workloads that need fine-grained index tuning.
Chroma#
Lightweight, Python-native, embeddable. Runs in-process or as a server. Great developer experience with a simple API.
Best for: Local development, prototyping, and small-to-medium datasets.
pgvector#
A PostgreSQL extension that adds vector columns and ANN search (HNSW and IVF). Your vectors live alongside relational data in one database.
Best for: Teams already on PostgreSQL who want to avoid a second database.
Hybrid Search#
Pure vector search misses exact keyword matches. Pure keyword search misses semantic meaning. Hybrid search combines both.
Architecture:
- Run a sparse retrieval (BM25 / full-text) for keyword relevance
- Run a dense retrieval (vector similarity) for semantic relevance
- Fuse the two ranked lists using Reciprocal Rank Fusion (RRF) or a learned re-ranker
Pinecone supports sparse-dense vectors natively. Weaviate has a built-in BM25 + vector hybrid mode. With pgvector, you can combine tsvector full-text search with vector similarity in a single SQL query.
RAG Pipeline Integration#
Vector databases are the retrieval backbone of Retrieval-Augmented Generation (RAG):
- Ingest — chunk documents, generate embeddings, store in vector DB with metadata
- Retrieve — embed the user query, run ANN search, return top-K chunks
- Augment — inject retrieved chunks into the LLM prompt as context
- Generate — the LLM produces a grounded answer
Key decisions:
- Chunk size — 256-512 tokens balances specificity and context
- Overlap — 10-20% overlap between chunks prevents losing context at boundaries
- Metadata filters — filter by source, date, or category before vector search to improve relevance
- Re-ranking — use a cross-encoder (e.g., Cohere Rerank, bge-reranker) to re-score top-K results
Indexing Pipeline#
A production vector database needs an ingestion pipeline:
- Extract — pull text from PDFs, HTML, Markdown, Notion, Confluence
- Chunk — split into fixed-size or semantic chunks
- Embed — call the embedding model API (batch for throughput)
- Upsert — write vectors + metadata to the database
- Sync — handle updates and deletes when source documents change
Use idempotent upserts keyed on a content hash to avoid duplicates. Track a last_updated timestamp to enable incremental syncs.
Performance Tuning#
- Index type — HNSW for low-latency; IVF+PQ for memory-constrained billion-scale
- Dimension reduction — Matryoshka embeddings or PCA to reduce dimensions without losing much accuracy
- Batch queries — amortize network overhead by batching multiple queries
- Pre-filtering vs. post-filtering — pre-filter metadata before ANN search to reduce the candidate set
- Warm cache — keep hot indexes in memory; use memory-mapped files for cold data
When NOT to Use a Vector Database#
- Your dataset is under 10K items — brute-force cosine similarity in NumPy is fast enough
- You need exact matches only — a traditional database or search engine is simpler
- You have no embedding model — vector databases require vectors; they do not generate them (unless the DB has built-in vectorization like Weaviate)
Start Building#
Vector databases are the infrastructure layer that makes AI search, RAG, and recommendations possible. Pick the right tool for your scale, understand the tradeoffs between HNSW and IVF, and always combine vector search with metadata filtering and re-ranking for production quality.
Try designing a vector database architecture on codelit.io — describe your system, get an interactive architecture diagram, export as code.
Article #325 of 327. Explore all articles, templates, and tools at codelit.io.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
Netflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsCloud File Storage Platform
Dropbox-like file storage with sync, sharing, versioning, and real-time collaboration.
8 componentsSearch Engine Architecture
Web-scale search with crawling, indexing, ranking, and sub-second query serving.
8 componentsBuild this architecture
Generate an interactive Vector Database Architecture in seconds.
Try it in Codelit →
Comments