vector databaseembeddingsAI searchHNSWRAGPineconepgvector

Vector Database Architecture — How Embeddings Power Modern AI Search

March 29, 2026 7 min readBy Codelit Team Discussion

Vector Database Architecture#

Traditional databases index rows by primary key. Vector databases index high-dimensional embeddings and answer a fundamentally different question: "What is most similar to this?"

If you are building RAG pipelines, semantic search, recommendation engines, or anomaly detection, you need to understand how vector databases work under the hood.

What Is an Embedding?#

An embedding is a fixed-length array of floats that represents the meaning of a piece of data — a sentence, an image, a product, a code snippet.

Text embeddings — OpenAI text-embedding-3-large outputs 3072 dimensions
Image embeddings — CLIP maps images and text into a shared 512-d space
Code embeddings — models like code-search-ada embed functions for retrieval

The key insight: items that are semantically similar end up close together in vector space.

Similarity Metrics#

When you query a vector database, you are asking "find the K vectors closest to this query vector." Closeness depends on the metric.

Cosine Similarity#

Measures the angle between two vectors, ignoring magnitude. Values range from -1 (opposite) to 1 (identical direction). Most common for text.

Euclidean Distance (L2)#

Measures straight-line distance between two points. Lower is more similar. Better when magnitude matters — e.g., comparing raw pixel features.

Dot Product (Inner Product)#

Measures both direction and magnitude. If vectors are normalized, dot product equals cosine similarity. Used when you want magnitude to influence ranking.

Rule of thumb: Use cosine for text search, euclidean for spatial data, dot product when vectors are already normalized.

Approximate Nearest Neighbor (ANN) Search#

Exact nearest-neighbor search is O(n) — you compare the query against every vector. At millions of vectors, that is too slow.

ANN algorithms trade a small amount of accuracy for massive speed gains. The two dominant approaches:

HNSW (Hierarchical Navigable Small World)#

HNSW builds a multi-layer graph where each node connects to its nearest neighbors. Higher layers contain fewer, more spread-out nodes for fast coarse navigation. Lower layers contain all nodes for precise search.

How search works:

Enter the graph at the top layer
Greedily move to the neighbor closest to the query
Drop to the next layer and repeat
At the bottom layer, explore the local neighborhood and return top-K

Tradeoffs:

Very fast queries (sub-millisecond at 1M vectors)
High memory usage — the full graph lives in RAM
Slow index build time compared to IVF

IVF (Inverted File Index)#

IVF partitions vectors into clusters using k-means. At query time, you only search the nearest clusters instead of the full dataset.

How search works:

Run k-means on all vectors to create nlist centroids
Assign each vector to its nearest centroid
At query time, find the nprobe nearest centroids
Search only vectors in those clusters

Tradeoffs:

Lower memory than HNSW — you only need centroids in RAM
Tunable accuracy via nprobe (more probes = more accurate, slower)
Faster index build, especially with GPU-accelerated k-means

IVF + PQ (Product Quantization)#

Combines IVF with compression. Each vector is split into sub-vectors and quantized to a codebook entry. Dramatically reduces memory at some accuracy cost. Used by FAISS at billion-scale.

Vector Database Comparison#

Pinecone#

Fully managed. No infrastructure to operate. Supports metadata filtering, namespaces, and sparse-dense hybrid search. Pricing is per-pod or serverless.

Best for: Teams that want zero ops overhead and fast prototyping.

Weaviate#

Open-source, supports HNSW indexing, built-in vectorization modules (OpenAI, Cohere, HuggingFace), and GraphQL API. Can run on Kubernetes.

Best for: Teams that want built-in embedding generation and a rich query language.

Milvus#

Open-source, built for billion-scale. Supports IVF, HNSW, DiskANN, and GPU indexes. Separates storage and compute. Runs on Kubernetes via the Milvus Operator.

Best for: High-scale production workloads that need fine-grained index tuning.

Chroma#

Lightweight, Python-native, embeddable. Runs in-process or as a server. Great developer experience with a simple API.

Best for: Local development, prototyping, and small-to-medium datasets.

pgvector#

A PostgreSQL extension that adds vector columns and ANN search (HNSW and IVF). Your vectors live alongside relational data in one database.

Best for: Teams already on PostgreSQL who want to avoid a second database.

Hybrid Search#

Pure vector search misses exact keyword matches. Pure keyword search misses semantic meaning. Hybrid search combines both.

Architecture:

Run a sparse retrieval (BM25 / full-text) for keyword relevance
Run a dense retrieval (vector similarity) for semantic relevance
Fuse the two ranked lists using Reciprocal Rank Fusion (RRF) or a learned re-ranker

Pinecone supports sparse-dense vectors natively. Weaviate has a built-in BM25 + vector hybrid mode. With pgvector, you can combine tsvector full-text search with vector similarity in a single SQL query.

RAG Pipeline Integration#

Vector databases are the retrieval backbone of Retrieval-Augmented Generation (RAG):

Ingest — chunk documents, generate embeddings, store in vector DB with metadata
Retrieve — embed the user query, run ANN search, return top-K chunks
Augment — inject retrieved chunks into the LLM prompt as context
Generate — the LLM produces a grounded answer

Key decisions:

Chunk size — 256-512 tokens balances specificity and context
Overlap — 10-20% overlap between chunks prevents losing context at boundaries
Metadata filters — filter by source, date, or category before vector search to improve relevance
Re-ranking — use a cross-encoder (e.g., Cohere Rerank, bge-reranker) to re-score top-K results

Indexing Pipeline#

A production vector database needs an ingestion pipeline:

Extract — pull text from PDFs, HTML, Markdown, Notion, Confluence
Chunk — split into fixed-size or semantic chunks
Embed — call the embedding model API (batch for throughput)
Upsert — write vectors + metadata to the database
Sync — handle updates and deletes when source documents change

Use idempotent upserts keyed on a content hash to avoid duplicates. Track a last_updated timestamp to enable incremental syncs.

Performance Tuning#

Index type — HNSW for low-latency; IVF+PQ for memory-constrained billion-scale
Dimension reduction — Matryoshka embeddings or PCA to reduce dimensions without losing much accuracy
Batch queries — amortize network overhead by batching multiple queries
Pre-filtering vs. post-filtering — pre-filter metadata before ANN search to reduce the candidate set
Warm cache — keep hot indexes in memory; use memory-mapped files for cold data

When NOT to Use a Vector Database#

Your dataset is under 10K items — brute-force cosine similarity in NumPy is fast enough
You need exact matches only — a traditional database or search engine is simpler
You have no embedding model — vector databases require vectors; they do not generate them (unless the DB has built-in vectorization like Weaviate)

Start Building#

Vector databases are the infrastructure layer that makes AI search, RAG, and recommendations possible. Pick the right tool for your scale, understand the tradeoffs between HNSW and IVF, and always combine vector search with metadata filtering and re-ranking for production quality.

Try designing a vector database architecture on codelit.io — describe your system, get an interactive architecture diagram, export as code.

Article #325 of 327. Explore all articles, templates, and tools at codelit.io.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Context Engineering for Agentic Systems

2 min read

AI agents

Agentic RAG Architecture for Internal Tools

3 min read

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

Try these templates

Netflix Video Streaming Architecture

Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.

10 components

Cloud File Storage Platform

Dropbox-like file storage with sync, sharing, versioning, and real-time collaboration.

8 components

Search Engine Architecture

Web-scale search with crawling, indexing, ranking, and sub-second query serving.

8 components

Build this architecture

Generate an interactive Vector Database Architecture in seconds.

Try it in Codelit →

vector databaseembeddingsAI searchHNSWRAGPineconepgvector

Vector Database Architecture — How Embeddings Power Modern AI Search

March 29, 2026 7 min readBy Codelit Team Discussion

Vector Database Architecture#

Traditional databases index rows by primary key. Vector databases index high-dimensional embeddings and answer a fundamentally different question: "What is most similar to this?"

If you are building RAG pipelines, semantic search, recommendation engines, or anomaly detection, you need to understand how vector databases work under the hood.

What Is an Embedding?#

An embedding is a fixed-length array of floats that represents the meaning of a piece of data — a sentence, an image, a product, a code snippet.

Text embeddings — OpenAI text-embedding-3-large outputs 3072 dimensions
Image embeddings — CLIP maps images and text into a shared 512-d space
Code embeddings — models like code-search-ada embed functions for retrieval

The key insight: items that are semantically similar end up close together in vector space.

Similarity Metrics#

When you query a vector database, you are asking "find the K vectors closest to this query vector." Closeness depends on the metric.

Cosine Similarity#

Measures the angle between two vectors, ignoring magnitude. Values range from -1 (opposite) to 1 (identical direction). Most common for text.

Euclidean Distance (L2)#

Measures straight-line distance between two points. Lower is more similar. Better when magnitude matters — e.g., comparing raw pixel features.

Dot Product (Inner Product)#

Measures both direction and magnitude. If vectors are normalized, dot product equals cosine similarity. Used when you want magnitude to influence ranking.

Rule of thumb: Use cosine for text search, euclidean for spatial data, dot product when vectors are already normalized.

Approximate Nearest Neighbor (ANN) Search#

Exact nearest-neighbor search is O(n) — you compare the query against every vector. At millions of vectors, that is too slow.

ANN algorithms trade a small amount of accuracy for massive speed gains. The two dominant approaches:

HNSW (Hierarchical Navigable Small World)#

How search works:

Enter the graph at the top layer
Greedily move to the neighbor closest to the query
Drop to the next layer and repeat
At the bottom layer, explore the local neighborhood and return top-K

Tradeoffs:

Very fast queries (sub-millisecond at 1M vectors)
High memory usage — the full graph lives in RAM
Slow index build time compared to IVF

IVF (Inverted File Index)#

IVF partitions vectors into clusters using k-means. At query time, you only search the nearest clusters instead of the full dataset.

How search works:

Run k-means on all vectors to create nlist centroids
Assign each vector to its nearest centroid
At query time, find the nprobe nearest centroids
Search only vectors in those clusters

Tradeoffs:

Lower memory than HNSW — you only need centroids in RAM
Tunable accuracy via nprobe (more probes = more accurate, slower)
Faster index build, especially with GPU-accelerated k-means

IVF + PQ (Product Quantization)#

Combines IVF with compression. Each vector is split into sub-vectors and quantized to a codebook entry. Dramatically reduces memory at some accuracy cost. Used by FAISS at billion-scale.

Vector Database Comparison#

Pinecone#

Fully managed. No infrastructure to operate. Supports metadata filtering, namespaces, and sparse-dense hybrid search. Pricing is per-pod or serverless.

Best for: Teams that want zero ops overhead and fast prototyping.

Weaviate#

Open-source, supports HNSW indexing, built-in vectorization modules (OpenAI, Cohere, HuggingFace), and GraphQL API. Can run on Kubernetes.

Best for: Teams that want built-in embedding generation and a rich query language.

Milvus#

Open-source, built for billion-scale. Supports IVF, HNSW, DiskANN, and GPU indexes. Separates storage and compute. Runs on Kubernetes via the Milvus Operator.

Best for: High-scale production workloads that need fine-grained index tuning.

Chroma#

Lightweight, Python-native, embeddable. Runs in-process or as a server. Great developer experience with a simple API.

Best for: Local development, prototyping, and small-to-medium datasets.

pgvector#

A PostgreSQL extension that adds vector columns and ANN search (HNSW and IVF). Your vectors live alongside relational data in one database.

Best for: Teams already on PostgreSQL who want to avoid a second database.

Hybrid Search#

Pure vector search misses exact keyword matches. Pure keyword search misses semantic meaning. Hybrid search combines both.

Architecture:

Run a sparse retrieval (BM25 / full-text) for keyword relevance
Run a dense retrieval (vector similarity) for semantic relevance
Fuse the two ranked lists using Reciprocal Rank Fusion (RRF) or a learned re-ranker

RAG Pipeline Integration#

Vector databases are the retrieval backbone of Retrieval-Augmented Generation (RAG):

Ingest — chunk documents, generate embeddings, store in vector DB with metadata
Retrieve — embed the user query, run ANN search, return top-K chunks
Augment — inject retrieved chunks into the LLM prompt as context
Generate — the LLM produces a grounded answer

Key decisions:

Chunk size — 256-512 tokens balances specificity and context
Overlap — 10-20% overlap between chunks prevents losing context at boundaries
Metadata filters — filter by source, date, or category before vector search to improve relevance
Re-ranking — use a cross-encoder (e.g., Cohere Rerank, bge-reranker) to re-score top-K results

Indexing Pipeline#

A production vector database needs an ingestion pipeline:

Extract — pull text from PDFs, HTML, Markdown, Notion, Confluence
Chunk — split into fixed-size or semantic chunks
Embed — call the embedding model API (batch for throughput)
Upsert — write vectors + metadata to the database
Sync — handle updates and deletes when source documents change

Use idempotent upserts keyed on a content hash to avoid duplicates. Track a last_updated timestamp to enable incremental syncs.

Performance Tuning#

Index type — HNSW for low-latency; IVF+PQ for memory-constrained billion-scale
Dimension reduction — Matryoshka embeddings or PCA to reduce dimensions without losing much accuracy
Batch queries — amortize network overhead by batching multiple queries
Pre-filtering vs. post-filtering — pre-filter metadata before ANN search to reduce the candidate set
Warm cache — keep hot indexes in memory; use memory-mapped files for cold data

When NOT to Use a Vector Database#

Your dataset is under 10K items — brute-force cosine similarity in NumPy is fast enough
You need exact matches only — a traditional database or search engine is simpler
You have no embedding model — vector databases require vectors; they do not generate them (unless the DB has built-in vectorization like Weaviate)

Start Building#

Try designing a vector database architecture on codelit.io — describe your system, get an interactive architecture diagram, export as code.

Article #325 of 327. Explore all articles, templates, and tools at codelit.io.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Build this architecture

Generate an interactive Vector Database Architecture in seconds.

Try it in Codelit →

Vector Database Architecture — How Embeddings Power Modern AI Search

Vector Database Architecture#

What Is an Embedding?#

Similarity Metrics#

Cosine Similarity#

Euclidean Distance (L2)#

Dot Product (Inner Product)#

Approximate Nearest Neighbor (ANN) Search#

HNSW (Hierarchical Navigable Small World)#

IVF (Inverted File Index)#

IVF + PQ (Product Quantization)#

Vector Database Comparison#

Pinecone#

Weaviate#

Milvus#

Chroma#

pgvector#

Hybrid Search#

RAG Pipeline Integration#

Indexing Pipeline#

Performance Tuning#

When NOT to Use a Vector Database#

Start Building#

Comments

Related articles

Context Engineering for Agentic Systems

Agentic RAG Architecture for Internal Tools

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

Try these templates

Netflix Video Streaming Architecture

Cloud File Storage Platform

Search Engine Architecture

Build this architecture

Vector Database Architecture — How Embeddings Power Modern AI Search

Vector Database Architecture#

What Is an Embedding?#

Similarity Metrics#

Cosine Similarity#

Euclidean Distance (L2)#

Dot Product (Inner Product)#

Approximate Nearest Neighbor (ANN) Search#

HNSW (Hierarchical Navigable Small World)#

IVF (Inverted File Index)#

IVF + PQ (Product Quantization)#

Vector Database Comparison#

Pinecone#

Weaviate#

Milvus#

Chroma#

pgvector#

Hybrid Search#

RAG Pipeline Integration#

Indexing Pipeline#

Performance Tuning#

When NOT to Use a Vector Database#

Start Building#

Comments

Related articles

Context Engineering for Agentic Systems

Agentic RAG Architecture for Internal Tools

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

Try these templates

Netflix Video Streaming Architecture

Cloud File Storage Platform

Search Engine Architecture

Build this architecture