Recommendation Engine Architecture: From Collaborative Filtering to Deep Learning
Recommendation Engine Architecture#
Recommendations drive 35% of Amazon's revenue and 80% of Netflix watches. Behind every "you might also like" is a system balancing relevance, diversity, freshness, and latency.
Why Recommendations Matter#
Without recommendations:
User searches → browses → maybe finds something → high bounce rate
With recommendations:
User arrives → personalized feed → discovers items → longer sessions
Result: 10-30% increase in engagement and conversion
The difference between a mediocre and great recommendation engine is architecture, not just algorithms.
Collaborative Filtering#
The most intuitive approach: people who liked similar things will like similar things.
User-Based Collaborative Filtering#
Find users similar to the target user, then recommend what those similar users liked.
# Simplified user-based CF
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# User-item interaction matrix (ratings)
# Rows = users, Columns = items
ratings = np.array([
[5, 3, 0, 1], # User A
[4, 0, 0, 1], # User B
[1, 1, 0, 5], # User C
[0, 0, 5, 4], # User D
])
# Compute user similarity
user_sim = cosine_similarity(ratings)
# For User B, find most similar user → User A
# Recommend items User A rated highly that User B hasn't seen
# → Recommend Item 2 (rating 3 from User A)
Pros: No item metadata needed, captures complex patterns. Cons: Cold start for new users, doesn't scale well with millions of users.
Item-Based Collaborative Filtering#
Find items similar to what the user already liked. Amazon popularized this approach.
# Item-based CF — compute item similarity
item_sim = cosine_similarity(ratings.T)
# User liked Item 1 → find items most similar to Item 1
# Similarity scores tell us which items co-occur in ratings
# More stable than user-based (items change less than user behavior)
Why item-based often wins: Item relationships are more stable than user relationships. A user's taste shifts; the similarity between two movies doesn't.
Content-Based Filtering#
Recommend items with features similar to what the user previously liked.
# Content-based: use item features
item_features = {
"movie_1": {"genre": "sci-fi", "director": "Nolan", "year": 2014},
"movie_2": {"genre": "sci-fi", "director": "Villeneuve", "year": 2021},
"movie_3": {"genre": "comedy", "director": "Gerwig", "year": 2023},
}
# User watched movie_1 and liked it
# TF-IDF or embedding on features → movie_2 is most similar
# Recommend movie_2
Pros: No cold start for items (features available immediately), transparent reasoning. Cons: Over-specialization (filter bubble), can't discover surprising recommendations.
Hybrid Approaches#
Production systems combine multiple strategies:
┌─────────────────────────────────────────────┐
│ Hybrid Recommender │
├─────────────────────────────────────────────┤
│ │
│ Collaborative ──┐ │
│ Filtering ├──→ Weighted ──→ Final │
│ │ Combination List │
│ Content-Based ──┤ │
│ Filtering │ │
│ │ │
│ Popularity ──┘ │
│ Baseline │
│ │
└─────────────────────────────────────────────┘
Common hybrid strategies:
- Weighted: Score = 0.6 * CF_score + 0.3 * content_score + 0.1 * popularity
- Switching: Use content-based for new users, CF once enough data exists
- Cascade: CF generates candidates, content-based re-ranks them
- Feature augmentation: CF embeddings become features for a content model
Matrix Factorization#
The breakthrough behind Netflix Prize. Decompose the sparse user-item matrix into dense latent factors.
# Matrix Factorization with Surprise library
from surprise import SVD, Dataset, Reader
import pandas as pd
# Load ratings
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(
df[['user_id', 'item_id', 'rating']], reader
)
# SVD — learns latent factors for users and items
algo = SVD(n_factors=100, n_epochs=20, lr_all=0.005)
trainset = data.build_full_trainset()
algo.fit(trainset)
# Predict: user 42's rating for item 314
prediction = algo.predict(uid=42, iid=314)
# prediction.est → 4.2 (predicted rating)
Latent factors capture hidden dimensions — a movie might score high on "cerebral sci-fi" and low on "family friendly" without anyone labeling those dimensions.
Deep Learning Recommendations#
Neural networks handle complex patterns, sequences, and multimodal data.
# Two-tower model with TensorFlow Recommenders
import tensorflow as tf
import tensorflow_recommenders as tfrs
# User tower — learns user embeddings
user_model = tf.keras.Sequential([
tf.keras.layers.StringLookup(vocabulary=user_ids),
tf.keras.layers.Embedding(len(user_ids) + 1, 64),
])
# Item tower — learns item embeddings
item_model = tf.keras.Sequential([
tf.keras.layers.StringLookup(vocabulary=item_ids),
tf.keras.layers.Embedding(len(item_ids) + 1, 64),
])
# Retrieval task — find items closest to user in embedding space
task = tfrs.tasks.Retrieval(
metrics=tfrs.metrics.FactorizedTopK(
candidates=items.batch(128).map(item_model)
)
)
When to use deep learning: Large datasets (millions of interactions), sequential patterns (session-based), multimodal features (text + image + behavior).
The Cold Start Problem#
The hardest challenge: recommending for new users or new items with no interaction history.
| Strategy | New Users | New Items |
|---|---|---|
| Popularity baseline | Show trending items | N/A |
| Content features | Ask preferences on signup | Use item metadata |
| Demographic matching | Match similar demographics | N/A |
| Exploration bonus | Boost diverse items | Boost new items |
| Bandit algorithms | Explore vs exploit balance | Explore vs exploit |
# Multi-armed bandit for cold start (Thompson Sampling)
import numpy as np
class ThompsonSampling:
def __init__(self, n_items):
self.alpha = np.ones(n_items) # successes
self.beta = np.ones(n_items) # failures
def select_item(self):
samples = np.random.beta(self.alpha, self.beta)
return np.argmax(samples)
def update(self, item_idx, reward):
if reward:
self.alpha[item_idx] += 1
else:
self.beta[item_idx] += 1
A/B Testing Recommendations#
You can't improve what you can't measure. Every recommendation change needs rigorous testing.
Control group (50%): Current algorithm
Treatment group (50%): New algorithm
Metrics to track:
├── Engagement: CTR, time-on-site, pages-per-session
├── Conversion: purchases, signups, completions
├── Diversity: unique items shown, category spread
├── Novelty: how "surprising" are recommendations
└── Coverage: % of catalog that gets recommended
Pitfall: Optimizing only for CTR creates clickbait. Track downstream metrics (purchases, retention) alongside clicks.
Real-Time vs Batch Processing#
| Aspect | Batch | Real-Time | Hybrid |
|---|---|---|---|
| Latency | Hours | Milliseconds | Minutes |
| Freshness | Stale | Immediate | Near-real-time |
| Cost | Low | High | Medium |
| Complexity | Simple | Complex | Moderate |
Most production systems use hybrid: batch-compute candidate sets, real-time re-rank based on session context.
Batch pipeline (nightly):
User history → Train model → Generate top-1000 candidates per user → Store in Redis
Real-time layer (per request):
Session context → Re-rank candidates → Apply business rules → Return top-20
Tools and Frameworks#
| Tool | Best For | Scale |
|---|---|---|
| Surprise | Prototyping, research | Small-medium |
| TensorFlow Recommenders | Deep learning recs | Large |
| Apache Mahout | Hadoop-based CF | Very large |
| LensKit | Academic research | Small |
| Merlin (NVIDIA) | GPU-accelerated training | Enterprise |
| Feast | Feature store for ML | Any |
| Milvus / Pinecone | Vector similarity search | Large |
Production Architecture#
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Event Stream │────→│ Feature │────→│ Model │
│ (Kafka) │ │ Store │ │ Training │
└──────────────┘ │ (Feast) │ │ (nightly) │
└──────────────┘ └──────┬───────┘
│
┌──────────────┐ ┌──────────────┐ ┌───────▼───────┐
│ API Gateway │←───│ Re-Ranker │←───│ Candidate │
│ (response) │ │ (real-time) │ │ Store (Redis)│
└──────────────┘ └──────────────┘ └───────────────┘
Key Takeaways#
- Start simple — popularity and item-based CF beat complex models with small data
- Hybrid always wins — combine multiple signals for robustness
- Solve cold start explicitly — bandits and content features fill the gap
- Measure everything — A/B test with downstream metrics, not just CTR
- Batch + real-time — precompute candidates, re-rank in real time
Build smarter systems with codelit.io — your visual architecture companion.
Article 191 of the Codelit engineering blog series.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
AI Agent Tool Use Architecture: Function Calling, ReAct Loops & Structured Outputs
6 min read
AI searchAI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG
8 min read
AI safetyAI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop
8 min read
Try these templates
Real-Time Collaborative Editor
Notion-like document editor with real-time collaboration, conflict resolution, and rich media.
9 componentsNetflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsSearch Engine Architecture
Web-scale search with crawling, indexing, ranking, and sub-second query serving.
8 componentsBuild this architecture
Generate an interactive Recommendation Engine Architecture in seconds.
Try it in Codelit →
Comments