ML System Design: Architecture Patterns for Production Machine Learning
Building a machine learning model in a notebook is the easy part. Designing the system that trains, serves, monitors, and iterates on that model in production is where the real engineering lives. This guide covers the architecture patterns that power ML systems at scale.
ML Pipeline Architecture#
A production ML system is a pipeline with distinct stages. Each stage has its own failure modes, scaling characteristics, and tooling:
┌────────────┐ ┌───────────────┐ ┌───────────┐ ┌──────────┐ ┌──────────┐
│ Data │───▶│ Feature │───▶│ Training │───▶│ Model │───▶│ Serving │
│ Ingestion │ │ Store │ │ Pipeline │ │ Registry│ │ Layer │
└────────────┘ └───────────────┘ └───────────┘ └──────────┘ └──────────┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
Raw Data Features Experiments Artifacts Predictions
Data Ingestion#
Data ingestion feeds the entire pipeline. It must handle:
- Batch sources — Data warehouses (BigQuery, Snowflake), data lakes (S3, GCS), database replicas.
- Streaming sources — Kafka, Kinesis, Pub/Sub for real-time event data.
- Data validation — Schema checks, distribution drift detection, missing value alerts. Tools like Great Expectations and TFX Data Validation catch problems before they corrupt your features.
Feature Store#
A feature store is a centralized repository for computing, storing, and serving features consistently across training and inference:
┌──────────────────┐
Training Job ───▶│ │◀─── Serving Layer
│ Feature Store │
Batch ETL ───▶│ │◀─── Streaming
└──────────────────┘
│ Offline Store │ Online Store │
│ (Parquet/BQ) │ (Redis/Dynamo)│
Why it matters:
- Training-serving skew is the silent killer of ML systems. If training reads features from a warehouse but serving computes them on the fly, subtle differences produce degraded predictions.
- Feature reuse — Teams across the organization share curated features instead of re-deriving them.
- Point-in-time correctness — The store handles time-travel queries so training data never leaks future information.
Popular options:
| Tool | Strengths | Considerations |
|---|---|---|
| Feast | Open-source, cloud-agnostic, active community | Requires self-hosting infra |
| Tecton | Managed, real-time feature pipelines, enterprise support | Commercial licensing |
| Vertex AI Feature Store | Native GCP integration | GCP lock-in |
| SageMaker Feature Store | Native AWS integration | AWS lock-in |
Training Pipeline#
The training pipeline orchestrates data loading, feature retrieval, model training, evaluation, and artifact storage:
- Orchestrators — Kubeflow Pipelines, Airflow, Prefect, or Dagster schedule and retry training runs.
- Experiment tracking — MLflow, Weights & Biases, or Neptune log hyperparameters, metrics, and artifacts.
- Distributed training — For large models, frameworks like Horovod, DeepSpeed, or PyTorch FSDP distribute computation across GPUs.
Model Registry#
The model registry is the version control system for trained models:
- Store model artifacts with metadata (training data version, hyperparameters, evaluation metrics).
- Tag models with lifecycle stages:
staging,production,archived. - MLflow Model Registry is the most widely adopted open-source option.
Batch vs Real-Time Inference#
The serving strategy depends on latency requirements and cost constraints:
Batch Inference#
Pre-compute predictions on a schedule and store them for lookup:
Training Data ──▶ Model ──▶ Predictions Table ──▶ Application
(refreshed nightly)
- Use when: Latency tolerance is hours, predictions are for a known set of entities (e.g., daily product recommendations).
- Advantages: Simple infrastructure, easy to validate before serving, cost-efficient.
- Tooling: Spark, Dataflow, or a simple Python job on Kubernetes.
Real-Time Inference#
Serve predictions on demand via an API:
User Request ──▶ API Gateway ──▶ Model Server ──▶ Response
(< 100 ms)
- Use when: Predictions depend on fresh input (e.g., fraud detection at transaction time, search ranking).
- Advantages: Always up-to-date, handles unseen inputs.
- Challenges: Strict latency budgets, autoscaling complexity, feature retrieval at serving time.
Hybrid Approach#
Many production systems combine both: batch inference for the long tail, real-time inference for high-priority or time-sensitive requests.
Model Serving Infrastructure#
TensorFlow Serving#
Purpose-built for TensorFlow SavedModels. Supports model versioning, batching, and gRPC/REST endpoints. Mature and battle-tested at Google scale.
NVIDIA Triton Inference Server#
Model-agnostic server supporting TensorFlow, PyTorch, ONNX, TensorRT, and custom backends. Features dynamic batching, concurrent model execution, and GPU scheduling. Ideal for heterogeneous model fleets.
vLLM#
Optimized for large language model serving. Uses PagedAttention for efficient KV-cache management, delivering 2-4x higher throughput than naive implementations. The go-to choice for LLM inference at scale.
Comparison#
| Server | Best For | Multi-Framework | Dynamic Batching |
|---|---|---|---|
| TF Serving | TensorFlow models | No | Yes |
| Triton | Mixed model fleets | Yes | Yes |
| vLLM | LLM inference | PyTorch/HF | Yes (continuous) |
| TorchServe | PyTorch models | No | Yes |
A/B Testing Models#
Deploying a new model is a hypothesis. A/B testing validates it:
┌──────────────┐
90% ──▶ Model A (v1) │
User Traffic ──┤ └──────────────┘
10% ──▶ Model B (v2) │
└──────────────┘
Implementation Patterns#
- Traffic splitting — The API gateway or service mesh (Istio, Envoy) routes a percentage of traffic to each model version.
- Shadow mode — Route 100 % of traffic to both models, but only return Model A's result. Log Model B's predictions for offline comparison.
- Interleaving — For ranking systems, interleave results from both models in a single response and measure user engagement per source.
Statistical Rigor#
- Define the primary metric (e.g., click-through rate, conversion, latency) and guardrail metrics (e.g., error rate must not increase).
- Calculate sample size before the experiment to ensure statistical power.
- Run the test for a fixed duration — do not peek and stop early.
- Use sequential testing (e.g., CUSUM) if you need early stopping with controlled false-positive rates.
MLOps Tooling#
MLOps is the discipline of operationalizing ML. The ecosystem has converged around a few key platforms:
MLflow#
Open-source platform covering experiment tracking, model registry, and model deployment. Works with any ML framework. The default choice for teams starting their MLOps journey.
Kubeflow#
Kubernetes-native ML platform. Provides Pipelines (DAG orchestration), KServe (model serving), Katib (hyperparameter tuning), and training operators. Best for teams already invested in Kubernetes.
End-to-End Platforms#
- Vertex AI (GCP) — Managed pipelines, feature store, model monitoring, and serving.
- SageMaker (AWS) — Training jobs, endpoints, pipelines, and built-in algorithms.
- Databricks — Unified analytics and ML with MLflow integration and managed Spark.
CI/CD for ML#
ML CI/CD extends software CI/CD with model-specific steps:
- Code tests — Unit tests for feature transformations and model code.
- Data tests — Validate input data schema and distributions.
- Training — Retrain on latest data.
- Evaluation — Compare against the current production model on a holdout set.
- Promotion — If metrics pass thresholds, promote to staging, then production.
- Monitoring — Continuous post-deployment checks.
Monitoring Model Drift#
A model that was accurate at deployment degrades over time as the world changes. Drift monitoring catches this before users notice.
Types of Drift#
- Data drift (covariate shift) — The distribution of input features changes. Example: a fraud model trained on US transactions starts receiving EU traffic.
- Concept drift — The relationship between features and the target changes. Example: user purchasing behavior shifts during a recession.
- Prediction drift — The distribution of model outputs changes, even if inputs look stable.
Detection Methods#
| Method | What It Detects | Tools |
|---|---|---|
| PSI (Population Stability Index) | Feature distribution shift | EvidentlyAI, Nannyml |
| KS Test (Kolmogorov-Smirnov) | Distribution difference | SciPy, custom |
| ADWIN | Concept drift in streams | River, MOA |
| Performance monitoring | Accuracy/F1 degradation | Prometheus + custom |
Response Playbook#
- Alert — Drift score exceeds threshold.
- Diagnose — Identify which features shifted and whether concept drift is involved.
- Retrain — Trigger the training pipeline with recent data.
- Evaluate — Compare the retrained model against the current production model.
- Deploy — Roll out via A/B test or shadow mode.
- Update baselines — Reset drift reference distributions.
Architecture Checklist#
Before shipping an ML system to production, verify:
- Feature store prevents training-serving skew.
- Model registry tracks every artifact with lineage.
- Serving layer handles autoscaling and graceful degradation.
- A/B testing framework is in place for safe rollouts.
- Drift monitoring alerts on data, concept, and prediction drift.
- Rollback can revert to the previous model version in under a minute.
- CI/CD pipeline runs data validation, training, and evaluation automatically.
- Logging captures inputs, outputs, and latency for every prediction.
ML system design is not about choosing the fanciest model — it is about building the infrastructure that lets you iterate on models safely, quickly, and at scale.
Design, build, and ship ML systems with confidence at codelit.io.
This is article #166 in the Codelit engineering blog series.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
Uber Real-Time Location System
Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.
6 componentsNetflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsE-Commerce Checkout System
Production checkout flow with Stripe payments, inventory management, and fraud detection.
11 components
Comments