ML system designmachine learning architectureMLOpsmodel servingfeature storemodel driftsystem designAI infrastructure

ML System Design: Architecture Patterns for Production Machine Learning

March 28, 2026 8 min readBy Codelit Team Discussion

Building a machine learning model in a notebook is the easy part. Designing the system that trains, serves, monitors, and iterates on that model in production is where the real engineering lives. This guide covers the architecture patterns that power ML systems at scale.

ML Pipeline Architecture#

A production ML system is a pipeline with distinct stages. Each stage has its own failure modes, scaling characteristics, and tooling:

┌────────────┐    ┌───────────────┐    ┌───────────┐    ┌──────────┐    ┌──────────┐
│   Data      │───▶│   Feature     │───▶│  Training │───▶│  Model   │───▶│  Serving │
│  Ingestion  │    │   Store       │    │  Pipeline │    │  Registry│    │  Layer   │
└────────────┘    └───────────────┘    └───────────┘    └──────────┘    └──────────┘
       │                  │                   │                │               │
       ▼                  ▼                   ▼                ▼               ▼
   Raw Data          Features            Experiments       Artifacts      Predictions

Data Ingestion#

Data ingestion feeds the entire pipeline. It must handle:

Batch sources — Data warehouses (BigQuery, Snowflake), data lakes (S3, GCS), database replicas.
Streaming sources — Kafka, Kinesis, Pub/Sub for real-time event data.
Data validation — Schema checks, distribution drift detection, missing value alerts. Tools like Great Expectations and TFX Data Validation catch problems before they corrupt your features.

Feature Store#

A feature store is a centralized repository for computing, storing, and serving features consistently across training and inference:

                   ┌──────────────────┐
  Training Job ───▶│                  │◀─── Serving Layer
                   │   Feature Store  │
  Batch ETL   ───▶│                  │◀─── Streaming
                   └──────────────────┘
                     │ Offline Store │ Online Store │
                     │ (Parquet/BQ)  │ (Redis/Dynamo)│

Why it matters:

Training-serving skew is the silent killer of ML systems. If training reads features from a warehouse but serving computes them on the fly, subtle differences produce degraded predictions.
Feature reuse — Teams across the organization share curated features instead of re-deriving them.
Point-in-time correctness — The store handles time-travel queries so training data never leaks future information.

Popular options:

Tool	Strengths	Considerations
Feast	Open-source, cloud-agnostic, active community	Requires self-hosting infra
Tecton	Managed, real-time feature pipelines, enterprise support	Commercial licensing
Vertex AI Feature Store	Native GCP integration	GCP lock-in
SageMaker Feature Store	Native AWS integration	AWS lock-in

Training Pipeline#

The training pipeline orchestrates data loading, feature retrieval, model training, evaluation, and artifact storage:

Orchestrators — Kubeflow Pipelines, Airflow, Prefect, or Dagster schedule and retry training runs.
Experiment tracking — MLflow, Weights & Biases, or Neptune log hyperparameters, metrics, and artifacts.
Distributed training — For large models, frameworks like Horovod, DeepSpeed, or PyTorch FSDP distribute computation across GPUs.

Model Registry#

The model registry is the version control system for trained models:

Store model artifacts with metadata (training data version, hyperparameters, evaluation metrics).
Tag models with lifecycle stages: staging, production, archived.
MLflow Model Registry is the most widely adopted open-source option.

Batch vs Real-Time Inference#

The serving strategy depends on latency requirements and cost constraints:

Batch Inference#

Pre-compute predictions on a schedule and store them for lookup:

Training Data ──▶ Model ──▶ Predictions Table ──▶ Application
                              (refreshed nightly)

Use when: Latency tolerance is hours, predictions are for a known set of entities (e.g., daily product recommendations).
Advantages: Simple infrastructure, easy to validate before serving, cost-efficient.
Tooling: Spark, Dataflow, or a simple Python job on Kubernetes.

Real-Time Inference#

Serve predictions on demand via an API:

User Request ──▶ API Gateway ──▶ Model Server ──▶ Response
                                  (< 100 ms)

Use when: Predictions depend on fresh input (e.g., fraud detection at transaction time, search ranking).
Advantages: Always up-to-date, handles unseen inputs.
Challenges: Strict latency budgets, autoscaling complexity, feature retrieval at serving time.

Hybrid Approach#

Many production systems combine both: batch inference for the long tail, real-time inference for high-priority or time-sensitive requests.

Model Serving Infrastructure#

TensorFlow Serving#

Purpose-built for TensorFlow SavedModels. Supports model versioning, batching, and gRPC/REST endpoints. Mature and battle-tested at Google scale.

NVIDIA Triton Inference Server#

Model-agnostic server supporting TensorFlow, PyTorch, ONNX, TensorRT, and custom backends. Features dynamic batching, concurrent model execution, and GPU scheduling. Ideal for heterogeneous model fleets.

vLLM#

Optimized for large language model serving. Uses PagedAttention for efficient KV-cache management, delivering 2-4x higher throughput than naive implementations. The go-to choice for LLM inference at scale.

Comparison#

Server	Best For	Multi-Framework	Dynamic Batching
TF Serving	TensorFlow models	No	Yes
Triton	Mixed model fleets	Yes	Yes
vLLM	LLM inference	PyTorch/HF	Yes (continuous)
TorchServe	PyTorch models	No	Yes

A/B Testing Models#

Deploying a new model is a hypothesis. A/B testing validates it:

                    ┌──────────────┐
              90% ──▶ Model A (v1) │
User Traffic ──┤    └──────────────┘
              10% ──▶ Model B (v2) │
                    └──────────────┘

Implementation Patterns#

Traffic splitting — The API gateway or service mesh (Istio, Envoy) routes a percentage of traffic to each model version.
Shadow mode — Route 100 % of traffic to both models, but only return Model A's result. Log Model B's predictions for offline comparison.
Interleaving — For ranking systems, interleave results from both models in a single response and measure user engagement per source.

Statistical Rigor#

Define the primary metric (e.g., click-through rate, conversion, latency) and guardrail metrics (e.g., error rate must not increase).
Calculate sample size before the experiment to ensure statistical power.
Run the test for a fixed duration — do not peek and stop early.
Use sequential testing (e.g., CUSUM) if you need early stopping with controlled false-positive rates.

MLOps Tooling#

MLOps is the discipline of operationalizing ML. The ecosystem has converged around a few key platforms:

MLflow#

Open-source platform covering experiment tracking, model registry, and model deployment. Works with any ML framework. The default choice for teams starting their MLOps journey.

Kubeflow#

Kubernetes-native ML platform. Provides Pipelines (DAG orchestration), KServe (model serving), Katib (hyperparameter tuning), and training operators. Best for teams already invested in Kubernetes.

End-to-End Platforms#

Vertex AI (GCP) — Managed pipelines, feature store, model monitoring, and serving.
SageMaker (AWS) — Training jobs, endpoints, pipelines, and built-in algorithms.
Databricks — Unified analytics and ML with MLflow integration and managed Spark.

CI/CD for ML#

ML CI/CD extends software CI/CD with model-specific steps:

Code tests — Unit tests for feature transformations and model code.
Data tests — Validate input data schema and distributions.
Training — Retrain on latest data.
Evaluation — Compare against the current production model on a holdout set.
Promotion — If metrics pass thresholds, promote to staging, then production.
Monitoring — Continuous post-deployment checks.

Monitoring Model Drift#

A model that was accurate at deployment degrades over time as the world changes. Drift monitoring catches this before users notice.

Types of Drift#

Data drift (covariate shift) — The distribution of input features changes. Example: a fraud model trained on US transactions starts receiving EU traffic.
Concept drift — The relationship between features and the target changes. Example: user purchasing behavior shifts during a recession.
Prediction drift — The distribution of model outputs changes, even if inputs look stable.

Detection Methods#

Method	What It Detects	Tools
PSI (Population Stability Index)	Feature distribution shift	EvidentlyAI, Nannyml
KS Test (Kolmogorov-Smirnov)	Distribution difference	SciPy, custom
ADWIN	Concept drift in streams	River, MOA
Performance monitoring	Accuracy/F1 degradation	Prometheus + custom

Response Playbook#

Alert — Drift score exceeds threshold.
Diagnose — Identify which features shifted and whether concept drift is involved.
Retrain — Trigger the training pipeline with recent data.
Evaluate — Compare the retrained model against the current production model.
Deploy — Roll out via A/B test or shadow mode.
Update baselines — Reset drift reference distributions.

Architecture Checklist#

Before shipping an ML system to production, verify:

Feature store prevents training-serving skew.
Model registry tracks every artifact with lineage.
Serving layer handles autoscaling and graceful degradation.
A/B testing framework is in place for safe rollouts.
Drift monitoring alerts on data, concept, and prediction drift.
Rollback can revert to the previous model version in under a minute.
CI/CD pipeline runs data validation, training, and evaluation automatically.
Logging captures inputs, outputs, and latency for every prediction.

ML system design is not about choosing the fanciest model — it is about building the infrastructure that lets you iterate on models safely, quickly, and at scale.

Design, build, and ship ML systems with confidence at codelit.io.

This is article #166 in the Codelit engineering blog series.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Agent Workflows for AI Infrastructure Teams

2 min read

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

Try these templates

Uber Real-Time Location System

Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.

6 components

Netflix Video Streaming Architecture

Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.

10 components

E-Commerce Checkout System

Production checkout flow with Stripe payments, inventory management, and fraud detection.

11 components

Build this architecture

Generate an interactive ML System Design in seconds.

Try it in Codelit →

ML system designmachine learning architectureMLOpsmodel servingfeature storemodel driftsystem designAI infrastructure

ML System Design: Architecture Patterns for Production Machine Learning

March 28, 2026 8 min readBy Codelit Team Discussion

ML Pipeline Architecture#

A production ML system is a pipeline with distinct stages. Each stage has its own failure modes, scaling characteristics, and tooling:

┌────────────┐    ┌───────────────┐    ┌───────────┐    ┌──────────┐    ┌──────────┐
│   Data      │───▶│   Feature     │───▶│  Training │───▶│  Model   │───▶│  Serving │
│  Ingestion  │    │   Store       │    │  Pipeline │    │  Registry│    │  Layer   │
└────────────┘    └───────────────┘    └───────────┘    └──────────┘    └──────────┘
       │                  │                   │                │               │
       ▼                  ▼                   ▼                ▼               ▼
   Raw Data          Features            Experiments       Artifacts      Predictions

Data Ingestion#

Data ingestion feeds the entire pipeline. It must handle:

Batch sources — Data warehouses (BigQuery, Snowflake), data lakes (S3, GCS), database replicas.
Streaming sources — Kafka, Kinesis, Pub/Sub for real-time event data.
Data validation — Schema checks, distribution drift detection, missing value alerts. Tools like Great Expectations and TFX Data Validation catch problems before they corrupt your features.

Feature Store#

A feature store is a centralized repository for computing, storing, and serving features consistently across training and inference:

                   ┌──────────────────┐
  Training Job ───▶│                  │◀─── Serving Layer
                   │   Feature Store  │
  Batch ETL   ───▶│                  │◀─── Streaming
                   └──────────────────┘
                     │ Offline Store │ Online Store │
                     │ (Parquet/BQ)  │ (Redis/Dynamo)│

Why it matters:

Training-serving skew is the silent killer of ML systems. If training reads features from a warehouse but serving computes them on the fly, subtle differences produce degraded predictions.
Feature reuse — Teams across the organization share curated features instead of re-deriving them.
Point-in-time correctness — The store handles time-travel queries so training data never leaks future information.

Popular options:

Tool	Strengths	Considerations
Feast	Open-source, cloud-agnostic, active community	Requires self-hosting infra
Tecton	Managed, real-time feature pipelines, enterprise support	Commercial licensing
Vertex AI Feature Store	Native GCP integration	GCP lock-in
SageMaker Feature Store	Native AWS integration	AWS lock-in

Training Pipeline#

The training pipeline orchestrates data loading, feature retrieval, model training, evaluation, and artifact storage:

Orchestrators — Kubeflow Pipelines, Airflow, Prefect, or Dagster schedule and retry training runs.
Experiment tracking — MLflow, Weights & Biases, or Neptune log hyperparameters, metrics, and artifacts.
Distributed training — For large models, frameworks like Horovod, DeepSpeed, or PyTorch FSDP distribute computation across GPUs.

Model Registry#

The model registry is the version control system for trained models:

Store model artifacts with metadata (training data version, hyperparameters, evaluation metrics).
Tag models with lifecycle stages: staging, production, archived.
MLflow Model Registry is the most widely adopted open-source option.

Batch vs Real-Time Inference#

The serving strategy depends on latency requirements and cost constraints:

Batch Inference#

Pre-compute predictions on a schedule and store them for lookup:

Training Data ──▶ Model ──▶ Predictions Table ──▶ Application
                              (refreshed nightly)

Use when: Latency tolerance is hours, predictions are for a known set of entities (e.g., daily product recommendations).
Advantages: Simple infrastructure, easy to validate before serving, cost-efficient.
Tooling: Spark, Dataflow, or a simple Python job on Kubernetes.

Real-Time Inference#

Serve predictions on demand via an API:

User Request ──▶ API Gateway ──▶ Model Server ──▶ Response
                                  (< 100 ms)

Use when: Predictions depend on fresh input (e.g., fraud detection at transaction time, search ranking).
Advantages: Always up-to-date, handles unseen inputs.
Challenges: Strict latency budgets, autoscaling complexity, feature retrieval at serving time.

Hybrid Approach#

Many production systems combine both: batch inference for the long tail, real-time inference for high-priority or time-sensitive requests.

Model Serving Infrastructure#

TensorFlow Serving#

Purpose-built for TensorFlow SavedModels. Supports model versioning, batching, and gRPC/REST endpoints. Mature and battle-tested at Google scale.

NVIDIA Triton Inference Server#

vLLM#

Comparison#

Server	Best For	Multi-Framework	Dynamic Batching
TF Serving	TensorFlow models	No	Yes
Triton	Mixed model fleets	Yes	Yes
vLLM	LLM inference	PyTorch/HF	Yes (continuous)
TorchServe	PyTorch models	No	Yes

A/B Testing Models#

Deploying a new model is a hypothesis. A/B testing validates it:

                    ┌──────────────┐
              90% ──▶ Model A (v1) │
User Traffic ──┤    └──────────────┘
              10% ──▶ Model B (v2) │
                    └──────────────┘

Implementation Patterns#

Traffic splitting — The API gateway or service mesh (Istio, Envoy) routes a percentage of traffic to each model version.
Shadow mode — Route 100 % of traffic to both models, but only return Model A's result. Log Model B's predictions for offline comparison.
Interleaving — For ranking systems, interleave results from both models in a single response and measure user engagement per source.

Statistical Rigor#

Define the primary metric (e.g., click-through rate, conversion, latency) and guardrail metrics (e.g., error rate must not increase).
Calculate sample size before the experiment to ensure statistical power.
Run the test for a fixed duration — do not peek and stop early.
Use sequential testing (e.g., CUSUM) if you need early stopping with controlled false-positive rates.

MLOps Tooling#

MLOps is the discipline of operationalizing ML. The ecosystem has converged around a few key platforms:

MLflow#

Open-source platform covering experiment tracking, model registry, and model deployment. Works with any ML framework. The default choice for teams starting their MLOps journey.

Kubeflow#

Kubernetes-native ML platform. Provides Pipelines (DAG orchestration), KServe (model serving), Katib (hyperparameter tuning), and training operators. Best for teams already invested in Kubernetes.

End-to-End Platforms#

Vertex AI (GCP) — Managed pipelines, feature store, model monitoring, and serving.
SageMaker (AWS) — Training jobs, endpoints, pipelines, and built-in algorithms.
Databricks — Unified analytics and ML with MLflow integration and managed Spark.

CI/CD for ML#

ML CI/CD extends software CI/CD with model-specific steps:

Code tests — Unit tests for feature transformations and model code.
Data tests — Validate input data schema and distributions.
Training — Retrain on latest data.
Evaluation — Compare against the current production model on a holdout set.
Promotion — If metrics pass thresholds, promote to staging, then production.
Monitoring — Continuous post-deployment checks.

Monitoring Model Drift#

A model that was accurate at deployment degrades over time as the world changes. Drift monitoring catches this before users notice.

Types of Drift#

Data drift (covariate shift) — The distribution of input features changes. Example: a fraud model trained on US transactions starts receiving EU traffic.
Concept drift — The relationship between features and the target changes. Example: user purchasing behavior shifts during a recession.
Prediction drift — The distribution of model outputs changes, even if inputs look stable.

Detection Methods#

Method	What It Detects	Tools
PSI (Population Stability Index)	Feature distribution shift	EvidentlyAI, Nannyml
KS Test (Kolmogorov-Smirnov)	Distribution difference	SciPy, custom
ADWIN	Concept drift in streams	River, MOA
Performance monitoring	Accuracy/F1 degradation	Prometheus + custom

Response Playbook#

Alert — Drift score exceeds threshold.
Diagnose — Identify which features shifted and whether concept drift is involved.
Retrain — Trigger the training pipeline with recent data.
Evaluate — Compare the retrained model against the current production model.
Deploy — Roll out via A/B test or shadow mode.
Update baselines — Reset drift reference distributions.

Architecture Checklist#

Before shipping an ML system to production, verify:

Feature store prevents training-serving skew.
Model registry tracks every artifact with lineage.
Serving layer handles autoscaling and graceful degradation.
A/B testing framework is in place for safe rollouts.
Drift monitoring alerts on data, concept, and prediction drift.
Rollback can revert to the previous model version in under a minute.
CI/CD pipeline runs data validation, training, and evaluation automatically.
Logging captures inputs, outputs, and latency for every prediction.

ML system design is not about choosing the fanciest model — it is about building the infrastructure that lets you iterate on models safely, quickly, and at scale.

Design, build, and ship ML systems with confidence at codelit.io.

This is article #166 in the Codelit engineering blog series.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Build this architecture

Generate an interactive ML System Design in seconds.

Try it in Codelit →

ML System Design: Architecture Patterns for Production Machine Learning

ML Pipeline Architecture#

Data Ingestion#

Feature Store#

Training Pipeline#

Model Registry#

Batch vs Real-Time Inference#

Batch Inference#

Real-Time Inference#

Hybrid Approach#

Model Serving Infrastructure#

TensorFlow Serving#

NVIDIA Triton Inference Server#

vLLM#

Comparison#

A/B Testing Models#

Implementation Patterns#

Statistical Rigor#

MLOps Tooling#

MLflow#

Kubeflow#

End-to-End Platforms#

CI/CD for ML#

Monitoring Model Drift#

Types of Drift#

Detection Methods#

Response Playbook#

Architecture Checklist#

Comments

Related articles

Agent Workflows for AI Infrastructure Teams

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

Try these templates

Uber Real-Time Location System

Netflix Video Streaming Architecture

E-Commerce Checkout System

Build this architecture

ML System Design: Architecture Patterns for Production Machine Learning

ML Pipeline Architecture#

Data Ingestion#

Feature Store#

Training Pipeline#

Model Registry#

Batch vs Real-Time Inference#

Batch Inference#

Real-Time Inference#

Hybrid Approach#

Model Serving Infrastructure#

TensorFlow Serving#

NVIDIA Triton Inference Server#

vLLM#

Comparison#

A/B Testing Models#

Implementation Patterns#

Statistical Rigor#

MLOps Tooling#

MLflow#

Kubeflow#

End-to-End Platforms#

CI/CD for ML#

Monitoring Model Drift#

Types of Drift#

Detection Methods#

Response Playbook#

Architecture Checklist#

Comments

Related articles

Agent Workflows for AI Infrastructure Teams

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

Try these templates

Uber Real-Time Location System

Netflix Video Streaming Architecture

E-Commerce Checkout System

Build this architecture