LLM inferencemodel servingvLLMquantizationGPU optimizationspeculative decoding

LLM Inference Architecture — Serving, Batching, Quantization, and GPU Optimization

March 29, 2026 8 min readBy Codelit Team Discussion

LLM Inference Architecture#

Training gets the headlines. Inference pays the bills.

Serving a large language model in production is an engineering challenge: you need low latency, high throughput, and reasonable cost — all while the model wants to consume every GPU you own.

This guide covers the architecture of LLM inference from GPU memory to HTTP response.

The Inference Pipeline#

Every LLM inference request follows the same steps:

Tokenize — convert text to token IDs
Prefill — process all input tokens in parallel (compute-bound)
Decode — generate output tokens one at a time, autoregressively (memory-bound)
Detokenize — convert token IDs back to text

The prefill phase is fast per-token but processes many tokens. The decode phase is slow because each new token depends on the previous one. Most optimization targets the decode phase.

Model Serving Frameworks#

vLLM#

The most popular open-source LLM serving engine. Key innovation: PagedAttention, which manages KV cache memory like an operating system manages virtual memory pages.

Features:

PagedAttention for near-zero KV cache waste
Continuous batching
Tensor parallelism across GPUs
OpenAI-compatible API
Supports most HuggingFace models

Best for: General-purpose LLM serving with maximum throughput.

Text Generation Inference (TGI)#

Built by HuggingFace. Production-ready with built-in support for quantization, tensor parallelism, and streaming.

Features:

Flash Attention 2 integration
Continuous batching
Token streaming via SSE
Watermark detection for generated text
gRPC and HTTP APIs

Best for: HuggingFace ecosystem users who want a batteries-included server.

NVIDIA Triton Inference Server#

A general-purpose inference server that supports any model framework (PyTorch, TensorFlow, TensorRT, ONNX). Combined with the TensorRT-LLM backend, it is highly optimized for NVIDIA GPUs.

Features:

Multi-model serving
Dynamic batching
Model ensembles and pipelines
GPU and CPU inference
Kubernetes-native with Helm charts

Best for: Enterprise deployments needing multi-model serving on NVIDIA hardware.

Ollama#

Lightweight, local-first LLM runner. Downloads and serves models with a single command. Uses llama.cpp under the hood.

Best for: Local development, experimentation, and edge deployment.

Batching Strategies#

Batching multiple requests together amortizes the fixed cost of loading model weights from GPU memory.

Static Batching#

Collect N requests, process them together, return all results. Simple but wasteful — short sequences wait for long ones to finish.

Dynamic Batching#

Set a maximum wait time (e.g., 50ms). Batch together whatever requests arrive in that window. Better utilization than static batching.

Continuous Batching (Iteration-Level)#

The breakthrough that vLLM and TGI use. Instead of waiting for all sequences in a batch to finish, evict completed sequences and insert new ones at every decode step.

Why it matters: A batch of 32 requests where some finish in 10 tokens and others need 500 tokens. Static batching holds GPU memory for all 32 until the longest finishes. Continuous batching frees slots as sequences complete and fills them with waiting requests.

Continuous batching can improve throughput by 2-10x over static batching.

KV Cache Management#

During autoregressive decoding, the model reuses the key-value tensors from all previous tokens. This KV cache grows linearly with sequence length and batch size.

For a 7B parameter model with 4K context:

KV cache per sequence: ~1 GB
Batch of 32: ~32 GB — that is an entire A100 GPU just for KV cache

PagedAttention (vLLM)#

Instead of pre-allocating a contiguous block for each sequence's maximum length, PagedAttention allocates KV cache in small fixed-size pages on demand. Pages can be non-contiguous in GPU memory.

Benefits:

Near-zero memory waste (no pre-allocation padding)
Sequences can share KV cache pages (e.g., common system prompts)
Memory utilization goes from ~50% to ~95%

KV Cache Quantization#

Compress the KV cache to FP8 or INT8, halving memory usage with minimal quality loss. Supported in vLLM and TensorRT-LLM.

Prefix Caching#

If many requests share the same system prompt, cache the KV values for that prefix. New requests skip the prefill for the shared prefix entirely. vLLM calls this "automatic prefix caching."

Quantization#

Full-precision models (FP16/BF16) use 2 bytes per parameter. A 70B model needs 140 GB — two A100-80GB GPUs minimum. Quantization shrinks the model.

GPTQ (Post-Training Quantization)#

Quantizes weights to 4-bit or 3-bit using a calibration dataset. Weights are stored as integers and dequantized during computation.

70B model at 4-bit: ~35 GB (fits on one A100-80GB)
Minimal quality loss for 4-bit; noticeable at 3-bit
One-time quantization cost using a calibration set

AWQ (Activation-Aware Weight Quantization)#

Identifies the 1% of "salient" weight channels that matter most for accuracy and keeps them at higher precision. Remaining weights are aggressively quantized.

Often better quality than GPTQ at the same bit-width
Faster inference because it skips dequantization for salient channels
Supported by vLLM and TGI

GGUF (llama.cpp)#

A file format for quantized models used by llama.cpp and Ollama. Supports 2-bit through 8-bit quantization with various schemes (Q4_K_M, Q5_K_S, etc.).

Runs on CPU, Apple Silicon (Metal), and NVIDIA GPUs
Great for local and edge deployment
Active community quantizing every new model within hours of release

FP8#

NVIDIA H100 and newer GPUs natively support FP8 arithmetic. FP8 quantization halves memory vs. FP16 with almost no quality loss because the hardware handles the reduced precision natively.

Speculative Decoding#

Autoregressive decoding is slow because each token depends on the previous one. Speculative decoding parallelizes this.

How it works:

A small draft model (e.g., 1B params) generates K candidate tokens quickly
The large target model verifies all K tokens in a single forward pass (parallelizable)
If the target model agrees, all K tokens are accepted
If it disagrees at position i, tokens after i are discarded and regenerated

Speedup: 2-3x faster decoding with no quality loss (the target model always verifies). Works best when the draft model has high agreement with the target.

vLLM, TGI, and TensorRT-LLM all support speculative decoding.

Streaming Responses#

Users expect to see tokens appear as they are generated, not wait for the full response.

Implementation:

Server sends tokens via Server-Sent Events (SSE) over HTTP
Each SSE event contains one or more new tokens
Client renders tokens incrementally

Architecture considerations:

Load balancers must support long-lived HTTP connections
Token-level streaming means the connection stays open for the full generation time
vLLM and TGI expose OpenAI-compatible streaming endpoints out of the box

GPU Optimization#

Tensor Parallelism#

Split model layers across multiple GPUs on the same node. Each GPU holds a slice of every layer. Requires high-bandwidth interconnect (NVLink).

70B model on 4x A100-80GB: each GPU holds ~17.5B parameters
Linear speedup for prefill; moderate speedup for decode

Pipeline Parallelism#

Split model layers sequentially — GPU 0 runs layers 0-15, GPU 1 runs layers 16-31. Simpler than tensor parallelism but introduces pipeline bubbles.

Flash Attention#

Rewrites the attention computation to minimize GPU memory reads/writes (IO-bound operation). Flash Attention 2 is standard in all modern serving frameworks. Flash Attention 3 targets H100 hardware.

CUDA Graphs#

Capture the GPU kernel launch sequence for a fixed batch size and replay it without CPU overhead. Reduces per-token latency by eliminating kernel launch overhead.

Cost Per Token#

LLM inference cost depends on:

Factor	Impact
Model size	Larger models need more GPUs
Quantization	4-bit cuts GPU count by ~4x vs FP16
Batch size	Higher batch = lower cost per token
Sequence length	Longer context = more KV cache memory
GPU choice	H100 is ~2x faster than A100, but ~2x the price

Rough cost benchmarks (cloud GPU, 2026):

GPT-4 class (API): $2-10 per 1M output tokens
Llama 3.1 70B (self-hosted, 4-bit, A100): ~$0.50 per 1M output tokens
Llama 3.1 8B (self-hosted, FP16, L4): ~$0.10 per 1M output tokens

Self-hosting breaks even at roughly 1-10M tokens per day, depending on model size and GPU cost.

Production Checklist#

Choose a serving framework — vLLM for throughput, TGI for HuggingFace ecosystem, Triton for enterprise
Quantize — AWQ or GPTQ to 4-bit unless quality demands FP16
Enable continuous batching — default in vLLM and TGI
Set max sequence length — limit KV cache memory usage
Enable prefix caching — if requests share system prompts
Add autoscaling — scale GPU instances based on queue depth
Monitor — track tokens/second, time-to-first-token, queue wait time, GPU utilization
Set up streaming — SSE endpoints for responsive UX

Start Building#

LLM inference is where AI meets systems engineering. The difference between a $50K/month GPU bill and a $5K/month bill is in the architecture: quantization, continuous batching, KV cache management, and smart GPU allocation.

Design your LLM serving architecture on codelit.io — describe your system, get an interactive architecture diagram, export as code.

Article #327 of 327. Explore all articles, templates, and tools at codelit.io.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

ML system design

ML System Design: Architecture Patterns for Production Machine Learning

8 min read

Try these templates

Netflix Video Streaming Architecture

Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.

10 components

Search Engine Architecture

Web-scale search with crawling, indexing, ranking, and sub-second query serving.

8 components

Google Search Engine Architecture

Web-scale search with crawling, indexing, PageRank, query processing, ads, and knowledge graph.

10 components

Build this architecture

Generate an interactive LLM Inference Architecture in seconds.

Try it in Codelit →

LLM inferencemodel servingvLLMquantizationGPU optimizationspeculative decoding

LLM Inference Architecture — Serving, Batching, Quantization, and GPU Optimization

March 29, 2026 8 min readBy Codelit Team Discussion

LLM Inference Architecture#

Training gets the headlines. Inference pays the bills.

Serving a large language model in production is an engineering challenge: you need low latency, high throughput, and reasonable cost — all while the model wants to consume every GPU you own.

This guide covers the architecture of LLM inference from GPU memory to HTTP response.

The Inference Pipeline#

Every LLM inference request follows the same steps:

Tokenize — convert text to token IDs
Prefill — process all input tokens in parallel (compute-bound)
Decode — generate output tokens one at a time, autoregressively (memory-bound)
Detokenize — convert token IDs back to text

The prefill phase is fast per-token but processes many tokens. The decode phase is slow because each new token depends on the previous one. Most optimization targets the decode phase.

Model Serving Frameworks#

vLLM#

The most popular open-source LLM serving engine. Key innovation: PagedAttention, which manages KV cache memory like an operating system manages virtual memory pages.

Features:

PagedAttention for near-zero KV cache waste
Continuous batching
Tensor parallelism across GPUs
OpenAI-compatible API
Supports most HuggingFace models

Best for: General-purpose LLM serving with maximum throughput.

Text Generation Inference (TGI)#

Built by HuggingFace. Production-ready with built-in support for quantization, tensor parallelism, and streaming.

Features:

Flash Attention 2 integration
Continuous batching
Token streaming via SSE
Watermark detection for generated text
gRPC and HTTP APIs

Best for: HuggingFace ecosystem users who want a batteries-included server.

NVIDIA Triton Inference Server#

A general-purpose inference server that supports any model framework (PyTorch, TensorFlow, TensorRT, ONNX). Combined with the TensorRT-LLM backend, it is highly optimized for NVIDIA GPUs.

Features:

Multi-model serving
Dynamic batching
Model ensembles and pipelines
GPU and CPU inference
Kubernetes-native with Helm charts

Best for: Enterprise deployments needing multi-model serving on NVIDIA hardware.

Ollama#

Lightweight, local-first LLM runner. Downloads and serves models with a single command. Uses llama.cpp under the hood.

Best for: Local development, experimentation, and edge deployment.

Batching Strategies#

Batching multiple requests together amortizes the fixed cost of loading model weights from GPU memory.

Static Batching#

Collect N requests, process them together, return all results. Simple but wasteful — short sequences wait for long ones to finish.

Dynamic Batching#

Set a maximum wait time (e.g., 50ms). Batch together whatever requests arrive in that window. Better utilization than static batching.

Continuous Batching (Iteration-Level)#

The breakthrough that vLLM and TGI use. Instead of waiting for all sequences in a batch to finish, evict completed sequences and insert new ones at every decode step.

Continuous batching can improve throughput by 2-10x over static batching.

KV Cache Management#

During autoregressive decoding, the model reuses the key-value tensors from all previous tokens. This KV cache grows linearly with sequence length and batch size.

For a 7B parameter model with 4K context:

KV cache per sequence: ~1 GB
Batch of 32: ~32 GB — that is an entire A100 GPU just for KV cache

PagedAttention (vLLM)#

Instead of pre-allocating a contiguous block for each sequence's maximum length, PagedAttention allocates KV cache in small fixed-size pages on demand. Pages can be non-contiguous in GPU memory.

Benefits:

Near-zero memory waste (no pre-allocation padding)
Sequences can share KV cache pages (e.g., common system prompts)
Memory utilization goes from ~50% to ~95%

KV Cache Quantization#

Compress the KV cache to FP8 or INT8, halving memory usage with minimal quality loss. Supported in vLLM and TensorRT-LLM.

Prefix Caching#

If many requests share the same system prompt, cache the KV values for that prefix. New requests skip the prefill for the shared prefix entirely. vLLM calls this "automatic prefix caching."

Quantization#

Full-precision models (FP16/BF16) use 2 bytes per parameter. A 70B model needs 140 GB — two A100-80GB GPUs minimum. Quantization shrinks the model.

GPTQ (Post-Training Quantization)#

Quantizes weights to 4-bit or 3-bit using a calibration dataset. Weights are stored as integers and dequantized during computation.

70B model at 4-bit: ~35 GB (fits on one A100-80GB)
Minimal quality loss for 4-bit; noticeable at 3-bit
One-time quantization cost using a calibration set

AWQ (Activation-Aware Weight Quantization)#

Identifies the 1% of "salient" weight channels that matter most for accuracy and keeps them at higher precision. Remaining weights are aggressively quantized.

Often better quality than GPTQ at the same bit-width
Faster inference because it skips dequantization for salient channels
Supported by vLLM and TGI

GGUF (llama.cpp)#

A file format for quantized models used by llama.cpp and Ollama. Supports 2-bit through 8-bit quantization with various schemes (Q4_K_M, Q5_K_S, etc.).

Runs on CPU, Apple Silicon (Metal), and NVIDIA GPUs
Great for local and edge deployment
Active community quantizing every new model within hours of release

FP8#

NVIDIA H100 and newer GPUs natively support FP8 arithmetic. FP8 quantization halves memory vs. FP16 with almost no quality loss because the hardware handles the reduced precision natively.

Speculative Decoding#

Autoregressive decoding is slow because each token depends on the previous one. Speculative decoding parallelizes this.

How it works:

A small draft model (e.g., 1B params) generates K candidate tokens quickly
The large target model verifies all K tokens in a single forward pass (parallelizable)
If the target model agrees, all K tokens are accepted
If it disagrees at position i, tokens after i are discarded and regenerated

Speedup: 2-3x faster decoding with no quality loss (the target model always verifies). Works best when the draft model has high agreement with the target.

vLLM, TGI, and TensorRT-LLM all support speculative decoding.

Streaming Responses#

Users expect to see tokens appear as they are generated, not wait for the full response.

Implementation:

Server sends tokens via Server-Sent Events (SSE) over HTTP
Each SSE event contains one or more new tokens
Client renders tokens incrementally

Architecture considerations:

Load balancers must support long-lived HTTP connections
Token-level streaming means the connection stays open for the full generation time
vLLM and TGI expose OpenAI-compatible streaming endpoints out of the box

GPU Optimization#

Tensor Parallelism#

Split model layers across multiple GPUs on the same node. Each GPU holds a slice of every layer. Requires high-bandwidth interconnect (NVLink).

70B model on 4x A100-80GB: each GPU holds ~17.5B parameters
Linear speedup for prefill; moderate speedup for decode

Pipeline Parallelism#

Split model layers sequentially — GPU 0 runs layers 0-15, GPU 1 runs layers 16-31. Simpler than tensor parallelism but introduces pipeline bubbles.

Flash Attention#

Rewrites the attention computation to minimize GPU memory reads/writes (IO-bound operation). Flash Attention 2 is standard in all modern serving frameworks. Flash Attention 3 targets H100 hardware.

CUDA Graphs#

Capture the GPU kernel launch sequence for a fixed batch size and replay it without CPU overhead. Reduces per-token latency by eliminating kernel launch overhead.

Cost Per Token#

LLM inference cost depends on:

Factor	Impact
Model size	Larger models need more GPUs
Quantization	4-bit cuts GPU count by ~4x vs FP16
Batch size	Higher batch = lower cost per token
Sequence length	Longer context = more KV cache memory
GPU choice	H100 is ~2x faster than A100, but ~2x the price

Rough cost benchmarks (cloud GPU, 2026):

GPT-4 class (API): $2-10 per 1M output tokens
Llama 3.1 70B (self-hosted, 4-bit, A100): ~$0.50 per 1M output tokens
Llama 3.1 8B (self-hosted, FP16, L4): ~$0.10 per 1M output tokens

Self-hosting breaks even at roughly 1-10M tokens per day, depending on model size and GPU cost.

Production Checklist#

Choose a serving framework — vLLM for throughput, TGI for HuggingFace ecosystem, Triton for enterprise
Quantize — AWQ or GPTQ to 4-bit unless quality demands FP16
Enable continuous batching — default in vLLM and TGI
Set max sequence length — limit KV cache memory usage
Enable prefix caching — if requests share system prompts
Add autoscaling — scale GPU instances based on queue depth
Monitor — track tokens/second, time-to-first-token, queue wait time, GPU utilization
Set up streaming — SSE endpoints for responsive UX

Start Building#

Design your LLM serving architecture on codelit.io — describe your system, get an interactive architecture diagram, export as code.

Article #327 of 327. Explore all articles, templates, and tools at codelit.io.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

ML system design

ML System Design: Architecture Patterns for Production Machine Learning

8 min read

Build this architecture

Generate an interactive LLM Inference Architecture in seconds.

Try it in Codelit →

LLM Inference Architecture — Serving, Batching, Quantization, and GPU Optimization

LLM Inference Architecture#

The Inference Pipeline#

Model Serving Frameworks#

vLLM#

Text Generation Inference (TGI)#

NVIDIA Triton Inference Server#

Ollama#

Batching Strategies#

Static Batching#

Dynamic Batching#

Continuous Batching (Iteration-Level)#

KV Cache Management#

PagedAttention (vLLM)#

KV Cache Quantization#

Prefix Caching#

Quantization#

GPTQ (Post-Training Quantization)#

AWQ (Activation-Aware Weight Quantization)#

GGUF (llama.cpp)#

FP8#

Speculative Decoding#

Streaming Responses#

GPU Optimization#

Tensor Parallelism#

Pipeline Parallelism#

Flash Attention#

CUDA Graphs#

Cost Per Token#

Production Checklist#

Start Building#

Comments

Related articles

ML System Design: Architecture Patterns for Production Machine Learning

Try these templates

Netflix Video Streaming Architecture

Search Engine Architecture

Google Search Engine Architecture

Build this architecture

LLM Inference Architecture — Serving, Batching, Quantization, and GPU Optimization

LLM Inference Architecture#

The Inference Pipeline#

Model Serving Frameworks#

vLLM#

Text Generation Inference (TGI)#

NVIDIA Triton Inference Server#

Ollama#

Batching Strategies#

Static Batching#

Dynamic Batching#

Continuous Batching (Iteration-Level)#

KV Cache Management#

PagedAttention (vLLM)#

KV Cache Quantization#

Prefix Caching#

Quantization#

GPTQ (Post-Training Quantization)#

AWQ (Activation-Aware Weight Quantization)#

GGUF (llama.cpp)#

FP8#

Speculative Decoding#

Streaming Responses#

GPU Optimization#

Tensor Parallelism#

Pipeline Parallelism#

Flash Attention#

CUDA Graphs#

Cost Per Token#

Production Checklist#

Start Building#

Comments

Related articles

ML System Design: Architecture Patterns for Production Machine Learning

Try these templates

Netflix Video Streaming Architecture

Search Engine Architecture

Google Search Engine Architecture

Build this architecture