LLM Inference Architecture — Serving, Batching, Quantization, and GPU Optimization
LLM Inference Architecture#
Training gets the headlines. Inference pays the bills.
Serving a large language model in production is an engineering challenge: you need low latency, high throughput, and reasonable cost — all while the model wants to consume every GPU you own.
This guide covers the architecture of LLM inference from GPU memory to HTTP response.
The Inference Pipeline#
Every LLM inference request follows the same steps:
- Tokenize — convert text to token IDs
- Prefill — process all input tokens in parallel (compute-bound)
- Decode — generate output tokens one at a time, autoregressively (memory-bound)
- Detokenize — convert token IDs back to text
The prefill phase is fast per-token but processes many tokens. The decode phase is slow because each new token depends on the previous one. Most optimization targets the decode phase.
Model Serving Frameworks#
vLLM#
The most popular open-source LLM serving engine. Key innovation: PagedAttention, which manages KV cache memory like an operating system manages virtual memory pages.
Features:
- PagedAttention for near-zero KV cache waste
- Continuous batching
- Tensor parallelism across GPUs
- OpenAI-compatible API
- Supports most HuggingFace models
Best for: General-purpose LLM serving with maximum throughput.
Text Generation Inference (TGI)#
Built by HuggingFace. Production-ready with built-in support for quantization, tensor parallelism, and streaming.
Features:
- Flash Attention 2 integration
- Continuous batching
- Token streaming via SSE
- Watermark detection for generated text
- gRPC and HTTP APIs
Best for: HuggingFace ecosystem users who want a batteries-included server.
NVIDIA Triton Inference Server#
A general-purpose inference server that supports any model framework (PyTorch, TensorFlow, TensorRT, ONNX). Combined with the TensorRT-LLM backend, it is highly optimized for NVIDIA GPUs.
Features:
- Multi-model serving
- Dynamic batching
- Model ensembles and pipelines
- GPU and CPU inference
- Kubernetes-native with Helm charts
Best for: Enterprise deployments needing multi-model serving on NVIDIA hardware.
Ollama#
Lightweight, local-first LLM runner. Downloads and serves models with a single command. Uses llama.cpp under the hood.
Best for: Local development, experimentation, and edge deployment.
Batching Strategies#
Batching multiple requests together amortizes the fixed cost of loading model weights from GPU memory.
Static Batching#
Collect N requests, process them together, return all results. Simple but wasteful — short sequences wait for long ones to finish.
Dynamic Batching#
Set a maximum wait time (e.g., 50ms). Batch together whatever requests arrive in that window. Better utilization than static batching.
Continuous Batching (Iteration-Level)#
The breakthrough that vLLM and TGI use. Instead of waiting for all sequences in a batch to finish, evict completed sequences and insert new ones at every decode step.
Why it matters: A batch of 32 requests where some finish in 10 tokens and others need 500 tokens. Static batching holds GPU memory for all 32 until the longest finishes. Continuous batching frees slots as sequences complete and fills them with waiting requests.
Continuous batching can improve throughput by 2-10x over static batching.
KV Cache Management#
During autoregressive decoding, the model reuses the key-value tensors from all previous tokens. This KV cache grows linearly with sequence length and batch size.
For a 7B parameter model with 4K context:
- KV cache per sequence: ~1 GB
- Batch of 32: ~32 GB — that is an entire A100 GPU just for KV cache
PagedAttention (vLLM)#
Instead of pre-allocating a contiguous block for each sequence's maximum length, PagedAttention allocates KV cache in small fixed-size pages on demand. Pages can be non-contiguous in GPU memory.
Benefits:
- Near-zero memory waste (no pre-allocation padding)
- Sequences can share KV cache pages (e.g., common system prompts)
- Memory utilization goes from ~50% to ~95%
KV Cache Quantization#
Compress the KV cache to FP8 or INT8, halving memory usage with minimal quality loss. Supported in vLLM and TensorRT-LLM.
Prefix Caching#
If many requests share the same system prompt, cache the KV values for that prefix. New requests skip the prefill for the shared prefix entirely. vLLM calls this "automatic prefix caching."
Quantization#
Full-precision models (FP16/BF16) use 2 bytes per parameter. A 70B model needs 140 GB — two A100-80GB GPUs minimum. Quantization shrinks the model.
GPTQ (Post-Training Quantization)#
Quantizes weights to 4-bit or 3-bit using a calibration dataset. Weights are stored as integers and dequantized during computation.
- 70B model at 4-bit: ~35 GB (fits on one A100-80GB)
- Minimal quality loss for 4-bit; noticeable at 3-bit
- One-time quantization cost using a calibration set
AWQ (Activation-Aware Weight Quantization)#
Identifies the 1% of "salient" weight channels that matter most for accuracy and keeps them at higher precision. Remaining weights are aggressively quantized.
- Often better quality than GPTQ at the same bit-width
- Faster inference because it skips dequantization for salient channels
- Supported by vLLM and TGI
GGUF (llama.cpp)#
A file format for quantized models used by llama.cpp and Ollama. Supports 2-bit through 8-bit quantization with various schemes (Q4_K_M, Q5_K_S, etc.).
- Runs on CPU, Apple Silicon (Metal), and NVIDIA GPUs
- Great for local and edge deployment
- Active community quantizing every new model within hours of release
FP8#
NVIDIA H100 and newer GPUs natively support FP8 arithmetic. FP8 quantization halves memory vs. FP16 with almost no quality loss because the hardware handles the reduced precision natively.
Speculative Decoding#
Autoregressive decoding is slow because each token depends on the previous one. Speculative decoding parallelizes this.
How it works:
- A small draft model (e.g., 1B params) generates K candidate tokens quickly
- The large target model verifies all K tokens in a single forward pass (parallelizable)
- If the target model agrees, all K tokens are accepted
- If it disagrees at position i, tokens after i are discarded and regenerated
Speedup: 2-3x faster decoding with no quality loss (the target model always verifies). Works best when the draft model has high agreement with the target.
vLLM, TGI, and TensorRT-LLM all support speculative decoding.
Streaming Responses#
Users expect to see tokens appear as they are generated, not wait for the full response.
Implementation:
- Server sends tokens via Server-Sent Events (SSE) over HTTP
- Each SSE event contains one or more new tokens
- Client renders tokens incrementally
Architecture considerations:
- Load balancers must support long-lived HTTP connections
- Token-level streaming means the connection stays open for the full generation time
- vLLM and TGI expose OpenAI-compatible streaming endpoints out of the box
GPU Optimization#
Tensor Parallelism#
Split model layers across multiple GPUs on the same node. Each GPU holds a slice of every layer. Requires high-bandwidth interconnect (NVLink).
- 70B model on 4x A100-80GB: each GPU holds ~17.5B parameters
- Linear speedup for prefill; moderate speedup for decode
Pipeline Parallelism#
Split model layers sequentially — GPU 0 runs layers 0-15, GPU 1 runs layers 16-31. Simpler than tensor parallelism but introduces pipeline bubbles.
Flash Attention#
Rewrites the attention computation to minimize GPU memory reads/writes (IO-bound operation). Flash Attention 2 is standard in all modern serving frameworks. Flash Attention 3 targets H100 hardware.
CUDA Graphs#
Capture the GPU kernel launch sequence for a fixed batch size and replay it without CPU overhead. Reduces per-token latency by eliminating kernel launch overhead.
Cost Per Token#
LLM inference cost depends on:
| Factor | Impact |
|---|---|
| Model size | Larger models need more GPUs |
| Quantization | 4-bit cuts GPU count by ~4x vs FP16 |
| Batch size | Higher batch = lower cost per token |
| Sequence length | Longer context = more KV cache memory |
| GPU choice | H100 is ~2x faster than A100, but ~2x the price |
Rough cost benchmarks (cloud GPU, 2026):
- GPT-4 class (API): $2-10 per 1M output tokens
- Llama 3.1 70B (self-hosted, 4-bit, A100): ~$0.50 per 1M output tokens
- Llama 3.1 8B (self-hosted, FP16, L4): ~$0.10 per 1M output tokens
Self-hosting breaks even at roughly 1-10M tokens per day, depending on model size and GPU cost.
Production Checklist#
- Choose a serving framework — vLLM for throughput, TGI for HuggingFace ecosystem, Triton for enterprise
- Quantize — AWQ or GPTQ to 4-bit unless quality demands FP16
- Enable continuous batching — default in vLLM and TGI
- Set max sequence length — limit KV cache memory usage
- Enable prefix caching — if requests share system prompts
- Add autoscaling — scale GPU instances based on queue depth
- Monitor — track tokens/second, time-to-first-token, queue wait time, GPU utilization
- Set up streaming — SSE endpoints for responsive UX
Start Building#
LLM inference is where AI meets systems engineering. The difference between a $50K/month GPU bill and a $5K/month bill is in the architecture: quantization, continuous batching, KV cache management, and smart GPU allocation.
Design your LLM serving architecture on codelit.io — describe your system, get an interactive architecture diagram, export as code.
Article #327 of 327. Explore all articles, templates, and tools at codelit.io.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
Netflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsSearch Engine Architecture
Web-scale search with crawling, indexing, ranking, and sub-second query serving.
8 componentsGoogle Search Engine Architecture
Web-scale search with crawling, indexing, PageRank, query processing, ads, and knowledge graph.
10 componentsBuild this architecture
Generate an interactive LLM Inference Architecture in seconds.
Try it in Codelit →
Comments