How OpenAI Serves 200M Users — The Architecture Nobody Talks About
The API call that changes everything#
When you call openai.chat.completions.create(), you're triggering one of the most complex inference pipelines ever built. Behind that simple API call is an infrastructure that handles 200 million weekly users, billions of tokens per day, and GPU clusters worth hundreds of millions of dollars.
Let's explore how it actually works.
The request journey#
Your API call doesn't go straight to a GPU. It goes through at least 7 services before a single token is generated. Let's trace it:
Click on the Rate Limiter node. This is more complex than you'd think — it's not just a counter. It tracks usage across multiple dimensions: per-key, per-organization, per-model, per-tier. And it has to do this at millions of requests per minute without becoming a bottleneck itself.
GPU scheduling: the hard part#
The actual hard problem isn't the AI — it's scheduling. OpenAI runs thousands of GPUs, and each model needs a different number of them. GPT-4 might need 8 GPUs per inference, while GPT-3.5 needs 1. How do you efficiently pack requests onto hardware?
The KV Cache Manager is the secret weapon. When you're in a conversation and send a follow-up message, OpenAI doesn't reprocess the entire conversation. It caches the internal model state (key-value pairs) from your previous messages. This is why follow-ups are faster and cheaper than first messages.
Streaming: why tokens trickle in#
Notice how ChatGPT shows tokens one by one? That's not just a UI trick — it's an architectural decision that fundamentally changes the infrastructure.
Without streaming, the server holds the entire response in memory, processes all tokens, then sends the complete response. With 100,000 concurrent users, that's 100,000 responses in memory.
With streaming, each token is sent immediately via Server-Sent Events (SSE). The server memory per request drops dramatically because you're only buffering one token at a time.
Design your own streaming AI infrastructure
Codelit turns system descriptions into interactive, explorable architecture diagrams. No signup required.
Open CodelitThe billing system nobody appreciates#
Every token in, every token out, priced differently per model, per tier, per feature (vision costs more than text). OpenAI processes billions of billing events per day.
What this teaches us#
OpenAI's infrastructure isn't magic. It's:
- Request routing — getting the right request to the right GPU
- Caching — KV caches for conversation continuity
- Streaming — SSE for real-time token delivery
- Scheduling — bin-packing requests onto GPU clusters
- Metering — counting everything for billing
Every AI company building on top of LLMs needs some version of this. The scale is different, but the patterns are the same.
Design your AI infrastructure
Stop reading about architecture. Start building it. Describe any system and watch it come alive.
Launch CodelitTry it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
OpenAI API Request Pipeline
7-stage pipeline from API call to token generation, handling millions of requests per minute.
8 componentsNetflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsSearch Engine Architecture
Web-scale search with crawling, indexing, ranking, and sub-second query serving.
8 componentsBuild this architecture
Generate an interactive How OpenAI Serves 200M Users in seconds.
Try it in Codelit →
Comments