case-studyAI-infrastructure

How OpenAI Serves 200M Users — The Architecture Nobody Talks About

March 23, 2026 4 min readBy Mo Discussion

The API call that changes everything#

When you call openai.chat.completions.create(), you're triggering one of the most complex inference pipelines ever built. Behind that simple API call is an infrastructure that handles 200 million weekly users, billions of tokens per day, and GPU clusters worth hundreds of millions of dollars.

Let's explore how it actually works.

The request journey#

Your API call doesn't go straight to a GPU. It goes through at least 7 services before a single token is generated. Let's trace it:

Click on the Rate Limiter node. This is more complex than you'd think — it's not just a counter. It tracks usage across multiple dimensions: per-key, per-organization, per-model, per-tier. And it has to do this at millions of requests per minute without becoming a bottleneck itself.

GPU scheduling: the hard part#

The actual hard problem isn't the AI — it's scheduling. OpenAI runs thousands of GPUs, and each model needs a different number of them. GPT-4 might need 8 GPUs per inference, while GPT-3.5 needs 1. How do you efficiently pack requests onto hardware?

The KV Cache Manager is the secret weapon. When you're in a conversation and send a follow-up message, OpenAI doesn't reprocess the entire conversation. It caches the internal model state (key-value pairs) from your previous messages. This is why follow-ups are faster and cheaper than first messages.

Streaming: why tokens trickle in#

Notice how ChatGPT shows tokens one by one? That's not just a UI trick — it's an architectural decision that fundamentally changes the infrastructure.

Without streaming, the server holds the entire response in memory, processes all tokens, then sends the complete response. With 100,000 concurrent users, that's 100,000 responses in memory.

With streaming, each token is sent immediately via Server-Sent Events (SSE). The server memory per request drops dramatically because you're only buffering one token at a time.

Design your own streaming AI infrastructure

Codelit turns system descriptions into interactive, explorable architecture diagrams. No signup required.

Open Codelit

The billing system nobody appreciates#

Every token in, every token out, priced differently per model, per tier, per feature (vision costs more than text). OpenAI processes billions of billing events per day.

What this teaches us#

OpenAI's infrastructure isn't magic. It's:

Request routing — getting the right request to the right GPU
Caching — KV caches for conversation continuity
Streaming — SSE for real-time token delivery
Scheduling — bin-packing requests onto GPU clusters
Metering — counting everything for billing

Every AI company building on top of LLMs needs some version of this. The scale is different, but the patterns are the same.

{ }

Design your AI infrastructure

Stop reading about architecture. Start building it. Describe any system and watch it come alive.

Launch Codelit

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

case study

How Uber Actually Works Under the Hood — An Interactive Deep Dive

5 min read

Try these templates

OpenAI API Request Pipeline

7-stage pipeline from API call to token generation, handling millions of requests per minute.

8 components

Netflix Video Streaming Architecture

Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.

10 components

Search Engine Architecture

Web-scale search with crawling, indexing, ranking, and sub-second query serving.

8 components

Build this architecture

Generate an interactive How OpenAI Serves 200M Users in seconds.

Try it in Codelit →

case-studyAI-infrastructure

How OpenAI Serves 200M Users — The Architecture Nobody Talks About

March 23, 2026 4 min readBy Mo Discussion

The API call that changes everything#

Let's explore how it actually works.

The request journey#

Your API call doesn't go straight to a GPU. It goes through at least 7 services before a single token is generated. Let's trace it:

GPU scheduling: the hard part#

Streaming: why tokens trickle in#

Notice how ChatGPT shows tokens one by one? That's not just a UI trick — it's an architectural decision that fundamentally changes the infrastructure.

Without streaming, the server holds the entire response in memory, processes all tokens, then sends the complete response. With 100,000 concurrent users, that's 100,000 responses in memory.

With streaming, each token is sent immediately via Server-Sent Events (SSE). The server memory per request drops dramatically because you're only buffering one token at a time.

Design your own streaming AI infrastructure

Codelit turns system descriptions into interactive, explorable architecture diagrams. No signup required.

Open Codelit

The billing system nobody appreciates#

Every token in, every token out, priced differently per model, per tier, per feature (vision costs more than text). OpenAI processes billions of billing events per day.

What this teaches us#

OpenAI's infrastructure isn't magic. It's:

Request routing — getting the right request to the right GPU
Caching — KV caches for conversation continuity
Streaming — SSE for real-time token delivery
Scheduling — bin-packing requests onto GPU clusters
Metering — counting everything for billing

Every AI company building on top of LLMs needs some version of this. The scale is different, but the patterns are the same.

{ }

Design your AI infrastructure

Stop reading about architecture. Start building it. Describe any system and watch it come alive.

Launch Codelit

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

case study

How Uber Actually Works Under the Hood — An Interactive Deep Dive

5 min read

Build this architecture

Generate an interactive How OpenAI Serves 200M Users in seconds.

Try it in Codelit →

How OpenAI Serves 200M Users — The Architecture Nobody Talks About

The API call that changes everything#

The request journey#

GPU scheduling: the hard part#

Streaming: why tokens trickle in#

The billing system nobody appreciates#

What this teaches us#

Design your AI infrastructure

Comments

Related articles

How Uber Actually Works Under the Hood — An Interactive Deep Dive

Try these templates

OpenAI API Request Pipeline

Netflix Video Streaming Architecture

Search Engine Architecture

Build this architecture

How OpenAI Serves 200M Users — The Architecture Nobody Talks About

The API call that changes everything#

The request journey#

GPU scheduling: the hard part#

Streaming: why tokens trickle in#

The billing system nobody appreciates#

What this teaches us#

Design your AI infrastructure

Comments

Related articles

How Uber Actually Works Under the Hood — An Interactive Deep Dive

Try these templates

OpenAI API Request Pipeline

Netflix Video Streaming Architecture

Search Engine Architecture

Build this architecture