GraphQL Subscriptions — Real-Time Data with WebSockets, PubSub, and Scaling
The third operation type#
GraphQL has three operation types: queries (read), mutations (write), and subscriptions (real-time). While queries and mutations follow request-response patterns, subscriptions maintain a persistent connection and push data to the client as events occur.
subscription OnMessageSent($channelId: ID!) {
messageSent(channelId: $channelId) {
id
text
sender {
name
avatar
}
createdAt
}
}
When a client subscribes, the server keeps the connection open. Every time a new message is sent in that channel, the server pushes the data — no polling required.
WebSocket transport — the underlying protocol#
GraphQL subscriptions typically run over WebSockets using the graphql-ws protocol (successor to the older subscriptions-transport-ws).
Connection lifecycle#
- Client opens WebSocket to
wss://api.example.com/graphql - Connection init — client sends
connection_initwith auth payload - Server acknowledges with
connection_ack - Client subscribes — sends
subscribemessage with query and variables - Server streams
nextmessages as events occur - Client or server sends
completeto end the subscription
import { createClient } from 'graphql-ws';
const client = createClient({
url: 'wss://api.example.com/graphql',
connectionParams: {
authToken: 'bearer-token-here',
},
});
const unsubscribe = client.subscribe(
{
query: `subscription ($channelId: ID!) {
messageSent(channelId: $channelId) {
id
text
sender { name }
}
}`,
variables: { channelId: 'general' },
},
{
next: (data) => console.log('New message:', data),
error: (err) => console.error('Subscription error:', err),
complete: () => console.log('Subscription ended'),
}
);
Why not Server-Sent Events?#
SSE is simpler and works over HTTP, but it's unidirectional — the client cannot send messages back over the same connection. GraphQL subscriptions sometimes need bidirectional communication for connection init and subscription management. Some libraries like graphql-sse do support SSE transport, but WebSocket remains the standard.
Subscription resolvers — server-side wiring#
A subscription resolver differs from query/mutation resolvers. It uses a subscribe function that returns an AsyncIterator.
// Apollo Server / graphql-yoga style
const resolvers = {
Subscription: {
messageSent: {
subscribe: (_, { channelId }, { pubsub }) => {
return pubsub.asyncIterableIterator(`MESSAGE_SENT_${channelId}`);
},
},
},
Mutation: {
sendMessage: async (_, { channelId, text, senderId }, { pubsub, db }) => {
const message = await db.messages.create({
channelId,
text,
senderId,
createdAt: new Date().toISOString(),
});
await pubsub.publish(`MESSAGE_SENT_${channelId}`, {
messageSent: message,
});
return message;
},
},
};
The flow: a mutation publishes an event to a topic. The subscription resolver listens on that topic and forwards events to connected clients.
PubSub engines — in-memory to distributed#
The PubSub engine is the backbone of subscription delivery. It determines how events flow from publishers to subscribers.
In-memory PubSub#
import { PubSub } from 'graphql-subscriptions';
const pubsub = new PubSub();
Simple and fast. Events are published and consumed within a single process. This breaks the moment you have more than one server instance. A client connected to server A won't receive events published on server B.
Use only for development and prototyping.
Redis PubSub#
import { RedisPubSub } from 'graphql-redis-subscriptions';
import Redis from 'ioredis';
const pubsub = new RedisPubSub({
publisher: new Redis({ host: 'redis.internal', port: 6379 }),
subscriber: new Redis({ host: 'redis.internal', port: 6379 }),
});
Redis PubSub is fire-and-forget — messages are delivered to currently connected subscribers and not persisted. If a server goes down and reconnects, it misses events published during the outage.
Best for: chat messages, typing indicators, presence updates — events where missing one is acceptable.
Kafka PubSub#
import { KafkaPubSub } from 'graphql-kafka-subscriptions';
const pubsub = new KafkaPubSub({
topic: 'graphql-events',
host: 'kafka-broker-1',
port: '9092',
groupId: 'graphql-server-group',
});
Kafka persists messages to a log. Consumers can replay events, handle backpressure, and recover from downtime without data loss.
Best for: financial updates, order status changes, audit-critical events — anything where missing an event is unacceptable.
Comparison#
| Engine | Persistence | Latency | Scalability | Use case |
|---|---|---|---|---|
| In-memory | None | Sub-ms | Single instance | Development |
| Redis | None (fire-and-forget) | 1-2ms | Multi-instance | Chat, presence |
| Kafka | Durable log | 5-20ms | Massive scale | Financial, audit |
| NATS | Optional | 1-5ms | Multi-instance | IoT, microservices |
Scaling subscriptions — the hard problem#
Subscriptions are stateful. Each client holds an open WebSocket connection pinned to a specific server. This makes scaling fundamentally different from stateless HTTP.
Challenge 1: sticky sessions#
A load balancer must route WebSocket upgrade requests consistently. Once a client connects to server A, all messages for that subscription flow through server A.
Use IP hash or cookie-based affinity at the load balancer:
upstream graphql {
ip_hash;
server graphql-1:4000;
server graphql-2:4000;
server graphql-3:4000;
}
Challenge 2: connection limits#
Each WebSocket holds a file descriptor. A single server can typically handle 10,000-50,000 concurrent connections depending on memory and CPU.
Monitor:
- Open file descriptors per process
- Memory per connection (typically 5-50KB)
- CPU for heartbeat/keepalive processing
Challenge 3: fan-out#
When a message is sent to a channel with 10,000 subscribers, the server must iterate and push to each connection. This is CPU-bound.
Mitigations:
- Shard channels across servers — channel A lives on server 1, channel B on server 2
- Use a dedicated subscription gateway (separate from query/mutation servers)
- Batch event delivery with microbatching (collect events over 50ms, send once)
Architecture for scale#
Client --> Load Balancer (sticky) --> Subscription Gateway
|
Redis/Kafka PubSub
|
API Servers (mutations)
Separate your subscription servers from your query/mutation servers. Subscription servers are long-lived and stateful. API servers are stateless and horizontally scalable. Don't let a spike in subscriptions affect your API latency.
Authentication and authorization#
Connection-level auth#
Authenticate during the WebSocket handshake. Reject unauthorized connections before they consume resources.
import { useServer } from 'graphql-ws/lib/use/ws';
useServer(
{
context: async (ctx) => {
const token = ctx.connectionParams?.authToken;
if (!token) throw new Error('Missing auth token');
const user = await verifyToken(token);
if (!user) throw new Error('Invalid token');
return { user, pubsub };
},
subscribe: async (ctx, msg) => {
// Per-subscription authorization
const { user } = ctx;
const { channelId } = msg.payload.variables;
const hasAccess = await checkChannelAccess(user.id, channelId);
if (!hasAccess) throw new Error('Forbidden');
// Proceed with subscription
},
},
wsServer
);
Token expiration#
WebSocket connections are long-lived. Tokens can expire while the connection is open. Handle this by:
- Periodic re-auth — client sends a fresh token via a custom message
- Connection TTL — server closes connections after a maximum duration (e.g., 24 hours)
- Token refresh on reconnect — when the connection drops, the client reconnects with a fresh token
Filtering and per-user subscriptions#
Not every event should go to every subscriber. Filter at the resolver level.
messageSent: {
subscribe: withFilter(
(_, { channelId }, { pubsub }) =>
pubsub.asyncIterableIterator(`MESSAGE_SENT_${channelId}`),
(payload, variables, context) => {
// Only deliver if user has access to this channel
return context.user.channels.includes(variables.channelId);
}
),
},
The withFilter higher-order function evaluates each event before delivering it. Events that don't pass the filter are silently dropped for that subscriber.
Error handling and reconnection#
Server-side#
- Catch errors in subscription resolvers and send
errormessages instead of crashing - Implement heartbeat/ping-pong to detect dead connections
- Clean up resources (unsubscribe from PubSub topics) when connections close
Client-side#
The graphql-ws client handles reconnection automatically with exponential backoff:
const client = createClient({
url: 'wss://api.example.com/graphql',
retryAttempts: 10,
retryWait: async (retries) => {
await new Promise((resolve) =>
setTimeout(resolve, Math.min(1000 * 2 ** retries, 30000))
);
},
});
After reconnection, the client re-establishes all active subscriptions. Events published during the disconnect are lost (unless using Kafka or a similar durable PubSub).
Visualize your real-time architecture#
Map out your subscription flow, PubSub engines, and WebSocket gateways — try Codelit to generate interactive architecture diagrams.
Key takeaways#
- Subscriptions are the third GraphQL operation — persistent connections that push data as events occur
- WebSocket via graphql-ws is the standard transport — SSE is an alternative for simpler cases
- PubSub engine choice matters — Redis for speed, Kafka for durability, in-memory only for dev
- Scaling is stateful — use sticky sessions, monitor connection limits, and separate subscription servers
- Authenticate at connection init — reject unauthorized WebSockets before they consume resources
- Filter per-user with
withFilterto prevent data leakage across subscribers - Handle reconnection with exponential backoff and automatic re-subscription
Article #430 in the Codelit engineering series. Explore our full library of system design, infrastructure, and architecture guides at codelit.io.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
Related articles
Try these templates
Uber Real-Time Location System
Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.
6 componentsReal-Time Collaborative Editor
Notion-like document editor with real-time collaboration, conflict resolution, and rich media.
9 componentsReal-Time Analytics Dashboard
Live analytics platform with event ingestion, stream processing, and interactive dashboards.
8 componentsBuild this architecture
Generate an interactive architecture for GraphQL Subscriptions in seconds.
Try it in Codelit →
Comments