Data Serialization Formats: JSON vs Protobuf vs Avro vs MessagePack vs CBOR
Every time two systems exchange data, they agree on a serialization format. The choice affects message size, parsing speed, schema evolution safety, and developer ergonomics. Picking the wrong format creates pain that compounds as the system grows.
The Format Landscape#
Serialization formats fall into two broad categories:
Text-based — Human-readable, self-describing, larger on the wire.
- JSON, XML, YAML
Binary — Compact, fast to parse, usually schema-driven.
- Protocol Buffers (Protobuf), Avro, MessagePack, CBOR, FlatBuffers, Cap'n Proto
This guide focuses on the five most common choices for service-to-service and storage use cases.
JSON#
JSON is the lingua franca of web APIs. Every language has a parser. Every developer can read it.
{
"id": 42,
"name": "Alice",
"email": "alice@example.com",
"roles": ["admin", "editor"]
}
Strengths:
- Universal tooling and browser support
- Self-describing — no schema required to read
- Easy to debug with
curlandjq
Weaknesses:
- Verbose — field names repeated in every message
- No native binary, date, or integer-size types
- Parsing is CPU-intensive compared to binary formats
- No built-in schema evolution guarantees
JSON is the right default for public APIs, configuration files, and low-throughput internal services where debuggability matters more than efficiency.
Protocol Buffers (Protobuf)#
Protobuf is Google's binary serialization format. It uses a schema (.proto file) to generate strongly typed code in many languages.
syntax = "proto3";
message User {
int32 id = 1;
string name = 2;
string email = 3;
repeated string roles = 4;
}
The encoded message contains only field numbers and values — no field names. A User with the fields above serializes to roughly 30 bytes versus 90 bytes in JSON.
Schema evolution rules:
- Add new fields with new field numbers — old readers ignore unknown fields (forward compatible).
- Never reuse a field number — use
reservedto prevent accidental reuse. - Removing a field is safe if the number is reserved.
- Changing a field type is restricted to compatible type pairs (e.g.,
int32toint64).
Strengths:
- Small message size (3-10x smaller than JSON)
- Fast serialization and deserialization
- Strong schema evolution with field numbering
- Mature code generation for 10+ languages
Weaknesses:
- Not human-readable — requires
protocor schema to decode - Schema must be distributed to all consumers
- No self-describing wire format (without descriptors)
Protobuf is the standard for gRPC services and high-throughput internal communication.
Apache Avro#
Avro takes a different approach to schema management. The writer's schema is included with the data (or resolved from a schema registry), and the reader uses its own schema. Avro resolves differences between the two at deserialization time.
{
"type": "record",
"name": "User",
"fields": [
{"name": "id", "type": "int"},
{"name": "name", "type": "string"},
{"name": "email", "type": "string"},
{"name": "roles", "type": {"type": "array", "items": "string"}}
]
}
Schema evolution rules:
- Add a field with a default value — backward compatible.
- Remove a field that has a default — forward compatible.
- The schema registry enforces compatibility checks before a new schema version is registered.
Strengths:
- Schema resolution enables flexible evolution
- Compact encoding — no field tags in the payload
- Native support in Hadoop, Kafka, and Spark
- Schema registry integration (Confluent, AWS Glue)
Weaknesses:
- Requires the writer schema to decode — not self-describing on its own
- Smaller ecosystem outside the JVM and data engineering
- Dynamic languages lose some type safety
Avro is the dominant choice for event streaming (Kafka), data pipelines, and long-term storage where schema evolution must be managed centrally.
MessagePack#
MessagePack is a binary format that mirrors JSON's data model. It is self-describing — no schema needed.
# JSON (27 bytes)
{"id":42,"name":"Alice"}
# MessagePack (17 bytes)
82 a2 69 64 2a a4 6e 61 6d 65 a5 41 6c 69 63 65
Strengths:
- Drop-in replacement for JSON with 20-40% size reduction
- Self-describing — decode without a schema
- Faster parsing than JSON
- Broad language support
Weaknesses:
- No schema enforcement or evolution guarantees
- Still includes field names (unlike Protobuf/Avro)
- Less compact than schema-driven formats
MessagePack fits where you want better performance than JSON without adopting a schema system — caching layers, session storage, and internal APIs with stable contracts.
CBOR#
Concise Binary Object Representation (CBOR, RFC 8949) is another self-describing binary format. It extends JSON's model with native support for binary data, dates, and tags.
Strengths:
- IETF standard with formal specification
- Native binary byte strings (no Base64 encoding needed)
- Extensible type system via semantic tags
- Deterministic encoding mode for canonical forms
- Required by protocols like COSE, CWT, and WebAuthn/FIDO2
Weaknesses:
- Smaller community than MessagePack
- Tooling is less mature in some ecosystems
- Tag semantics require agreement between producer and consumer
CBOR is the right choice when you need a standards-based binary format, especially for IoT, security tokens, and protocols that mandate it.
Performance Comparison#
Benchmarks vary by message shape and language, but the relative order is consistent:
Format Encode Decode Size
─────────────────────────────────────────
JSON 1.0x 1.0x 1.0x (baseline)
MessagePack 1.5x 1.8x 0.7x
CBOR 1.4x 1.6x 0.7x
Protobuf 3.0x 4.0x 0.3x
Avro 2.5x 3.5x 0.25x
Higher multipliers = faster. Lower size = smaller. Values are approximate relative to JSON.
The schema-driven formats (Protobuf, Avro) win on both speed and size because they eliminate field names from the wire and enable optimized generated code.
Schema Evolution: Backward and Forward Compatibility#
Backward compatibility — New code can read data written by old code. Achieved by giving new fields default values.
Forward compatibility — Old code can read data written by new code. Achieved by ignoring unknown fields.
Full compatibility — Both backward and forward. Required when producers and consumers deploy independently.
Writer v1 Writer v2
Reader v1 ✓ Forward compat needed
Reader v2 Backward ✓
compat needed
| Format | Backward | Forward | Mechanism |
|---|---|---|---|
| JSON | Manual | Manual | Ignore unknown keys by convention |
| Protobuf | Yes | Yes | Field numbers + unknown field preservation |
| Avro | Yes | Yes | Schema resolution + registry |
| MessagePack | Manual | Manual | No schema enforcement |
| CBOR | Manual | Manual | No schema enforcement |
When to Use Which#
JSON — Public APIs, browser clients, config files, low-throughput services.
Protobuf — gRPC services, high-throughput internal APIs, mobile clients, any context where generated types improve safety.
Avro — Kafka event streams, data lakes, Hadoop/Spark pipelines, environments with a schema registry.
MessagePack — Cache serialization, Redis values, drop-in JSON replacement where schema adoption is not feasible.
CBOR — IoT devices, security protocols (WebAuthn, COSE), constrained environments, IETF-mandated formats.
Key Takeaways#
- JSON is the default for human-facing and public APIs. Optimize elsewhere.
- Protobuf and Avro deliver 3-4x smaller payloads and faster parsing through schema-driven encoding.
- Schema evolution is not free — Protobuf uses field numbers, Avro uses registry-enforced resolution, and JSON relies on convention.
- MessagePack and CBOR are self-describing binary formats that improve on JSON without requiring a schema.
- Choose based on your ecosystem: Protobuf for gRPC, Avro for Kafka, CBOR for IoT and security standards.
Build and explore system design concepts hands-on at codelit.io.
270 articles on system design at codelit.io/blog.
Try it on Codelit
GitHub Integration
Paste any repo URL to generate an interactive architecture diagram from real code
Related articles
Try these templates
Build this architecture
Generate an interactive architecture for Data Serialization Formats in seconds.
Try it in Codelit →
Comments