Object Storage Architecture: File, Block & Object Storage Explained
Object Storage Architecture#
Every application stores files — user uploads, images, logs, backups, ML datasets. The architecture you choose for storage determines cost, performance, and scalability.
Three Storage Paradigms#
Block Storage#
Raw storage volumes attached to a single compute instance. Think of it as a virtual hard drive.
Block storage:
/dev/sda1 → 500GB EBS volume → attached to one EC2 instance
Low latency (~1ms), fixed size, no built-in sharing
Use cases: Databases, OS boot volumes, high-IOPS workloads. Examples: AWS EBS, Google Persistent Disk, Azure Managed Disks.
File Storage#
A shared filesystem accessible by multiple machines over a network (NFS/SMB).
File storage:
/shared/uploads/ → NFS mount → accessible by 10 app servers
Hierarchical directories, file locking, POSIX semantics
Use cases: Shared application data, home directories, CMS media. Examples: AWS EFS, Google Filestore, Azure Files.
Object Storage#
Flat namespace of objects (blobs) accessed via HTTP APIs. No directories — just buckets and keys.
Object storage:
s3://my-bucket/users/42/avatar.png
↑ bucket ↑ key (not a directory path)
HTTP PUT/GET/DELETE — no filesystem semantics
Virtually unlimited capacity, pay-per-GB
Use cases: User uploads, static assets, backups, data lakes, ML training data. Examples: AWS S3, Google Cloud Storage, Azure Blob Storage.
Comparison#
| Feature | Block | File | Object |
|---|---|---|---|
| Access pattern | Mounted volume | Network filesystem | HTTP API |
| Max size | TBs per volume | PBs | Unlimited |
| Latency | ~1ms | ~5-10ms | ~50-200ms |
| Concurrent access | Single instance | Multiple instances | Unlimited |
| Cost (per GB/mo) | $0.08-0.10 | $0.30 | $0.023 |
| Metadata | Filesystem attrs | Filesystem attrs | Custom key-value |
Object storage wins on cost and scalability. That's why it dominates modern architectures.
S3 Architecture Internals#
Amazon S3 — the de facto standard — is engineered for 99.999999999% (11 nines) durability.
How S3 Stores Data#
PUT s3://bucket/photo.jpg
1. Object split into chunks
2. Each chunk erasure-coded (not simple replication)
3. Coded fragments distributed across multiple AZs
4. Metadata stored in a distributed index
5. ACK returned to client
Erasure coding is more space-efficient than 3x replication while providing equivalent durability. A typical scheme like Reed-Solomon 8/4 stores 12 fragments — any 8 can reconstruct the object.
Key Design Decisions#
- Flat namespace — No directories. The
/in keys is just a convention - Immutable objects — You overwrite, not modify in place
- Read-after-write consistency — S3 provides strong consistency for all operations (since 2020)
- Unlimited scale — No provisioning. Throughput scales automatically
Presigned URLs#
Presigned URLs let clients upload/download directly to S3 without exposing credentials.
// Server generates a presigned upload URL
const url = s3.getSignedUrl('putObject', {
Bucket: 'uploads',
Key: `users/${userId}/${fileId}`,
ContentType: 'image/jpeg',
Expires: 300, // 5 minutes
});
// Client uploads directly to S3
// PUT https://uploads.s3.amazonaws.com/users/42/abc123?X-Amz-Signature=...
Architecture Flow#
1. Client → Server: "I want to upload a 5MB JPEG"
2. Server → S3: generate presigned PUT URL (5min TTL)
3. Server → Client: presigned URL
4. Client → S3: PUT file directly (server never touches the bytes)
5. S3 → Lambda/webhook: notify server of completed upload
6. Server: validate, process, update database
This offloads bandwidth and CPU from your servers entirely.
Multipart Uploads#
For large files (>100MB), multipart uploads provide reliability and parallelism.
Multipart upload flow:
1. Initiate → S3 returns uploadId
2. Upload parts in parallel (5MB-5GB each)
Part 1: bytes 0-10MB → ETag "abc"
Part 2: bytes 10MB-20MB → ETag "def"
Part 3: bytes 20MB-30MB → ETag "ghi"
3. Complete → send ordered list of ETags
4. S3 assembles final object
Failed part? Retry just that part, not the entire file.
Benefits:
- Parallel uploads — Saturate bandwidth with concurrent parts
- Retry granularity — Only re-upload failed parts
- Pause/resume — Upload can span multiple sessions
- Required for objects >5GB
Storage Tiers#
Not all data is accessed equally. Tiering reduces costs dramatically.
| Tier | Access Frequency | Retrieval Time | Cost/GB/mo | Example |
|---|---|---|---|---|
| Hot (Standard) | Frequent | Instant | $0.023 | Active user uploads |
| Warm (IA) | Monthly | Instant | $0.0125 | Old user files |
| Cold (Glacier Instant) | Quarterly | Instant | $0.004 | Compliance archives |
| Archive (Glacier Deep) | Rarely | 12-48 hours | $0.00099 | Legal holds, raw logs |
Lifecycle Policies#
Automate transitions between tiers:
{
"Rules": [
{
"ID": "TierDown",
"Status": "Enabled",
"Transitions": [
{ "Days": 30, "StorageClass": "STANDARD_IA" },
{ "Days": 90, "StorageClass": "GLACIER_INSTANT_RETRIEVAL" },
{ "Days": 365, "StorageClass": "DEEP_ARCHIVE" }
],
"Expiration": { "Days": 2555 }
}
]
}
Intelligent Tiering#
S3 Intelligent-Tiering automatically moves objects between tiers based on access patterns. Small monitoring fee per object, but zero retrieval fees.
CDN Integration#
Object storage + CDN = fast global delivery.
Architecture:
User → CloudFront/Cloudflare edge (cache HIT) → response in 5ms
User → CloudFront/Cloudflare edge (cache MISS) → S3 origin → cache + respond
Cache key: bucket + object key + query params
TTL: configured per path pattern (images: 30d, API: 0)
Cache Invalidation Strategies#
- Immutable keys — Include content hash in filename:
avatar-a3f8b2c1.jpg - Versioned keys —
v3/logo.png - Explicit invalidation — Purge specific paths (slow, costly at scale)
Immutable keys are the gold standard — no invalidation needed, infinite cache TTL.
Metadata and Indexing#
Object storage has limited query capability. For searchable metadata, maintain a separate index.
S3 object:
Key: uploads/user-42/invoice-2026-03.pdf
Custom metadata: { "userId": "42", "type": "invoice", "month": "2026-03" }
System metadata: { "size": 245000, "contentType": "application/pdf", "lastModified": "..." }
Separate index (DynamoDB/Postgres):
{ key, userId, type, uploadedAt, size, status, thumbnailKey }
S3 can list objects by prefix, but cannot query by metadata values. Always maintain an external index for search.
Deduplication#
Avoid storing duplicate files to save cost and bandwidth.
Content-Addressable Storage#
Hash the file content and use the hash as the key:
Upload flow:
1. Client hashes file → SHA-256: "a3f8b2c1..."
2. Client → Server: "Do you have a3f8b2c1?"
3. Server checks index → if exists, skip upload
4. If new, upload to s3://bucket/blobs/a3f8b2c1
5. Store reference: user-42/report.pdf → a3f8b2c1
Multiple users upload same file → stored once
Reference Counting#
Track how many logical files point to each physical blob. Delete the blob only when the reference count hits zero.
Tools and Providers#
| Tool | Type | Standout Feature |
|---|---|---|
| AWS S3 | Cloud | Industry standard, deepest ecosystem |
| Google Cloud Storage | Cloud | Tight BigQuery/ML integration |
| MinIO | Self-hosted | S3-compatible, runs on Kubernetes |
| Cloudflare R2 | Cloud | Zero egress fees |
| Backblaze B2 | Cloud | Lowest cost per GB ($0.006/GB/mo) |
Choosing the Right Tool#
- Default choice: S3 — widest tooling support, most documentation
- Egress-heavy workloads: R2 — zero egress fees save thousands per month
- Budget storage: Backblaze B2 — 1/4 the cost of S3 for archival
- Self-hosted/air-gapped: MinIO — full S3 API compatibility on your infrastructure
- GCP ecosystem: GCS — native integration with BigQuery, Vertex AI, Cloud Functions
Architecture Checklist#
- Choose access pattern — Direct S3 access vs presigned URLs vs CDN
- Set lifecycle policies — Automate tier transitions from day one
- Enable versioning — Protect against accidental overwrites and deletions
- Configure CORS — Required for browser-based direct uploads
- Implement deduplication — Content-addressable storage for user-uploaded files
- Maintain metadata index — External database for searchable file metadata
- Set up CDN — CloudFront or Cloudflare in front of your bucket
- Monitor costs — Storage, requests, and especially egress
Key Takeaways#
- Object storage is the default — Use block/file storage only for specific needs
- Presigned URLs offload your servers — Clients upload directly to S3
- Tiering is free money — Move cold data to cheaper tiers automatically
- CDN is non-negotiable — Cache static assets at the edge
- Index externally — S3 is a blob store, not a database
- Watch egress costs — Consider R2 or B2 if bandwidth is significant
Object storage underpins nearly every modern application. For deeper dives into storage patterns, CDN architecture, and system design, visit codelit.io.
This is article #170 in the Codelit engineering blog series.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
AI Agent Tool Use Architecture: Function Calling, ReAct Loops & Structured Outputs
6 min read
AI searchAI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG
8 min read
AI safetyAI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop
8 min read
Try these templates
Netflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsCloud File Storage Platform
Dropbox-like file storage with sync, sharing, versioning, and real-time collaboration.
8 componentsSearch Engine Architecture
Web-scale search with crawling, indexing, ranking, and sub-second query serving.
8 componentsBuild this architecture
Generate an interactive Object Storage Architecture in seconds.
Try it in Codelit →
Comments