Cloud Cost Optimization: Right-Size, Reserve, and Automate Your Way to 40% Savings
Cloud Cost Optimization#
The average company wastes 30-35% of cloud spend on idle or oversized resources. Cloud cost optimization is not about cutting corners — it is about paying only for what you actually use.
The Problem#
Month 1: Spin up dev cluster "temporarily" → $800/mo
Month 3: Nobody remembers who owns it → still running
Month 6: Finance asks "why is AWS $47K/mo?" → panic
Month 7: Team finds 23 idle load balancers → $4,200/mo wasted
Cloud makes it easy to provision. It makes it hard to de-provision.
Right-Sizing#
Most instances run at 10-20% CPU utilization. Right-sizing means matching instance size to actual usage.
Before:
Production API: c5.4xlarge (16 vCPU, 32 GB)
Avg CPU: 12%
Avg Memory: 18%
Monthly cost: $490
After right-sizing:
Production API: c5.xlarge (4 vCPU, 8 GB)
Avg CPU: 48%
Avg Memory: 72%
Monthly cost: $122
Savings: 75%
How to identify candidates:
1. Pull 14-day CPU/memory metrics from CloudWatch
2. Flag instances with peak utilization below 40%
3. Recommend one size down (or two if peak is below 20%)
4. Test in staging first
5. Apply during maintenance window
Reserved Instances and Savings Plans#
For stable workloads, commit to 1-3 years for massive discounts:
| Purchase Type | Discount | Flexibility | Risk |
|---|---|---|---|
| On-Demand | 0% | Full | None |
| Savings Plan (1yr) | 30-40% | Region + family flexible | Low |
| Reserved Instance (1yr) | 35-45% | Locked to type + AZ | Medium |
| Reserved Instance (3yr) | 55-65% | Locked to type + AZ | High |
Strategy:
Baseline load (always running) → Reserved Instances / Savings Plans
Variable load (scales with traffic) → On-Demand + Auto-Scaling
Batch / fault-tolerant jobs → Spot Instances
Dev/test environments → Spot + scheduled shutdown
Spot Instances#
Spare cloud capacity at 60-90% discount. The catch: they can be reclaimed with 2 minutes notice.
Good for:
→ CI/CD build runners
→ Batch data processing
→ Stateless web workers behind a load balancer
→ Machine learning training jobs (with checkpointing)
→ Test environments
Bad for:
→ Databases
→ Stateful services without checkpointing
→ Single-instance critical services
Spot interruption strategy:
1. Use multiple instance types (diversify capacity pools)
2. Use multiple AZs
3. Set up interruption handler: save state → drain connections → terminate
4. Mix spot + on-demand in ASG (e.g., 70% spot, 30% on-demand)
Auto-Scaling#
Scale resources to match demand — pay for peak only when peak happens.
Target tracking:
Auto-Scaling Policy:
Metric: Average CPU utilization
Target: 60%
Min instances: 2
Max instances: 20
Cooldown: 300 seconds
Traffic spike → CPU rises above 60% → add instances
Traffic drops → CPU falls below 60% → remove instances
Scheduled scaling for predictable patterns:
Weekdays 9 AM: scale to 10 instances (business hours)
Weekdays 7 PM: scale to 3 instances (evening)
Weekends: scale to 2 instances (minimal traffic)
Savings: ~55% vs running 10 instances 24/7
Idle Resource Detection#
The silent budget killer: resources nobody uses but everybody pays for.
Common culprits:
→ Unattached EBS volumes ($0.10/GB/mo, adds up fast)
→ Idle load balancers ($16-22/mo each, zero traffic)
→ Stopped instances with attached EBS (compute free, storage not)
→ Old snapshots ($0.05/GB/mo, forgotten)
→ Unused Elastic IPs ($3.60/mo each)
→ Orphaned NAT Gateways ($32/mo + data processing)
→ Dev/staging clusters left running (hundreds/mo)
Automated cleanup script pattern:
Daily scan:
1. List all EBS volumes where state = "available" (unattached)
2. List all ELBs with zero requests in 14 days
3. List all instances with avg CPU below 2% for 7 days
4. List all snapshots older than 90 days
5. Generate report → send to Slack channel
6. Auto-tag with "idle-candidate" and "cleanup-date"
7. Delete after 14-day grace period if unclaimed
FinOps: The Cultural Practice#
FinOps is not a tool — it is a practice where engineering, finance, and business collaborate on cloud spending.
Three phases:
Inform → Visibility into who spends what
→ Dashboards, cost allocation, showback reports
Optimize → Right-size, reserve, spot, auto-scale
→ Engineering-driven changes with financial context
Operate → Continuous governance
→ Budgets, alerts, anomaly detection, policy enforcement
Key FinOps metrics:
Unit economics: cost per transaction, cost per user, cost per request
Coverage ratio: % of spend covered by commitments (target: 70-80%)
Waste ratio: % of spend on idle resources (target: below 5%)
Forecast accuracy: predicted vs actual spend (target: within 5%)
Tagging Strategy#
Tags are the foundation of cost visibility. Without tags, you cannot allocate costs.
Mandatory tags for every resource:
team: "platform"
environment: "production"
service: "payment-api"
owner: "jane@company.com"
cost-center: "engineering-42"
created-by: "terraform"
Enforcement:
1. AWS SCP / Azure Policy → deny resource creation without required tags
2. CI check → Terraform plan must include all mandatory tags
3. Weekly scan → find untagged resources → auto-notify owner
4. Dashboard → show % of spend that is tagged (target: 95%+)
Cost Allocation#
Map cloud spend to teams, products, and features:
Total AWS bill: $47,000/mo
By team:
Platform: $18,000 (38%)
Data: $14,000 (30%)
Product: $10,000 (21%)
Untagged: $5,000 (11%) ← fix this
By environment:
Production: $28,000 (60%)
Staging: $9,000 (19%)
Dev: $5,000 (11%)
Untagged: $5,000 (11%)
Showback vs chargeback:
Showback: "Team X, your cloud spend was $18K this month" (informational)
Chargeback: "Team X, $18K is deducted from your budget" (accountability)
Start with showback. Move to chargeback once tagging is mature.
Tools#
| Tool | Focus | Best For |
|---|---|---|
| Infracost | Pre-deployment cost estimation | Terraform cost in PRs |
| Kubecost | Kubernetes cost allocation | K8s cluster optimization |
| AWS Cost Explorer | AWS spend analysis | AWS-native visibility |
| Spot.io (by NetApp) | Spot instance management | Automated spot lifecycle |
| CloudHealth | Multi-cloud governance | Enterprise FinOps |
Infracost in CI#
Estimate cost impact before merging infrastructure changes:
PR opens → Terraform plan → Infracost diff
↓
"This change adds $240/mo"
↓
Reviewer sees cost impact in PR comment
↓
Approve or request optimization
Kubecost#
Real-time Kubernetes cost allocation by namespace, deployment, and label:
Kubecost Dashboard:
namespace/production: $12,400/mo
namespace/staging: $3,200/mo
namespace/monitoring: $1,800/mo
Over-provisioned pods: 47 (requesting 4x actual usage)
Recommended savings: $4,100/mo
Track your infrastructure costs at codelit.io — generate interactive architecture diagrams with resource and cost annotations.
Summary#
- Right-size instances — match capacity to actual utilization
- Reserve stable workloads — 1-year Savings Plans for 30-40% off
- Spot for fault-tolerant jobs — 60-90% savings with interruption handling
- Auto-scale everything — pay for peak only during peak
- Kill idle resources — automated scans, grace periods, then cleanup
- Tag religiously — enforce mandatory tags via policy
- FinOps culture — inform, optimize, operate continuously
362 articles on system design at codelit.io/blog.
Try it on Codelit
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
GitHub Integration
Paste a repo URL and generate architecture from your actual codebase
Related articles
Try these templates
Cloud File Storage Platform
Dropbox-like file storage with sync, sharing, versioning, and real-time collaboration.
8 componentsDropbox Cloud Storage Platform
Cloud file storage and sync with real-time collaboration, versioning, sharing, and cross-device sync.
10 componentsCI/CD Pipeline Architecture
End-to-end continuous integration and deployment with testing, security scanning, staging, and production rollout.
10 componentsBuild this architecture
Generate an interactive architecture for Cloud Cost Optimization in seconds.
Try it in Codelit →
Comments