cloudDevOpscost optimizationinfrastructureFinOps

Cloud Cost Optimization: Right-Size, Reserve, and Automate Your Way to 40% Savings

March 29, 2026 6 min readBy Codelit Team Discussion

Cloud Cost Optimization#

The average company wastes 30-35% of cloud spend on idle or oversized resources. Cloud cost optimization is not about cutting corners — it is about paying only for what you actually use.

The Problem#

Month 1: Spin up dev cluster "temporarily"     → $800/mo
Month 3: Nobody remembers who owns it           → still running
Month 6: Finance asks "why is AWS $47K/mo?"     → panic
Month 7: Team finds 23 idle load balancers      → $4,200/mo wasted

Cloud makes it easy to provision. It makes it hard to de-provision.

Right-Sizing#

Most instances run at 10-20% CPU utilization. Right-sizing means matching instance size to actual usage.

Before:

Production API: c5.4xlarge (16 vCPU, 32 GB)
  Avg CPU: 12%
  Avg Memory: 18%
  Monthly cost: $490

After right-sizing:

Production API: c5.xlarge (4 vCPU, 8 GB)
  Avg CPU: 48%
  Avg Memory: 72%
  Monthly cost: $122
  Savings: 75%

How to identify candidates:

1. Pull 14-day CPU/memory metrics from CloudWatch
2. Flag instances with peak utilization below 40%
3. Recommend one size down (or two if peak is below 20%)
4. Test in staging first
5. Apply during maintenance window

Reserved Instances and Savings Plans#

For stable workloads, commit to 1-3 years for massive discounts:

Purchase Type	Discount	Flexibility	Risk
On-Demand	0%	Full	None
Savings Plan (1yr)	30-40%	Region + family flexible	Low
Reserved Instance (1yr)	35-45%	Locked to type + AZ	Medium
Reserved Instance (3yr)	55-65%	Locked to type + AZ	High

Strategy:

Baseline load (always running)     → Reserved Instances / Savings Plans
Variable load (scales with traffic) → On-Demand + Auto-Scaling
Batch / fault-tolerant jobs        → Spot Instances
Dev/test environments              → Spot + scheduled shutdown

Spot Instances#

Spare cloud capacity at 60-90% discount. The catch: they can be reclaimed with 2 minutes notice.

Good for:

→ CI/CD build runners
→ Batch data processing
→ Stateless web workers behind a load balancer
→ Machine learning training jobs (with checkpointing)
→ Test environments

Bad for:

→ Databases
→ Stateful services without checkpointing
→ Single-instance critical services

Spot interruption strategy:

1. Use multiple instance types (diversify capacity pools)
2. Use multiple AZs
3. Set up interruption handler: save state → drain connections → terminate
4. Mix spot + on-demand in ASG (e.g., 70% spot, 30% on-demand)

Auto-Scaling#

Scale resources to match demand — pay for peak only when peak happens.

Target tracking:

Auto-Scaling Policy:
  Metric: Average CPU utilization
  Target: 60%
  Min instances: 2
  Max instances: 20
  Cooldown: 300 seconds

Traffic spike → CPU rises above 60% → add instances
Traffic drops → CPU falls below 60% → remove instances

Scheduled scaling for predictable patterns:

Weekdays 9 AM:  scale to 10 instances (business hours)
Weekdays 7 PM:  scale to 3 instances  (evening)
Weekends:       scale to 2 instances   (minimal traffic)

Savings: ~55% vs running 10 instances 24/7

Idle Resource Detection#

The silent budget killer: resources nobody uses but everybody pays for.

Common culprits:

→ Unattached EBS volumes              ($0.10/GB/mo, adds up fast)
→ Idle load balancers                  ($16-22/mo each, zero traffic)
→ Stopped instances with attached EBS  (compute free, storage not)
→ Old snapshots                        ($0.05/GB/mo, forgotten)
→ Unused Elastic IPs                   ($3.60/mo each)
→ Orphaned NAT Gateways               ($32/mo + data processing)
→ Dev/staging clusters left running    (hundreds/mo)

Automated cleanup script pattern:

Daily scan:
  1. List all EBS volumes where state = "available" (unattached)
  2. List all ELBs with zero requests in 14 days
  3. List all instances with avg CPU below 2% for 7 days
  4. List all snapshots older than 90 days
  5. Generate report → send to Slack channel
  6. Auto-tag with "idle-candidate" and "cleanup-date"
  7. Delete after 14-day grace period if unclaimed

FinOps: The Cultural Practice#

FinOps is not a tool — it is a practice where engineering, finance, and business collaborate on cloud spending.

Three phases:

Inform  → Visibility into who spends what
          → Dashboards, cost allocation, showback reports

Optimize → Right-size, reserve, spot, auto-scale
           → Engineering-driven changes with financial context

Operate → Continuous governance
          → Budgets, alerts, anomaly detection, policy enforcement

Key FinOps metrics:

Unit economics:   cost per transaction, cost per user, cost per request
Coverage ratio:   % of spend covered by commitments (target: 70-80%)
Waste ratio:      % of spend on idle resources (target: below 5%)
Forecast accuracy: predicted vs actual spend (target: within 5%)

Tagging Strategy#

Tags are the foundation of cost visibility. Without tags, you cannot allocate costs.

Mandatory tags for every resource:

team:        "platform"
environment: "production"
service:     "payment-api"
owner:       "jane@company.com"
cost-center: "engineering-42"
created-by:  "terraform"

Enforcement:

1. AWS SCP / Azure Policy → deny resource creation without required tags
2. CI check → Terraform plan must include all mandatory tags
3. Weekly scan → find untagged resources → auto-notify owner
4. Dashboard → show % of spend that is tagged (target: 95%+)

Cost Allocation#

Map cloud spend to teams, products, and features:

Total AWS bill: $47,000/mo

By team:
  Platform:    $18,000 (38%)
  Data:        $14,000 (30%)
  Product:     $10,000 (21%)
  Untagged:     $5,000 (11%) ← fix this

By environment:
  Production:  $28,000 (60%)
  Staging:      $9,000 (19%)
  Dev:          $5,000 (11%)
  Untagged:     $5,000 (11%)

Showback vs chargeback:

Showback:   "Team X, your cloud spend was $18K this month" (informational)
Chargeback: "Team X, $18K is deducted from your budget" (accountability)

Start with showback. Move to chargeback once tagging is mature.

Tools#

Tool	Focus	Best For
Infracost	Pre-deployment cost estimation	Terraform cost in PRs
Kubecost	Kubernetes cost allocation	K8s cluster optimization
AWS Cost Explorer	AWS spend analysis	AWS-native visibility
Spot.io (by NetApp)	Spot instance management	Automated spot lifecycle
CloudHealth	Multi-cloud governance	Enterprise FinOps

Infracost in CI#

Estimate cost impact before merging infrastructure changes:

PR opens → Terraform plan → Infracost diff
                              ↓
                     "This change adds $240/mo"
                              ↓
                     Reviewer sees cost impact in PR comment
                              ↓
                     Approve or request optimization

Kubecost#

Real-time Kubernetes cost allocation by namespace, deployment, and label:

Kubecost Dashboard:
  namespace/production:  $12,400/mo
  namespace/staging:      $3,200/mo
  namespace/monitoring:   $1,800/mo

  Over-provisioned pods:  47 (requesting 4x actual usage)
  Recommended savings:    $4,100/mo

Track your infrastructure costs at codelit.io — generate interactive architecture diagrams with resource and cost annotations.

Summary#

Right-size instances — match capacity to actual utilization
Reserve stable workloads — 1-year Savings Plans for 30-40% off
Spot for fault-tolerant jobs — 60-90% savings with interruption handling
Auto-scale everything — pay for peak only during peak
Kill idle resources — automated scans, grace periods, then cleanup
Tag religiously — enforce mandatory tags via policy
FinOps culture — inform, optimize, operate continuously

362 articles on system design at codelit.io/blog.

Try it on Codelit

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

AI agents

A DevOps and SRE AI Agent Workflow That Does Not Make Incidents Worse

3 min read

API design

API Backward Compatibility: Ship Changes Without Breaking Consumers

6 min read

backend

API Gateway Custom Plugins — Extending Kong, Envoy, and NGINX

6 min read

Try these templates

Cloud File Storage Platform

Dropbox-like file storage with sync, sharing, versioning, and real-time collaboration.

8 components

Dropbox Cloud Storage Platform

Cloud file storage and sync with real-time collaboration, versioning, sharing, and cross-device sync.

10 components

CI/CD Pipeline Architecture

End-to-end continuous integration and deployment with testing, security scanning, staging, and production rollout.

10 components

Build this architecture

Generate an interactive architecture for Cloud Cost Optimization in seconds.

Try it in Codelit →

cloudDevOpscost optimizationinfrastructureFinOps

Cloud Cost Optimization: Right-Size, Reserve, and Automate Your Way to 40% Savings

March 29, 2026 6 min readBy Codelit Team Discussion

Cloud Cost Optimization#

The average company wastes 30-35% of cloud spend on idle or oversized resources. Cloud cost optimization is not about cutting corners — it is about paying only for what you actually use.

The Problem#

Month 1: Spin up dev cluster "temporarily"     → $800/mo
Month 3: Nobody remembers who owns it           → still running
Month 6: Finance asks "why is AWS $47K/mo?"     → panic
Month 7: Team finds 23 idle load balancers      → $4,200/mo wasted

Cloud makes it easy to provision. It makes it hard to de-provision.

Right-Sizing#

Most instances run at 10-20% CPU utilization. Right-sizing means matching instance size to actual usage.

Before:

Production API: c5.4xlarge (16 vCPU, 32 GB)
  Avg CPU: 12%
  Avg Memory: 18%
  Monthly cost: $490

After right-sizing:

Production API: c5.xlarge (4 vCPU, 8 GB)
  Avg CPU: 48%
  Avg Memory: 72%
  Monthly cost: $122
  Savings: 75%

How to identify candidates:

1. Pull 14-day CPU/memory metrics from CloudWatch
2. Flag instances with peak utilization below 40%
3. Recommend one size down (or two if peak is below 20%)
4. Test in staging first
5. Apply during maintenance window

Reserved Instances and Savings Plans#

For stable workloads, commit to 1-3 years for massive discounts:

Purchase Type	Discount	Flexibility	Risk
On-Demand	0%	Full	None
Savings Plan (1yr)	30-40%	Region + family flexible	Low
Reserved Instance (1yr)	35-45%	Locked to type + AZ	Medium
Reserved Instance (3yr)	55-65%	Locked to type + AZ	High

Strategy:

Baseline load (always running)     → Reserved Instances / Savings Plans
Variable load (scales with traffic) → On-Demand + Auto-Scaling
Batch / fault-tolerant jobs        → Spot Instances
Dev/test environments              → Spot + scheduled shutdown

Spot Instances#

Spare cloud capacity at 60-90% discount. The catch: they can be reclaimed with 2 minutes notice.

Good for:

→ CI/CD build runners
→ Batch data processing
→ Stateless web workers behind a load balancer
→ Machine learning training jobs (with checkpointing)
→ Test environments

Bad for:

→ Databases
→ Stateful services without checkpointing
→ Single-instance critical services

Spot interruption strategy:

1. Use multiple instance types (diversify capacity pools)
2. Use multiple AZs
3. Set up interruption handler: save state → drain connections → terminate
4. Mix spot + on-demand in ASG (e.g., 70% spot, 30% on-demand)

Auto-Scaling#

Scale resources to match demand — pay for peak only when peak happens.

Target tracking:

Auto-Scaling Policy:
  Metric: Average CPU utilization
  Target: 60%
  Min instances: 2
  Max instances: 20
  Cooldown: 300 seconds

Traffic spike → CPU rises above 60% → add instances
Traffic drops → CPU falls below 60% → remove instances

Scheduled scaling for predictable patterns:

Weekdays 9 AM:  scale to 10 instances (business hours)
Weekdays 7 PM:  scale to 3 instances  (evening)
Weekends:       scale to 2 instances   (minimal traffic)

Savings: ~55% vs running 10 instances 24/7

Idle Resource Detection#

The silent budget killer: resources nobody uses but everybody pays for.

Common culprits:

→ Unattached EBS volumes              ($0.10/GB/mo, adds up fast)
→ Idle load balancers                  ($16-22/mo each, zero traffic)
→ Stopped instances with attached EBS  (compute free, storage not)
→ Old snapshots                        ($0.05/GB/mo, forgotten)
→ Unused Elastic IPs                   ($3.60/mo each)
→ Orphaned NAT Gateways               ($32/mo + data processing)
→ Dev/staging clusters left running    (hundreds/mo)

Automated cleanup script pattern:

Daily scan:
  1. List all EBS volumes where state = "available" (unattached)
  2. List all ELBs with zero requests in 14 days
  3. List all instances with avg CPU below 2% for 7 days
  4. List all snapshots older than 90 days
  5. Generate report → send to Slack channel
  6. Auto-tag with "idle-candidate" and "cleanup-date"
  7. Delete after 14-day grace period if unclaimed

FinOps: The Cultural Practice#

FinOps is not a tool — it is a practice where engineering, finance, and business collaborate on cloud spending.

Three phases:

Inform  → Visibility into who spends what
          → Dashboards, cost allocation, showback reports

Optimize → Right-size, reserve, spot, auto-scale
           → Engineering-driven changes with financial context

Operate → Continuous governance
          → Budgets, alerts, anomaly detection, policy enforcement

Key FinOps metrics:

Unit economics:   cost per transaction, cost per user, cost per request
Coverage ratio:   % of spend covered by commitments (target: 70-80%)
Waste ratio:      % of spend on idle resources (target: below 5%)
Forecast accuracy: predicted vs actual spend (target: within 5%)

Tagging Strategy#

Tags are the foundation of cost visibility. Without tags, you cannot allocate costs.

Mandatory tags for every resource:

team:        "platform"
environment: "production"
service:     "payment-api"
owner:       "jane@company.com"
cost-center: "engineering-42"
created-by:  "terraform"

Enforcement:

1. AWS SCP / Azure Policy → deny resource creation without required tags
2. CI check → Terraform plan must include all mandatory tags
3. Weekly scan → find untagged resources → auto-notify owner
4. Dashboard → show % of spend that is tagged (target: 95%+)

Cost Allocation#

Map cloud spend to teams, products, and features:

Total AWS bill: $47,000/mo

By team:
  Platform:    $18,000 (38%)
  Data:        $14,000 (30%)
  Product:     $10,000 (21%)
  Untagged:     $5,000 (11%) ← fix this

By environment:
  Production:  $28,000 (60%)
  Staging:      $9,000 (19%)
  Dev:          $5,000 (11%)
  Untagged:     $5,000 (11%)

Showback vs chargeback:

Showback:   "Team X, your cloud spend was $18K this month" (informational)
Chargeback: "Team X, $18K is deducted from your budget" (accountability)

Start with showback. Move to chargeback once tagging is mature.

Tools#

Tool	Focus	Best For
Infracost	Pre-deployment cost estimation	Terraform cost in PRs
Kubecost	Kubernetes cost allocation	K8s cluster optimization
AWS Cost Explorer	AWS spend analysis	AWS-native visibility
Spot.io (by NetApp)	Spot instance management	Automated spot lifecycle
CloudHealth	Multi-cloud governance	Enterprise FinOps

Infracost in CI#

Estimate cost impact before merging infrastructure changes:

PR opens → Terraform plan → Infracost diff
                              ↓
                     "This change adds $240/mo"
                              ↓
                     Reviewer sees cost impact in PR comment
                              ↓
                     Approve or request optimization

Kubecost#

Real-time Kubernetes cost allocation by namespace, deployment, and label:

Kubecost Dashboard:
  namespace/production:  $12,400/mo
  namespace/staging:      $3,200/mo
  namespace/monitoring:   $1,800/mo

  Over-provisioned pods:  47 (requesting 4x actual usage)
  Recommended savings:    $4,100/mo

Track your infrastructure costs at codelit.io — generate interactive architecture diagrams with resource and cost annotations.

Summary#

Right-size instances — match capacity to actual utilization
Reserve stable workloads — 1-year Savings Plans for 30-40% off
Spot for fault-tolerant jobs — 60-90% savings with interruption handling
Auto-scale everything — pay for peak only during peak
Kill idle resources — automated scans, grace periods, then cleanup
Tag religiously — enforce mandatory tags via policy
FinOps culture — inform, optimize, operate continuously

362 articles on system design at codelit.io/blog.

Try it on Codelit

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

AI agents

A DevOps and SRE AI Agent Workflow That Does Not Make Incidents Worse

3 min read

API design

API Backward Compatibility: Ship Changes Without Breaking Consumers

6 min read

backend

API Gateway Custom Plugins — Extending Kong, Envoy, and NGINX

6 min read

Build this architecture

Generate an interactive architecture for Cloud Cost Optimization in seconds.

Try it in Codelit →

Cloud Cost Optimization: Right-Size, Reserve, and Automate Your Way to 40% Savings

Cloud Cost Optimization#

The Problem#

Right-Sizing#

Reserved Instances and Savings Plans#

Spot Instances#

Auto-Scaling#

Idle Resource Detection#

FinOps: The Cultural Practice#

Tagging Strategy#

Cost Allocation#

Tools#

Infracost in CI#

Kubecost#

Summary#

Comments

Related articles

A DevOps and SRE AI Agent Workflow That Does Not Make Incidents Worse

API Backward Compatibility: Ship Changes Without Breaking Consumers

API Gateway Custom Plugins — Extending Kong, Envoy, and NGINX

Try these templates

Cloud File Storage Platform

Dropbox Cloud Storage Platform

CI/CD Pipeline Architecture

Build this architecture

Cloud Cost Optimization: Right-Size, Reserve, and Automate Your Way to 40% Savings

Cloud Cost Optimization#

The Problem#

Right-Sizing#

Reserved Instances and Savings Plans#

Spot Instances#

Auto-Scaling#

Idle Resource Detection#

FinOps: The Cultural Practice#

Tagging Strategy#

Cost Allocation#

Tools#

Infracost in CI#

Kubecost#

Summary#

Comments

Related articles

A DevOps and SRE AI Agent Workflow That Does Not Make Incidents Worse

API Backward Compatibility: Ship Changes Without Breaking Consumers

API Gateway Custom Plugins — Extending Kong, Envoy, and NGINX

Try these templates

Cloud File Storage Platform

Dropbox Cloud Storage Platform

CI/CD Pipeline Architecture

Build this architecture