Infrastructure Monitoring Guide — Metrics, Dashboards, and Alerting
Why infrastructure monitoring matters#
Your application runs on infrastructure. When a disk fills up, a container runs out of memory, or a node becomes unreachable, your users feel it before your on-call engineer does — unless you have monitoring in place.
Infrastructure monitoring answers three questions continuously:
- Is everything running? (availability)
- Is everything fast enough? (performance)
- Will we run out of capacity soon? (forecasting)
Host metrics — the foundation#
Every monitoring stack starts with host-level metrics. These are the vital signs of your servers.
CPU#
- cpu_usage_percent — overall utilization across all cores
- cpu_iowait — percentage of time the CPU waits for disk I/O (high iowait signals disk bottlenecks)
- cpu_steal — time stolen by the hypervisor in virtualized environments
- load_average — 1, 5, and 15-minute load averages; compare against core count
Alert when CPU usage exceeds 80% sustained for 5 minutes. Investigate iowait separately — it often masquerades as CPU pressure.
Memory#
- memory_used_bytes — total memory in use (excluding buffers and cache)
- memory_available_bytes — memory available for new allocations
- memory_swap_used_bytes — swap usage indicates memory pressure
Alert when available memory drops below 10% of total. Swap usage above zero on a production server warrants investigation.
Disk#
- disk_used_percent — per-mount utilization
- disk_read_bytes / disk_write_bytes — throughput
- disk_io_time — time spent on I/O operations
- disk_inodes_free — running out of inodes is as bad as running out of space
Alert at 85% disk usage. Alert on inode exhaustion separately — it catches a different failure mode.
Network#
- network_bytes_sent / network_bytes_received — bandwidth utilization
- network_errors — CRC errors, dropped packets
- network_connections — TCP connection count by state (ESTABLISHED, TIME_WAIT)
- network_retransmits — TCP retransmissions indicate packet loss
Alert on sustained error rates above baseline and on connection count spikes.
Container metrics#
Containers add a layer of abstraction. You need metrics from both the container runtime and the host.
Key container metrics#
- container_cpu_usage_seconds_total — CPU time consumed by the container
- container_memory_usage_bytes — current memory usage including cache
- container_memory_working_set_bytes — memory that cannot be reclaimed (the real pressure indicator)
- container_network_transmit_bytes_total — egress traffic
- container_fs_writes_bytes_total — filesystem write volume
Container-specific concerns#
- OOMKilled events — the kernel killed the container for exceeding its memory limit
- CPU throttling — the container hit its CPU limit and was throttled
- Restart count — containers restarting frequently indicate crash loops
# Detect containers being CPU-throttled
rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.1
Kubernetes metrics#
Kubernetes adds orchestration metrics on top of container metrics.
Node-level#
- node_cpu_utilization — how much of the node's allocatable CPU is in use
- node_memory_utilization — same for memory
- node_pod_count — number of pods scheduled; watch for nodes hitting pod limits
- node_condition — DiskPressure, MemoryPressure, PIDPressure, Ready
Pod-level#
- pod_phase — Pending, Running, Succeeded, Failed, Unknown
- pod_restart_count — CrashLoopBackOff detection
- pod_cpu_request vs pod_cpu_usage — right-sizing analysis
- pod_memory_request vs pod_memory_usage — same for memory
Cluster-level#
- kube_deployment_status_replicas_available — are all replicas running?
- kube_hpa_status_current_replicas — HPA scaling activity
- kube_job_status_failed — failed batch jobs
- etcd_server_has_leader — etcd health (critical for cluster stability)
Prometheus and Node Exporter setup#
Prometheus is the de facto standard for infrastructure metrics. Node Exporter exposes host metrics in Prometheus format.
Node Exporter#
Install Node Exporter on every host. It exposes metrics at :9100/metrics.
# Docker Compose example
node-exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
Prometheus configuration#
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Retention and storage#
- Default retention is 15 days — increase for capacity planning
- Use remote write to send data to long-term storage (Thanos, Cortex, or Mimir)
- Estimate storage: ~1-2 bytes per sample, multiply by series count and scrape frequency
Grafana dashboards#
Dashboards turn metrics into situational awareness. Follow these principles:
Dashboard hierarchy#
- Overview dashboard — one screen showing the health of the entire infrastructure
- Service dashboards — per-service latency, error rate, throughput
- Node dashboards — deep dive into a specific host or pod
- Debug dashboards — detailed metrics for incident investigation
Essential panels for the overview dashboard#
- Cluster CPU and memory utilization — gauge or stat panel
- Node health matrix — table showing each node's status
- Top 5 pods by CPU — bar chart for quick hotspot identification
- Disk usage by mount — bar gauge with threshold colors
- Alert firing count — stat panel linked to Alertmanager
Dashboard best practices#
- Use template variables for environment, cluster, and namespace filtering
- Set consistent time ranges across panels
- Add annotation overlays for deployments and incidents
- Keep dashboards under 20 panels — too many panels slow rendering and overwhelm operators
Alerting rules#
Good alerts are actionable, not noisy. Follow these guidelines:
Alert design principles#
- Alert on symptoms, not causes — alert on high error rate, not on a specific pod restarting
- Include runbook links — every alert should link to a remediation guide
- Set appropriate severity — page for customer-facing issues, ticket for everything else
- Use inhibition — suppress downstream alerts when an upstream failure explains them
Example alerting rules#
groups:
- name: infrastructure
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
runbook: "https://runbooks.example.com/high-cpu"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15
for: 10m
labels:
severity: critical
annotations:
summary: "Disk space below 15% on {{ $labels.instance }}"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
Capacity alerts#
Capacity alerts predict exhaustion before it happens.
Linear prediction#
Use Prometheus predict_linear to forecast when a resource will run out:
# Alert if disk will fill within 24 hours
predict_linear(node_filesystem_avail_bytes[6h], 24 * 3600) < 0
Capacity planning thresholds#
| Resource | Warning | Critical | Prediction window |
|---|---|---|---|
| Disk space | 75% used | 85% used | 24 hours to full |
| Memory | 80% used | 90% used | 6 hours to exhaustion |
| CPU | 70% sustained | 85% sustained | — |
| Pod count | 80% of node limit | 90% of node limit | — |
Explore monitoring architectures#
On Codelit, generate a Prometheus and Grafana monitoring stack to see how metrics flow from exporters through scraping, storage, dashboards, and alerting. Click on any component to explore its configuration and data flow.
This is article #376 in the Codelit engineering blog series.
Build and explore monitoring architectures visually at codelit.io.
Try it on Codelit
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
GitHub Integration
Paste a repo URL and generate architecture from your actual codebase
Related articles
Try these templates
Kubernetes Container Orchestration
K8s cluster with pod scheduling, service mesh, auto-scaling, and CI/CD deployment pipeline.
9 componentsLogging & Observability Platform
Datadog-like platform with log aggregation, metrics collection, distributed tracing, and alerting.
8 componentsCI/CD Pipeline Architecture
End-to-end continuous integration and deployment with testing, security scanning, staging, and production rollout.
10 componentsBuild this architecture
Generate an interactive architecture for Infrastructure Monitoring Guide in seconds.
Try it in Codelit →
Comments