devopsmonitoringobservabilitykubernetesinfrastructure

Infrastructure Monitoring Guide — Metrics, Dashboards, and Alerting

March 29, 2026 6 min readBy Codelit Team Discussion

Why infrastructure monitoring matters#

Your application runs on infrastructure. When a disk fills up, a container runs out of memory, or a node becomes unreachable, your users feel it before your on-call engineer does — unless you have monitoring in place.

Infrastructure monitoring answers three questions continuously:

Is everything running? (availability)
Is everything fast enough? (performance)
Will we run out of capacity soon? (forecasting)

Host metrics — the foundation#

Every monitoring stack starts with host-level metrics. These are the vital signs of your servers.

CPU#

cpu_usage_percent — overall utilization across all cores
cpu_iowait — percentage of time the CPU waits for disk I/O (high iowait signals disk bottlenecks)
cpu_steal — time stolen by the hypervisor in virtualized environments
load_average — 1, 5, and 15-minute load averages; compare against core count

Alert when CPU usage exceeds 80% sustained for 5 minutes. Investigate iowait separately — it often masquerades as CPU pressure.

Memory#

memory_used_bytes — total memory in use (excluding buffers and cache)
memory_available_bytes — memory available for new allocations
memory_swap_used_bytes — swap usage indicates memory pressure

Alert when available memory drops below 10% of total. Swap usage above zero on a production server warrants investigation.

Disk#

disk_used_percent — per-mount utilization
disk_read_bytes / disk_write_bytes — throughput
disk_io_time — time spent on I/O operations
disk_inodes_free — running out of inodes is as bad as running out of space

Alert at 85% disk usage. Alert on inode exhaustion separately — it catches a different failure mode.

Network#

network_bytes_sent / network_bytes_received — bandwidth utilization
network_errors — CRC errors, dropped packets
network_connections — TCP connection count by state (ESTABLISHED, TIME_WAIT)
network_retransmits — TCP retransmissions indicate packet loss

Alert on sustained error rates above baseline and on connection count spikes.

Container metrics#

Containers add a layer of abstraction. You need metrics from both the container runtime and the host.

Key container metrics#

container_cpu_usage_seconds_total — CPU time consumed by the container
container_memory_usage_bytes — current memory usage including cache
container_memory_working_set_bytes — memory that cannot be reclaimed (the real pressure indicator)
container_network_transmit_bytes_total — egress traffic
container_fs_writes_bytes_total — filesystem write volume

Container-specific concerns#

OOMKilled events — the kernel killed the container for exceeding its memory limit
CPU throttling — the container hit its CPU limit and was throttled
Restart count — containers restarting frequently indicate crash loops

# Detect containers being CPU-throttled
rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.1

Kubernetes metrics#

Kubernetes adds orchestration metrics on top of container metrics.

Node-level#

node_cpu_utilization — how much of the node's allocatable CPU is in use
node_memory_utilization — same for memory
node_pod_count — number of pods scheduled; watch for nodes hitting pod limits
node_condition — DiskPressure, MemoryPressure, PIDPressure, Ready

Pod-level#

pod_phase — Pending, Running, Succeeded, Failed, Unknown
pod_restart_count — CrashLoopBackOff detection
pod_cpu_request vs pod_cpu_usage — right-sizing analysis
pod_memory_request vs pod_memory_usage — same for memory

Cluster-level#

kube_deployment_status_replicas_available — are all replicas running?
kube_hpa_status_current_replicas — HPA scaling activity
kube_job_status_failed — failed batch jobs
etcd_server_has_leader — etcd health (critical for cluster stability)

Prometheus and Node Exporter setup#

Prometheus is the de facto standard for infrastructure metrics. Node Exporter exposes host metrics in Prometheus format.

Node Exporter#

Install Node Exporter on every host. It exposes metrics at :9100/metrics.

# Docker Compose example
node-exporter:
  image: prom/node-exporter:latest
  ports:
    - "9100:9100"
  volumes:
    - /proc:/host/proc:ro
    - /sys:/host/sys:ro
    - /:/rootfs:ro
  command:
    - '--path.procfs=/host/proc'
    - '--path.sysfs=/host/sys'
    - '--path.rootfs=/rootfs'

Prometheus configuration#

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Retention and storage#

Default retention is 15 days — increase for capacity planning
Use remote write to send data to long-term storage (Thanos, Cortex, or Mimir)
Estimate storage: ~1-2 bytes per sample, multiply by series count and scrape frequency

Grafana dashboards#

Dashboards turn metrics into situational awareness. Follow these principles:

Dashboard hierarchy#

Overview dashboard — one screen showing the health of the entire infrastructure
Service dashboards — per-service latency, error rate, throughput
Node dashboards — deep dive into a specific host or pod
Debug dashboards — detailed metrics for incident investigation

Essential panels for the overview dashboard#

Cluster CPU and memory utilization — gauge or stat panel
Node health matrix — table showing each node's status
Top 5 pods by CPU — bar chart for quick hotspot identification
Disk usage by mount — bar gauge with threshold colors
Alert firing count — stat panel linked to Alertmanager

Dashboard best practices#

Use template variables for environment, cluster, and namespace filtering
Set consistent time ranges across panels
Add annotation overlays for deployments and incidents
Keep dashboards under 20 panels — too many panels slow rendering and overwhelm operators

Alerting rules#

Good alerts are actionable, not noisy. Follow these guidelines:

Alert design principles#

Alert on symptoms, not causes — alert on high error rate, not on a specific pod restarting
Include runbook links — every alert should link to a remediation guide
Set appropriate severity — page for customer-facing issues, ticket for everything else
Use inhibition — suppress downstream alerts when an upstream failure explains them

Example alerting rules#

groups:
  - name: infrastructure
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          runbook: "https://runbooks.example.com/high-cpu"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Disk space below 15% on {{ $labels.instance }}"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"

Capacity alerts#

Capacity alerts predict exhaustion before it happens.

Linear prediction#

Use Prometheus predict_linear to forecast when a resource will run out:

# Alert if disk will fill within 24 hours
predict_linear(node_filesystem_avail_bytes[6h], 24 * 3600) < 0

Capacity planning thresholds#

Resource	Warning	Critical	Prediction window
Disk space	75% used	85% used	24 hours to full
Memory	80% used	90% used	6 hours to exhaustion
CPU	70% sustained	85% sustained	—
Pod count	80% of node limit	90% of node limit	—

Explore monitoring architectures#

On Codelit, generate a Prometheus and Grafana monitoring stack to see how metrics flow from exporters through scraping, storage, dashboards, and alerting. Click on any component to explore its configuration and data flow.

This is article #376 in the Codelit engineering blog series.

Build and explore monitoring architectures visually at codelit.io.

{ }

Explore the WhatsApp architecture interactively

Try it →

Try it on Codelit

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

AI agents

AgentOps Observability for AI Agents

3 min read

AI agents

Agentic Data Pipeline Workflow

2 min read

AI agents

An Incident Response Agent Should Slow Down at the Right Moments

2 min read

Try these templates

Kubernetes Container Orchestration

K8s cluster with pod scheduling, service mesh, auto-scaling, and CI/CD deployment pipeline.

9 components

Logging & Observability Platform

Datadog-like platform with log aggregation, metrics collection, distributed tracing, and alerting.

8 components

CI/CD Pipeline Architecture

End-to-end continuous integration and deployment with testing, security scanning, staging, and production rollout.

10 components

Build this architecture

Generate an interactive architecture for Infrastructure Monitoring Guide in seconds.

Try it in Codelit →

devopsmonitoringobservabilitykubernetesinfrastructure

Infrastructure Monitoring Guide — Metrics, Dashboards, and Alerting

March 29, 2026 6 min readBy Codelit Team Discussion

Why infrastructure monitoring matters#

Infrastructure monitoring answers three questions continuously:

Is everything running? (availability)
Is everything fast enough? (performance)
Will we run out of capacity soon? (forecasting)

Host metrics — the foundation#

Every monitoring stack starts with host-level metrics. These are the vital signs of your servers.

CPU#

cpu_usage_percent — overall utilization across all cores
cpu_iowait — percentage of time the CPU waits for disk I/O (high iowait signals disk bottlenecks)
cpu_steal — time stolen by the hypervisor in virtualized environments
load_average — 1, 5, and 15-minute load averages; compare against core count

Alert when CPU usage exceeds 80% sustained for 5 minutes. Investigate iowait separately — it often masquerades as CPU pressure.

Memory#

memory_used_bytes — total memory in use (excluding buffers and cache)
memory_available_bytes — memory available for new allocations
memory_swap_used_bytes — swap usage indicates memory pressure

Alert when available memory drops below 10% of total. Swap usage above zero on a production server warrants investigation.

Disk#

disk_used_percent — per-mount utilization
disk_read_bytes / disk_write_bytes — throughput
disk_io_time — time spent on I/O operations
disk_inodes_free — running out of inodes is as bad as running out of space

Alert at 85% disk usage. Alert on inode exhaustion separately — it catches a different failure mode.

Network#

network_bytes_sent / network_bytes_received — bandwidth utilization
network_errors — CRC errors, dropped packets
network_connections — TCP connection count by state (ESTABLISHED, TIME_WAIT)
network_retransmits — TCP retransmissions indicate packet loss

Alert on sustained error rates above baseline and on connection count spikes.

Container metrics#

Containers add a layer of abstraction. You need metrics from both the container runtime and the host.

Key container metrics#

container_cpu_usage_seconds_total — CPU time consumed by the container
container_memory_usage_bytes — current memory usage including cache
container_memory_working_set_bytes — memory that cannot be reclaimed (the real pressure indicator)
container_network_transmit_bytes_total — egress traffic
container_fs_writes_bytes_total — filesystem write volume

Container-specific concerns#

OOMKilled events — the kernel killed the container for exceeding its memory limit
CPU throttling — the container hit its CPU limit and was throttled
Restart count — containers restarting frequently indicate crash loops

# Detect containers being CPU-throttled
rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.1

Kubernetes metrics#

Kubernetes adds orchestration metrics on top of container metrics.

Node-level#

node_cpu_utilization — how much of the node's allocatable CPU is in use
node_memory_utilization — same for memory
node_pod_count — number of pods scheduled; watch for nodes hitting pod limits
node_condition — DiskPressure, MemoryPressure, PIDPressure, Ready

Pod-level#

pod_phase — Pending, Running, Succeeded, Failed, Unknown
pod_restart_count — CrashLoopBackOff detection
pod_cpu_request vs pod_cpu_usage — right-sizing analysis
pod_memory_request vs pod_memory_usage — same for memory

Cluster-level#

kube_deployment_status_replicas_available — are all replicas running?
kube_hpa_status_current_replicas — HPA scaling activity
kube_job_status_failed — failed batch jobs
etcd_server_has_leader — etcd health (critical for cluster stability)

Prometheus and Node Exporter setup#

Prometheus is the de facto standard for infrastructure metrics. Node Exporter exposes host metrics in Prometheus format.

Node Exporter#

Install Node Exporter on every host. It exposes metrics at :9100/metrics.

# Docker Compose example
node-exporter:
  image: prom/node-exporter:latest
  ports:
    - "9100:9100"
  volumes:
    - /proc:/host/proc:ro
    - /sys:/host/sys:ro
    - /:/rootfs:ro
  command:
    - '--path.procfs=/host/proc'
    - '--path.sysfs=/host/sys'
    - '--path.rootfs=/rootfs'

Prometheus configuration#

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Retention and storage#

Default retention is 15 days — increase for capacity planning
Use remote write to send data to long-term storage (Thanos, Cortex, or Mimir)
Estimate storage: ~1-2 bytes per sample, multiply by series count and scrape frequency

Grafana dashboards#

Dashboards turn metrics into situational awareness. Follow these principles:

Dashboard hierarchy#

Overview dashboard — one screen showing the health of the entire infrastructure
Service dashboards — per-service latency, error rate, throughput
Node dashboards — deep dive into a specific host or pod
Debug dashboards — detailed metrics for incident investigation

Essential panels for the overview dashboard#

Cluster CPU and memory utilization — gauge or stat panel
Node health matrix — table showing each node's status
Top 5 pods by CPU — bar chart for quick hotspot identification
Disk usage by mount — bar gauge with threshold colors
Alert firing count — stat panel linked to Alertmanager

Dashboard best practices#

Use template variables for environment, cluster, and namespace filtering
Set consistent time ranges across panels
Add annotation overlays for deployments and incidents
Keep dashboards under 20 panels — too many panels slow rendering and overwhelm operators

Alerting rules#

Good alerts are actionable, not noisy. Follow these guidelines:

Alert design principles#

Alert on symptoms, not causes — alert on high error rate, not on a specific pod restarting
Include runbook links — every alert should link to a remediation guide
Set appropriate severity — page for customer-facing issues, ticket for everything else
Use inhibition — suppress downstream alerts when an upstream failure explains them

Example alerting rules#

groups:
  - name: infrastructure
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          runbook: "https://runbooks.example.com/high-cpu"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Disk space below 15% on {{ $labels.instance }}"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"

Capacity alerts#

Capacity alerts predict exhaustion before it happens.

Linear prediction#

Use Prometheus predict_linear to forecast when a resource will run out:

# Alert if disk will fill within 24 hours
predict_linear(node_filesystem_avail_bytes[6h], 24 * 3600) < 0

Capacity planning thresholds#

Resource	Warning	Critical	Prediction window
Disk space	75% used	85% used	24 hours to full
Memory	80% used	90% used	6 hours to exhaustion
CPU	70% sustained	85% sustained	—
Pod count	80% of node limit	90% of node limit	—

Explore monitoring architectures#

This is article #376 in the Codelit engineering blog series.

Build and explore monitoring architectures visually at codelit.io.

{ }

Explore the WhatsApp architecture interactively

Try it →

Try it on Codelit

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

GitHub Integration

Paste a repo URL and generate architecture from your actual codebase

Build this architecture →

Comments

AI agents

Build this architecture

Generate an interactive architecture for Infrastructure Monitoring Guide in seconds.

Try it in Codelit →

Infrastructure Monitoring Guide — Metrics, Dashboards, and Alerting

Why infrastructure monitoring matters#

Host metrics — the foundation#

CPU#

Memory#

Disk#

Network#

Container metrics#

Key container metrics#

Container-specific concerns#

Kubernetes metrics#

Node-level#

Pod-level#

Cluster-level#

Prometheus and Node Exporter setup#

Node Exporter#

Prometheus configuration#

Retention and storage#

Grafana dashboards#

Dashboard hierarchy#

Essential panels for the overview dashboard#

Dashboard best practices#

Alerting rules#

Alert design principles#

Example alerting rules#

Capacity alerts#

Linear prediction#

Capacity planning thresholds#

Explore monitoring architectures#

Comments

Related articles

AgentOps Observability for AI Agents

Agentic Data Pipeline Workflow

An Incident Response Agent Should Slow Down at the Right Moments

Try these templates

Kubernetes Container Orchestration

Logging & Observability Platform

CI/CD Pipeline Architecture

Build this architecture

Infrastructure Monitoring Guide — Metrics, Dashboards, and Alerting

Why infrastructure monitoring matters#

Host metrics — the foundation#

CPU#

Memory#

Disk#

Network#

Container metrics#

Key container metrics#

Container-specific concerns#

Kubernetes metrics#

Node-level#

Pod-level#

Cluster-level#

Prometheus and Node Exporter setup#

Node Exporter#

Prometheus configuration#

Retention and storage#

Grafana dashboards#

Dashboard hierarchy#

Essential panels for the overview dashboard#

Dashboard best practices#

Alerting rules#

Alert design principles#

Example alerting rules#

Capacity alerts#

Linear prediction#

Capacity planning thresholds#

Explore monitoring architectures#

Comments

Related articles

AgentOps Observability for AI Agents

Agentic Data Pipeline Workflow

An Incident Response Agent Should Slow Down at the Right Moments

Try these templates

Kubernetes Container Orchestration

Logging & Observability Platform

CI/CD Pipeline Architecture

Build this architecture