Infrastructure Drift Detection — Finding and Fixing Manual Changes
What is infrastructure drift#
You define your infrastructure in code. Terraform, CloudFormation, Pulumi — the tool does not matter. What matters is that someone logged into the AWS console and changed a security group rule. Now your code says port 443 only, but the actual infrastructure has port 22 open to the world.
That gap between what your code declares and what actually exists is drift.
Why drift happens#
- Emergency fixes — production is down, someone SSH'd in and changed a config
- Console clicking — a developer tested something in the UI and forgot to update code
- Automated processes — AWS auto-scaling modifies tags, Kubernetes operators update resources
- Partial applies — Terraform apply failed halfway, leaving resources in a mixed state
- Multiple teams — Team A manages the VPC in their Terraform, Team B modifies it in theirs
Drift is inevitable. The question is how fast you detect it.
Terraform plan as drift detection#
The simplest drift detector: run terraform plan and check if it wants to change anything.
terraform plan -detailed-exitcode -out=drift-check.plan
# Exit codes:
# 0 = no changes (no drift)
# 1 = error
# 2 = changes detected (drift!)
Scheduled drift checks in CI#
# .github/workflows/drift-check.yml
name: Infrastructure Drift Check
on:
schedule:
- cron: '0 */6 * * *' # Every 6 hours
jobs:
drift-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Terraform Init
run: terraform init -backend-config=prod.hcl
- name: Check for Drift
id: plan
run: |
terraform plan -detailed-exitcode -no-color 2>&1 | tee plan-output.txt
echo "exitcode=$?" >> $GITHUB_OUTPUT
continue-on-error: true
- name: Alert on Drift
if: steps.plan.outputs.exitcode == '2'
run: |
curl -X POST "$SLACK_WEBHOOK" \
-H 'Content-Type: application/json' \
-d '{"text":"Infrastructure drift detected. Review plan output."}'
Run this every 4-6 hours. More frequent checks catch drift faster but increase API calls to your cloud provider.
Terraform Cloud drift detection#
Terraform Cloud and Terraform Enterprise have built-in drift detection. Enable it in workspace settings:
workspace {
name = "production"
setting {
drift_detection = true
drift_detection_interval = "6h"
}
}
It runs a speculative plan on schedule and notifies you through integrations.
Crossplane reconciliation#
Crossplane takes a different approach. Instead of detecting drift after the fact, it continuously reconciles.
apiVersion: ec2.aws.crossplane.io/v1beta1
kind: SecurityGroup
metadata:
name: web-sg
spec:
forProvider:
region: us-east-1
description: "Web security group"
ingress:
- fromPort: 443
toPort: 443
ipProtocol: tcp
ipRanges:
- cidrIp: "0.0.0.0/0"
If someone adds port 22 in the console, Crossplane detects the diff and reverts it. Automatically. Within minutes.
Reconciliation loop: Observe actual state, compare to desired state, take action.
This is the Kubernetes controller pattern applied to infrastructure. The desired state is the YAML in your cluster. The actual state is what exists in AWS. The controller closes the gap continuously.
Crossplane vs Terraform for drift#
| Aspect | Terraform | Crossplane |
|---|---|---|
| Detection | On-demand (plan) | Continuous (reconciliation) |
| Remediation | Manual (apply) | Automatic |
| State storage | State file (S3, TFC) | Kubernetes etcd |
| Drift response time | Hours (scheduled) | Minutes |
| Risk of auto-fix | None (manual) | Could revert intentional changes |
AWS Config rules#
AWS Config continuously monitors resource configurations.
{
"ConfigRuleName": "restricted-ssh",
"Source": {
"Owner": "AWS",
"SourceIdentifier": "INCOMING_SSH_DISABLED"
},
"Scope": {
"ComplianceResourceTypes": [
"AWS::EC2::SecurityGroup"
]
}
}
When a security group allows SSH, AWS Config flags it as non-compliant. Combine with AWS Systems Manager for auto-remediation.
Building a drift detection pipeline#
A comprehensive drift detection system:
Step 1 — State snapshot#
Store the expected state somewhere queryable. For Terraform, this is the state file. For Kubernetes, it is the manifests in git.
Step 2 — Actual state collection#
Query your cloud provider APIs for current resource configurations.
import boto3
def get_actual_security_groups():
ec2 = boto3.client('ec2')
response = ec2.describe_security_groups()
return {
sg['GroupId']: sg
for sg in response['SecurityGroups']
}
Step 3 — Diff engine#
Compare expected vs actual and produce a structured diff.
def detect_drift(expected: dict, actual: dict) -> list:
drifts = []
for resource_id, expected_config in expected.items():
actual_config = actual.get(resource_id)
if actual_config is None:
drifts.append({"type": "deleted", "id": resource_id})
elif expected_config != actual_config:
drifts.append({
"type": "modified",
"id": resource_id,
"expected": expected_config,
"actual": actual_config,
})
for resource_id in actual:
if resource_id not in expected:
drifts.append({"type": "unmanaged", "id": resource_id})
return drifts
Step 4 — Classification#
Not all drift is equal:
- Critical — security group rules, IAM policies, encryption settings
- Warning — tags, descriptions, non-functional metadata
- Ignored — auto-generated fields, timestamps, provider-managed attributes
Step 5 — Notification and remediation#
Route critical drift to PagerDuty. Route warnings to Slack. Ignore the rest.
Remediation strategies#
Strategy 1 — Reapply code (safest)#
Run terraform apply to bring actual state back to declared state. Review the plan first.
Strategy 2 — Import and update code#
The manual change was intentional. Import the new state into Terraform and update the code to match.
terraform import aws_security_group.web sg-12345
Then update your .tf files to reflect the new rule.
Strategy 3 — Auto-remediation (risky)#
Automatically apply on drift detection. Only safe for well-tested, non-destructive changes.
- name: Auto-remediate drift
if: steps.plan.outputs.exitcode == '2'
run: terraform apply -auto-approve drift-check.plan
Use this sparingly. Auto-remediation can revert emergency fixes that were applied manually for good reason.
Strategy 4 — Quarantine and review#
Flag drifted resources, prevent further manual changes (SCPs, OPA policies), and queue for human review.
Tools for drift detection#
- Terraform Cloud/Enterprise — built-in scheduled drift detection
- Spacelift — drift detection with policy-based remediation
- env0 — scheduled plan runs with drift alerts
- Driftctl — open-source CLI for scanning cloud drift against Terraform state
- AWS Config — continuous compliance monitoring for AWS resources
- Crossplane — continuous reconciliation for multi-cloud
- Checkov / OPA — policy-as-code to prevent drift-causing configurations
Preventing drift in the first place#
- Lock down console access — use SCPs to restrict manual changes in production
- Break-glass procedures — document when and how manual changes are allowed, with mandatory follow-up tickets
- GitOps workflow — all changes go through pull requests, no exceptions
- Ephemeral credentials — short-lived tokens reduce the window for manual changes
- Tagging policy — tag every resource with its IaC source so unmanaged resources are obvious
Visualize your infrastructure pipeline#
Map out your IaC workflow, drift detection, and remediation flow — try Codelit to generate an interactive architecture diagram.
Key takeaways#
- Drift is inevitable — the question is how fast you detect and respond
terraform plan -detailed-exitcodeis the simplest drift detector — run it on a schedule- Crossplane reconciles continuously — drift is corrected automatically within minutes
- Classify drift severity — not all drift is equal, route critical changes to alerts
- Auto-remediation is risky — it can revert intentional emergency fixes
- Lock down production consoles — prevention is cheaper than detection
- Import intentional changes — update your code to match reality when manual changes are valid
Article #406 in the Codelit engineering series. Explore our full library of system design, infrastructure, and architecture guides at codelit.io.
Try it on Codelit
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
GitHub Integration
Paste a repo URL and generate architecture from your actual codebase
Related articles
Try these templates
Cloud File Storage Platform
Dropbox-like file storage with sync, sharing, versioning, and real-time collaboration.
8 componentsDropbox Cloud Storage Platform
Cloud file storage and sync with real-time collaboration, versioning, sharing, and cross-device sync.
10 componentsCI/CD Pipeline Architecture
End-to-end continuous integration and deployment with testing, security scanning, staging, and production rollout.
10 componentsBuild this architecture
Generate an interactive architecture for Infrastructure Drift Detection in seconds.
Try it in Codelit →
Comments