The Kubernetes Operator Pattern: Extending the Control Plane
Kubernetes ships with controllers for Deployments, StatefulSets, Services, and other built-in resources. The operator pattern extends this same model to your own domain-specific resources. An operator encodes operational knowledge — the kind of runbook a human SRE would follow — into software that continuously reconciles desired state with actual state.
Core Concepts#
Custom Resource Definitions (CRDs)#
A CRD teaches the Kubernetes API server about a new resource type. Once registered, you can kubectl apply instances of your custom resource just like any built-in object.
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: databases.example.com
spec:
group: example.com
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
engine:
type: string
enum: ["postgres", "mysql"]
version:
type: string
replicas:
type: integer
minimum: 1
maximum: 5
storageGb:
type: integer
status:
type: object
properties:
phase:
type: string
connectionString:
type: string
scope: Namespaced
names:
plural: databases
singular: database
kind: Database
shortNames:
- db
Now users can create databases declaratively:
apiVersion: example.com/v1
kind: Database
metadata:
name: orders-db
namespace: production
spec:
engine: postgres
version: "16"
replicas: 3
storageGb: 100
Controllers and the Reconciliation Loop#
A controller watches for changes to your custom resources and reconciles the cluster state to match. The reconciliation loop is the heart of every operator:
┌──────────────────────────────────────┐
│ Watch / Informer │
│ (API server pushes change events) │
└──────────────┬───────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Work Queue │
│ (deduplicated, rate-limited) │
└──────────────┬───────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Reconcile Function │
│ │
│ 1. Read desired state (CR spec) │
│ 2. Read actual state (cluster) │
│ 3. Compute diff │
│ 4. Apply changes │
│ 5. Update status subresource │
└──────────────┬───────────────────────┘
│
▼
Requeue or Done
Key principles of reconciliation:
- Level-triggered, not edge-triggered — The reconciler reacts to the current state, not the sequence of events that led to it. This makes it resilient to missed events.
- Idempotent — Running reconcile twice with the same input produces the same result.
- Convergent — Each reconcile loop brings actual state closer to desired state, even if it cannot achieve it in one pass.
Building an Operator with Operator SDK#
The Operator SDK (part of the Operator Framework) scaffolds Go-based operators with controller-runtime:
# Initialize a new operator project
operator-sdk init --domain example.com --repo github.com/example/db-operator
# Create a new API and controller
operator-sdk create api --group apps --version v1 --kind Database --resource --controller
This generates a reconciler stub:
func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
// Step 1: Fetch the Database CR
var db appsv1.Database
if err := r.Get(ctx, req.NamespacedName, &db); err != nil {
if apierrors.IsNotFound(err) {
// CR deleted — clean up external resources
return ctrl.Result{}, nil
}
return ctrl.Result{}, err
}
// Step 2: Check if StatefulSet exists
var sts appsv1.StatefulSet
err := r.Get(ctx, types.NamespacedName{
Name: db.Name + "-sts",
Namespace: db.Namespace,
}, &sts)
if apierrors.IsNotFound(err) {
// Step 3: Create the StatefulSet
sts = r.buildStatefulSet(&db)
if err := ctrl.SetControllerReference(&db, &sts, r.Scheme); err != nil {
return ctrl.Result{}, err
}
if err := r.Create(ctx, &sts); err != nil {
return ctrl.Result{}, err
}
log.Info("Created StatefulSet", "name", sts.Name)
} else if err != nil {
return ctrl.Result{}, err
}
// Step 4: Ensure replicas match
if *sts.Spec.Replicas != int32(db.Spec.Replicas) {
sts.Spec.Replicas = ptr(int32(db.Spec.Replicas))
if err := r.Update(ctx, &sts); err != nil {
return ctrl.Result{}, err
}
log.Info("Updated replicas", "desired", db.Spec.Replicas)
}
// Step 5: Update status
db.Status.Phase = "Running"
db.Status.ConnectionString = fmt.Sprintf(
"postgres://%s-sts-0.%s.%s.svc:5432/app",
db.Name, db.Name, db.Namespace,
)
if err := r.Status().Update(ctx, &db); err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{}, nil
}
Helm vs. Kustomize vs. Operators#
These three tools serve different purposes and can coexist:
| Dimension | Helm | Kustomize | Operator |
|---|---|---|---|
| What it does | Template and package | Patch and overlay | Continuously reconcile |
| Day-1 install | Excellent | Good | Good |
| Day-2 operations | Manual (helm upgrade) | Manual (kubectl apply) | Automated |
| Scaling logic | Not built-in | Not built-in | Encoded in controller |
| Backup/restore | Not built-in | Not built-in | Can be automated |
| Failure recovery | Manual | Manual | Self-healing |
| Complexity | Low | Low | High (requires Go/code) |
When to use an operator:
- Your application has complex Day-2 operations (upgrades, failovers, backups, scaling decisions).
- You want the system to self-heal without human intervention.
- The operational domain knowledge is deep enough to justify codifying it.
When Helm or Kustomize is enough:
- The application is stateless or uses managed backing services.
- Day-2 operations are simple (rolling restart, config change).
- Your team does not have capacity to maintain operator code.
Operator Maturity Model#
The Operator Framework defines five capability levels:
- Basic Install — Automated install via Helm chart or manifests.
- Seamless Upgrades — Operator handles version upgrades with zero downtime.
- Full Lifecycle — Backup, restore, and failure recovery are automated.
- Deep Insights — Operator exposes metrics, alerts, and log aggregation.
- Auto Pilot — Operator auto-scales, auto-tunes, and self-heals based on observed load.
Most production operators aim for level 3 or 4. Level 5 is rare and requires significant investment.
Finalizers and Cleanup#
When a custom resource is deleted, the operator may need to clean up external resources (cloud databases, DNS records, certificates). Finalizers prevent the resource from being garbage-collected until cleanup is complete:
const finalizerName = "databases.example.com/cleanup"
func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
var db appsv1.Database
if err := r.Get(ctx, req.NamespacedName, &db); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// Handle deletion
if !db.DeletionTimestamp.IsZero() {
if controllerutil.ContainsFinalizer(&db, finalizerName) {
// Clean up external resources
if err := r.deleteExternalDatabase(ctx, &db); err != nil {
return ctrl.Result{}, err
}
// Remove finalizer to allow deletion
controllerutil.RemoveFinalizer(&db, finalizerName)
if err := r.Update(ctx, &db); err != nil {
return ctrl.Result{}, err
}
}
return ctrl.Result{}, nil
}
// Add finalizer if not present
if !controllerutil.ContainsFinalizer(&db, finalizerName) {
controllerutil.AddFinalizer(&db, finalizerName)
if err := r.Update(ctx, &db); err != nil {
return ctrl.Result{}, err
}
}
// Normal reconciliation logic...
return ctrl.Result{}, nil
}
Testing Operators#
Operator correctness is critical — bugs can destroy production data. Use a layered testing strategy:
Unit tests — Test reconciliation logic with fake clients:
func TestReconcile_CreatesStatefulSet(t *testing.T) {
db := &appsv1.Database{
ObjectMeta: metav1.ObjectMeta{Name: "test-db", Namespace: "default"},
Spec: appsv1.DatabaseSpec{Engine: "postgres", Replicas: 3},
}
client := fake.NewClientBuilder().WithObjects(db).Build()
reconciler := &DatabaseReconciler{Client: client, Scheme: scheme.Scheme}
result, err := reconciler.Reconcile(ctx, ctrl.Request{
NamespacedName: types.NamespacedName{Name: "test-db", Namespace: "default"},
})
assert.NoError(t, err)
assert.Equal(t, ctrl.Result{}, result)
var sts appsv1.StatefulSet
err = client.Get(ctx, types.NamespacedName{Name: "test-db-sts", Namespace: "default"}, &sts)
assert.NoError(t, err)
assert.Equal(t, int32(3), *sts.Spec.Replicas)
}
Integration tests — Use envtest to spin up a real API server and etcd without a full cluster.
End-to-end tests — Deploy to a Kind or k3d cluster and validate the full lifecycle.
Common Pitfalls#
- Infinite reconcile loops — Updating the CR status triggers another reconcile. Use the status subresource to avoid this.
- Missing owner references — Without
SetControllerReference, child resources are not garbage-collected when the CR is deleted. - Not handling requeue — Return
ctrl.Result{RequeueAfter: time.Minute}for operations that need time to converge. - Overly broad RBAC — Grant only the permissions your operator needs. Avoid
cluster-admin. - No rate limiting — Without rate limiting on the work queue, a flood of events can overwhelm the operator.
The operator pattern is one of the most powerful extension points in Kubernetes. It turns operational knowledge into code that runs 24/7, reacting to failures faster than any human could. Start with Helm for simple deployments; graduate to an operator when Day-2 complexity demands it.
That is article #388 on Codelit. Browse all articles or explore the platform to level up your engineering skills.
Try it on Codelit
GitHub Integration
Paste a repo URL and generate architecture from your actual codebase
Related articles
Try these templates
Build this architecture
Generate an interactive architecture for The Kubernetes Operator Pattern in seconds.
Try it in Codelit →
Comments