How to Use controller-runtime for Building Operators

Use controller-runtime by defining a custom resource (CRD) and implementing a Reconciler that watches for changes, fetches the desired state, and updates the cluster to match it.

The operator pattern without the boilerplate

You are managing a fleet of databases. An engineer creates a Database object in Kubernetes. You want a Deployment, a Service, a PersistentVolumeClaim, and a Secret to appear automatically. If the Deployment dies, you want it back. If the Database spec changes, you want the Deployment to update. If the Database is deleted, you want all the child resources to vanish.

Writing this from scratch means watching the API server for changes, handling leader election so multiple replicas don't fight, retrying on network blips, managing watch caches, and diffing state. That is a mountain of infrastructure code. controller-runtime exists so you skip the mountain. You write the logic that says "Here is what the world should look like," and the library handles the control loop, the watches, the retries, and the leader election.

The reconciler is a thermostat

Think of a reconciler like a thermostat. You set the desired temperature. The thermostat does not care how the heater works. It checks the current temperature, compares it to the desired one, and turns the heater on or off. If someone opens a window and the room gets cold, the thermostat notices and turns the heater back on.

In Kubernetes, your Reconcile function is the thermostat. The Custom Resource is the temperature setting. The cluster state is the room. The library is the wiring and sensors. You define the desired state in code. The library calls Reconcile whenever something changes. You fetch the current state, compute the desired state, and apply the difference. If you return an error, the library retries. If you return success, it waits for the next event.

The critical insight is idempotency. You can call Reconcile ten times in a row, and the cluster state should be identical after the first call. This allows the library to retry safely without breaking the world.

Minimal reconciler example

Here is the core of an operator. The Reconcile method fetches the Custom Resource, builds the desired Deployment, and ensures it exists. The code is split into two blocks to keep the logic readable.

// Reconcile runs whenever the Custom Resource changes or the controller starts.
func (r *MyResourceReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // Fetch the Custom Resource to read the desired configuration.
    var myRes myv1.MyResource
    if err := r.Get(ctx, req.NamespacedName, &myRes); err != nil {
        // Ignore NotFound errors. The resource is gone, so reconciliation is complete.
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // Return early if the resource is being deleted to avoid creating new objects.
    if !myRes.DeletionTimestamp.IsZero() {
        return ctrl.Result{}, nil
    }

    // Construct the Deployment object. This represents the target state, not the current state.
    desiredDeployment := &appsv1.Deployment{
        ObjectMeta: metav1.ObjectMeta{
            Name:      myRes.Name,
            Namespace: myRes.Namespace,
        },
        Spec: appsv1.DeploymentSpec{
            Replicas: myRes.Spec.Replicas,
            Template: corev1.PodTemplateSpec{
                Spec: corev1.PodSpec{
                    Containers: []corev1.Container{{
                        Name:  "app",
                        Image: myRes.Spec.Image,
                    }},
                },
            },
        },
    }

    // Link the Deployment to the CR. Kubernetes uses this reference for garbage collection.
    if err := controllerutil.SetControllerReference(&myRes, desiredDeployment, r.Scheme); err != nil {
        return ctrl.Result{}, err
    }

    // Check if the Deployment exists. If not, create it.
    existingDeployment := &appsv1.Deployment{}
    err := r.Get(ctx, req.NamespacedName, existingDeployment)
    if err != nil && apierrors.IsNotFound(err) {
        // Create the Deployment since it doesn't exist yet.
        if err := r.Create(ctx, desiredDeployment); err != nil {
            return ctrl.Result{}, err
        }
    } else if err == nil {
        // Update the Deployment if it already exists and specs differ.
        existingDeployment.Spec = desiredDeployment.Spec
        if err := r.Update(ctx, existingDeployment); err != nil {
            return ctrl.Result{}, err
        }
    }

    // Return success with no requeue. The loop will wake up again on the next event.
    return ctrl.Result{}, nil
}

The receiver name r is the convention for reconcilers. It matches the type MyResourceReconciler. The context.Context parameter is always first, named ctx. The function returns ctrl.Result and error. If you return an error, the library requeues the request and retries. If you return ctrl.Result{RequeueAfter: 10 * time.Second}, the library retries after the delay. Returning ctrl.Result{} means "done for now."

The reconciler is a thermostat. Set the state, let the library handle the wiring.

Wiring the manager

The reconciler does not run alone. You need a manager to start the control loop, set up the cache, and handle leader election. The manager holds the shared client and scheme.

func main() {
    // Set up the signal handler for graceful shutdown on SIGTERM.
    ctx := ctrl.SetupSignalHandler()

    // Create the manager that holds the shared cache and client.
    mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
        Scheme: scheme,
    })
    if err != nil {
        // Log the error and exit. In production, use a structured logger.
        log.Fatal(err)
    }

    // Register the reconciler with the manager.
    if err := (&MyResourceReconciler{
        Client: mgr.GetClient(),
        Scheme: mgr.GetScheme(),
    }).SetupWithManager(mgr); err != nil {
        log.Fatal(err)
    }

    // Start the manager. This blocks until the context is cancelled.
    if err := mgr.Start(ctx); err != nil {
        log.Fatal(err)
    }
}

The Scheme is not just a type registry. It tells the client how to serialize objects to JSON for the API server. If you forget to add your Custom Resource type to the scheme, the compiler rejects the program with cannot convert ... to untyped nil or a runtime panic when the client tries to encode the object.

SetupWithManager registers the reconciler and sets up the watches. By default, it watches the Custom Resource type. When the CR changes, the manager queues a request and calls Reconcile. You can add watches for other resources using predicates to filter events.

Leader election is automatic. If you deploy multiple replicas of the operator, only one replica runs the reconcilers. The others are hot standbys. If the leader dies, the lease expires, and a standby takes over. The cache warms up, and reconciliation resumes. You do not need to configure this manually.

Leader election protects the cluster. Run multiple replicas, trust the lease.

Pitfalls and conventions

Operators introduce specific failure modes. The most common is the infinite loop. If Reconcile returns an error, the library retries. If the error is permanent, you will flood the API server. Handle transient errors by returning them. Handle permanent errors by updating the status and returning success.

Another pitfall is missing owner references. If you forget SetControllerReference, the child resources become orphans. Deleting the Custom Resource will not delete the Deployment. The cluster fills up with stale objects. Always set the controller reference for objects your operator creates.

The compiler enforces some discipline. If you forget to import a package, you get undefined: pkg. If you import a package and do not use it, you get imported and not used. Go requires all imports to be used. This keeps the codebase clean.

Error handling follows the standard Go pattern. if err != nil { return err } is verbose by design. The community accepts the boilerplate because it makes the unhappy path visible. Do not swallow errors. Return them so the library can retry.

The context.Context parameter must be passed to every API call. The manager uses the context for cancellation. If the context is cancelled, the calls abort. This prevents goroutine leaks when the operator shuts down.

Goroutine leaks happen when a goroutine waits on a channel that never gets closed. The manager handles the main control loop, but if you spawn background goroutines, you must provide a cancellation path. Pass ctx to every long-lived call.

Idempotency is the law. Write Reconcile so it can run twice without breaking the world. Check if an object exists before creating it. Update only if the spec differs. The library may call Reconcile multiple times for the same event due to race conditions or retries.

Trust the control loop. Make Reconcile safe to call repeatedly.

When to use controller-runtime

Use controller-runtime when building a Kubernetes operator that manages custom resources and child objects. The library handles the control loop, caching, leader election, and event filtering. You focus on the reconciliation logic.

Use the Kubernetes client-go library directly when you need fine-grained control over watch streams or are writing a CLI tool that does not need a long-running control loop. client-go gives you the raw primitives. controller-runtime wraps them in a higher-level abstraction.

Use a simple CronJob or batch script when the task is periodic and does not require reactive state reconciliation. If you just need to run a job every hour, a CronJob is simpler than an operator.

Use a webhook when you need to validate or mutate objects before they are persisted to the API server. Webhooks run synchronously during the admission control phase. Reconcilers run asynchronously after the object is stored.

Use controller-runtime for operators. Use client-go for tools. Use webhooks for validation.

Where to go next