How to Build a Kubernetes Operator in Go

Build a Kubernetes Operator in Go by scaffolding a project with Operator SDK, defining a CRD, implementing a reconciliation loop, and deploying it to your cluster.

When the cluster needs to learn new behavior

You are managing a database cluster. You need to scale replicas, handle backups, and recover from crashes. Doing this by hand with kubectl apply and scripts is error-prone. You want the cluster to self-heal. You write a program that watches the cluster state and fixes drift. That program is an operator.

Kubernetes knows about Pods, Services, and Deployments. It does not know about "MyDatabase" or "MyShardedCache." An operator teaches the cluster about your domain. You define a Custom Resource Definition (CRD) to describe the shape of your object. You implement a controller that runs a reconciliation loop. The loop reads the desired state from the CRD, checks the actual state, and applies changes to close the gap. If a Pod dies, the loop sees the gap and creates a new Pod. The operator makes the cluster behave like a managed service for your workload.

The reconciliation loop

An operator is a controller with opinions. The core mechanism is the reconciliation loop. The framework watches the API server for changes. When a custom resource changes, the framework pushes a request into a workqueue. A worker goroutine pops the request and calls your Reconcile function.

The Reconcile function is the heart of the operator. It fetches the current state, compares it to the desired state defined in the custom resource, and performs actions to converge. If the function returns an error, the framework requeues the request with exponential backoff. If it returns success, the loop waits for the next event. The function must be idempotent. It can be called twice for the same event, and it must produce the same result both times.

Operators are controllers with opinions. Keep the loop tight and idempotent.

Minimal example

Here's the scaffold command sequence to create a project and generate the API types.

# Scaffold the project structure and go.mod
operator-sdk init --domain example.com --type go
# Generate the API types and controller skeleton
operator-sdk create api --group cache --version v1 --kind Memcached

The operator-sdk tool creates a Go module, sets up the project layout, and generates boilerplate code. The create api command generates the type definitions for your CRD and the skeleton for the controller. You run make generate afterward to produce the deepcopy methods and CRD YAML. The tool controller-gen scans your Go types and emits the necessary code. Markers in comments drive this generation. You write // +kubebuilder:validation:Required and the tool emits the CRD validation schema. This keeps your Go types and YAML in sync.

Here's the core reconciliation loop.

// Reconcile is called by the controller when the CR changes or related objects change.
// It returns a Result and an error. The error triggers a retry with exponential backoff.
func (r *MemcachedReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // Fetch the custom resource instance from the API server.
    // If it's deleted, the garbage collector handles cleanup, so we return early.
    var memcached v1.Memcached
    if err := r.Get(ctx, req.NamespacedName, &memcached); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // Check if the deployment exists. If not, create it.
    // This is the core logic: desired state vs actual state.
    deployment := &appsv1.Deployment{}
    if err := r.Get(ctx, req.NamespacedName, deployment); err != nil {
        if errors.IsNotFound(err) {
            // Create the deployment matching the CR spec.
            dep := r.buildDeploymentForMemcached(&memcached)
            if err := r.Create(ctx, dep); err != nil {
                return ctrl.Result{}, err
            }
        }
    }

    // Return empty result to stop requeuing.
    // The controller will only call Reconcile again when a watch event fires.
    return ctrl.Result{}, nil
}

The receiver is named r for reconciler. This matches the convention of short, predictable names. The context.Context is the first parameter, named ctx. The framework passes this context, and it carries cancellation signals. If the context is canceled, you must stop work and return. The ctrl.Request contains the namespace and name of the object. The return type ctrl.Result controls requeueing. An empty result means success. You can return ctrl.Result{RequeueAfter: 10 * time.Second} to schedule a periodic check.

Run gofmt on your controller code. The community expects standard formatting. Most editors do this on save. Don't argue about indentation; let the tool decide.

Wiring up watches

The controller needs to know what to watch. You configure watches in SetupWithManager. You watch the primary custom resource. You can also watch related objects. If a related object changes, the controller maps the change back to the custom resource and requeues it. This is the owner reference pattern.

Here's how you wire up the watches.

// SetupWithManager configures the controller to watch Memcached objects.
// It also sets up watches on owned Deployments so changes propagate back to the CR.
func (r *MemcachedReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        // Watch the primary custom resource type.
        For(&v1.Memcached{}).
        // Watch Deployments owned by the CR.
        // The framework uses owner references to map Pod changes back to the Memcached object.
        Owns(&appsv1.Deployment{}).
        Complete(r)
}

The Owns method sets up a watch on Deployments that have an owner reference pointing to a Memcached object. When you create a Deployment, you must set the owner reference. The framework uses ctrl.SetControllerReference to do this. It links the Deployment to the CR. When the Deployment changes, the framework finds the owner and calls Reconcile for the CR. This ensures the operator reacts to changes in the resources it manages.

Reconcile runs in a worker goroutine. Treat it as a pure function of state.

Managing lifecycle with finalizers

Real operators manage lifecycle. If a user deletes the CR, you might need to delete dependent resources or run a backup script. You use finalizers. A finalizer is a string in the CR's metadata. The API server refuses to delete the object until the finalizer list is empty. You add a finalizer when you create resources. When the CR has a DeletionTimestamp, you run cleanup, then remove the finalizer.

Here's the finalizer pattern.

// ensureFinalizer adds a finalizer to the CR to prevent deletion until cleanup is done.
// This allows the operator to delete dependent resources before the CR disappears.
func (r *MemcachedReconciler) ensureFinalizer(ctx context.Context, memcached *v1.Memcached) error {
    if !containsString(memcached.Finalizers, "memcached.example.com") {
        memcached.Finalizers = append(memcached.Finalizers, "memcached.example.com")
        return r.Update(ctx, memcached)
    }
    return nil
}

In Reconcile, you check for the finalizer. If it's missing, you add it. If the CR has a DeletionTimestamp, you perform cleanup. After cleanup, you remove the finalizer and update the CR. The API server then removes the object. The if err != nil check is verbose by design. The community accepts the boilerplate because it makes the unhappy path visible. Always handle errors. If you ignore an error, the compiler rejects the program with an unused-variable error.

Updating status safely

Operators should update the .status field separately from the spec. Use the status subresource. This prevents update loops where the controller updates the object, triggers a watch, and updates again. The status subresource allows the controller to write status without bumping the resource version of the spec.

Here's how you update status.

// UpdateStatus writes to the status subresource.
// This avoids triggering a reconciliation loop caused by spec updates.
memcached.Status.Ready = true
if err := r.Status().Update(ctx, &memcached); err != nil {
    return ctrl.Result{}, err
}

The r.Status() method returns a client that targets the status subresource. You call Update on this client. The API server updates the status without affecting the spec. This is crucial for stability. If you update the spec, the resource version changes. The watch sees the change and calls Reconcile again. If Reconcile updates the spec again, you get an infinite loop. The status subresource breaks this cycle.

Pitfalls and errors

The worst bug is the infinite loop. If Reconcile updates the status and triggers a watch that calls Reconcile again, you burn CPU. Always update status carefully. Use the status subresource. The framework passes a context. If the context is canceled, stop work. The compiler complains with cannot use r.Get(...) as error if you ignore the return value. Go requires you to handle errors.

Reconcile must be idempotent. It can be called twice for the same event. Don't assume state. Check if resources exist before creating them. Use errors.IsNotFound to handle missing objects. The framework handles retries. If Reconcile returns an error, the item goes back to the queue with a delay. If it returns success, the loop waits for the next event.

Goroutine leaks happen when the goroutine waits on a channel that never gets closed. Always have a cancellation path. In operators, the context provides cancellation. Pass the context to every long-lived call. If the context is canceled, return immediately.

Errors trigger backoff. Success stops the retry. Trust the queue.

Decision matrix

Use an operator when you need to manage complex stateful workloads with custom lifecycle logic. Use a Helm chart when you only need to deploy static configurations and don't require runtime reconciliation. Use a CronJob when you need to run a script periodically without maintaining long-lived state. Use a plain Deployment when your app is stateless and Kubernetes built-in primitives handle your needs.

Operators add complexity. Only build one when the cluster needs to learn new behavior.

Where to go next