How to Implement Health Checks for Microservices in Go

The process is alive, but the service is dead

You deploy a Go microservice. The container starts. The process is running. The load balancer sees an open TCP port and assumes everything is fine. It sends a request. Your service tries to talk to the database. The database is down. Your handler hangs waiting for a connection. The load balancer waits. The user waits. The orchestrator sees the process is still alive and does nothing.

The service is effectively dead, but the system thinks it is healthy.

A health check solves this gap. It is a lightweight HTTP endpoint that answers a specific question: can you handle work right now? Not "are you running", but "are you useful". When the answer is no, the endpoint returns a failure status. The load balancer stops sending traffic. The orchestrator knows the service is broken.

Liveness versus readiness

Orchestrators like Kubernetes distinguish between two types of health. The distinction matters when you design your endpoints.

Liveness checks ask if the process is stuck. If the goroutine pool is deadlocked, or the main loop has exited, the liveness check fails. The orchestrator kills the container and restarts it. A restart might fix the problem.

Readiness checks ask if the service can accept traffic. If the database is down, or the configuration file is missing, the readiness check fails. The orchestrator removes the pod from the load balancer. It does not restart the pod. The service stays running, waiting for the dependency to come back.

Use a single endpoint for simple services. Use separate endpoints when you need different behaviors for crashes versus dependency failures.

The minimal health check

Start with the simplest thing that works. An HTTP handler that pings the database. If the ping succeeds, return 200 OK. If it fails, return 503 Service Unavailable.

Here is the minimal implementation. It checks one dependency and returns a plain text response.

package main

import (
	"database/sql"
	"net/http"
	"os"
)

// db holds the database connection pool.
// Initialize this before starting the HTTP server.
var db *sql.DB

func main() {
	// Open the database connection pool.
	// In real code, handle the error from sql.Open.
	dsn := os.Getenv("DB_URL")
	db, _ = sql.Open("postgres", dsn)

	// Register the health check handler.
	http.HandleFunc("/health", healthHandler)

	// Start the HTTP server.
	http.ListenAndServe(":8080", nil)
}

// healthHandler checks the database and returns the status.
func healthHandler(w http.ResponseWriter, r *http.Request) {
	// Ping verifies the connection is actually open.
	// It sends a lightweight query to the database.
	if err := db.Ping(); err != nil {
		// Return 503 to signal the service is unavailable.
		// The load balancer will stop sending traffic.
		http.Error(w, "DB connection failed", http.StatusServiceUnavailable)
		return
	}

	// Return 200 to signal the service is healthy.
	w.WriteHeader(http.StatusOK)
	w.Write([]byte("OK"))
}

The code is straightforward. db.Ping() sends a lightweight query to the database. It verifies the connection is open and the server is responding. If the database is down, Ping returns an error. The handler writes a 503 status and a short message. The load balancer sees the 503 and marks the backend as unhealthy.

Convention aside: http.Error writes the status code and the body in one call. It is the standard way to return error responses in Go. The community accepts the boilerplate because it makes the unhappy path visible.

Why ping is not enough

The minimal example works for a toy service. Real services have more dependencies. They talk to Redis, message queues, and other APIs. They also need to respect timeouts. A health check that hangs for thirty seconds is worse than no health check. The orchestrator will kill the process, creating a restart loop.

A robust health check needs three things. It needs to check all critical dependencies. It needs a timeout to prevent hanging. It needs to return a structured response for debugging.

Here is a realistic implementation. It uses a struct to hold dependencies. It uses context to enforce a timeout. It returns JSON so you can see which dependency failed.

package main

import (
	"context"
	"database/sql"
	"encoding/json"
	"errors"
	"net/http"
	"time"
)

// HealthChecker holds the dependencies required for health checks.
type HealthChecker struct {
	db *sql.DB
}

// NewHealthChecker creates a new checker with the given database.
func NewHealthChecker(db *sql.DB) *HealthChecker {
	return &HealthChecker{db: db}
}

// Check performs the health check and returns an error if anything fails.
func (h *HealthChecker) Check(ctx context.Context) error {
	// Create a child context with a short timeout.
	// Health checks must be fast. 500ms is usually enough.
	ctx, cancel := context.WithTimeout(ctx, 500*time.Millisecond)
	defer cancel()

	// Ping the database using the context.
	// This ensures the ping is cancelled if the timeout fires.
	if err := h.db.PingContext(ctx); err != nil {
		// Wrap the error to identify the failing dependency.
		return errors.New("database unreachable")
	}

	// Add more checks here, like Redis or external APIs.
	// Return nil if all checks pass.
	return nil
}

The HealthChecker struct holds the database connection. The Check method takes a context.Context. This is a Go convention. Functions that perform I/O should accept a context as the first parameter. The context carries deadlines and cancellation signals.

The method creates a child context with a 500-millisecond timeout. It calls db.PingContext instead of db.Ping. PingContext respects the deadline. If the database is slow, the ping is cancelled after 500ms. The check returns an error. The handler does not hang.

Convention aside: context.Context always goes as the first parameter. It is conventionally named ctx. Functions that take a context should respect cancellation and deadlines. This pattern prevents goroutine leaks and ensures your service responds quickly to failure.

Wiring the handler

The checker does the work. The handler exposes it over HTTP. The handler calls the checker, marshals the result to JSON, and writes the response.

Here is the handler implementation. It separates the business logic from the HTTP plumbing.

package main

import (
	"encoding/json"
	"net/http"
)

// HealthResponse represents the JSON response for the health endpoint.
type HealthResponse struct {
	Status  string `json:"status"`
	Message string `json:"message,omitempty"`
}

// ServeHTTP implements the http.Handler interface.
// It checks health and writes the JSON response.
func (h *HealthChecker) ServeHTTP(w http.ResponseWriter, r *http.Request) {
	// Use the request context.
	// The server cancels this context if the client disconnects.
	ctx := r.Context()

	// Run the health check.
	err := h.Check(ctx)

	// Prepare the response.
	resp := HealthResponse{Status: "healthy"}
	if err != nil {
		resp.Status = "unhealthy"
		resp.Message = err.Error()
	}

	// Set the content type to JSON.
	w.Header().Set("Content-Type", "application/json")

	// Write the status code.
	if resp.Status == "healthy" {
		w.WriteHeader(http.StatusOK)
	} else {
		w.WriteHeader(http.StatusServiceUnavailable)
	}

	// Encode the response to JSON.
	json.NewEncoder(w).Encode(resp)
}

The handler implements the http.Handler interface. It calls ServeHTTP. This allows you to pass the checker directly to http.Handle. The handler uses r.Context() to get the request context. It calls Check. It builds a HealthResponse struct. It sets the content type. It writes the status code. It encodes the JSON.

Convention aside: w.Header().Set must be called before w.WriteHeader. Once the headers are written, you cannot change them. The compiler will not catch this mistake. It is a runtime logic error. Always set headers first.

Pitfalls and anti-patterns

Health checks seem simple. They are not. A bad health check causes outages.

The restart loop is the most common failure. You configure a liveness check. The database goes down. The check fails. The orchestrator kills the pod. The pod restarts. The database is still down. The check fails. The orchestrator kills the pod again. The service never recovers. Use readiness checks for dependency failures. Use liveness checks only for process hangs.

The slow check is another trap. You add a check for an external API. The API is slow. The health check takes two seconds. The load balancer expects a response in one second. It marks the service as down. You create a false negative. Keep health checks fast. Use short timeouts. Check local state when possible.

The silent failure is the worst. You return 200 OK even when the database is down. The load balancer sends traffic. The requests fail. Users see errors. The orchestrator does nothing. Always return 503 when a critical dependency is down. Do not swallow errors.

Compiler errors happen when you forget context. If you pass a context.Context to a function that expects a *sql.DB, the compiler rejects the program with cannot use ctx (variable of type context.Context) as *sql.DB value in argument. If you forget to import a package, you get undefined: pkg. If you forget to use one, you get imported and not used. Go is strict about types. Use the right types.

Decision matrix

Choose the right pattern for your service.

Use a simple ping when you have one critical dependency and no complex startup logic. A single db.Ping() is enough.

Use a structured JSON response when you need to debug failures in production. Operators can see which dependency failed without digging into logs.

Use a separate readiness endpoint when you need to distinguish between process crashes and dependency failures. Kubernetes uses /ready for traffic routing and /live for restarts.

Use a lightweight check when the health check runs frequently. Load balancers poll every few seconds. Do not run expensive queries. Check connection pool stats instead of running a full transaction.

Use a context with timeout when you talk to external services. Prevent the health check from hanging forever. Set a deadline of 500ms or less.

Use a single endpoint when your service is simple. Do not over-engineer. One endpoint that returns 200 or 503 is often enough.

Where to go next

Health checks are the foundation of observability. Once your service reports its status, you can measure it.

Health checks are promises. Keep them fast. Keep them honest.

A health check is a simple status report your service gives to a manager or load balancer. It tells them if the service is running and if it can talk to its database or other tools. Think of it like a doctor's checkup; if the service is sick, the manager stops sending it traffic so it doesn't crash.