How to Monitor Background Job Health in Go

Monitor background job health in Go by exposing a metrics endpoint that reports job status, error counts, and last execution time via `runtime/metrics` or custom counters. Use a goroutine to track job state and expose it through an HTTP handler for external monitoring tools.

The silent failure

You deploy a Go service that scrapes prices every minute. It works locally. You push it to production. Three days later, the price database is empty. The process is alive. The HTTP server responds to requests. The background job didn't crash the whole program. It hung on a slow network request, or the scraper logic entered an infinite loop, or the database connection dropped and never recovered. The job is dead, but the process keeps running. You need a way to ask the process what's happening without stopping it.

Monitoring a background job means exposing its internal state through a safe interface. The job runs in a goroutine, detached from the main request flow. The monitoring endpoint runs in a different goroutine, triggered by HTTP requests. Both goroutines access the same variables. If you share plain variables between goroutines, you get data races. The compiler won't stop you. The race detector will catch it in testing, but in production, you get corrupted data or panics. You need synchronization.

For simple metrics like counters, booleans, and timestamps, sync/atomic provides the right tool. Atomic operations guarantee that reads and writes happen as single, indivisible steps. No locks are needed. No mutex contention. The overhead is tiny. The code stays readable.

Shared state and the race detector

Think of a background job like a worker in a basement. You can't see them. You need a status light or an intercom. In Go, the status light is usually an HTTP endpoint that reads shared state. The state lives in package-level variables or a struct. The job writes to the state. The handler reads from the state.

Atomic types wrap primitive values and provide methods like Store, Load, and Add. Store writes a value. Load reads a value. Add increments a counter. These methods are safe to call from multiple goroutines simultaneously. The underlying implementation uses CPU instructions to ensure consistency.

If you use a plain bool or int64, the compiler generates code that assumes single-threaded access. When two goroutines touch the same memory, the CPU might cache values differently. One goroutine writes a value, but another goroutine reads a stale copy. Or worse, the write happens in the middle of a read, producing garbage. The race detector inserts checks at runtime to find these bugs. It prints a stack trace and stops the program. You need to fix the race before deploying.

Atomic types eliminate the race for simple values. They are the standard library solution for metrics, flags, and counters.

Minimal monitoring pattern

Here's the simplest pattern: a goroutine updates atomic counters, and an HTTP handler reads them. The job runs in a loop. The handler responds to requests. The state is shared via atomic variables.

package main

import (
	"sync/atomic"
	"time"
)

// status tracks job metrics with lock-free atomic operations.
// Atomic types allow safe concurrent access without mutexes.
var status struct {
	running    atomic.Bool
	errorCount atomic.Int64
	lastRun    atomic.Int64
}

// runJob updates status at the start and end of each cycle.
func runJob() {
	for {
		status.running.Store(true)
		status.lastRun.Store(time.Now().Unix())

		// Simulate work duration.
		time.Sleep(2 * time.Second)

		status.running.Store(false)

		// Increment counter on simulated failure.
		if time.Now().Unix()%10 == 0 {
			status.errorCount.Add(1)
		}

		time.Sleep(8 * time.Second)
	}
}

The struct groups related metrics. Using a struct keeps the namespace clean. Each field is an atomic type. atomic.Bool tracks whether the job is currently executing. atomic.Int64 tracks error counts and timestamps. The runJob function sets running to true at the start of work, updates lastRun, performs the work, sets running to false, and increments errorCount if needed.

import (
	"fmt"
	"net/http"
)

// healthHandler reads atomic state and formats the response.
// Each Load returns a snapshot of the value at that moment.
func healthHandler(w http.ResponseWriter, r *http.Request) {
	fmt.Fprintf(w, "Running: %v\n", status.running.Load())
	fmt.Fprintf(w, "Errors: %d\n", status.errorCount.Load())
	fmt.Fprintf(w, "Last Run: %s\n", time.Unix(status.lastRun.Load(), 0).Format(time.RFC3339))
}

func main() {
	go runJob()
	http.HandleFunc("/health", healthHandler)
	http.ListenAndServe(":8080", nil)
}

The handler calls Load on each atomic field. Load returns the current value. The values are independent. You might read running as true and lastRun as five minutes ago if the read happens between updates. This is expected. Each load is a snapshot. The handler formats the output as plain text. In production, you'd return JSON. The main function starts the job in a goroutine and launches the HTTP server.

Atomic operations are fast. They are not a substitute for design.

How the code works

The atomic.Bool type stores a boolean value. Store(true) writes true. Load() reads the value. The write and read are atomic. No other goroutine can see a partial write. The atomic.Int64 type works the same way for integers. Add(1) increments the counter atomically. You don't need to read the value, increment it, and write it back. Add does it in one step.

The lastRun field stores a Unix timestamp. time.Now().Unix() returns the current time as an int64. Storing the timestamp as an integer is efficient. The handler converts it back to a time.Time using time.Unix. This conversion happens on every request. The overhead is negligible.

The job loop runs forever. time.Sleep pauses the goroutine. During sleep, running is false. When work starts, running becomes true. This allows the handler to report whether the job is active. The error counter accumulates over time. It never resets. This is useful for tracking total failures. If you need a rate, you can compute it by reading the counter and timestamp, then calculating errors per minute.

Convention aside: gofmt formats the code automatically. Don't argue about indentation. Most editors run gofmt on save. The community expects formatted code. If you submit code that isn't formatted, the review will ask you to run gofmt.

Real-world health endpoint

Monitoring tools expect structured data. Plain text is hard to parse. JSON is the standard format. The handler should return a JSON object with a status field, metrics, and timestamps. You can derive a health status from the metrics. If the error count exceeds a threshold, the status becomes degraded.

import (
	"encoding/json"
	"net/http"
)

// healthResponse defines the JSON shape for monitoring tools.
// Struct tags control the JSON field names.
type healthResponse struct {
	Status  string `json:"status"`
	Running bool   `json:"running"`
	Errors  int64  `json:"error_count"`
	LastRun string `json:"last_run"`
}

// healthHandler builds a JSON response with derived health status.
func healthHandler(w http.ResponseWriter, r *http.Request) {
	errors := status.errorCount.Load()
	statusText := "ok"
	if errors > 100 {
		statusText = "degraded"
	}

	resp := healthResponse{
		Status:  statusText,
		Running: status.running.Load(),
		Errors:  errors,
		LastRun: time.Unix(status.lastRun.Load(), 0).Format(time.RFC3339),
	}

	w.Header().Set("Content-Type", "application/json")
	json.NewEncoder(w).Encode(resp)
}

The healthResponse struct defines the output format. JSON tags map Go field names to JSON keys. json:"error_count" produces error_count in the JSON. The handler loads the error count and determines the status. If errors exceed 100, the status is degraded. The handler builds the response struct and encodes it. json.NewEncoder writes directly to the response writer. The content type header is set to application/json.

Convention aside: Public names start with a capital letter. Private names start lowercase. healthHandler is lowercase because it's only used in this package. If you export it, it becomes HealthHandler. Struct fields are public so json can access them. The receiver name convention applies to methods. If you move this to a struct, use (j *Job) Run(), not (this *Job).

Pitfalls and compiler errors

Data races are the most common bug. If you replace atomic.Bool with bool, the compiler won't complain. The code compiles. The race detector catches the bug at runtime. It prints WARNING: DATA RACE and shows the stack traces of the conflicting goroutines. You must fix the race by using atomic types or a mutex.

Stale reads happen when you read multiple fields without synchronization. The handler reads running, then lastRun. Between the two reads, the job might update both fields. You could see running: true and lastRun: 5 minutes ago. This is a consistent snapshot of the state at two different moments. It's usually acceptable for monitoring. If you need a consistent snapshot of multiple fields, use a mutex.

Compiler errors appear when you misuse types. If you try to compare a Unix timestamp with a time.Time, the compiler rejects the code.

The compiler rejects if status.lastRun.Load() < time.Now() with invalid operation: mismatched types. You must convert the Unix timestamp back to a time.Time before comparing.

If you forget to import a package, you get undefined: pkg. If you import a package but don't use it, you get imported and not used. Go enforces clean imports. Remove unused imports to compile.

Convention aside: if err != nil { return err } is verbose by design. The community accepts the boilerplate because it makes the unhappy path visible. Don't hide errors. Return them or log them. In the health handler, json.NewEncoder(w).Encode(resp) returns an error. You should check it. In a simple handler, you can ignore it if the response writer is already flushed. In production code, log the error.

Detecting stuck jobs

A counter tells you how many errors happened. It doesn't tell you if the job is frozen. A frozen job is worse than a failing job. A failing job increments the error count. A frozen job leaves running as true forever. You need a watchdog. The watchdog checks if running is true and lastRun is older than the expected cycle time. If so, the job is stuck.

You can expose this as a stuck boolean in your health endpoint. This requires comparing timestamps. time.Since(time.Unix(status.lastRun.Load(), 0)) > 30*time.Second. This logic belongs in the handler. The handler computes the derived state on demand. This keeps the job loop simple. The job only updates atomic values. The handler interprets them.

Convention aside: time.Since is idiomatic. It returns a time.Duration. Comparing durations is straightforward. Don't use time.Now().Sub. time.Since is clearer.

Background jobs run until the process dies. Give them a way to stop, or they leak. Use context.Context to signal cancellation. Pass context.Context as the first argument to your job function. Name it ctx. Check ctx.Done() in your loop. This follows the standard Go convention. Functions that take a context should respect cancellation.

Context is plumbing. Run it through every long-lived call site.

Decision matrix

Use sync/atomic counters when you need simple metrics like error totals or boolean flags with zero allocation overhead. Use a sync.Mutex when you need to update multiple related fields atomically, such as resetting a counter and updating a timestamp in one step. Use an external metrics library like Prometheus when you need histograms, gauges, or integration with existing dashboards. Use context.Context when the job needs to stop cleanly on shutdown, allowing the handler to report a graceful termination. Use plain sequential code when the job is fast and runs synchronously; background monitoring adds complexity you don't need.

Monitor the job, not just the process.

Where to go next