The crash that shouldn't kill your service
You are running a background service that processes messages from a queue. The code works perfectly for thousands of messages. Then a bad payload arrives. It contains malformed JSON. Your worker goroutine panics. The goroutine dies. The queue keeps filling up. The rest of your service is still running, but that worker is gone forever. You have to restart the entire binary to fix it.
You want a system that isolates failures. If a worker crashes, you want a new worker to take its place immediately. You want the parent to monitor the children and restart them automatically. This pattern is called a supervisor tree. Erlang and Elixir have this built into the runtime. Go does not. Go gives you the primitives to build exactly what you need. You compose goroutines, channels, and context to create supervision.
Supervision is just a loop with goroutines
A supervisor is a goroutine that manages other goroutines. The supervisor spawns a child, waits for the child to finish, and then spawns a new child. The loop continues until the supervisor itself is told to stop. The supervisor does not do the work. The supervisor ensures the work keeps happening.
Think of a supervisor like a shift manager at a restaurant. The manager does not chop vegetables or plate food. The manager watches the line. If a cook drops a pan and walks away, the manager calls a relief cook to take the station. The manager keeps a log of who is working and ensures the kitchen keeps moving. If the restaurant closes for the night, the manager tells everyone to go home.
In Go, the supervisor loop looks like a for loop that runs forever. Inside the loop, you start a goroutine. You create a channel to signal when the goroutine is done. You use select to wait for either the channel signal or a cancellation signal from the context. When the channel signals, the loop repeats. When the context cancels, the loop breaks.
Minimal supervisor example
Here is the simplest supervisor. It spawns a worker, waits for the worker to finish, and restarts the worker. The worker runs until the context is cancelled.
package main
import (
"context"
"fmt"
"time"
)
// worker simulates a long-running task.
func worker(ctx context.Context, id int, done chan<- struct{}) {
// defer ensures the channel is closed when the function returns.
defer close(done)
for {
select {
case <-ctx.Done():
// context cancelled, worker exits cleanly.
return
default:
// do work here.
fmt.Printf("Worker %d running\n", id)
time.Sleep(2 * time.Second)
}
}
}
// supervisor manages the worker lifecycle.
func supervisor(ctx context.Context) {
for {
// create a channel to signal worker completion.
done := make(chan struct{})
// spawn worker in a new goroutine.
go worker(ctx, 1, done)
// wait for worker to finish or context to cancel.
select {
case <-done:
// worker finished, loop restarts it.
fmt.Println("Worker finished, restarting...")
case <-ctx.Done():
// supervisor told to stop.
fmt.Println("Supervisor stopping.")
return
}
}
}
func main() {
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
go supervisor(ctx)
// let it run for a bit.
time.Sleep(10 * time.Second)
cancel()
// wait for supervisor to exit.
time.Sleep(1 * time.Second)
}
The supervisor loop runs indefinitely. Each iteration creates a new done channel and spawns a new worker. The select statement blocks until one of two things happens. The worker closes the done channel, which triggers the first case. The context is cancelled, which triggers the second case. If the worker finishes, the loop repeats and spawns a fresh worker. If the context cancels, the supervisor returns and the program can exit.
Goroutines are cheap. Channels are not magic. The supervisor pattern works because goroutines have low overhead and channels provide a reliable way to coordinate between them.
Adding panic recovery and backoff
The minimal example restarts the worker if it returns normally. Real workers crash. They panic. If a worker panics and you do not recover the panic, the panic propagates up and kills the supervisor too. You need to catch panics inside the worker or inside the supervisor spawn logic.
You also need to prevent rapid restart loops. If a worker crashes immediately, you do not want to spawn a new worker every microsecond. You add a backoff delay. You wait longer between each restart attempt. This protects the system from thundering herds and gives downstream services time to recover.
Here is a realistic supervisor. It uses a struct to hold configuration. It wraps the worker in a recovery function. It implements exponential backoff.
package main
import (
"context"
"fmt"
"time"
)
// Supervisor holds configuration for the supervision loop.
type Supervisor struct {
maxRestarts int
backoff time.Duration
}
// Run starts the supervision loop.
func (s *Supervisor) Run(ctx context.Context) {
restarts := 0
for {
// spawn worker with panic recovery.
done := make(chan struct{})
go func() {
// recover catches panics to prevent supervisor crash.
defer func() {
if r := recover(); r != nil {
fmt.Printf("Worker panicked: %v\n", r)
}
close(done)
}()
// simulate work that might panic.
worker(ctx)
}()
// wait for worker or context.
select {
case <-done:
restarts++
if restarts > s.maxRestarts {
fmt.Println("Max restarts reached, giving up.")
return
}
// apply backoff before restarting.
backoff := time.Duration(restarts) * s.backoff
fmt.Printf("Restarting in %v...\n", backoff)
select {
case <-time.After(backoff):
// backoff elapsed, loop continues.
case <-ctx.Done():
return
}
case <-ctx.Done():
return
}
}
}
// worker does the actual work.
func worker(ctx context.Context) {
for {
select {
case <-ctx.Done():
return
default:
fmt.Println("Working...")
time.Sleep(1 * time.Second)
}
}
}
func main() {
s := &Supervisor{
maxRestarts: 5,
backoff: 1 * time.Second,
}
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
go s.Run(ctx)
time.Sleep(15 * time.Second)
cancel()
}
The Run method uses the receiver name s, which matches the type Supervisor. This follows the Go convention of using short receiver names. The worker is wrapped in an anonymous function that defers a recover. If the worker panics, the recover catches it, prints the error, and closes the done channel. The supervisor sees the channel close and treats it like a normal exit. The supervisor increments the restart counter and applies a backoff delay. If the restart count exceeds the maximum, the supervisor gives up. This prevents infinite loops when the worker is fundamentally broken.
Context is plumbing. Run it through every long-lived call site. The worker checks ctx.Done() to stop cleanly. The supervisor checks ctx.Done() to stop the loop. Cancellation flows down from the parent to the children.
Pitfalls and compiler errors
Supervisor trees introduce complexity. You must handle errors, panics, and cancellation correctly. Here are common mistakes.
Panic propagation. If you spawn a goroutine that panics and you do not recover, the panic kills the goroutine. If the supervisor waits on a channel that never closes, the supervisor hangs. If the supervisor itself panics, the whole supervision tree dies. Always recover inside the spawned goroutine or use a wrapper function.
Goroutine leaks. If you spawn a new worker but the old worker is still running, you leak goroutines. The supervisor must wait for the old worker to finish before spawning a new one. The select on the done channel ensures this. If you forget to close the channel, the supervisor waits forever. The compiler cannot catch this. You must ensure every path in the worker closes the channel or recovers.
Context cancellation. If the parent context cancels, the supervisor must stop spawning new workers. The supervisor must also wait for the current worker to finish. If the supervisor returns immediately, the worker goroutine might still be running. This is a goroutine leak. The supervisor should wait for the final worker to finish before returning.
Compiler errors. If you forget to pass the context to the worker, the compiler rejects the program with not enough arguments in call to worker. If you pass the wrong type, you get cannot use ctx (variable of type context.Context) as context.Context value in argument. If you forget to import a package, you get undefined: pkg. If you import a package and do not use it, you get imported and not used. Go errors are plain text. Read them carefully. They tell you exactly what is wrong.
Error handling. Go does not have exceptions. Errors are values. If the worker returns an error, the supervisor should handle it. You can pass errors through a channel. You can wrap errors with fmt.Errorf. The community accepts verbose error handling because it makes the unhappy path visible. if err != nil { return err } is boilerplate that keeps bugs out.
The worst goroutine bug is the one that never logs. Always log when a worker crashes. Always log when the supervisor restarts. Logs are your only way to debug supervision issues in production.
When to use a supervisor tree
Go gives you many ways to manage concurrency. You choose the tool that matches your problem.
Use a supervisor tree when you need automatic restarts for long-running workers. Use a supervisor tree when a worker crash should not kill the entire service. Use a supervisor tree when you want to isolate failures and keep the system alive.
Use errgroup when you want to wait for a group of tasks and propagate the first error. Use errgroup when tasks are independent and you do not need to restart them. Use errgroup when you want to cancel all tasks if one fails.
Use a single goroutine when the task is fire-and-forget and does not need recovery. Use a single goroutine when the task is short-lived and crashes are acceptable. Use a single goroutine when you do not need to monitor the task.
Use an external process manager when you need to manage the entire binary, not just internal goroutines. Use an external process manager when you need to restart the process on memory leaks or configuration changes. Use an external process manager when you want to separate supervision from application logic.
Trust the primitives. Goroutines, channels, and context are enough to build reliable systems. You do not need a framework. You do not need a library. You write the loop. You handle the errors. You own the lifecycle.