How to Debug Goroutine Leaks in Go

Debug goroutine leaks in Go by capturing a profile with runtime/pprof and analyzing the stack traces using go tool pprof.

The staircase graph

Your Go service has been running for three weeks. The load balancer reports healthy status codes, but the memory usage graph looks like a staircase climbing toward the ceiling. CPU usage is near zero. The service isn't crashing, it's just slowly suffocating. You restart the pod, and the graph resets. Two weeks later, the climb starts again.

This is the signature of a goroutine leak. Goroutines are cheap, but they are not free. When they pile up without finishing, they hold onto memory and file descriptors until the process runs out of resources. Debugging a leak means finding the goroutine that refuses to die and understanding why it's stuck.

What a leak actually is

A goroutine leak happens when you spawn a goroutine but forget to give it a way to exit. The goroutine keeps running, waiting for something that never arrives. It holds a reference to the data it captured, so the garbage collector can't reclaim that memory. Over time, thousands of these zombie goroutines accumulate.

Think of a restaurant kitchen. Every customer order spawns a cook. The cook grabs ingredients, makes the dish, serves it, and goes back to the break room. If the ticket machine stops sending "done" signals, the cooks stand at the station forever, holding their spatulas and ingredients. The kitchen fills up with cooks doing nothing. Eventually, there's no room for new orders.

The leak isn't the cooks working too hard. It's the cooks who never get sent home. In Go, the "ticket" is usually a channel send or a context cancellation. When the signal never comes, the goroutine blocks indefinitely.

Goroutines are cheap. Leaks are expensive.

Minimal leak pattern

Here's the simplest leak: a goroutine waiting on a channel that never receives a value.

package main

import (
	"fmt"
	"time"
)

func main() {
	// Channel created but never sent to.
	done := make(chan struct{})

	// Goroutine blocks forever waiting on done.
	go func() {
		<-done
		fmt.Println("This never prints")
	}()

	// Main goroutine sleeps to show the leak persists.
	time.Sleep(2 * time.Second)
	fmt.Println("Program exits, leaked goroutine is abandoned")
}

The code creates a channel and launches a goroutine. The goroutine immediately hits <-done and blocks. It waits for a value. The main function sleeps for two seconds. During those two seconds, the runtime scheduler sees one runnable goroutine and one blocked goroutine. The blocked goroutine sits in a wait queue, consuming a small amount of stack memory.

When main finishes, the process terminates. The leaked goroutine is killed instantly. In a long-running server, main never finishes. The goroutine sits in the wait queue forever. Every time this code path executes, another goroutine joins the queue. The queue grows. Memory grows.

A blocked goroutine is a resource held forever.

How the scheduler handles stuck goroutines

Here's the surprise: the runtime scheduler does not detect leaks. It assumes every blocked goroutine is waiting for valid work. The scheduler will happily manage thousands of stuck goroutines until the process runs out of memory. The runtime trusts you to provide an exit path. When you forget that path, the runtime keeps waiting, too.

Each goroutine starts with a small stack, usually 2 kilobytes. The stack grows if the goroutine needs more space. A leaked goroutine keeps its stack alive. If the goroutine captured a large slice or map in a closure, that data stays in memory as long as the goroutine exists. The leak amplifies when the closure holds heavy references.

The scheduler marks the goroutine as "waiting" and moves on. It doesn't log a warning. It doesn't print an error. The leak is silent until the resource limit hits. You only notice when the memory graph climbs or the process gets OOM-killed.

Trust the scheduler. It does exactly what you told it to do.

Realistic leak in an HTTP handler

Real leaks hide in HTTP handlers or background workers. A common pattern is spawning a goroutine to do work and forgetting to handle cancellation.

package main

import (
	"fmt"
	"net/http"
	"time"
)

// HandleRequest processes an HTTP request with a timeout.
func HandleRequest(w http.ResponseWriter, r *http.Request) {
	// Context carries the client's cancellation signal.
	ctx := r.Context()

	// Goroutine starts work but ignores ctx.
	go func() {
		// Simulates long-running work that never checks for cancellation.
		time.Sleep(10 * time.Second)
		fmt.Println("Work finished")
	}()

	// Handler returns immediately, but the goroutine keeps running.
	w.WriteHeader(http.StatusOK)
}

The handler launches a goroutine and returns. The goroutine sleeps for ten seconds. If requests arrive faster than one per ten seconds, goroutines pile up. Each one holds a reference to the request context and any data captured in the closure. The leak grows linearly with traffic.

The fix uses select to watch the context. The goroutine checks ctx.Done() and exits when the client disconnects or the timeout fires.

// HandleRequestFixed demonstrates proper context usage.
func HandleRequestFixed(w http.ResponseWriter, r *http.Request) {
	ctx := r.Context()

	go func() {
		select {
		case <-time.After(10 * time.Second):
			fmt.Println("Work finished")
		case <-ctx.Done():
			fmt.Println("Work cancelled")
		}
	}()

	w.WriteHeader(http.StatusOK)
}

The fixed version respects cancellation. The goroutine exits as soon as the context is done. No leak.

Convention note: context.Context always goes as the first parameter, conventionally named ctx. Functions that take a context should respect cancellation and deadlines. Pass the context through every long-lived call site.

Context is plumbing. Run it through every long-lived call site.

Capturing a goroutine profile

Finding the leak requires looking at the stack traces of running goroutines. The standard library provides runtime/pprof for this. You can expose profiling endpoints and capture a snapshot.

package main

import (
	"net/http"
	"net/http/pprof"
)

func main() {
	// Register pprof handlers on /debug/pprof.
	// This is the standard way to expose profiling endpoints.
	http.DefaultServeMux.HandleFunc("/debug/pprof/goroutine", pprof.Index)
	http.DefaultServeMux.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
	http.DefaultServeMux.HandleFunc("/debug/pprof/profile", pprof.Profile)
	http.DefaultServeMux.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
	http.DefaultServeMux.HandleFunc("/debug/pprof/trace", pprof.Trace)

	// Serve on localhost for safety.
	http.ListenAndServe("localhost:6060", nil)
}

The code registers the standard pprof handlers. The /debug/pprof/goroutine endpoint captures the current goroutine stack traces. Fetch the profile while the leak is happening.

# Fetch the profile while the leak is happening.
curl -o goroutine.prof http://localhost:6060/debug/pprof/goroutine?debug=2

The debug=2 flag returns a text file with one stack trace per goroutine. Each trace starts with goroutine [id] [status]. The status tells you what the goroutine is doing. Look for chan receive or select statuses that repeat hundreds of times.

Analyze the output with go tool pprof for a ranked view.

# Analyze the profile to see the top call sites.
go tool pprof -top goroutine.prof

The tool prints a table sorted by goroutine count. The top rows show the functions where goroutines are stuck. The stack trace points directly to the blocking line.

Trust pprof. The stack trace tells the truth.

Reading the stack trace

The text profile lists every goroutine. A healthy service shows a few goroutines in various states. A leaked service shows hundreds or thousands of goroutines stuck in the same function.

Look for the pattern. If you see 500 goroutines blocked on net/http.(*conn).serve, the leak is in HTTP connections. If you see them blocked on time.Sleep, the leak is a timer that never fires. If they're blocked on a channel receive, check who is supposed to send.

The stack trace shows the call chain. The top frame is where the goroutine is blocked. The frames below show how it got there. Follow the chain back to the code that spawned the goroutine. That's where the leak originates.

Convention note: The receiver name is usually one or two letters matching the type. In the HTTP example, the handler method would likely use (h *Server) HandleRequest. Don't use this or self. Go convention prefers short names.

The stack trace is a map. Follow it to the source.

Pitfalls and runtime errors

Debugging leaks involves watching out for specific patterns. Sometimes the runtime helps by panicking. Other times it stays silent.

If a goroutine waits on a channel and no one sends, and the program has no other work, the runtime detects a deadlock. The program crashes with fatal error: all goroutines are asleep - deadlock!. This is actually helpful. It means the leak is immediate and obvious. Leaks are worse because the program keeps running.

Writing to a closed channel panics. The runtime stops the program with panic: runtime error: send on closed channel. Always check if a channel is closed before sending, or use a select with a default case if appropriate. Usually, the sender should be the one to close the channel.

Convention note: if err != nil { return err } is verbose by design. The community accepts the boilerplate because it makes the unhappy path visible. When debugging leaks, check error paths. A goroutine might exit early on error, leaving a partner goroutine waiting forever.

The worst goroutine bug is the one that never logs.

Decision matrix

Use runtime/pprof when you need to capture a snapshot of all goroutine stacks to find repeating patterns. Use go tool pprof -top when you want to rank goroutines by count and identify the hottest call sites. Use a memory profile when goroutine counts look normal but memory keeps climbing, suggesting the goroutines are holding large allocations. Use context.Context cancellation when you design new code, to prevent leaks before they start. Use a leak detector in your test suite when you want to catch leaks during development rather than in production.

Prevent leaks with context. Debug the rest with pprof.

Where to go next