How to Use go tool pprof for Performance Profiling

The slow request that breaks the SLA

You ship a new feature. Locally it runs in twenty milliseconds. In production, the same endpoint stalls at four hundred. The logs show nothing unusual. The database queries look fine. You stare at the code and trace the execution path. This is the exact moment profiling earns its keep. go tool pprof is the standard way to measure where a Go program spends its CPU cycles, allocates memory, or blocks on synchronization. It turns vague suspicion into exact function names and line numbers.

What profiling actually measures

Profiling works by sampling. Instead of tracing every single instruction, the runtime pauses your program at regular intervals and records the current call stack. If you sample every ten milliseconds, a function that appears in half the samples is consuming roughly half your CPU time. Think of it like a security camera that takes a snapshot every few seconds. You never see the exact moment someone walks through the door, but if the camera catches them in the hallway every time it clicks, you know they are lingering there.

Go ships with built-in support for this in the net/http/pprof package and the runtime/pprof package. The runtime exposes endpoints that stream profile data in the pprof format. The go tool pprof command downloads that data, parses it, and gives you a terminal interface to explore the hotspots. You do not need to install third-party profilers to get started. The tool is already in your Go installation directory. Run go tool pprof -h to see the available flags.

Sampling gives you a map. The map is only as good as the traffic you drive while it is recording.

Your first CPU profile

Here is the simplest way to generate a CPU profile for a standalone program. You run the binary with a flag that tells the runtime to write samples to a file.

// main.go
package main

import (
    "fmt"
    "time"
)

// HeavyLoop simulates a CPU-bound task that burns cycles.
func HeavyLoop() int {
    sum := 0
    // Run a tight loop to generate measurable CPU samples.
    for i := 0; i < 100_000_000; i++ {
        sum += i
    }
    return sum
}

func main() {
    // Call the function to ensure the compiler does not optimize it away.
    result := HeavyLoop()
    fmt.Println(result)
    // Keep the process alive briefly so the profiler can attach if needed.
    time.Sleep(100 * time.Millisecond)
}

Run the program and capture the profile in one step. The -cpuprofile flag writes the data to a binary file that pprof understands.

go run -cpuprofile=cpu.prof main.go
go tool pprof cpu.prof

The second command opens an interactive shell. You are now inside the profiler. Type top to see the functions consuming the most CPU time. The output ranks functions by flat time and cumulative time. Flat time measures how long the function itself ran. Cumulative time adds up the time spent in that function plus everything it called. If HeavyLoop shows high flat time, it is doing the work. If it shows high cumulative time but low flat time, it is spending its budget calling other functions.

Flat time tells you where the work happens. Cumulative time tells you where the responsibility lies.

Reading the interactive shell

The interactive shell has a few commands that matter most. top gives you a ranked table. list <function> drops you into a source view with line-by-line breakdowns. The numbers next to each line show how many samples hit that exact line. peek shows the callers and callees of a function without leaving the current context. web generates a Graphviz diagram if you have the dot executable installed. It draws a call graph where node size represents CPU time and edge thickness represents call frequency.

You can also filter the view. Type focus HeavyLoop to show only samples that touch that function. Type ignore fmt to hide standard library formatting overhead. The shell recalculates percentages based on the filtered set, so you can isolate a specific subsystem without losing context.

Profiling is not magic. It measures what the runtime records. If you are measuring CPU time, the profiler samples the CPU. If you are measuring memory, it samples heap allocations. If you are measuring blocking, it samples goroutines waiting on channels or mutexes. The sampling interval defaults to ten milliseconds for CPU and one megabyte for memory. You can change the interval with the -sample_index flag or by setting runtime.SetCPUProfileRate, but the defaults work for ninety percent of cases.

Go developers run gofmt on every file before committing. The profiler respects the same discipline. It expects clean, formatted source code to map stack traces back to line numbers. If you skip formatting, the tool still works, but the community expects you to let the tool decide indentation.

The shell is a microscope. Focus on the function, not the noise.

Profiling a real HTTP server

Real applications rarely run as a single main function. They serve HTTP requests, process messages, or run background workers. Here is how you attach profiling to a standard HTTP server. The net/http/pprof package registers routes under /debug/pprof/ automatically. You just need to import it with a blank identifier.

// server.go
package main

import (
    "fmt"
    "net/http"
    _ "net/http/pprof" // registers /debug/pprof/ endpoints
)

// HandleRequest simulates a slow endpoint that does JSON marshaling.
func HandleRequest(w http.ResponseWriter, r *http.Request) {
    // Generate a large payload to trigger memory and CPU samples.
    data := make([]map[string]string, 10000)
    for i := range data {
        data[i] = map[string]string{"id": fmt.Sprintf("item-%d", i)}
    }
    fmt.Fprint(w, "done")
}

func main() {
    http.HandleFunc("/slow", HandleRequest)
    // Start the server on localhost.
    http.ListenAndServe(":8080", nil)
}

Start the server in one terminal. Hit the endpoint repeatedly in another terminal or with a load generator. While the traffic flows, open a third terminal and run:

go tool pprof http://localhost:8080/debug/pprof/profile?seconds=30

The seconds parameter tells the profiler to sample for thirty seconds before returning. During that window, the runtime captures CPU samples across all goroutines handling requests. When the download finishes, the interactive shell opens with the aggregated data. You can immediately see which handler functions are burning cycles. If HandleRequest dominates the top output, you know the bottleneck lives in your application code, not the network stack.

Memory profiling follows the same pattern. Replace profile with heap in the URL. The heap profile shows allocated objects, their sizes, and the call stacks that created them. It does not measure garbage collection pauses. It measures allocation pressure. High allocation pressure forces the garbage collector to run more often, which adds latency. You can also profile blocking goroutines with block in the URL. That profile shows where goroutines wait on channels or mutexes longer than the sampling threshold.

Functions that take a context should always put it as the first parameter, conventionally named ctx. When you profile an HTTP handler, the context is already attached to the request. Pass r.Context() to any downstream call so cancellation propagates. The profiler will show where goroutines are stuck waiting for a context that never cancels.

Heap profiles show allocation pressure. Fix the pressure before you fight the garbage collector.

Common traps and what the tool tells you

Profiling introduces overhead. CPU sampling pauses the program for a few microseconds per sample. Memory profiling adds bookkeeping to every allocation. In production, you usually want to limit profiling to a single instance or use a load balancer to route a small percentage of traffic to the profiled node. Running full CPU profiling on every node in a cluster will skew your metrics and waste resources.

The interactive shell will complain if you pass a corrupted profile file. You get an error like parse profile: unexpected EOF if the download was interrupted. Always verify the file size before loading it. If you try to run go tool pprof on a trace file instead of a profile, the tool rejects it with pprof: profile format not recognized. Traces use a different format and require go tool trace.

Another common trap is optimizing the wrong function. The top command shows you where time is spent, not necessarily where you should fix it. Sometimes a function appears hot because it is called millions of times with tiny work. Optimizing it yields nothing. Look at the cumulative time column. Functions with high cumulative time but low flat time are orchestrators. Functions with high flat time are doing the actual work. Fix the flat time first. If the flat time is already low, the bottleneck lives in a dependency or an algorithmic choice.

The compiler also optimizes away dead code. If you profile a function that returns a value but never uses it, the compiler might remove the entire function. You will see zero samples and wonder why the profiler is broken. The tool is working correctly. The code never ran. Use fmt.Print or a global variable to force the compiler to keep the code alive, or profile a real workload where the output matters.

Go developers write if err != nil { return err } on every error return. The boilerplate is intentional. It makes the unhappy path visible. When you profile, you will see error handling paths in the stack traces. Do not ignore them. They consume cycles too. Receiver names follow the same pragmatic rule. Use one or two letters matching the type, like (b *Buffer) Write(...), not (this *Buffer). The profiler outputs receiver names exactly as written, so consistent naming makes stack traces readable.

The profiler shows you reality. Reality rarely matches your assumptions.

When to reach for pprof versus other tools

Use go tool pprof with CPU sampling when you need to find which functions consume the most processor time during a sustained load. Use go tool pprof with heap sampling when you suspect excessive allocations are triggering frequent garbage collection cycles. Use go tool pprof with block sampling when goroutines stall on channels or mutexes and you need to identify the synchronization bottleneck. Use go tool trace when you need to visualize scheduler behavior, goroutine creation, and system call latency at a microsecond level. Use runtime.ReadMemStats when you need programmatic access to memory metrics without generating a profile file. Use plain benchmarking with go test -bench when you are comparing two implementations and only care about relative throughput.

Pick the tool that matches the symptom. CPU burns cycles. Heap burns memory. Block burns time waiting.

Where to go next

go tool pprof helps you find slow or memory-hungry parts of your Go code. It works like a car diagnostic scanner, showing you exactly which functions are causing performance issues so you can fix them.