How to Write Effective Benchmarks in Go

Write effective Go benchmarks by ensuring your `Benchmark` functions run the target code inside a loop controlled by `b.N`, avoiding premature optimization, and using `b.ResetTimer()` to exclude setup costs.

The benchmark that lied

You spend an hour refactoring a hot loop. You run the benchmark. The new version is ten times faster. You deploy. The latency stays exactly the same. The benchmark lied. The compiler optimized your test code away because you never used the result.

This happens to everyone. Writing a benchmark that measures real work requires a few specific tricks to keep the compiler honest. Go's compiler is aggressive. If it sees code that calculates a value but never uses that value, it removes the code. It assumes you made a mistake. In a benchmark, you often calculate a value just to measure the cost. The compiler sees "calculate and discard" and deletes the calculation. Your benchmark measures the cost of doing nothing.

How benchmarks work

A benchmark is a stopwatch for code. You wrap the code you care about in a loop, run it many times, and measure the average. The Go testing framework handles the loop count and the timing. The function name must start with Benchmark and take a *testing.B parameter. The framework scans for exported functions with this prefix when you run go test -bench.

The parameter b controls the loop. The field b.N determines how many times the loop runs. You do not set b.N. The framework sets it. It starts with a small number, runs the loop, measures the time, and checks if the run was long enough for stable statistics. If not, it doubles b.N and tries again. This repeats until the total time exceeds a threshold, usually around one second. The framework then reports the time per iteration.

Convention aside: benchmark functions must be exported. The name must start with a capital B. The testing package only runs exported functions matching the Benchmark prefix. Unexported functions are ignored. Also, go test skips benchmarks by default. You must pass the -bench flag to run them.

Keeping the compiler honest

Here's the simplest benchmark structure. The loop runs b.N times. The framework adjusts b.N to get a stable measurement. The critical part is preventing the compiler from removing the work. The idiomatic way is runtime.KeepAlive. This tells the compiler the value might be read later, so it must keep the calculation.

Here's a factorial benchmark that forces the compiler to do the work.

import "runtime"

// BenchmarkFactorial measures the cost of computing a factorial.
func BenchmarkFactorial(b *testing.B) {
    // b.N is controlled by the testing framework.
    // It increases until the benchmark runs long enough for stable statistics.
    for i := 0; i < b.N; i++ {
        // Compute the value. Without using the result, the compiler removes this line.
        result := factorial(20)
        
        // Tell the compiler the result must survive past this point.
        // This prevents dead-code elimination while keeping the benchmark clean.
        runtime.KeepAlive(result)
    }
}

// factorial returns n! recursively.
func factorial(n int) int {
    if n <= 1 {
        return 1
    }
    return n * factorial(n-1)
}

When you run this, the framework calls BenchmarkFactorial. It starts with a small b.N, runs the loop, measures the time, and checks stability. The runtime.KeepAlive call ensures the compiler treats result as live. The multiplication and recursion actually happen. The output shows the time per operation and the number of iterations.

Convention aside: some developers use a package-level variable to store the result, like globalResult = result. This also prevents optimization. runtime.KeepAlive is preferred because it keeps the benchmark self-contained and avoids shared state between parallel test runs.

Excluding setup costs

Real code often needs setup. Connecting to a database, loading a file, or initializing a struct takes time. You usually want to measure the query or the processing, not the setup. The b.ResetTimer() method pauses the clock. Call it after setup and before the loop. The clock starts again when the loop begins.

Here's a benchmark that measures query latency while excluding connection setup.

// BenchmarkDBQuery measures query latency, excluding connection setup.
func BenchmarkDBQuery(b *testing.B) {
    // Setup happens outside the loop.
    // This runs once per benchmark invocation, not per iteration.
    db := connectToDB()
    defer db.Close()
    
    // Reset the timer to ignore setup time.
    // The clock starts again when the loop begins.
    b.ResetTimer()
    
    for i := 0; i < b.N; i++ {
        // Execute the query.
        // The result is discarded here, but the side effect (network I/O)
        // prevents the compiler from optimizing the call away.
        _, err := db.Query("SELECT id FROM users WHERE id = ?", i)
        if err != nil {
            // Fatal stops the benchmark and reports the error.
            b.Fatal(err)
        }
    }
}

The setup runs once. b.ResetTimer() clears the elapsed time. The loop measures only the query. If the query has side effects, like network I/O, the compiler cannot remove it even if you discard the result. Pure calculations still need KeepAlive.

Convention aside: use b.Fatal to stop immediately on error. b.Error marks the benchmark as failed but continues running. In benchmarks, b.Fatal is common to stop on the first error and avoid cascading failures.

Measuring memory allocations

Speed is not the only metric. Memory allocations impact performance and garbage collection pressure. Go tracks allocations automatically. Pass the -benchmem flag to see allocation statistics. The output includes allocs/op (allocations per operation) and bytes/op (bytes allocated per operation).

You can also enable this programmatically with b.ReportAllocs(). This is useful when you want allocation stats in the output without remembering the flag. Most teams add -benchmem to their CI commands so allocation regressions are caught automatically.

Convention aside: -benchmem is the standard flag. It is rare to see a benchmark report without it. Allocation counts matter as much as time. A function that is fast but allocates heavily can cause GC pauses that hurt latency.

Comparing variations with sub-benchmarks

When you refactor code, you want to compare the old and new versions. Sub-benchmarks let you run multiple variations in one test file. Use b.Run(name, func) to create a sub-benchmark. Each sub-benchmark gets its own b.N and timing. The names appear in the output, making comparison easy.

Here's a benchmark that compares naive string concatenation with strings.Builder.

import (
    "runtime"
    "strings"
)

// BenchmarkStringConcat measures different string building strategies.
func BenchmarkStringConcat(b *testing.B) {
    // b.Run creates a sub-benchmark.
    // Each sub-benchmark gets its own b.N and timing.
    b.Run("naive", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            s := ""
            for j := 0; j < 100; j++ {
                s += "x"
            }
            runtime.KeepAlive(s)
        }
    })
    
    b.Run("builder", func(b *testing.B) {
        for i := 0; i < b.N; i++ {
            var b strings.Builder
            for j := 0; j < 100; j++ {
                b.WriteString("x")
            }
            runtime.KeepAlive(b.String())
        }
    })
}

The output shows BenchmarkStringConcat/naive and BenchmarkStringConcat/builder. You can see the time and allocations for each. This keeps variations isolated and comparable.

Convention aside: sub-benchmark names should be descriptive. Use lowercase names separated by slashes if you need hierarchy, like b.Run("json/small"). The testing framework supports nested names.

Pitfalls and errors

Benchmarks have common traps. The compiler optimizing away code is the biggest one. If you forget KeepAlive or a global variable, the result is 0 ns/op. This is a silent failure. The benchmark runs instantly because nothing happens.

Another trap is measuring setup costs. If you forget b.ResetTimer(), the setup time is included in the average. This skews the result, especially if b.N is small.

Goroutine leaks are dangerous in benchmarks. If a benchmark spawns goroutines that never finish, the test hangs or leaks memory. Always ensure goroutines exit when the benchmark stops. Use context.Context with cancellation or a done channel to signal goroutines to stop.

Compiler errors catch signature mistakes. If you name a function BenchmarkFoo but pass *testing.T instead of *testing.B, the compiler rejects it with cannot use t (variable of type *testing.T) as *testing.B value in argument. If the function has no parameters, you get benchmark function BenchmarkFoo has wrong signature. These errors are clear. Fix the signature and the benchmark runs.

Convention aside: run benchmarks with -benchtime to control duration. The default is one second. Use -benchtime=5s for more stable results on noisy systems. Use -count=3 to run the benchmark multiple times and check variance.

The worst benchmark bug is the one that measures nothing. Verify your benchmark by adding a deliberate delay or allocation and checking if the result changes. If the time stays the same, the compiler optimized away your work.

When to use benchmarks

Use a benchmark when you need to measure the performance of a specific function or loop. Use b.ResetTimer() when setup costs would skew the measurement of the target code. Use runtime.KeepAlive(result) when the compiler might optimize away a pure calculation. Use b.RunParallel when you want to measure how the code scales with multiple goroutines. Use the -benchmem flag when you need to track memory allocations alongside speed. Use b.Run(name) when you want to compare multiple variations of the same logic in one test file. Use plain sequential code when you don't need concurrency: the simplest thing that works is usually the right thing.

Measure once, compare often. A benchmark without a baseline is just a number.

Where to go next

Benchmarks are automated tests that run your code thousands of times to measure how fast it is. They help you spot slow parts of your program before users do. Think of it like a stopwatch that runs a race track loop to see exactly how long a lap takes.