How to Use benchstat to Compare Benchmark Results

The benchmark that lied to you

You changed a loop from a manual index to a range. You ran the benchmark. The new version is 8% faster. You commit the change. Two days later, a colleague runs the benchmark on their machine. The old version is 4% faster. You argue about the numbers. The numbers are not lying. The numbers are just noisy.

Benchmarks measure time. Time is messy. Your CPU frequency changes based on temperature. The garbage collector runs at random intervals. A background process wakes up and steals a millisecond. The operating system scheduler moves your goroutine to a different core. One benchmark run captures all that noise. One number tells you nothing about the true performance. You need statistics. benchstat gives you the statistics. It runs the numbers through a test so you can commit with confidence instead of hope.

Why one run is never enough

benchstat compares two sets of benchmark results. You run your benchmarks many times before a change and many times after. benchstat calculates the mean and the standard deviation. The mean is the average. The standard deviation tells you how much the results vary. If the difference between the means is small compared to the variation, the tool says the change is not significant. If the difference is large, it says the change is real.

The tool reports a p-value. The p-value is the probability that the observed difference happened by random chance. A low p-value means the difference is likely real. A high p-value means the difference could easily be noise. The convention in science and engineering is a threshold of 0.05. If the p-value is below 0.05, you can trust the result. If it is above, you cannot.

benchstat also uses the geometric mean for calculating percentage changes. Benchmark times are ratios. A speedup multiplies the time. A slowdown divides the time. Geometric mean handles multiplicative changes correctly. Arithmetic mean fails when you mix speedups and slowdowns. If one benchmark gets 10% faster and another gets 10% slower, the arithmetic mean suggests no change. The geometric mean correctly reflects the combined effect.

Minimal example

Install benchstat first. It lives in the golang.org/x/perf repository, not the standard library. You install it with go install.

# Install the latest version of benchstat.
go install golang.org/x/perf/cmd/benchstat@latest

Write a simple benchmark. The testing framework controls how many iterations run via b.N. Your job is to write the loop and prevent the compiler from optimizing the work away.

package main

import (
    "runtime"
    "testing"
)

// BenchmarkAdd computes a sum in a loop.
func BenchmarkAdd(b *testing.B) {
    sum := 0
    for i := 0; i < b.N; i++ {
        sum += i
    }
    // KeepAlive prevents the compiler from deleting the sum variable.
    runtime.KeepAlive(sum)
}

Run the benchmark ten times and save the output. The -count=10 flag forces ten independent runs. Each run produces a row in the output. benchstat needs multiple rows to calculate statistics.

# Run the benchmark ten times and save the TSV output.
go test -bench=BenchmarkAdd -count=10 > old.txt

Modify the code. Run the benchmark again. Save the new output.

# Run the benchmark ten times after the change.
go test -bench=BenchmarkAdd -count=10 > new.txt

Pass both files to benchstat. The tool parses the files and prints a comparison table.

# Compare the old and new results.
benchstat old.txt new.txt

Walking through the output

benchstat prints a table with columns for the old result, the new result, and the delta. The delta column shows the percentage change. The p-value and sample size appear in parentheses.

name    old time/op    new time/op    delta
Add-8   1.20ns ± 3%    1.15ns ± 4%    -4.17%  (p=0.000 n=10+10)

The ± 3% is the standard deviation relative to the mean. It tells you how noisy the measurement is. The p=0.000 means the probability of this difference being random is effectively zero. The change is significant. The n=10+10 means ten runs for the old version and ten runs for the new version.

If the p-value is high, the delta is unreliable. You might see -4.17% (p=0.45 n=10+10). That p-value means there is a 45% chance the difference is noise. You cannot claim the change is faster. You need more runs or a larger effect.

benchstat also handles allocation statistics if you pass -benchmem to go test. The output includes alloc/op and allocs/op columns. The tool compares allocations the same way it compares time.

name    old alloc/op    new alloc/op    delta
Add-8   0B ± 0%         0B ± 0%         ~     (all equal)

The ~ symbol means the values are equal within noise. The (all equal) note confirms no significant difference.

Realistic example: Slice preallocation

Slice preallocation is a common optimization. You know make([]T, 0, cap) avoids reallocations. You want to prove it. Write two benchmarks. One appends to a nil slice. The other pre-allocates capacity.

package main

import "testing"

// BenchmarkAppendDynamic grows a slice without capacity.
func BenchmarkAppendDynamic(b *testing.B) {
    for i := 0; i < b.N; i++ {
        var s []int
        for j := 0; j < 100; j++ {
            s = append(s, j)
        }
    }
}

// BenchmarkAppendPrealloc sets capacity upfront.
func BenchmarkAppendPrealloc(b *testing.B) {
    for i := 0; i < b.N; i++ {
        s := make([]int, 0, 100)
        for j := 0; j < 100; j++ {
            s = append(s, j)
        }
    }
}

Run both benchmarks ten times. Save the output. Compare.

go test -bench=. -benchmem -count=10 > old.txt
# ... change code to use prealloc ...
go test -bench=. -benchmem -count=10 > new.txt
benchstat old.txt new.txt

The output shows the time delta and the allocation delta. Preallocation should reduce allocations to zero. The time should drop because the runtime stops resizing the underlying array.

name                old time/op    new time/op    delta
AppendDynamic-8     1.50µs ± 2%    0.80µs ± 1%    -46.67%  (p=0.000 n=10+10)
AppendPrealloc-8    0.80µs ± 1%    0.80µs ± 1%    ~       (all equal)

name                old alloc/op    new alloc/op    delta
AppendDynamic-8     1.69kB ± 0%    0B ± 0%         -100.00%  (p=0.000 n=10+10)
AppendPrealloc-8    1.69kB ± 0%    0B ± 0%         -100.00%  (p=0.000 n=10+10)

The dynamic version allocates memory repeatedly. The prealloc version allocates once. The p-value confirms the improvement is real. The allocation drop is 100%. The time drop is significant. You can commit this change.

Pitfalls and noise

benchstat assumes the hardware is stable. If you compare results from different machines, the output is meaningless. CPU architecture, memory speed, and cache size vary wildly. benchstat does not normalize for hardware. Run both sets on the same machine.

CPU frequency scaling introduces noise. Linux scales the CPU frequency based on load. Your benchmark might run at 2.0GHz for one run and 3.5GHz for another. The variance spikes. The p-value drops. On Linux, you can lock the frequency to performance mode. Use cpupower frequency-set -g performance as root. This removes frequency scaling as a noise source.

The garbage collector runs concurrently. If your benchmark allocates memory, the GC might trigger during the run. The timing varies. Run benchmarks with -benchmem to see allocation pressure. If allocations are high, the GC noise is higher. You can disable the GC with GODEBUG=gcpause=1000000000 to isolate CPU performance, but this hides real-world behavior. Use benchstat with GC enabled for realistic results.

If you pass a file with the wrong format, benchstat fails with a parsing error. The tool expects TSV output from go test. If you pipe raw text or a different format, the compiler rejects the input with error parsing .... Always use go test -bench to generate the files.

benchstat is not in the standard library. It lives in golang.org/x/perf. This is a convention for experimental or supplementary tools. The standard library contains only stable, core features. benchstat is stable in practice but lives in the x repo by design. Install it with go install. Do not vendor it in your project. It is a development tool.

Decision matrix

Use benchstat when you need to prove a performance change is statistically significant. Use go test -bench alone when you are exploring code and do not need precision. Use pprof when you need to find where the time is spent, not just if it is faster. Use benchstat with -benchmem when allocations matter and you need to compare memory usage. Use a single run with -benchtime when the benchmark is too fast for default timing and you need more iterations per run.

Statistics beat intuition. Run it ten times. Trust the p-value.

Where to go next

benchstat is a tool that helps you decide if a code change actually made your program faster or slower. It takes the raw timing data from running your tests multiple times and uses statistics to filter out random noise. Think of it like a referee that tells you if a runner truly improved their time or if they just got lucky on one specific day.