The benchmark that lied to you
You changed a loop from a manual index to a range. You ran the benchmark. The new version is 8% faster. You commit the change. Two days later, a colleague runs the benchmark on their machine. The old version is 4% faster. You argue about the numbers. The numbers are not lying. The numbers are just noisy.
Benchmarks measure time. Time is messy. Your CPU frequency changes based on temperature. The garbage collector runs at random intervals. A background process wakes up and steals a millisecond. The operating system scheduler moves your goroutine to a different core. One benchmark run captures all that noise. One number tells you nothing about the true performance. You need statistics. benchstat gives you the statistics. It runs the numbers through a test so you can commit with confidence instead of hope.
Why one run is never enough
benchstat compares two sets of benchmark results. You run your benchmarks many times before a change and many times after. benchstat calculates the mean and the standard deviation. The mean is the average. The standard deviation tells you how much the results vary. If the difference between the means is small compared to the variation, the tool says the change is not significant. If the difference is large, it says the change is real.
The tool reports a p-value. The p-value is the probability that the observed difference happened by random chance. A low p-value means the difference is likely real. A high p-value means the difference could easily be noise. The convention in science and engineering is a threshold of 0.05. If the p-value is below 0.05, you can trust the result. If it is above, you cannot.
benchstat also uses the geometric mean for calculating percentage changes. Benchmark times are ratios. A speedup multiplies the time. A slowdown divides the time. Geometric mean handles multiplicative changes correctly. Arithmetic mean fails when you mix speedups and slowdowns. If one benchmark gets 10% faster and another gets 10% slower, the arithmetic mean suggests no change. The geometric mean correctly reflects the combined effect.
Minimal example
Install benchstat first. It lives in the golang.org/x/perf repository, not the standard library. You install it with go install.
# Install the latest version of benchstat.
go install golang.org/x/perf/cmd/benchstat@latest
Write a simple benchmark. The testing framework controls how many iterations run via b.N. Your job is to write the loop and prevent the compiler from optimizing the work away.
package main
import (
"runtime"
"testing"
)
// BenchmarkAdd computes a sum in a loop.
func BenchmarkAdd(b *testing.B) {
sum := 0
for i := 0; i < b.N; i++ {
sum += i
}
// KeepAlive prevents the compiler from deleting the sum variable.
runtime.KeepAlive(sum)
}
Run the benchmark ten times and save the output. The -count=10 flag forces ten independent runs. Each run produces a row in the output. benchstat needs multiple rows to calculate statistics.
# Run the benchmark ten times and save the TSV output.
go test -bench=BenchmarkAdd -count=10 > old.txt
Modify the code. Run the benchmark again. Save the new output.
# Run the benchmark ten times after the change.
go test -bench=BenchmarkAdd -count=10 > new.txt
Pass both files to benchstat. The tool parses the files and prints a comparison table.
# Compare the old and new results.
benchstat old.txt new.txt
Walking through the output
benchstat prints a table with columns for the old result, the new result, and the delta. The delta column shows the percentage change. The p-value and sample size appear in parentheses.
name old time/op new time/op delta
Add-8 1.20ns ยฑ 3% 1.15ns ยฑ 4% -4.17% (p=0.000 n=10+10)
The ยฑ 3% is the standard deviation relative to the mean. It tells you how noisy the measurement is. The p=0.000 means the probability of this difference being random is effectively zero. The change is significant. The n=10+10 means ten runs for the old version and ten runs for the new version.
If the p-value is high, the delta is unreliable. You might see -4.17% (p=0.45 n=10+10). That p-value means there is a 45% chance the difference is noise. You cannot claim the change is faster. You need more runs or a larger effect.
benchstat also handles allocation statistics if you pass -benchmem to go test. The output includes alloc/op and allocs/op columns. The tool compares allocations the same way it compares time.
name old alloc/op new alloc/op delta
Add-8 0B ยฑ 0% 0B ยฑ 0% ~ (all equal)
The ~ symbol means the values are equal within noise. The (all equal) note confirms no significant difference.
Realistic example: Slice preallocation
Slice preallocation is a common optimization. You know make([]T, 0, cap) avoids reallocations. You want to prove it. Write two benchmarks. One appends to a nil slice. The other pre-allocates capacity.
package main
import "testing"
// BenchmarkAppendDynamic grows a slice without capacity.
func BenchmarkAppendDynamic(b *testing.B) {
for i := 0; i < b.N; i++ {
var s []int
for j := 0; j < 100; j++ {
s = append(s, j)
}
}
}
// BenchmarkAppendPrealloc sets capacity upfront.
func BenchmarkAppendPrealloc(b *testing.B) {
for i := 0; i < b.N; i++ {
s := make([]int, 0, 100)
for j := 0; j < 100; j++ {
s = append(s, j)
}
}
}
Run both benchmarks ten times. Save the output. Compare.
go test -bench=. -benchmem -count=10 > old.txt
# ... change code to use prealloc ...
go test -bench=. -benchmem -count=10 > new.txt
benchstat old.txt new.txt
The output shows the time delta and the allocation delta. Preallocation should reduce allocations to zero. The time should drop because the runtime stops resizing the underlying array.
name old time/op new time/op delta
AppendDynamic-8 1.50ยตs ยฑ 2% 0.80ยตs ยฑ 1% -46.67% (p=0.000 n=10+10)
AppendPrealloc-8 0.80ยตs ยฑ 1% 0.80ยตs ยฑ 1% ~ (all equal)
name old alloc/op new alloc/op delta
AppendDynamic-8 1.69kB ยฑ 0% 0B ยฑ 0% -100.00% (p=0.000 n=10+10)
AppendPrealloc-8 1.69kB ยฑ 0% 0B ยฑ 0% -100.00% (p=0.000 n=10+10)
The dynamic version allocates memory repeatedly. The prealloc version allocates once. The p-value confirms the improvement is real. The allocation drop is 100%. The time drop is significant. You can commit this change.
Pitfalls and noise
benchstat assumes the hardware is stable. If you compare results from different machines, the output is meaningless. CPU architecture, memory speed, and cache size vary wildly. benchstat does not normalize for hardware. Run both sets on the same machine.
CPU frequency scaling introduces noise. Linux scales the CPU frequency based on load. Your benchmark might run at 2.0GHz for one run and 3.5GHz for another. The variance spikes. The p-value drops. On Linux, you can lock the frequency to performance mode. Use cpupower frequency-set -g performance as root. This removes frequency scaling as a noise source.
The garbage collector runs concurrently. If your benchmark allocates memory, the GC might trigger during the run. The timing varies. Run benchmarks with -benchmem to see allocation pressure. If allocations are high, the GC noise is higher. You can disable the GC with GODEBUG=gcpause=1000000000 to isolate CPU performance, but this hides real-world behavior. Use benchstat with GC enabled for realistic results.
If you pass a file with the wrong format, benchstat fails with a parsing error. The tool expects TSV output from go test. If you pipe raw text or a different format, the compiler rejects the input with error parsing .... Always use go test -bench to generate the files.
benchstat is not in the standard library. It lives in golang.org/x/perf. This is a convention for experimental or supplementary tools. The standard library contains only stable, core features. benchstat is stable in practice but lives in the x repo by design. Install it with go install. Do not vendor it in your project. It is a development tool.
Decision matrix
Use benchstat when you need to prove a performance change is statistically significant. Use go test -bench alone when you are exploring code and do not need precision. Use pprof when you need to find where the time is spent, not just if it is faster. Use benchstat with -benchmem when allocations matter and you need to compare memory usage. Use a single run with -benchtime when the benchmark is too fast for default timing and you need more iterations per run.
Statistics beat intuition. Run it ten times. Trust the p-value.