The missing numbers in your dashboard
Your payment service is humming along. The logs show green checkmarks. The CPU graph is flat. Then the dashboard turns red. Orders time out. You check the logs again and see nothing wrong. The problem isn't a crash. The problem is a slow database lock that's invisible to your logs. Logs tell you what happened at a specific moment. Metrics tell you how the system behaves over time. You need numbers to see the shape of the problem.
Go does not ship with a built-in business metrics library. The standard library leaves that space open so you can pick the format and exporter that matches your infrastructure. Instead, Go provides runtime/metrics for internal engine telemetry, and leaves application-level tracking to your own code or a third-party client. You define the numbers that matter, protect them from concurrent access, and expose them through a simple HTTP endpoint.
Metrics are cheap to collect if you design them right. They become expensive when you treat them like logs.
What metrics actually are
A metric is a numerical snapshot of a system state. You track them in three primary shapes. A counter only goes up. You use it for request totals, error counts, or bytes processed. A gauge moves up and down. You use it for active connections, queue depth, or current memory usage. A histogram records the distribution of values. You use it for request latency or payload sizes, breaking the data into buckets so you can calculate percentiles later.
Think of a counter like an odometer in a car. The number only increases as you drive. You can calculate speed by looking at how fast the number changes, but the odometer itself never goes backward. A gauge is like the fuel gauge. It rises when you fill up and drops as you drive. It tells you the current state, not the history. A histogram is like a grade distribution in a class. The average grade might be a B, but the histogram shows you that half the class got Fs and half got As. The average hides the reality. The histogram reveals it.
Latency is rarely uniform. Most requests finish quickly, but a few take forever. The average latency hides the slow requests. The histogram captures the spread. You define buckets like 10ms, 50ms, 100ms, 500ms. Each request falls into a bucket. The scraper sees how many requests landed in each bucket. You can calculate the 99th percentile from the buckets. If 99% of requests are under 100ms, you know the system is healthy. If the 99th percentile jumps to 500ms, you have a problem.
The industry standard for Go services is the pull model. Your application does not push data to a central server. Instead, it hosts an HTTP endpoint that returns all current metric values in a plain-text format. A scraper like Prometheus visits that endpoint every fifteen seconds, parses the text, and stores the time series. This approach keeps your application free from network backpressure. If the scraper goes down, your service keeps running. If your service goes down, the scraper records a gap. Pushing metrics requires your app to manage connections, retries, and authentication for the metrics backend. Pulling shifts that complexity to the scraper. Your app just serves text.
Go's concurrency model makes metrics collection straightforward. You increment a counter in a request handler, and a separate scraper goroutine reads it later. The only requirement is thread safety. You protect shared state with mutexes or atomic operations so the scraper never reads a torn value.
Numbers are only useful if they survive concurrent access. Protect the state, and the graphs will follow.
A thread-safe counter from scratch
Here's the simplest custom metric: a counter that tracks successful requests using atomic operations.
package main
import (
"sync/atomic"
)
// RequestCounter tracks successful HTTP requests using atomic operations.
type RequestCounter struct {
// atomic.Int64 guarantees lock-free, thread-safe increments and reads.
count atomic.Int64
}
// Increment adds one to the counter without blocking other goroutines.
func (c *RequestCounter) Increment() {
// Atomic add avoids mutex contention under high request volume.
c.count.Add(1)
}
// Value returns the current count for scraping or logging.
func (c *RequestCounter) Value() int64 {
// Atomic load ensures the scraper sees a complete, consistent value.
return c.count.Load()
}
The struct holds a single atomic.Int64. The Increment method bumps the value. The Value method reads it. Both operations complete in nanoseconds. You can call Increment from hundreds of concurrent request handlers without introducing lock contention. The scraper calls Value on a timer and formats the result.
Atomic operations work beautifully for simple counters and gauges. They break down when you need labeled metrics or complex aggregations.
Go convention prefers short receiver names that match the type. The receiver is named c for RequestCounter. Use (c *Counter) not (this *Counter) or (self *Counter). This keeps method signatures readable and consistent with the standard library.
How the runtime handles the load
When your service receives a request, the HTTP router dispatches it to a handler. The handler runs in its own goroutine. If you increment a metric inside that handler, you are modifying shared state from multiple goroutines simultaneously. Without synchronization, the CPU might read a partially written value, or two increments might overwrite each other. The race detector will catch this during testing and print a warning like WARNING: DATA RACE followed by the stack traces of the conflicting goroutines.
Atomic operations solve the problem by using CPU-level instructions that guarantee indivisible reads and writes. A mutex solves it by blocking other goroutines until the critical section finishes. For a single counter, atomics win on performance. For a registry of dozens of metrics with dynamic labels, a mutex or a dedicated client library wins on maintainability.
Go's scheduler does not care that you are updating metrics. It treats the atomic instruction like any other memory operation. The cost is negligible compared to network I/O or database queries. You can safely instrument hot paths without worrying about the scheduler stalling.
The scheduler yields to I/O, not to atomic math. Keep the critical section small, and the pipeline stays full.
Exposing metrics to a scraper
A counter is useless if nothing reads it. Here's a registry that holds multiple counters.
package main
// MetricsRegistry holds application counters and exposes them via HTTP.
type MetricsRegistry struct {
// requests tracks total successful HTTP requests.
requests RequestCounter
// errors tracks total failed HTTP requests.
errors RequestCounter
}
// NewMetricsRegistry returns a fresh registry ready for instrumentation.
func NewMetricsRegistry() *MetricsRegistry {
// Zero-value atomic.Int64 starts at 0, so no explicit initialization is needed.
return &MetricsRegistry{}
}
The registry holds two counters. The constructor returns a pointer to the struct. Go convention dictates that functions returning values should return structs or pointers to structs, not interfaces. This keeps the implementation details visible and makes the code easier to test.
Here's the handler that reads the counters and formats them for the scraper.
import (
"fmt"
"net/http"
)
// HandleScrape formats all registered metrics and writes them to the response.
func (m *MetricsRegistry) HandleScrape(w http.ResponseWriter, r *http.Request) {
// Read values atomically before formatting to avoid mid-write inconsistencies.
reqVal := m.requests.Value()
errVal := m.errors.Value()
// Prometheus format expects a metric name, optional labels, and a value.
output := fmt.Sprintf(
"# HELP http_requests_total Total number of HTTP requests.\n"+
"# TYPE http_requests_total counter\n"+
"http_requests_total %d\n"+
"# HELP http_errors_total Total number of HTTP errors.\n"+
"# TYPE http_errors_total counter\n"+
"http_errors_total %d\n",
reqVal, errVal,
)
// Plain text content type tells scrapers how to parse the response.
w.Header().Set("Content-Type", "text/plain; version=0.0.4")
w.Write([]byte(output))
}
The HandleScrape method reads both values atomically, formats them into the expected text layout, and writes them to the response. The Content-Type header signals to scrapers that this is a standard metrics endpoint. You register this handler on a dedicated route like /metrics and let the scraper poll it.
Go convention dictates that functions accepting a context should take context.Context as the first parameter, conventionally named ctx. This handler does not need one because it does not perform long-running work or cancelable I/O. If you ever add database queries or downstream calls to your metrics collection, thread the context through immediately.
Scrapers expect consistency. Format the text once, set the header, and let the polling interval do the heavy lifting.
Where things go sideways
Custom metrics introduce three common failure modes. The first is cardinality explosion. If you tag every metric with a unique user ID or request token, the number of time series grows linearly with traffic. Imagine a service handling 10,000 requests per second. If you tag each request with a user ID, you create 10,000 new time series every second. The scraper cannot store that much data. It runs out of memory and crashes. Keep labels low cardinality. Use status codes, endpoints, or service names. These values repeat across requests. The number of time series stays small. Never use identifiers that scale with request volume.
The second is race conditions on complex aggregations. Atomic operations work for single counters. They do not work for histograms with dynamic buckets or for maps that track per-label values. If you try to increment a map value without a mutex, the compiler will not stop you, but the race detector will flag WARNING: DATA RACE during tests. Switch to a mutex or use a battle-tested client library that handles concurrent bucket updates.
The third is ignoring the pull model's latency implications. If your /metrics endpoint calculates percentiles on the fly or queries a database, the scraper times out. Metrics endpoints must be fast. Pre-aggregate values, cache results for a few seconds, or compute percentiles in the scraper itself. The endpoint should return in milliseconds, not seconds.
The compiler will not catch a slow metrics endpoint. Your scraper will, and it will mark your service as down.
Picking the right tracking tool
Use a custom atomic counter when you need a single, high-performance number that only increases. Use a mutex-protected gauge when you need a value that rises and falls, like active connections or queue depth. Use a histogram or summary library when you need latency percentiles and cannot afford to compute them on every scrape. Use runtime/metrics when you only care about Go internals like GC pauses, heap allocations, or goroutine counts. Use a third-party client like Prometheus or Datadog when you need labeled metrics, automatic histogram buckets, or integration with existing dashboards. Use plain sequential code when you are prototyping and do not yet need concurrent safety: the simplest thing that works is usually the right thing.
Metrics are infrastructure. Pick the shape that matches the question you are asking.