How to Use Distributed Tracing in Go (OpenTelemetry, Jaeger)

Implement distributed tracing in Go by initializing the OpenTelemetry SDK with an OTLP exporter configured to send spans to a Jaeger collector.

The invisible thread across services

A user clicks checkout on an e-commerce site. The request hits an API gateway, which forwards it to an inventory service, then a payment processor, then a notification worker. The request fails after four seconds. You pull up the logs. The gateway says timeout. The inventory service says db slow. The payment service says connection reset. You have three separate logs, three separate timestamps, and zero visibility into which step actually caused the delay or where the request dropped.

Distributed tracing solves this blind spot. Instead of hunting through isolated log files, you get a single timeline that shows exactly how long each service took, in what order they ran, and where the failure originated. In Go, you build this timeline using OpenTelemetry. The SDK instruments your code to emit spans. An exporter ships those spans over the network. Jaeger collects them and renders the interactive trace view.

What distributed tracing actually does

A trace is the complete journey of a single request across your system. A span is one discrete step inside that journey. Think of it like a relay race. Each runner receives a baton that carries the race ID. They start their leg, run their portion, and pass the baton forward. When they finish, they record their split time. The baton ensures every leg belongs to the same race, even if the runners are on different tracks.

In Go, the baton lives inside context.Context. When you start a span, the SDK attaches a trace ID and a span ID to the context. You pass that context to every function that needs to participate in the trace. When a function starts its own span, it reads the parent IDs from the context and links itself to the chain. The context also carries deadlines and cancellation signals, which means tracing doubles as your coordination mechanism.

OpenTelemetry provides the instrumentation layer. Jaeger provides the storage and visualization layer. The two communicate via OTLP, the OpenTelemetry Protocol. OTLP is just a structured format sent over HTTP or gRPC. You configure the exporter once, and every span flows through the same pipe.

Tracing is not a replacement for logging. Logs capture discrete events. Traces capture relationships and timing. You use both. Tracing tells you where to look. Logging tells you what happened when you get there.

Setting up the tracer provider

You need three pieces to start emitting traces: an exporter to ship data, a resource to identify your service, and a tracer provider to manage the lifecycle. The setup runs once at startup.

Here is the minimal provider initialization. It creates an HTTP exporter, attaches a service name, and registers the provider globally.

package main

import (
	"context"
	"log"
	"os"

	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
	"go.opentelemetry.io/otel/sdk/resource"
	"go.opentelemetry.io/otel/sdk/trace"
	semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
)

func initTracer() (*trace.TracerProvider, error) {
	ctx := context.Background()
	// HTTP exporter sends spans to the OTLP endpoint defined in env vars
	exporter, err := otlptracehttp.New(ctx)
	if err != nil {
		return nil, err
	}

	// Resource tags every span with service metadata for filtering in Jaeger
	res, err := resource.New(ctx,
		resource.WithAttributes(semconv.ServiceName(os.Getenv("SERVICE_NAME"))),
	)
	if err != nil {
		return nil, err
	}

	// Batcher groups spans to reduce network overhead and improve throughput
	tp := trace.NewTracerProvider(
		trace.WithBatcher(exporter),
		trace.WithResource(res),
	)

	return tp, nil
}

The initTracer function returns a *trace.TracerProvider and an error. You call it early in main. If it fails, you exit. If it succeeds, you replace the default no-op provider so every subsequent span actually ships somewhere.

func main() {
	ctx := context.Background()
	tp, err := initTracer()
	if err != nil {
		log.Fatalf("tracing setup failed: %v", err)
	}
	// Ensure queued spans flush before the process exits
	defer func() { _ = tp.Shutdown(ctx) }()

	otel.SetTracerProvider(tp)
	tracer := tp.Tracer("checkout-service")

	// Start the first span to represent the incoming request
	ctx, span := tracer.Start(ctx, "handle-checkout")
	defer span.End()

	log.Println("processing request")
}

The defer span.End() call is mandatory. Spans measure wall-clock time. If you forget to end a span, the trace shows an open-ended operation that never finishes. The batcher will keep buffering it until the process exits or the buffer fills. Always pair Start with End.

Walking through the runtime

When tracer.Start runs, the SDK allocates a new span object. It generates a unique span ID and copies the trace ID from the incoming context. If no trace ID exists, it creates one. The span attaches to the context and returns the updated context. Every function that receives this context now participates in the same trace.

The batcher sits in the background. It collects spans in memory, waits for a configurable flush interval or buffer size, compresses them, and sends a single HTTP POST to the OTLP endpoint. This batching prevents your service from opening thousands of TCP connections or flooding the collector with tiny payloads.

Jaeger receives the OTLP payload, parses the spans, indexes them by trace ID and service name, and stores them in its backend. When you open the Jaeger UI and search for a trace ID, Jaeger reconstructs the tree. Parent spans appear at the top. Child spans indent below them. Duration bars show exactly where time was spent.

The resource configuration matters more than it looks. Without a service name, Jaeger groups all spans into a single bucket. You lose the ability to filter by component. The semantic conventions package (semconv) provides standardized attribute keys. Use them. They make cross-tool compatibility automatic.

Context is plumbing. Run it through every long-lived call site.

Instrumenting a real HTTP handler

Real services handle HTTP requests, call databases, and talk to other APIs. You instrument each boundary. The handler starts the root span. The database call starts a child span. The external API call starts another child span. Errors get recorded as span events or status codes.

Here is a realistic handler that instruments a checkout flow. It follows the Go convention of passing context.Context as the first parameter to every function that performs I/O or business logic.

func checkoutHandler(w http.ResponseWriter, r *http.Request) {
	// Extract the tracer from the global provider
	tracer := otel.Tracer("checkout-service")
	
	// Start a span that represents the full HTTP request lifecycle
	ctx, span := tracer.Start(r.Context(), "POST /checkout")
	defer span.End()

	// Attach HTTP attributes so Jaeger shows method and path
	span.SetAttributes(
		semconv.HTTPMethodKey.String(r.Method),
		semconv.HTTPRouteKey.String("/checkout"),
	)

	// Pass the enriched context to business logic
	err := processOrder(ctx, r)
	if err != nil {
		// Record the error on the span and set status to error
		span.RecordError(err)
		span.SetStatus(codes.Error, err.Error())
		http.Error(w, err.Error(), http.StatusInternalServerError)
		return
	}

	w.WriteHeader(http.StatusOK)
}

The processOrder function receives the context and starts its own spans. This keeps the handler thin and pushes instrumentation closer to the actual work.

func processOrder(ctx context.Context, r *http.Request) error {
	tracer := otel.Tracer("checkout-service")

	// Start a child span for database operations
	ctx, dbSpan := tracer.Start(ctx, "db.reserve-inventory")
	defer dbSpan.End()

	// Simulate a database call that respects context cancellation
	err := reserveInventory(ctx, r.URL.Query().Get("item"))
	if err != nil {
		dbSpan.RecordError(err)
		return err
	}

	// Start a child span for external payment service
	ctx, paySpan := tracer.Start(ctx, "ext.charge-payment")
	defer paySpan.End()

	err = chargePayment(ctx, r.URL.Query().Get("token"))
	if err != nil {
		paySpan.RecordError(err)
		return err
	}

	return nil
}

Notice the pattern. Each span gets its own defer span.End(). Errors get recorded immediately. The context flows downward. If the HTTP client cancels the request, the context cancels, the database driver aborts, and the spans end with a cancelled status. The trace shows exactly where the cancellation propagated.

The if err != nil { return err } pattern is verbose by design. The community accepts the boilerplate because it makes the unhappy path visible. Tracing complements this by attaching the error to the span so you can see it in context rather than hunting through log files.

Don't fight the type system. Wrap the value or change the design.

Where things go wrong

Tracing fails silently when configured incorrectly. The most common mistake is forgetting to set the global provider. If you skip otel.SetTracerProvider(tp), the SDK falls back to a no-op tracer. Your code runs without panics, but Jaeger receives nothing. The compiler will not catch this. You have to verify the exporter is actually shipping data.

Another frequent issue is context leakage. If you create a new context with context.Background() inside a handler instead of using r.Context(), you break the trace chain. The new span becomes a root span. The parent request disappears from the timeline. The compiler complains with cannot use ctx (type context.Context) as type trace.Span in argument if you accidentally pass the context where a span is expected, but it won't warn you about logical breaks.

Goroutine leaks happen when a background goroutine holds a context that never gets cancelled. If you spawn a worker to send a notification and pass the request context, the worker will block until the HTTP response finishes. If the response finishes early, the context cancels and the worker drops its span. Always derive a new context for background work using context.WithoutCancel or context.WithoutCancel from the standard library, or explicitly manage the goroutine's lifecycle.

The batcher can also back up if the OTLP endpoint is unreachable. Spans queue in memory until the buffer fills. When it fills, the SDK drops spans and logs a warning. You lose visibility exactly when you need it most. Configure a reasonable queue size and add a fallback exporter if your environment requires it.

The worst goroutine bug is the one that never logs.

When to reach for tracing

Observability tools solve different problems. Picking the right one depends on what you are trying to measure.

Use distributed tracing when you need to follow a single request across multiple services and measure latency at each boundary. Use structured logging when you need searchable, timestamped event records for debugging specific instances or auditing compliance. Use metrics when you need aggregated system health over time, like request rates, error percentages, or CPU utilization. Use OpenTelemetry when you want a single instrumentation layer that can export to multiple backends without rewriting your code. Use plain sequential code when you don't need concurrency: the simplest thing that works is usually the right thing.

Tracing adds overhead. Each span allocates memory, generates IDs, and updates context. The batcher compresses and ships data. In high-throughput services, you sample traces. Sampling keeps the signal clear without drowning your collector in noise. OpenTelemetry supports head-based and tail-based sampling. Head-based drops spans at the source. Tail-based collects everything and filters after aggregation. Pick based on your traffic volume and debugging needs.

Convention matters. Public names start with a capital letter. Private start lowercase. No keywords like public or private. Interfaces are accepted, structs are returned. "Accept interfaces, return structs" is the most common Go style mantra. Apply the same discipline to tracing: accept a context.Context, return a value or error, and let the SDK handle the heavy lifting.

Trust gofmt. Argue logic, not formatting.

Where to go next