How to Implement Retry with Exponential Backoff in Go

Web
Implement retry with exponential backoff in Go using a loop that doubles the wait time and adds jitter on each failure.

When the network blinks

You ship a service that calls a third-party payment gateway. The gateway returns a 503 error. Your code catches the error and immediately tries again. It fails again. You try again. Within two seconds, your single request has spawned fifty retries. The gateway sees a flood, hard-rates your IP, and your service grinds to a halt.

This is the thundering herd problem. Networks drop packets. Load balancers restart. Databases lock tables. Temporary failures are normal in distributed systems. The problem is not the failure itself. The problem is how you respond to it.

The math of patience

Exponential backoff is a retry strategy that increases the wait time between attempts. Instead of hammering a service at a fixed interval, you double the delay after each failure. One hundred milliseconds. Two hundred. Four hundred. Eight hundred. The delay grows fast, which gives the failing service time to recover and prevents your client from overwhelming it.

Jitter is the random noise added to that delay. If ten clients all fail at the exact same millisecond and all calculate the same backoff, they will all retry at the exact same moment. You just rebuilt the thundering herd. Jitter scatters those retries across a time window so they arrive at the server as a gentle stream instead of a wall.

Think of it like knocking on a neighbor's door. You knock once. Wait a second. Knock again. Wait two seconds. Knock a third time. Wait four seconds. If you add jitter, you sometimes wait three and a half seconds, sometimes four and a half. The neighbor gets breathing room. You avoid pounding on the wood.

The minimal loop

Here is the simplest retry function that implements exponential backoff with jitter. It takes a context, a maximum attempt count, and the operation you want to run.

// RetryWithBackoff executes fn until it succeeds or reaches maxAttempts.
func RetryWithBackoff(ctx context.Context, maxAttempts int, fn func() error) error {
	baseDelay := 100 * time.Millisecond
	maxDelay := 10 * time.Second
	rng := rand.New(rand.NewSource(time.Now().UnixNano()))

	for i := 0; i < maxAttempts; i++ {
		if err := fn(); err == nil {
			return nil // success path exits immediately
		}

		if i == maxAttempts-1 {
			return err // last attempt failed, bubble it up
		}

		// Calculate exponential delay using bit shifting
		delay := time.Duration(1<<uint(i)) * baseDelay
		if delay > maxDelay {
			delay = maxDelay // cap prevents overflow and infinite waits
		}

		// Add random jitter between -10% and +10%
		jitter := time.Duration((rng.Float64()*0.2 - 0.1) * float64(delay))
		select {
		case <-time.After(delay + jitter):
		case <-ctx.Done():
			return ctx.Err() // respect external cancellation
		}
	}
	return nil // unreachable, satisfies compiler
}

The function runs in a tight loop. It calls your operation. If the operation returns nil, the function returns immediately. If it returns an error, the loop calculates the next wait time, sleeps, and tries again. The select statement replaces time.Sleep so the function can exit early if the context gets cancelled.

How the pieces interact at runtime

The bit shift 1 << uint(i) is a fast way to calculate powers of two. When i is zero, the shift produces one. When i is one, it produces two. When i is three, it produces eight. Multiplying that by baseDelay gives you the exponential curve without importing a math package or calling math.Pow. The compiler converts the shift to a single CPU instruction at compile time.

The jitter calculation uses a uniform distribution. rng.Float64() returns a value between zero and one. Multiplying by 0.2 stretches it to 0.0 through 0.2. Subtracting 0.1 centers it at zero, giving a range of -0.1 to +0.1. That range represents negative ten percent to positive ten percent of the base delay. Adding it to delay shifts the sleep window slightly left or right.

The select block is where Go's concurrency model shines. time.After sends a value on a channel after the duration elapses. ctx.Done() sends a value when the parent context is cancelled. The select picks whichever channel fires first. If the context fires first, the function returns context.Canceled or context.DeadlineExceeded instead of waiting out the full sleep.

Context always goes as the first parameter in Go. It is conventionally named ctx. Functions that accept a context must check ctx.Done() before blocking operations. This pattern threads cancellation through your entire call stack. Trust the convention. Run it through every long-lived call site.

The idempotency requirement

Retry logic assumes your operation is safe to run multiple times. That property is called idempotency. An idempotent operation produces the same result whether you call it once or ten times. Reading a file is idempotent. Fetching a JSON endpoint is idempotent. Checking a database record is idempotent.

Writing to a file without clearing it first is not idempotent. Appending to a log is not idempotent. Charging a credit card is not idempotent. If you retry a non-idempotent operation, you create duplicate records, double charges, or corrupted state.

You have two options when dealing with non-idempotent operations. You can add an idempotency key to your API requests. The server stores the key and returns the cached result on subsequent calls with the same key. Or you can move the retry logic to the server side where the transaction boundary is clearer. Never retry a mutating operation without a deduplication strategy.

Putting it in production code

Real services rarely retry blindly. You need to distinguish between transient errors and permanent ones. A 404 Not Found is permanent. A 503 Service Unavailable is transient. A database connection timeout is transient. A malformed JSON payload is permanent.

Here is how you wrap an HTTP call with selective retry logic.

// FetchWithRetry calls the URL and retries only on transient failures.
func FetchWithRetry(ctx context.Context, client *http.Client, url string) ([]byte, error) {
	var body []byte

	err := RetryWithBackoff(ctx, 5, func() error {
		req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
		if err != nil {
			return err // request construction failed, abort
		}

		resp, err := client.Do(req)
		if err != nil {
			return err // network error, retryable
		}
		defer resp.Body.Close()

		if resp.StatusCode >= 500 {
			return fmt.Errorf("server error: %d", resp.StatusCode) // retryable
		}
		if resp.StatusCode == 404 {
			return fmt.Errorf("not found: %s", url) // permanent
		}

		body, err = io.ReadAll(resp.Body)
		return err // read error, retryable
	})

	if err != nil {
		return nil, fmt.Errorf("fetch failed after retries: %w", err)
	}
	return body, nil
}

The closure captures body from the outer scope. Go closures capture variables by reference, so the final successful response stays in body after the loop exits. The function returns early on permanent errors. The outer RetryWithBackoff handles the timing. Error wrapping with %w preserves the error chain for downstream logging.

The if err != nil { return err } pattern looks verbose. The community accepts the boilerplate because it forces you to acknowledge the failure path. Silent error swallowing causes production outages. Let the unhappy path be visible.

Where things go wrong

Developers usually break retry logic in three ways. They forget to cap the delay. They retry on errors that will never succeed. They block the caller indefinitely.

If you omit the maxDelay cap, the bit shift eventually overflows a 64-bit duration on 64-bit systems. The compiler will not catch the overflow at compile time. You will get a negative duration at runtime, which causes time.Sleep to panic with negative sleep duration. Always clamp the delay.

If you retry on a 404 or an authentication failure, you waste CPU cycles and exhaust your rate limit budget. Check status codes or error types before sleeping. The compiler complains with cannot use err (variable of type error) as bool in if statement if you accidentally write if err instead of if err != nil. Go requires explicit nil checks.

If you pass a background context and never cancel it, your retry loop will run until maxAttempts is exhausted. That is fine for bounded retries. If you switch to an infinite retry loop, you must provide a cancellation path. Goroutine leaks happen when a goroutine waits on a channel or a sleep that never completes. Always thread context.Context through long-running operations.

The worst retry bug is the one that silently succeeds after three minutes. Set a hard deadline on the context before calling the retry function. ctx, cancel := context.WithTimeout(parentCtx, 30*time.Second) guarantees the operation exits even if the jitter math goes wrong. Remember to call cancel() when the parent scope returns. The compiler will not remind you. The garbage collector will not clean up the timer.

Choosing the right strategy

Use a simple fixed-interval retry when you are polling a local file or a fast in-memory cache that occasionally locks. Use exponential backoff with jitter when you are calling external APIs, databases, or message queues that experience temporary load spikes. Use a circuit breaker when the downstream service is completely down and you need to fail fast across your entire fleet. Use plain sequential code when the operation is idempotent, fast, and the cost of a retry outweighs the benefit of automatic recovery.

Retry logic is defensive programming. Build it once, test it against a flaky mock, and let the network do what it does.

Where to go next