How to Implement Delayed and Retry Logic for Jobs in Go

Implement job retries in Go using a manual loop with exponential backoff and time.Sleep since no standard library exists for this.

The retry loop is just a loop

You call an external API. It returns a 502 Bad Gateway. Your code crashes. The user sees a red error screen. You know the API is usually fine. The glitch was a network hiccup or a temporary overload. You want to try again. Then wait a moment. Try again. Wait longer. Give up only after you are sure the service is down.

Go does not provide a retry function in the standard library. You write the loop yourself. This is intentional. A retry strategy depends on your specific needs. Some jobs need aggressive retries. Some need long pauses. Some must stop immediately if the user closes the browser. The standard library gives you time and context. You compose them into the logic you need.

Retry logic is a loop with a pause. The pause grows longer each time to avoid overwhelming the failing service. This is exponential backoff. You also add a random sprinkle of time to each pause. This is jitter. Jitter prevents multiple services from hammering the server at the exact same moment.

Minimal retry with exponential backoff

Start with the simplest form. A function that runs a job, checks for an error, sleeps, doubles the sleep duration, and repeats.

Here is the basic structure. It takes a job function, a retry count, and a starting delay.

func RunWithRetry(job func() error, maxRetries int, initialDelay time.Duration) error {
    delay := initialDelay
    for i := 0; i < maxRetries; i++ {
        if err := job(); err == nil {
            return nil // Success, exit immediately
        }
        time.Sleep(delay) // Block the goroutine for the current delay
        delay *= 2 // Double the delay for the next iteration
    }
    return fmt.Errorf("job failed after %d retries", maxRetries)
}

The loop runs maxRetries times. If job() returns nil, the function returns immediately. If it returns an error, the goroutine sleeps. The delay variable doubles. The next iteration waits twice as long as the previous one.

This approach works for simple scripts. It is easy to read. It has no dependencies. It also has limitations. The goroutine blocks during time.Sleep. You cannot cancel the retry if the user navigates away. You cannot add jitter to spread out the load.

Adding context and jitter

Real services need to be cancellable. If a request is cancelled, the retry loop should stop. You also need jitter to avoid the thundering herd problem. If ten services all fail at the same time and all retry after exactly two seconds, they all hit the server at once. That spike can crash a recovering service. Jitter randomizes the wait time so retries arrive at different moments.

Use context.Context for cancellation. Use time.NewTimer instead of time.Sleep so you can stop the timer if the context is done. Use math/rand to generate jitter.

Here is a production-ready retry function. It respects context cancellation, adds jitter, and wraps errors with attempt details.

func RunWithRetry(ctx context.Context, job func() error, maxRetries int, initialDelay time.Duration) error {
    delay := initialDelay
    for i := 0; i < maxRetries; i++ {
        if err := job(); err == nil {
            return nil
        }
        // Add random jitter to spread out retries across a window
        jitter := time.Duration(rand.Int63n(int64(delay)))
        timer := time.NewTimer(delay + jitter)
        select {
        case <-ctx.Done():
            timer.Stop() // Clean up timer if context is cancelled
            return ctx.Err()
        case <-timer.C:
            // Timer expired, proceed to next iteration
        }
        delay *= 2
    }
    return fmt.Errorf("job failed after %d retries", maxRetries)
}

The function signature follows Go convention: context.Context is the first parameter, named ctx. The loop checks the job. If it fails, it calculates jitter. Jitter is a random duration between zero and the current delay. This keeps the retry within a predictable window while breaking synchronization.

The select statement waits for two channels. ctx.Done() signals cancellation. timer.C signals that the wait time has passed. If the context is cancelled, the timer is stopped to prevent a goroutine leak, and the context error is returned. If the timer fires, the loop continues and the delay doubles.

Error wrapping is visible in the return statement. In a real application, you would wrap the last error with fmt.Errorf("retry %d: %w", i, err) to preserve the chain. This lets you unwrap the error later to see which attempt failed and why.

How the runtime handles the wait

When you call time.Sleep, the Go runtime puts the goroutine to sleep. It does not consume CPU cycles. The scheduler moves other goroutines forward. When the sleep duration expires, the runtime wakes the goroutine and puts it back in the run queue.

time.NewTimer works differently. It creates a timer object and a channel. The timer runs in the background. When the duration expires, the timer sends a value on the channel. The select statement receives that value. If you call timer.Stop(), the timer is cancelled and the channel never receives a value. This is crucial for cancellation. If you use time.Sleep, you cannot interrupt the sleep. The goroutine waits until the duration ends. time.NewTimer gives you a handle to stop the wait early.

Jitter uses rand.Int63n. This function returns a pseudo-random number between zero and the argument. Casting the delay to int64 and back to time.Duration is safe because delays are usually small. The jitter ensures that even if two services start a retry at the same time with the same delay, they will likely wait for different amounts of time.

Pitfalls and compiler errors

Retry logic introduces subtle bugs. The most common is ignoring context. If you use time.Sleep, the retry continues even if the parent context is cancelled. The goroutine hangs until the sleep finishes. This wastes resources and delays shutdown.

Another pitfall is infinite retries. Always cap maxRetries. A loop without a bound will retry forever if the service is down. The compiler cannot enforce this. You must write the bound. If you forget the loop condition, the compiler rejects the program with a syntax error. If you use a variable for the bound and forget to initialize it, you get undefined: maxRetries.

Error handling is verbose by design. The community accepts if err != nil boilerplate because it makes the unhappy path visible. In a retry loop, you must decide which errors are retryable. A 500 Internal Server Error is usually retryable. A 404 Not Found is not. Retrying a 404 wastes time and might trigger rate limits. Check the error type before sleeping.

Goroutine leaks happen when a timer is created but never stopped or garbage collected. In the select example, timer.Stop() is called on cancellation. If the loop exits normally, the timer is dropped and the garbage collector reclaims it. This is safe. If you spawn a goroutine for each retry, you must ensure it exits when the context is done. The worst goroutine bug is the one that never logs.

Decision: when to use each approach

Use time.Sleep in a simple loop when you are writing a quick script or a one-off job that does not need cancellation. Use time.NewTimer with select when the job must respect context cancellation or deadlines. Use jitter when multiple instances of your service might retry simultaneously to prevent thundering herd. Use a dedicated retry library when you need complex policies like circuit breakers or adaptive backoff based on error rates. Use no retry when the operation is idempotent and immediate failure is acceptable, or when the error is client-side like a 400 Bad Request.

Where to go next