How to Use Connection Retry and Circuit Breaker for Databases in Go

Go's standard library does not include built-in connection retry or circuit breaker logic for databases; you must implement these patterns manually or use a third-party library. For a basic retry mechanism, wrap your database call in a loop with exponential backoff to handle transient failures. For

When the database blinks

Your database server restarts for a split second. Your Go app doesn't notice the restart, but the next query fails. The next query fails. The next one fails. Without a plan, your app starts spawning goroutines to retry, or it crashes, or it floods the database with connection attempts the moment it comes back up. The database is still initializing, the flood overwhelms it, and the whole system collapses. You need two tools to survive the chaos: retry logic for temporary hiccups and a circuit breaker to stop the bleeding when things are truly broken.

Retry and circuit breakers in plain words

Retry is knocking on a door. You knock. No answer. You wait a second, knock again. You wait two seconds, knock again. You wait four seconds. Eventually, you stop knocking and go do something else. This handles temporary noise: a network blip, a lock contention, a brief server restart. The key is waiting longer between attempts so you don't hammer the door while it's being unlocked.

Circuit breaker is the safety switch in your electrical panel. If a short circuit happens, the breaker trips. Power stops flowing. This prevents the wires from melting. In software, if the database is down, retrying thousands of times per second does nothing but waste CPU and memory. The circuit breaker trips, rejects requests immediately, and gives the database a chance to recover. After a timeout, it lets one request through to test the connection. If that works, the breaker resets. If it fails, the breaker trips again.

The retry loop

Here's the simplest retry loop. It wraps a function call, sleeps between attempts, and stops after a limit.

// RetryWithBackoff calls fn until it succeeds or maxRetries is reached.
// It uses exponential backoff to avoid hammering the target.
func RetryWithBackoff(ctx context.Context, maxRetries int, fn func() error) error {
    var err error
    for i := 0; i < maxRetries; i++ {
        err = fn()
        if err == nil {
            return nil // Success, return immediately.
        }
        // Calculate delay: 1s, 2s, 4s, 8s...
        delay := time.Duration(1<<uint(i)) * time.Second
        select {
        case <-ctx.Done():
            return ctx.Err() // Respect cancellation.
        case <-time.After(delay):
            // Wait for backoff duration before retrying.
        }
    }
    return fmt.Errorf("exhausted %d retries: %w", maxRetries, err)
}

A common mistake is assuming sql.Open connects to the database. It does not. sql.Open validates the data source name and creates a connection pool. The pool starts empty. The first query or Ping triggers the actual connection. If you want to verify connectivity at startup, call PingContext. If Ping fails, the pool remains empty, and you can retry the Open and Ping sequence. The context.Context parameter always goes first and is named ctx. This is a strict convention so every function in the chain can respect deadlines and cancellation.

Retry is for transient errors. Don't retry logic errors.

How the retry loop works

The loop runs maxRetries times. Inside, fn() executes the risky operation. If it returns nil, we're done. If it errors, we calculate a delay. The expression 1<<uint(i) is a bit shift that doubles the value each iteration: 1, 2, 4, 8. This is exponential backoff. It prevents the thundering herd where every client retries at the exact same moment.

The select statement waits for either the delay to expire or the context to cancel. If the caller cancels the operation, the retry stops immediately. This keeps your app responsive. Without the select, the goroutine would sleep through cancellation and hold resources longer than necessary. The time.After call creates a timer that fires once after the delay. The compiler ensures time.After returns a channel, so the select can listen on it.

Idempotency matters

Retry logic assumes the operation is idempotent. An idempotent operation produces the same result no matter how many times you call it. A SELECT query is idempotent. A DELETE is usually idempotent. An INSERT is not. If the first insert succeeds but the network drops the response, a retry creates a duplicate row. You need database constraints or application-level checks to handle this.

If you retry a query that violates a unique constraint, retrying won't help. The second attempt fails for the same reason. You get duplicate key value violates unique constraint from the database, and your retry loop just burns CPU. Only retry on transient errors like connection resets, timeouts, or lock wait timeouts. Check the error type before deciding to retry.

The circuit breaker

A circuit breaker needs state. It tracks failures and decides whether to allow requests. Here's a thread-safe implementation using sync/atomic for performance.

// CircuitBreaker protects a downstream service from cascading failures.
// It tracks failures and trips to Open state when threshold is reached.
type CircuitBreaker struct {
    state       int32 // 0: Closed, 1: Open, 2: HalfOpen
    failures    int32
    threshold   int32
    resetTimer  time.Duration
    lastFailure time.Time
    mu          sync.Mutex
}

const (
    StateClosed  = int32(iota)
    StateOpen
    StateHalfOpen
)

The struct uses int32 for the state because atomic operations require specific pointer types. The receiver name is cb, matching the type abbreviation. This is the standard Go convention for receiver naming.

// Execute runs fn only if the circuit allows it.
// It returns ErrOpenCircuit if the breaker is tripped.
func (cb *CircuitBreaker) Execute(fn func() error) error {
    state := atomic.LoadInt32(&cb.state)
    if state == StateOpen {
        // Check if we should transition to HalfOpen.
        if time.Since(cb.lastFailure) > cb.resetTimer {
            atomic.CompareAndSwapInt32(&cb.state, StateOpen, StateHalfOpen)
        } else {
            return fmt.Errorf("circuit breaker is open")
        }
    }
    err := fn()
    cb.recordResult(err == nil)
    return err
}

The Execute method checks the state atomically. If the state is Open, it checks whether enough time has passed to allow a test request. CompareAndSwapInt32 ensures only one goroutine transitions to HalfOpen. This prevents multiple test requests from flooding the service simultaneously.

// recordResult updates the failure count and state based on success or failure.
func (cb *CircuitBreaker) recordResult(success bool) {
    cb.mu.Lock()
    defer cb.mu.Unlock()
    if success {
        cb.failures = 0
        atomic.StoreInt32(&cb.state, StateClosed)
        return
    }
    cb.failures++
    cb.lastFailure = time.Now()
    if cb.failures >= cb.threshold {
        atomic.StoreInt32(&cb.state, StateOpen)
    }
}

The recordResult method uses a mutex to protect the failure counter. Success resets the counter and closes the circuit. Failure increments the counter and trips the breaker if the threshold is reached. The mutex is only held during the update, keeping contention low.

Circuit breakers save your app from suicide. Trip early, reset cautiously.

Pitfalls and errors

If you forget to check ctx.Done() in your retry loop, the goroutine keeps running even after the client disconnects. This is a goroutine leak. The context will eventually expire, but if the deadline is far in the future, you waste resources. The compiler won't catch this; it's a runtime logic error. The worst goroutine bug is the one that never logs.

If you pass a wrong type to atomic.LoadInt32, the compiler rejects this with invalid operation: atomic.LoadInt32(&state) (mismatched types *int and *int32). Go's atomic operations require specific pointer types. Use int32 or int64, not int.

If your circuit breaker threshold is too low, you trip on normal load spikes. If it's too high, you flood a failing service. Tune the threshold based on your error budget. A threshold of 5 failures over a sliding window is a common starting point.

In production, you combine these patterns. The circuit breaker sits outside. If the breaker is closed, you allow the request to proceed to the retry logic. If the retry logic exhausts all attempts, that counts as a failure for the circuit breaker. This gives you resilience against transient errors while protecting against systemic outages. The retry handles the noise; the breaker handles the fire.

Decision: retry, breaker, or both

Use a simple retry loop when the failure is likely transient and the operation is idempotent, like reading a configuration or querying a cache. Use exponential backoff when retrying too quickly could overwhelm the recovering service or trigger rate limits. Use a circuit breaker when the downstream service is unstable and you need to fail fast to protect your application's resources. Use a half-open state in the circuit breaker to allow testing recovery without risking a full flood of requests. Use plain sequential code without retry when the operation is critical and immediate feedback is required, such as a user login attempt where retries confuse the user experience. Use the standard library database/sql connection pool for most workloads; it handles reconnection automatically for individual queries, so you only need application-level retry for startup or batch operations.

Retry handles noise. Circuit breakers handle disasters. Combine them wisely.

Where to go next