How to Implement a Token Bucket Rate Limiter in Go

When your code is too enthusiastic

You wrote a scraper that fetches a thousand pages. It works perfectly on your machine. You deploy it to production, and five minutes later the target server returns 429 Too Many Requests. Your code is correct, but it is too enthusiastic. You need a mechanism to smooth out the bursts and enforce a steady pace without rewriting the core logic. Rate limiting is not just about politeness. It is a survival mechanism for distributed systems. If your service hammers a database, a third-party API, or a message queue, you risk exhausting connection pools, triggering circuit breakers, or getting your IP banned. A rate limiter acts as a shock absorber. It absorbs sudden spikes in traffic and releases them at a controlled rate that the downstream system can handle.

The token bucket model

The token bucket algorithm models rate limiting as a physical container. Tokens drip into the bucket at a fixed rate. Each request costs one token. If the bucket has tokens, the request proceeds and a token is removed. If the bucket is empty, the request must wait for a token to arrive or get rejected immediately. The bucket size determines the burst capacity. A large bucket allows a sudden flood of requests as long as tokens are available, then forces a slowdown once the tokens run dry. A small bucket enforces a stricter, steadier flow. The refill rate determines the long-term average throughput.

This model differs from a leaky bucket, which enforces a constant output rate regardless of input bursts. A token bucket is more flexible because it allows bursts up to the bucket size, which matches how many real-world systems behave. They can handle a sudden load as long as the average rate stays within limits. Tokens are currency. Spend them wisely.

Minimal example: Allow and Wait

Here is the simplest rate limiter: create a limiter, check if a token is available, and proceed. The golang.org/x/time/rate package provides the Limiter type, which implements this algorithm.

package main

import (
	"context"
	"fmt"
	"time"

	"golang.org/x/time/rate"
)

func main() {
	// rate.Limit(2) sets the refill rate to 2 tokens per second
	// 4 is the burst capacity, allowing up to 4 tokens to accumulate
	limiter := rate.NewLimiter(rate.Limit(2), 4)

	for i := 0; i < 5; i++ {
		// Allow checks if a token is available without blocking
		if !limiter.Allow() {
			fmt.Println("Rate limit exceeded, waiting...")
			// Wait blocks until a token is available or context is cancelled
			limiter.Wait(context.Background())
		}
		fmt.Printf("Request %d allowed at %v\n", i+1, time.Now())
	}
}

The rate.NewLimiter function takes two arguments. The first is the rate.Limit, which defines how many tokens are added per second. The second is the burst size, which defines the maximum number of tokens the bucket can hold. The limiter starts full, so the first burst requests are allowed immediately. Install the package with go get golang.org/x/time/rate. This is a standard extension package maintained by the Go team, widely used in production systems. Wait blocks. Allow skips. Choose based on your latency budget.

How the limiter tracks state

The Limiter struct tracks the current token count and the last refill time. When you call Allow, it checks the internal counter. If tokens exist, it decrements the counter and returns true. If not, it returns false without blocking. The Wait method is different. It calculates how long until a token arrives, sleeps for that duration, and then proceeds. It respects context cancellation, so a cancelled request will not hang forever. If the context is cancelled while waiting, Wait returns the context error.

The limiter refills tokens lazily. It does not run a background goroutine to add tokens every millisecond. Instead, it calculates the number of tokens that should have been added based on the elapsed time since the last operation. This makes the limiter efficient and free of background overhead. The math is straightforward: multiply the elapsed time by the refill rate, add the result to the current token count, and cap it at the burst size. This lazy evaluation means the limiter uses zero CPU cycles when idle.

The Limiter struct is safe for concurrent use. Multiple goroutines can call Allow or Wait on the same instance without a mutex. The internal state is protected by atomic operations and a lock. However, SetLimit and SetBurst are not safe for concurrent use. Call them only during initialization or when you can guarantee exclusive access. Go favors explicit configuration over dynamic mutation. Set your limits once and let the limiter do its job.

Realistic example: A rate-limited client

In a real service, you wrap the limiter in a struct and pass it through your request pipeline. This keeps the rate limiting logic separate from the business logic. You can inject the limiter into your HTTP client, database connector, or worker pool.

package main

import (
	"context"
	"fmt"
	"time"

	"golang.org/x/time/rate"
)

// Client wraps an HTTP client with a rate limiter
type Client struct {
	limiter *rate.Limiter
}

// NewClient creates a client with the specified rate and burst
// c is the receiver name, matching the type Client
func NewClient(r rate.Limit, b int) *Client {
	return &Client{
		limiter: rate.NewLimiter(r, b),
	}
}

// DoRequest performs a request after waiting for a token
// ctx is the first parameter by convention
func (c *Client) DoRequest(ctx context.Context, url string) error {
	// limiter.Wait respects context cancellation to prevent goroutine leaks
	if err := c.limiter.Wait(ctx); err != nil {
		return fmt.Errorf("rate limit wait failed: %w", err)
	}

	// Simulate the actual request
	fmt.Printf("Fetching %s at %v\n", url, time.Now())
	return nil
}

func main() {
	// 5 requests per second, burst of 10
	client := NewClient(rate.Limit(5), 10)

	ctx := context.Background()
	for i := 0; i < 15; i++ {
		if err := client.DoRequest(ctx, fmt.Sprintf("https://api.example.com/%d", i)); err != nil {
			fmt.Println(err)
		}
	}
}

The Client struct holds the limiter as a field. The DoRequest method calls limiter.Wait before proceeding. This ensures that no request is sent until a token is available. The context is passed through, so if the caller cancels the context, the wait is interrupted and the error is returned. The receiver name is c, a single letter matching the type Client. This follows Go convention. The ctx parameter is first, also by convention. Error handling uses fmt.Errorf with %w to wrap the error, preserving the error chain for debugging. The community accepts the verbose if err != nil pattern because it makes the unhappy path visible. Don't fight the type system. Wrap the value or change the design.

Advanced control: Reserve and N-variants

The Allow and Wait methods handle single tokens. Sometimes you need more control. The Reserve method returns a Reservation struct that tells you when a token will be available without blocking. This is useful when you want to batch requests or schedule work in advance.

// Reserve returns a reservation for a token, without blocking
r := limiter.Reserve()

// Delay returns how long to wait until the token is available
delay := r.Delay()

// You can proceed immediately if delay is zero
if delay > 0 {
	time.Sleep(delay)
}

The Reserve method is non-blocking. It calculates the delay and returns it. You can use this delay to schedule the request in a timer or to batch multiple requests together. If you call Cancel on the reservation, the token is returned to the bucket, allowing other requests to use it. This is valuable when you discover mid-flow that a request is no longer needed. You give the token back instead of wasting it.

The N variants (AllowN, WaitN, ReserveN) handle multiple tokens at once. This is useful when a single operation consumes multiple resources. For example, a request might need two tokens: one for the API call and one for the database write.

// AllowN checks if multiple tokens are available
if limiter.AllowN(context.Background(), 2) {
	// Proceed with the operation
}

The N variants behave like the single-token methods but consume or wait for n tokens. The delay is calculated based on the time to accumulate n tokens. Reserve gives you visibility. Wait gives you simplicity. Pick the tool that matches your latency requirements.

Pitfalls and runtime traps

A common mistake is treating the burst size as the rate. If you set a rate of 10 per second but a burst of 100, the first 100 requests fly through instantly. The downstream service sees a spike, not a steady stream. Size your burst based on the downstream service's capacity, not your desire for speed. Another trap is calling Wait without a context. If the limiter is misconfigured or the system is under heavy load, your goroutine blocks forever. The compiler will not stop you, but your process will hang. Always pass a context with a deadline. If the context is cancelled, Wait returns the context error, allowing you to handle the failure gracefully.

If you pass a negative limit to NewLimiter, the function panics with rate: NewLimiter with invalid arguments. The limit must be positive. The burst must be positive. If you need to disable rate limiting, set the limit to rate.Inf, which represents infinity. Goroutine leaks happen when the goroutine waits on a channel that never gets closed. The same principle applies to rate limiters. If you call Wait without a context, the goroutine waits indefinitely. Always have a cancellation path. A burst is a spike, not a steady state. Size your bucket for reality, not hope.

Decision matrix

Use a token bucket limiter when you need to smooth out traffic with a specific average rate and controlled burst capacity. Use a channel-based semaphore when you need to bound the number of concurrent operations without caring about the time-based refill rate. Use a simple sleep between requests when the workload is trivial and precise rate control is not required. Use a dedicated rate-limiting middleware when you need to enforce limits across multiple routes or services centrally. Rate limiting is a contract with your downstream service. Honor it.

Where to go next

A token bucket rate limiter controls how often an action can happen, like limiting API calls. Imagine a bucket that fills with water (tokens) at a steady rate; you can only take a cup of water out if there is enough in the bucket. This prevents your system from being overwhelmed by too many requests at once.