The traffic spike that breaks staging
Your API handles test traffic perfectly. You deploy to production and a single client decides to hammer your login endpoint with ten thousand requests a second. The database connection pool exhausts itself. Response times climb into the seconds. Legitimate users receive timeouts. You need a circuit breaker that sits in front of your business logic and says no before the system chokes. Rate limiting is that breaker. It caps how often an operation can run, protecting your infrastructure from abuse and accidental traffic spikes.
How the token bucket actually works
Go handles this with the token bucket algorithm, packaged in golang.org/x/time/rate. Imagine a physical bucket with holes in the bottom. Water drips in at a steady rate. You can only draw water out if the bucket has some. If someone grabs a cup quickly, they drain what is available. If they try again immediately, they get nothing until the drip refills the bucket. The bucket size is your burst capacity. The drip speed is your sustained rate.
This model matches how real systems behave. Networks and disks handle short bursts easily, but they degrade under constant pressure. The token bucket lets you define exactly how much burst is acceptable and how fast the system recovers. It also avoids the edge cases of fixed-window counters, which often allow double the intended rate right at the boundary between two time windows. The bucket smooths out traffic while still enforcing a hard ceiling over time.
A minimal limiter in action
Here is the simplest way to create a limiter and test it. We define a steady rate of two requests per second and allow a burst of three.
package main
import (
"fmt"
"time"
"golang.org/x/time/rate"
)
func main() {
// Two tokens per second, burst capacity of three
limiter := rate.NewLimiter(rate.Every(500*time.Millisecond), 3)
// Check if we can proceed right now
if limiter.Allow() {
fmt.Println("First request: allowed")
}
if limiter.Allow() {
fmt.Println("Second request: allowed")
}
if limiter.Allow() {
fmt.Println("Third request: allowed")
}
// Bucket is empty. This returns false immediately.
if !limiter.Allow() {
fmt.Println("Fourth request: rejected")
}
// Wait two seconds for the bucket to refill
time.Sleep(2 * time.Second)
if limiter.Allow() {
fmt.Println("After wait: allowed again")
}
}
The rate.NewLimiter call sets two parameters. The first controls how fast tokens arrive. rate.Every(500*time.Millisecond) means one token appears every half second, which equals two tokens per second. The second parameter sets the maximum bucket size. A burst of three means the limiter starts full and can approve three rapid requests before hitting zero.
Allow() is non-blocking. It checks the internal counter. If a token exists, it subtracts one and returns true. If the counter is zero, it returns false immediately. The caller decides what to do with a rejection. Usually that means returning a 429 Too Many Requests status code.
What happens under the hood
The limiter tracks time internally. It does not spawn background goroutines to refill tokens. Every call to Allow() or Wait() advances the internal clock and calculates how many tokens should have accumulated since the last check. This lazy evaluation keeps the package extremely lightweight. You pay for rate limiting only when you actually check it.
The internal state holds three values: the last time a token was added, the current number of tokens, and the maximum burst size. When you call Allow(), the method computes the elapsed time, multiplies it by the refill rate, and adds the result to the current token count. It caps the total at the burst limit. If the result is greater than zero, it subtracts one, updates the timestamp, and returns true. If the result is zero, it updates the timestamp and returns false.
This design means the limiter is safe for concurrent use. Multiple goroutines can call Allow() on the same instance without external synchronization. The package uses atomic operations internally to prevent race conditions. You do not need to wrap every limiter call in a mutex.
Per-client middleware for real APIs
Real applications rarely rate limit globally. You usually want to track limits per client, per API key, or per IP address. That requires a map of limiters. Maps are not safe for concurrent access, so you need a mutex. Here is a middleware pattern that wraps an HTTP handler.
package main
import (
"net/http"
"sync"
"time"
"golang.org/x/time/rate"
)
// ClientLimiter tracks rate limiters per IP address
type ClientLimiter struct {
mu sync.Mutex
clients map[string]*rate.Limiter
}
// NewClientLimiter creates a tracker for per-client limits
func NewClientLimiter() *ClientLimiter {
return &ClientLimiter{
clients: make(map[string]*rate.Limiter),
}
}
// GetLimiter returns a limiter for the given key, creating one if needed
func (cl *ClientLimiter) GetLimiter(key string) *rate.Limiter {
cl.mu.Lock()
defer cl.mu.Unlock()
if limiter, exists := cl.clients[key]; exists {
return limiter
}
// Create a fresh limiter for unknown clients
limiter := rate.NewLimiter(rate.Every(200*time.Millisecond), 10)
cl.clients[key] = limiter
return limiter
}
The middleware itself attaches to the HTTP chain. It extracts the client identifier, grabs the limiter, and checks availability.
func (cl *ClientLimiter) Middleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// Use RemoteAddr as the client key for simplicity
key := r.RemoteAddr
limiter := cl.GetLimiter(key)
if !limiter.Allow() {
http.Error(w, "Rate limit exceeded", http.StatusTooManyRequests)
return
}
next.ServeHTTP(w, r)
})
}
This structure keeps the map access synchronized. The mutex protects the map itself, not the limiter state. The rate.Limiter type is already safe for concurrent use, so multiple goroutines can call Allow() on the same limiter without extra locking. Go convention dictates that receiver names match the type abbreviation, which is why the methods use cl instead of this or self. It keeps the code scannable and consistent with the standard library.
Where implementations fail
The biggest trap in per-client rate limiting is memory growth. Every new IP address adds an entry to the map. If you never remove old entries, the map grows until the process runs out of memory. You need a cleanup routine that runs periodically and deletes limiters that have not been touched recently. Store the last access time alongside each limiter, or use a cache library that handles eviction automatically. The worst goroutine bug is the one that never logs. The worst memory leak is the one that only appears after a month of steady traffic.
Another common mistake is ignoring context when using Wait(). If you switch from Allow() to Wait() to pause requests instead of rejecting them, you must pass a context.Context. Without it, a slow downstream service or a sudden traffic spike can cause goroutines to pile up indefinitely, waiting for tokens that never arrive. The compiler will not stop you from calling Wait(context.Background()), but your runtime will suffer. Always wrap Wait() in a context that carries a deadline or cancellation signal. Context is plumbing. Run it through every long-lived call site.
If you accidentally pass a negative burst size or a zero limit, the package panics at runtime. The compiler cannot catch these because they are dynamic values. You get a runtime error: invalid argument to rate.NewLimiter if the parameters do not make sense. Validate configuration before creating limiters. If you forget to import the rate package, the compiler rejects the program with undefined: rate. If you import it but never use it, you get imported and not used. Go errors are verbose by design. Read them. They tell you exactly what went wrong.
Error handling follows the standard pattern: check immediately, return or log. If you wrap the rate limit error, use fmt.Errorf("rate limit exceeded: %w", err) so callers can unwrap it later. The community accepts the if err != nil boilerplate because it makes the unhappy path visible. Do not hide it behind silent returns or panic calls.
Pick the right control method
Rate limiting looks simple until you pick the wrong tool for the traffic pattern. Match the method to the requirement.
Use Allow() when you want immediate rejection and low latency. It returns a boolean without blocking the calling goroutine. This fits HTTP endpoints where you prefer to fail fast and let the client retry later.
Use Wait() when dropping requests is unacceptable and you can afford to pause execution. It blocks until a token arrives or the context cancels. This works for background job processors or queue consumers that must eventually process every item.
Use Reserve() when you need to know exactly how long to wait before proceeding. It returns a reservation object with a Delay() method. This fits scenarios where you want to schedule work ahead of time or calculate retry headers for API responses.
Use an external store like Redis when your application runs behind a load balancer or across multiple machines. In-memory limiters only track traffic for the single process that holds them. A distributed counter ensures the limit applies globally.
Use plain sequential code when you do not need concurrency control. Rate limiting adds complexity and latency. Only implement it when you have measured traffic that threatens your system stability.
Where to go next
Rate limiting protects your application, but it does not replace authentication or security scanning. Build a complete defense by combining traffic controls with identity verification and dependency auditing.