The dynamic phonebook problem
You build a service called billing. It needs to talk to inventory. In development, you hardcode inventory:8080. It works. You deploy to production. You scale inventory to three instances behind a load balancer. billing still talks to the load balancer. That works too.
Then you move to a service mesh or a bare-metal cluster where load balancers are scarce. You hardcode the IP of one inventory instance. That instance crashes. The IP is reassigned to a new instance that hasn't started yet. billing sends requests to a black hole. Or worse, billing talks to the old instance, which is now a zombie holding stale connections.
Hardcoded addresses break the moment your infrastructure changes. You need a mechanism that answers "Where is inventory right now?" and updates the answer when instances join, leave, or fail. That mechanism is service discovery.
Think of service discovery as a dynamic whiteboard in a busy kitchen. Cooks write their station and status when they arrive. If a cook leaves, their name gets erased. Other cooks look at the board to find who can handle an order. The board is the registry. Writing the name is registration. Looking at the board is discovery.
How discovery works
Service discovery relies on three components. The registry stores the mapping of service names to network addresses. Services register themselves with the registry when they start and deregister when they stop. Clients query the registry to find available instances.
The registry can be a DNS server, a dedicated tool like Consul or Etcd, or a Kubernetes API. The client can be your Go code, or a sidecar proxy that intercepts traffic. In Go, you often implement client-side discovery: your code fetches the list of peers and picks one to talk to.
DNS-based discovery
DNS is the oldest form of service discovery. It requires no external dependencies. You update DNS records when services change, and clients resolve the hostname to get IPs. Go's standard library handles DNS resolution via the net package.
Here's the simplest DNS lookup: resolve a hostname and print the IPs.
package main
import (
"fmt"
"net"
)
func main() {
// Resolve IPs for a service name.
// The OS handles caching based on TTL.
ips, err := net.LookupIP("inventory.service.local")
if err != nil {
// Handle resolution failure gracefully.
fmt.Println("Failed to find inventory:", err)
return
}
// Print the discovered addresses.
fmt.Println("Inventory instances:", ips)
}
DNS is the path of least resistance. Trust the OS, but watch the TTL.
The caching trap
DNS caches aggressively. The OS resolver stores results to reduce traffic. The cache lifetime is controlled by the Time-To-Live (TTL) in the DNS record. If the TTL is 60 seconds, the OS might not check for updates for a full minute.
When an instance crashes, DNS might still point to the dead IP for the duration of the TTL. Your Go client will keep retrying a dead address until the cache expires. This is acceptable for slow-changing infrastructure. It is fatal for high-availability systems where seconds matter.
You can force Go to bypass the OS cache by using a custom resolver. The net.Resolver type allows you to set PreferGo: true, which tells Go to use its own DNS implementation instead of the OS libc. This gives you control over caching. It also breaks some OS-level features like split-horizon DNS. Use it only when you understand the trade-offs.
Client-side discovery pattern
Production systems often use a dedicated registry like Consul, Etcd, or a custom HTTP endpoint. The pattern is consistent: poll the registry, update local state, pick a peer. The client maintains a list of healthy addresses and refreshes it periodically.
This approach decouples the request path from the discovery path. You don't want to query the registry for every request. That adds latency and hammers the registry. You fetch the list in the background and use the local list for routing.
Here's a ServiceClient that maintains a peer list and refreshes it from a registry URL.
package main
import (
"context"
"fmt"
"math/rand"
"net/http"
"sync"
"time"
)
// ServiceClient maintains a list of healthy peers and rotates requests.
// It polls a registry endpoint to update the list periodically.
type ServiceClient struct {
registryURL string
peers []string
mu sync.RWMutex
client *http.Client
}
// NewServiceClient creates a client that fetches peers from the registry.
func NewServiceClient(registryURL string) *ServiceClient {
return &ServiceClient{
registryURL: registryURL,
client: &http.Client{Timeout: 2 * time.Second},
}
}
The receiver name c matches the type ServiceClient. Go convention favors short names that hint at the type. Avoid this or self.
The Refresh method fetches the latest peer list. It uses context to support cancellation. Functions that take a context put it first. The parameter is named ctx by convention.
// Refresh fetches the latest peer list from the registry.
// It replaces the local list atomically to avoid race conditions.
func (c *ServiceClient) Refresh(ctx context.Context) error {
// Build request with context to support cancellation.
req, err := http.NewRequestWithContext(ctx, http.MethodGet, c.registryURL, nil)
if err != nil {
return fmt.Errorf("build request: %w", err)
}
// Execute the request against the registry.
resp, err := c.client.Do(req)
if err != nil {
return fmt.Errorf("registry fetch: %w", err)
}
defer resp.Body.Close()
// Simulate parsing peers. Replace with JSON decode in production.
newPeers := []string{"10.0.0.1:8080", "10.0.0.2:8080"}
// Swap the peer list under a lock to prevent race conditions.
c.mu.Lock()
c.peers = newPeers
c.mu.Unlock()
return nil
}
Error handling is verbose. if err != nil is the standard pattern. It forces you to acknowledge the failure path. The community accepts the boilerplate because it makes the unhappy path visible.
The Pick method selects a peer from the local list. It uses a read lock to allow concurrent reads while the background goroutine updates the list.
// Pick returns a random peer from the current list.
// It returns an error if no peers are available.
func (c *ServiceClient) Pick() (string, error) {
c.mu.RLock()
defer c.mu.RUnlock()
if len(c.peers) == 0 {
return "", fmt.Errorf("no peers available")
}
// Select a random peer for load distribution.
return c.peers[rand.Intn(len(c.peers))], nil
}
The main function starts a background goroutine to refresh the list. It uses a ticker and a context to manage the loop. The goroutine must stop when the context is cancelled to avoid leaks.
func main() {
client := NewServiceClient("http://consul:8500/v1/catalog/service/inventory")
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
go func() {
ticker := time.NewTicker(30 * time.Second)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
if err := client.Refresh(ctx); err != nil {
fmt.Println("Refresh failed:", err)
}
}
}
}()
if err := client.Refresh(ctx); err != nil {
fmt.Println("Initial refresh failed:", err)
return
}
peer, err := client.Pick()
if err != nil {
return
}
fmt.Println("Routing to:", peer)
}
Refresh in the background. Pick in the foreground. Never block the request on a registry poll.
Pitfalls and runtime failures
Service discovery introduces concurrency and network dependencies. Bugs here are subtle and painful.
Goroutine leaks happen when the refresh loop doesn't exit. If you forget to select on ctx.Done(), the goroutine runs forever. The program hangs on shutdown. Always have a cancellation path. If you forget to import context, the compiler rejects the program with undefined: context. If you forget to use a variable, you get imported and not used.
Thundering herd is a real risk. If you deploy 1000 instances of your service and they all refresh every 30 seconds, they might all hit the registry at the exact same moment. The registry crashes. Add jitter to the refresh interval. Sleep for 30s + random(0-5s) so requests spread out over time.
DNS caching can mask failures. If you use net.LookupIP, the OS controls the cache. You might not see new instances until the TTL expires. If you need immediate updates, use a custom resolver or a dedicated registry.
The compiler helps with type safety. If you try to pass a context.Context where a string is expected, you get cannot use ctx (type context.Context) as string value in argument. If you forget to capture a loop variable in a goroutine, you get loop variable i captured by func literal. These errors save you from runtime bugs.
Jitter saves registries. Context saves goroutines. Handle both.
Decision matrix
Service discovery is not one-size-fits-all. Pick the approach that matches your infrastructure and requirements.
Use DNS-based discovery when your infrastructure already supports dynamic DNS updates and you want zero external dependencies.
Use a dedicated registry like Consul or Etcd when you need health checks, key-value storage, or fine-grained control over service metadata.
Use Kubernetes Endpoints when you are running inside a cluster and want to leverage the built-in service mesh without managing a separate registry.
Use static configuration when your topology is fixed and changes require a full redeploy; the simplest solution is often the best.
Match the discovery tool to the infrastructure. Complexity is a tax; pay it only when you get value.