How to Implement Horizontal Pod Autoscaling for Go Services

Configure a HorizontalPodAutoscaler resource to automatically scale Go service replicas based on CPU utilization thresholds.

When traffic spikes and your service stalls

Your Go service is running on Kubernetes. Traffic is calm until a marketing email goes out. Requests pile up. The single pod's CPU hits 100%. Latency spikes. Users see 502 errors. You need more replicas, but you don't want to manually scale up and then forget to scale down when the traffic dies. Kubernetes can do this for you.

Horizontal Pod Autoscaling (HPA) watches your deployment and changes the number of running pods. It is not magic. It is a feedback loop. The HPA controller checks metrics, compares them to a target, and updates the deployment's replica count. Think of it like a restaurant manager. When the dining room fills up, the manager calls more servers from the back. When the crowd leaves, the manager sends servers home. The HPA is the manager. The pods are the servers. The metrics are the crowd size.

How the loop works

The HPA controller runs a loop every 15 seconds by default. It queries the metrics server for the current usage of the pods in your target deployment. It calculates the average utilization across all pods. If the average is above your target, the controller computes a new replica count and updates the deployment. The deployment creates new pods. They get scheduled. They become ready. Traffic flows to them. The load spreads. CPU per pod drops.

The math is straightforward. The controller uses this formula:

desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]

If you have 2 pods and the average CPU is 140% of your target, the calculation yields 4 pods. If the average drops to 35%, the calculation yields 1 pod. The controller respects minReplicas and maxReplicas bounds. Scale-down is usually slower than scale-up to prevent flapping, where the system oscillates between scaling up and down due to transient spikes.

HPA is a feedback loop. Feed it bad metrics and it makes bad decisions.

Minimal configuration

Here's the simplest HPA configuration. It targets a deployment, sets a range for replicas, and defines a CPU target.

# apiVersion v2 supports multiple metrics and custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: go-service-hpa
spec:
  # Point to the Deployment that manages the pods
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: go-service
  # Keep at least 2 pods for high availability
  minReplicas: 2
  # Cap at 10 to prevent runaway costs
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        # Scale when average CPU across pods exceeds 70% of the request
        type: Utilization
        averageUtilization: 70

Apply this with kubectl apply -f hpa.yaml. The HPA object appears in your namespace. The controller picks it up immediately.

Requests are the anchor

HPA depends on resource requests. If your pod doesn't declare a CPU request, the metrics server can't calculate utilization. You get a warning. The HPA can't scale. The controller needs a denominator to compute the percentage. Without a request, there is no percentage.

You must set requests in your deployment. The request tells Kubernetes how much CPU the pod needs to run normally. The limit is the hard cap. HPA uses the request for scaling calculations.

Here's a deployment with correct resource settings.

# Deployment defines the desired state for the pods
apiVersion: apps/v1
kind: Deployment
metadata:
  name: go-service
spec:
  # Initial replicas; HPA will override this value
  replicas: 2
  selector:
    matchLabels:
      app: go-service
  template:
    metadata:
      labels:
        app: go-service
    spec:
      containers:
      - name: api
        image: my-go-service:latest
        resources:
          requests:
            # HPA calculates utilization based on this value
            # 250m means 0.25 CPU cores
            cpu: "250m"
            memory: "128Mi"
          limits:
            # Hard cap; OOMKilled if exceeded
            cpu: "500m"
            memory: "256Mi"

Requests are the denominator. No requests, no math.

Tuning scale-down behavior

Default HPA behavior can be aggressive. A brief dip in traffic might trigger scale-down, only for traffic to spike again seconds later. This causes flapping. Pods churn. Connections drop. Users see errors.

You can tune the behavior using the behavior field. This lets you set a stabilization window and limit how fast the system scales down. A stabilization window forces the HPA to wait before acting on scale-down signals. This smooths out noise.

Here's a configuration that prevents flapping.

behavior:
  scaleDown:
    # Wait 5 minutes before scaling down to avoid flapping
    stabilizationWindowSeconds: 300
    policies:
    - type: Percent
      # Remove at most 10% of pods per minute
      value: 10
      periodSeconds: 60

Add this block under spec in your HPA YAML. The controller now waits 5 minutes after the metric drops below the target before considering scale-down. It also removes pods slowly, capping the reduction at 10% per minute. This gives your service time to redistribute load and finish in-flight requests.

Tune the window based on your traffic patterns. Short windows react faster but risk flapping. Long windows are stable but slow to release resources.

Go code must handle shutdown

When Kubernetes scales down, it sends a SIGTERM signal to the container. The container has a grace period, usually 30 seconds, to exit cleanly. If your Go service ignores the signal, Kubernetes sends SIGKILL after the grace period. The process dies instantly. In-flight requests drop. Clients get errors.

Your Go code must catch SIGTERM and stop accepting new connections while waiting for active requests to finish. This is graceful shutdown. It protects users during scale-down.

Here's the server setup. It creates an HTTP server and starts listening.

package main

import (
	"log"
	"net/http"
)

// main starts the HTTP server and handles graceful shutdown
func main() {
	mux := http.NewServeMux()
	mux.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
		w.WriteHeader(http.StatusOK)
	})

	srv := &http.Server{
		Addr:    ":8080",
		Handler: mux,
	}

	// Run server in background to allow signal handling in main
	go func() {
		if err := srv.ListenAndServe(); err != http.ErrServerClosed {
			log.Fatalf("server error: %v", err)
		}
	}()

	// Signal handling and shutdown logic follows
}

Here's the shutdown logic. It waits for the signal and stops the server gracefully.

import (
	"context"
	"os"
	"os/signal"
	"syscall"
	"time"
)

// Context cancels when K8s sends SIGTERM during scale-down
ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer stop()

// Block until termination signal arrives
<-ctx.Done()
log.Println("shutdown signal received")

// Stop accepting new connections and wait for active requests
shutdownCtx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
if err := srv.Shutdown(shutdownCtx); err != nil {
	log.Fatalf("shutdown error: %v", err)
}

The signal.NotifyContext function creates a context that cancels when the OS sends a signal. The srv.Shutdown method stops the listener and waits for active requests to complete. If requests don't finish within the timeout, the server exits. This ensures HPA scale-down doesn't drop traffic.

Graceful shutdown protects users during scale-down. Handle SIGTERM or lose requests.

Pitfalls and errors

If you forget to set CPU requests, the HPA status shows Unable to fetch metrics or the condition is ScalingActive: False with a message about missing requests. The controller can't compute a percentage without a denominator. Check your deployment resources first.

If the metrics server is not installed, the HPA can't get data. The status shows Failed to get cpu utilization: missing request for cpu or similar errors depending on the version. Install the metrics server in your cluster. Most managed Kubernetes services include it by default.

If you use averageValue instead of averageUtilization, the target is an absolute value, not a percentage. averageUtilization: 70 means 70% of the request. averageValue: 250m means 250 millicores. Mixing these up leads to unexpected scaling. Use averageUtilization for CPU and memory unless you have a specific reason for absolute values.

Go's scheduler respects cgroup limits. Kubernetes sets CPU limits via cgroups. Go 1.5 and later automatically sets GOMAXPROCS based on the cgroup limit. You usually don't need to set GOMAXPROCS manually in Kubernetes. The runtime sees the right number of cores.

Metrics server is the foundation. Check it first.

When to use HPA

Use Horizontal Pod Autoscaling when traffic varies and you need more instances to handle the load. Use custom metrics for HPA when CPU usage doesn't reflect the actual bottleneck, such as a database connection pool or message queue lag. Use Vertical Pod Autoscaling when individual pods are resource-starved but adding more pods won't help the throughput. Use static replica counts when your workload is steady and predictable; autoscaling adds complexity that you don't need.

Where to go next