How to Implement the Saga Pattern in Go

Web
Implement the Saga Pattern in Go by defining a coordinator that executes steps sequentially and triggers compensating actions on failure to ensure data consistency.

The order that never finishes

You're building an order flow for an e-commerce service. Step one creates the order in the database. Step two reserves inventory. Step three charges the credit card. Step three fails because the card is declined. Now you have a half-finished order sitting in the database and inventory locked up for nothing. You need to undo step two and step one.

Database transactions handle this inside a single service. You wrap everything in BEGIN and COMMIT. If anything fails, ROLLBACK restores the state. But when your services are distributed, you can't wrap everything in one ACID transaction. The inventory service has its own database. The payment service has its own database. You need a way to coordinate work across boundaries and undo it when things go wrong. That's what the saga pattern does.

What a saga actually does

A saga is a sequence of local transactions. Each step updates its own database and triggers the next step. If a step fails, the saga runs compensating actions to reverse the previous steps. The result is eventual consistency. The system reaches a valid state, but there's a window where data is inconsistent.

Think of a saga like a multi-stage rocket. Each stage fires and separates. If a stage fails to ignite, the flight termination system destroys the rocket to prevent a crash. The compensation is the destruction sequence. You reverse the progress to reach a safe state. The rocket doesn't magically un-burn the fuel from previous stages. It actively performs new operations to neutralize the damage.

Sagas come in two flavors. Choreography-based sagas have no central coordinator. Each service publishes an event, and other services react. Orchestration-based sagas use a central coordinator that calls each step and manages the state. The code below implements an orchestrator. It's easier to reason about complex workflows when one component drives the sequence.

Minimal saga implementation

Here's the simplest saga structure. It defines a step with an action and a compensation. The saga executes steps forward and rolls back in reverse order if an error occurs.

package main

import (
	"fmt"
	"log"
)

// Step holds a forward operation and its undo operation.
type Step struct {
	Action       func() error // Performs the business logic for this step.
	Compensation func() error // Reverses the effect if a later step fails.
}

// Saga holds the ordered list of steps.
type Saga struct {
	Steps []Step
}

// Execute runs each step forward. If any step fails, it triggers rollback.
func (s *Saga) Execute() error {
	for i, step := range s.Steps {
		if err := step.Action(); err != nil {
			log.Printf("Step %d failed: %v", i, err)
			return s.Rollback(i - 1)
		}
	}
	return nil
}

// Rollback runs compensations in reverse order from the given index.
func (s *Saga) Rollback(fromIndex int) error {
	for i := fromIndex; i >= 0; i-- {
		if err := s.Steps[i].Compensation(); err != nil {
			return fmt.Errorf("compensation failed at step %d: %w", i, err)
		}
	}
	return nil
}

func main() {
	saga := &Saga{
		Steps: []Step{
			{
				Action:       func() error { fmt.Println("Create order"); return nil },
				Compensation: func() error { fmt.Println("Cancel order"); return nil },
			},
			{
				Action:       func() error { fmt.Println("Reserve stock"); return fmt.Errorf("out of stock") },
				Compensation: func() error { fmt.Println("Release stock"); return nil },
			},
			{
				Action:       func() error { fmt.Println("Charge card"); return nil },
				Compensation: func() error { fmt.Println("Refund card"); return nil },
			},
		},
	}

	if err := saga.Execute(); err != nil {
		log.Printf("Saga failed: %v", err)
	}
}

A saga is not a transaction. It's a recovery plan.

How the execution flow works

The Execute method iterates through the steps in order. It calls Action. If Action returns an error, the loop stops immediately. The code calls Rollback with i - 1. This index points to the last successful step.

Rollback counts down from that index to zero. It calls Compensation for each step. If a compensation fails, the saga aborts and returns the error. The caller must handle this. In production, a failed compensation usually means you need to alert an operator. The system is in an inconsistent state that code cannot fix automatically.

Compensations must be idempotent. Running "Cancel Order" twice should be safe. If the saga retries a compensation due to a network glitch, it shouldn't double-refund a payment or delete data twice. Idempotency is the safety net that makes retries possible.

Realistic saga with context and state

Real sagas involve database calls, HTTP requests, and shared data. You need to pass context for cancellation and deadlines. You also need a way for steps to share state, like an order ID generated in step one that step two needs.

Here's a more realistic implementation. It uses a SagaState struct to pass data between steps. It accepts context.Context in every function. The receiver name is o for Orchestrator, following the convention of short names matching the type.

package main

import (
	"context"
	"errors"
	"fmt"
)

// SagaState holds shared data between steps.
type SagaState struct {
	OrderID string
	UserID  string
	// Add fields as needed for your domain.
}

// ServiceStep wraps a business operation with its compensation.
type ServiceStep struct {
	Name         string
	Action       func(ctx context.Context, state *SagaState) error
	Compensation func(ctx context.Context, state *SagaState) error
}

// Orchestrator manages the saga lifecycle.
type Orchestrator struct {
	Steps []ServiceStep
}

// Run executes steps and returns the index of the failed step for rollback.
func (o *Orchestrator) Run(ctx context.Context, state *SagaState) (int, error) {
	for i, step := range o.Steps {
		if err := step.Action(ctx, state); err != nil {
			// Wrap the error to identify the failing step.
			return i, fmt.Errorf("step %s failed: %w", step.Name, err)
		}
	}
	return -1, nil
}

// Compensate reverses completed steps in reverse order.
func (o *Orchestrator) Compensate(ctx context.Context, state *SagaState, failedIndex int) error {
	for i := failedIndex - 1; i >= 0; i-- {
		step := o.Steps[i]
		if err := step.Compensation(ctx, state); err != nil {
			// Log the compensation failure but continue trying others.
			// In production, you might want to alert here.
			fmt.Printf("Warning: compensation for %s failed: %v\n", step.Name, err)
		}
	}
	return nil
}

func main() {
	ctx := context.Background()
	state := &SagaState{
		UserID: "user-123",
	}

	orch := &Orchestrator{
		Steps: []ServiceStep{
			{
				Name: "CreateOrder",
				Action: func(ctx context.Context, state *SagaState) error {
					// Simulate generating an order ID.
					state.OrderID = "order-456"
					return nil
				},
				Compensation: func(ctx context.Context, state *SagaState) error {
					// Simulate deleting the order.
					return nil
				},
			},
			{
				Name: "ReserveInventory",
				Action: func(ctx context.Context, state *SagaState) error {
					// Simulate a failure.
					return errors.New("inventory service timeout")
				},
				Compensation: func(ctx context.Context, state *SagaState) error {
					return nil
				},
			},
		},
	}

	failedIndex, err := orch.Run(ctx, state)
	if err != nil {
		fmt.Printf("Saga failed: %v\n", err)
		if compErr := orch.Compensate(ctx, state, failedIndex); compErr != nil {
			fmt.Printf("Compensation also failed: %v\n", compErr)
		}
	}
}

Pass context.Context as the first argument. Name it ctx. This lets the saga cancel long-running steps if a deadline is reached.

Pitfalls and failure modes

Sagas introduce complexity. You trade the simplicity of a single transaction for the ability to scale across services. That trade-off comes with risks.

Compensation failures. If a compensation fails, you're in a bad state. The saga can't just return nil. You need to log the error and alert an operator. The Compensate method above continues running other compensations even if one fails. This ensures you undo as much as possible. The caller checks the return value and handles the alerting.

Idempotency. Compensations often call external APIs. If the API times out, you might retry. If the API already processed the request, retrying causes a double refund. Every compensation needs an idempotency key. Generate a unique ID for the saga instance and pass it to every step. The step stores the ID with the operation. If the step retries, it checks the ID and skips the work if it's already done.

Goroutine leaks. If you spawn goroutines for steps, ensure they finish. A goroutine that waits on a channel that never gets closed will leak memory. Always have a cancellation path. Pass the context to the goroutine and check for cancellation.

Compiler errors. The compiler catches signature mismatches early. If you define Action as func() error but the orchestrator expects func(context.Context, *SagaState) error, the compiler rejects the program with cannot use step.Action (variable of type func() error) as func(context.Context, *SagaState) error value. If you forget to import context, you get undefined: context. If you forget to use a package, you get imported and not used. The compiler errors are plain text. Read them carefully. They tell you exactly what type mismatch occurred.

Error wrapping. Use %w in fmt.Errorf to wrap errors. This preserves the error chain. Callers can use errors.Is or errors.As to check for specific error types. If you use %v, you lose the chain and debugging becomes harder.

The worst saga bug is a silent compensation failure. Log everything.

When to use a saga

Sagas add complexity. Only reach for them when distributed state forces your hand.

Use a saga when you need to coordinate state changes across multiple services and cannot rely on a single database transaction.

Use a choreography-based saga when services are loosely coupled and each service can react to domain events without a central coordinator.

Use an orchestration-based saga when the workflow is complex, requires conditional branching, or needs a single source of truth for the process state.

Use a single database transaction when all data resides in one database and you need strict atomic guarantees.

Use the outbox pattern when you need to guarantee that database writes and event publishing happen atomically within a single service.

Sagas trade consistency for availability. Accept the trade-off or choose a different pattern.

Where to go next