How to Handle Partial Failures in Concurrent Go Code

The batch job that won't stop

You are processing a batch of user uploads. Five files arrive at once. You spin up a goroutine for each to resize and save. The third file is corrupted. The resize crashes. Do you discard the other four files that processed fine? Or do you let the corrupted file hang the whole request? Partial failures are the reality of concurrent systems. You need a way to collect results, report the bad ones, and keep the good ones moving.

Partial failure patterns

Partial failure means some work succeeds while other work fails. In sequential code, a panic or error usually stops everything. In concurrent code, goroutines run independently. If one crashes, the others keep going. You need a mechanism to gather outcomes. Channels are the standard tool. You send errors or results through a channel. A coordinator goroutine reads the channel. It can decide to cancel the rest, aggregate errors, or just log the failure and continue.

context.Context is the kill switch. When a critical error happens, you cancel the context. All workers check the context and exit. This prevents goroutine leaks. The choice depends on your failure policy. Do you stop immediately on the first error? Or do you finish all work and report a summary? The architecture changes based on that decision.

Minimal pattern: collect and report

Here's the minimal pattern: spawn workers, collect results on a buffered channel, and close the channel after a WaitGroup signals completion.

// Worker checks context and sends result.
func worker(ctx context.Context, id int, wg *sync.WaitGroup, ch chan<- string) {
	defer wg.Done()
	// Exit if cancelled to prevent leaks.
	if ctx.Err() != nil {
		return
	}
	// Simulate failure for id 3.
	if id == 3 {
		ch <- fmt.Sprintf("fail: %d", id)
		return
	}
	ch <- fmt.Sprintf("ok: %d", id)
}

func main() {
	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	var wg sync.WaitGroup
	ch := make(chan string, 5)

	for i := 1; i <= 5; i++ {
		wg.Add(1)
		go worker(ctx, i, &wg, ch)
	}

	// Close channel when all workers finish.
	go func() {
		wg.Wait()
		close(ch)
	}()

	for msg := range ch {
		fmt.Println(msg)
	}
}

Close the channel exactly once. Deadlocks are cheaper than leaks.

Walk through the mechanics

The program starts by creating a context with a cancel function. The cancel function stops all work when called. A WaitGroup tracks active goroutines. A buffered channel holds results. The buffer size matches the number of workers so sends never block. If the channel were unbuffered, each worker would block until main reads a message. That couples the workers to the coordinator and slows down failure handling. A buffer lets workers send and return immediately.

The loop spawns goroutines. Each worker checks the context. If cancelled, it returns immediately. Otherwise, it does work and sends a string. Worker 3 sends a failure message. A separate goroutine waits on the WaitGroup and closes the channel. This is crucial. If you don't close the channel, the range loop in main blocks forever. The closer goroutine bridges the workers and the receiver. It guarantees the channel closes only after all workers finish.

The defer wg.Done() in the worker is essential. If a worker panics, the defer ensures the WaitGroup decrements. Without defer, a panic leaves the WaitGroup stuck. The closer never closes the channel. The program deadlocks. Always defer the Done call.

Functions that take a context should always have it as the first parameter, named ctx. This convention makes it easy to spot and pass through call chains. The compiler won't enforce this, but the community expects it.

Realistic aggregation

In production, you often need to collect all errors and return them together. This function spawns a worker for each ID, sends errors to a channel, and aggregates them into a slice.

// ProcessBatch runs tasks and returns a list of errors.
func ProcessBatch(ctx context.Context, ids []int) []error {
	var wg sync.WaitGroup
	errCh := make(chan error, len(ids))

	for _, id := range ids {
		wg.Add(1)
		go func(id int) {
			defer wg.Done()
			// Capture error and send to channel.
			if err := doWork(ctx, id); err != nil {
				errCh <- err
			}
		}(id)
	}

	// Close channel when all workers finish.
	go func() {
		wg.Wait()
		close(errCh)
	}()

	var errs []error
	for err := range errCh {
		errs = append(errs, err)
	}
	return errs
}

Go forces you to check errors explicitly. The if err != nil pattern is verbose, but it makes the failure path impossible to ignore. In concurrent code, swallowing an error in a goroutine is a silent killer. Always send errors back or log them. Go 1.20 introduced errors.Join. You can combine multiple errors into one. This is cleaner than returning a slice in some APIs.

Collect errors, don't panic. The caller decides how to fail.

Stop on first error with errgroup

Sometimes you want to stop immediately when one task fails. The errgroup package simplifies this pattern. It handles context cancellation and error propagation automatically.

import "golang.org/x/sync/errgroup"

// ProcessFirstError stops on the first failure.
func ProcessFirstError(ctx context.Context, ids []int) error {
	g, ctx := errgroup.WithContext(ctx)
	for _, id := range ids {
		id := id
		g.Go(func() error {
			return doWork(ctx, id)
		})
	}
	return g.Wait()
}

The errgroup.WithContext creates a group tied to a context. When any goroutine returns an error, the group cancels the context. All other goroutines see the cancellation and exit. g.Wait returns the first error. This pattern is cleaner than manual channel management when you only care about the first failure.

Pitfalls and compiler errors

If you forget to close the channel, the receiver blocks forever. The program hangs. If you send to a closed channel, the program panics. The runtime stops with panic: send on closed channel. If you forget to add to the WaitGroup, the closer goroutine closes the channel early. Workers then panic trying to send.

If you capture the loop variable in a closure without passing it as an argument, all goroutines see the final value. The compiler rejects this in Go 1.22+ with loop variable i captured by func literal. In older versions, you get a subtle bug where every worker processes the last ID. Always pass the loop variable as an argument to the goroutine or closure.

Goroutine leaks happen when the goroutine waits on a channel that never gets closed. Always have a cancellation path. If a worker blocks on I/O, it needs a timeout or cancel-aware I/O. Otherwise, cancelling the context won't stop the worker. The worker leaks until the process exits.

Test the error path. Happy paths lie.

Decision matrix

Use a buffered error channel when you need to collect all errors and report them together. Use context cancellation when one failure makes the rest of the work pointless. Use errgroup.Group when you want to stop on the first error and propagate it cleanly. Use a single goroutine with sequential calls when the tasks depend on each other or when concurrency adds more overhead than value. Use sync.WaitGroup alone when you only care that everyone finished and don't need to exchange data.

Concurrency is a tool, not a default. Pick the pattern that matches the failure mode.

Where to go next

When running multiple tasks at the same time, one failing shouldn't crash the whole program or leave others running uselessly. You catch errors from each task as they finish, stop the rest if something goes wrong, and make sure everything shuts down cleanly. Think of it like a team where if one person trips, the manager stops the whole project to assess the damage rather than letting everyone keep running blindly.