How to Stream LLM Responses in Go (Server-Sent Events)

Web
Stream LLM responses in Go using Server-Sent Events by setting the text/event-stream header and flushing the response writer after each chunk.

The wall of text problem

You type a prompt into a chat interface. The screen freezes. Five seconds pass. Then the entire response dumps onto the screen at once. The user thinks the app crashed. You know the LLM is just thinking, but the browser has no idea. Streaming fixes this. You send chunks as they arrive. The UI updates in real time. The user sees words appear one by one. The experience feels instant, even if the model takes ten seconds to finish.

Server-Sent Events in plain words

Server-Sent Events is a protocol for pushing updates from a server to a browser over a single HTTP connection. Unlike WebSockets, which allow two-way communication, SSE is one-way: the server talks, the client listens. This fits LLM responses perfectly. The client sends a prompt once. The server streams back tokens until done. No need for the client to send data back during the stream.

SSE also handles reconnections automatically in the browser. If the network drops, the browser retries the request and sends a Last-Event-ID header so the server can resume. This saves you from writing retry logic. Go's net/http package supports SSE with a few header tweaks and a call to Flush().

SSE turns a slow request into a live conversation.

The streaming skeleton

Here's the simplest streaming handler: set headers, check for flush support, write data in SSE format, and flush after every chunk.

package main

import (
    "fmt"
    "net/http"
    "time"
)

// streamHandler sends chunks of text as Server-Sent Events.
func streamHandler(w http.ResponseWriter, r *http.Request) {
    // SSE requires specific headers to disable caching and keep the connection alive.
    w.Header().Set("Content-Type", "text/event-stream")
    w.Header().Set("Cache-Control", "no-cache")
    w.Header().Set("Connection", "keep-alive")

    // The ResponseWriter must support flushing to send data immediately.
    flusher, ok := w.(http.Flusher)
    if !ok {
        http.Error(w, "Streaming unsupported", http.StatusInternalServerError)
        return
    }

    // Simulate an LLM generating tokens one by one.
    tokens := []string{"Hello", " ", "from", " ", "Go"}
    for _, token := range tokens {
        // SSE format requires "data: " prefix and double newline to end the event.
        fmt.Fprintf(w, "data: %s\n\n", token)
        // Flush pushes the buffered data to the client right now.
        flusher.Flush()
        time.Sleep(500 * time.Millisecond)
    }
}

func main() {
    http.HandleFunc("/stream", streamHandler)
    http.ListenAndServe(":8080", nil)
}

Flush or fail. The buffer hides your data until you tell it otherwise.

How the pieces fit together

When the client hits /stream, the handler sets headers. The Content-Type tells the browser to expect an event stream. Cache-Control prevents proxies from storing the response. Connection: keep-alive keeps the TCP connection open.

The type assertion w.(http.Flusher) checks if the underlying writer supports flushing. http.ResponseWriter is an interface. http.Flusher is another interface. Not all implementations support flushing. Some reverse proxies or middleware wrap the writer and hide this capability. If the assertion fails, the handler returns a 500 error. The compiler won't catch this at build time. The check must happen at runtime.

Inside the loop, fmt.Fprintf writes the chunk in SSE format. The data: prefix is mandatory. The double newline \n\n signals the end of an event. A single newline continues the data field. A missing newline leaves the event open, and the browser waits forever. flusher.Flush() forces the data out of the buffer.

Go's HTTP server buffers writes to minimize system calls. Writing to a file descriptor one byte at a time is slow. The server collects data in memory and sends it in larger chunks. This improves throughput for standard requests. Streaming breaks this optimization. You force the server to send small packets immediately. That's why Flush() exists. It tells the server to empty the buffer now. Use it only when latency matters more than throughput.

Convention aside: gofmt is mandatory. Don't argue about indentation; let the tool decide. Most editors run it on save. The code above follows standard formatting.

Adding context and cancellation

Real applications need cancellation. If the user closes the tab, the server should stop generating tokens. This saves money on API calls and frees server resources. r.Context() returns a context that gets cancelled when the client disconnects. This is automatic in Go's HTTP server. You don't need to set up signal handlers.

Here's the handler with context support. It checks for cancellation before writing and passes the context to the token generator.

// streamLLMHandler streams LLM tokens with context cancellation support.
func streamLLMHandler(w http.ResponseWriter, r *http.Request) {
    // Context carries cancellation signals from the client disconnecting.
    ctx := r.Context()

    w.Header().Set("Content-Type", "text/event-stream")
    w.Header().Set("Cache-Control", "no-cache")

    flusher, ok := w.(http.Flusher)
    if !ok {
        http.Error(w, "Streaming unsupported", http.StatusInternalServerError)
        return
    }

    // Generate tokens in a separate goroutine to respect context.
    tokens := generateTokens(ctx)

    for token := range tokens {
        // Check context before writing to avoid writing to a closed connection.
        if ctx.Err() != nil {
            return
        }

        fmt.Fprintf(w, "data: %s\n\n", token)
        flusher.Flush()
    }
}

Context is plumbing. Run it through every long-lived call site.

Convention aside: context.Context always goes as the first parameter. Name it ctx. Functions that take a context should respect cancellation and deadlines. This is the standard Go pattern.

The token generator runs in a goroutine. It uses a select statement to wait for either a token or a cancellation signal.

// generateTokens simulates an LLM API that yields tokens over a channel.
func generateTokens(ctx context.Context) <-chan string {
    out := make(chan string)
    go func() {
        defer close(out)
        words := []string{"Thinking", " ", "about", " ", "your", " ", "prompt"}
        for _, w := range words {
            select {
            case <-ctx.Done():
                return
            case out <- w:
                time.Sleep(100 * time.Millisecond)
            }
        }
    }()
    return out
}

Convention aside: if err != nil { return err } is verbose by design. The community accepts the boilerplate because it makes the unhappy path visible. In streaming, you might return early on error, but ensure you don't write to the response after returning. The compiler won't complain, but the runtime might panic with http: superfluous response.WriteHeader call if you try to set status after writing.

Convention aside: Goroutine leaks happen when the goroutine waits on a channel that never gets closed. Always have a cancellation path. The defer close(out) ensures the channel closes when the goroutine exits. The range loop in the handler terminates when the channel closes.

Pitfalls and runtime traps

Skipping the flusher check causes silent failures. If the writer doesn't support flushing, writes buffer until the request ends. The client sees nothing until the handler returns. The user waits, then gets the whole response at once. You lose the streaming benefit.

Ignoring ctx.Err() wastes resources. If the client disconnects, the server keeps generating tokens. The tokens go nowhere. The goroutine runs until completion. This burns CPU and memory. In a high-traffic service, leaked goroutines accumulate and crash the server.

Missing the double newline breaks the event. The browser waits for more data. The UI freezes. The user thinks the app is broken. The SSE protocol is strict about formatting.

Writing to a closed connection panics. If the client disconnects, the underlying TCP connection closes. Writing to it returns an error. fmt.Fprintf returns an error value. Ignoring it hides network failures. The compiler doesn't catch this. At runtime, you might see broken pipe in logs. Check ctx.Err() before writing to avoid this.

The worst goroutine bug is the one that never logs.

When to stream and when not to

Use SSE when the server pushes updates to the client and the client only sends requests. Use WebSockets when the client and server need to exchange messages in both directions simultaneously. Use HTTP long-polling when you must support legacy browsers that lack SSE support. Use a standard HTTP response when the entire payload is small and available instantly.

Pick the tool that matches the traffic pattern.

Where to go next