How to Implement RAG (Retrieval Augmented Generation) in Go

Implement RAG in Go by retrieving context from a vector store and passing it to an LLM for generation.

The missing context problem

You build a chat interface that answers questions about your internal documentation. It handles general programming questions fine. Then a user asks about the exact refund policy for enterprise contracts. The model invents a thirty-day window that never existed. The failure is not a bug in your code. The model simply never saw your documentation during training. You need to feed it the exact page at the exact moment it needs it.

What RAG actually does

Retrieval Augmented Generation splits the workload into two distinct phases. The first phase searches your private data to find the paragraphs that match the user question. The second phase hands those paragraphs to a language model alongside the original question. The model reads the provided text and writes an answer grounded in your data. Think of it like a research desk. The librarian does not memorize every book in the archive. The librarian searches the catalog, pulls the three most relevant volumes, and slides them across the desk to a writer. The librarian handles the search. The writer handles the synthesis. Your Go code orchestrates both roles.

The skeleton: retrieval meets generation

Here is the simplest working pipeline. It separates the search step from the generation step and passes a context through both.

package main

import (
	"context"
	"fmt"
	"log"
)

// RunRAG executes the retrieval and generation pipeline.
func RunRAG(ctx context.Context, query string) (string, error) {
	// Context carries deadlines and cancellation signals.
	// Pass it first so every downstream call can respect timeouts.
	retrieved, err := retrieveDocuments(ctx, query)
	if err != nil {
		// Wrap the error to preserve the call stack for debugging.
		return "", fmt.Errorf("retrieval failed: %w", err)
	}

	// Hand the retrieved text to the model alongside the original question.
	answer, err := generateAnswer(ctx, query, retrieved)
	if err != nil {
		return "", fmt.Errorf("generation failed: %w", err)
	}

	return answer, nil
}

// retrieveDocuments queries a vector store and returns matching text.
func retrieveDocuments(ctx context.Context, query string) (string, error) {
	// Replace with actual vector store logic like pgvector or Chroma.
	// In production, this converts the query to embeddings and runs a similarity search.
	return "Enterprise refunds require manager approval within five business days.", nil
}

// generateAnswer sends the prompt to an LLM API and returns the response.
func generateAnswer(ctx context.Context, query, context string) (string, error) {
	// Replace with actual HTTP call to OpenAI, Anthropic, or a local model.
	// The prompt template combines the user question with the retrieved context.
	return fmt.Sprintf("Based on policy: %s", context), nil
}

func main() {
	ctx := context.Background()
	result, err := RunRAG(ctx, "What is the enterprise refund policy?")
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(result)
}

Walking through the pipeline

The program starts by creating a background context. That context object travels through every function call. It carries deadlines, cancellation signals, and request-scoped values. When you pass ctx as the first parameter, you follow the standard Go convention. Every function that accepts a context should check ctx.Err() before doing expensive work. If the client disconnects or a timeout fires, the context signals the cancellation and your code stops waiting for network responses.

The retrieveDocuments function represents your vector store. In production, this function converts the query into a numerical embedding, sends it to a database like PostgreSQL with pgvector or a dedicated store like Milvus, and fetches the top matching chunks. The function returns the raw text or a structured slice of documents. If the database is down or the query times out, it returns an error. You never ignore that error. You wrap it with fmt.Errorf and the %w verb so the original cause stays accessible. The verbose if err != nil boilerplate exists by design. It makes the unhappy path visible at every step.

The generateAnswer function builds the final prompt. It combines the user question with the retrieved context, often using a template string. It sends the combined text to an LLM provider via HTTP. The provider streams tokens back or returns a complete response. Your function parses the JSON payload and extracts the answer.

Go's type system keeps the pipeline strict. If you try to pass a string where the vector store client expects a []float32, the compiler rejects the program with cannot use query (variable of type string) as []float32 value in argument. If you forget to import a package, you get undefined: pkg. These errors are straightforward. They force you to align your data shapes before the program ever runs.

Trust the type system. Align your structs and slices early.

A realistic HTTP handler

Production code rarely runs from main. It lives behind an HTTP server. Here is how you wire the pipeline into a handler that accepts JSON requests and returns JSON responses.

package main

import (
	"context"
	"encoding/json"
	"net/http"
	"time"
)

// RAGRequest holds the incoming query payload.
type RAGRequest struct {
	Query string `json:"query"`
}

// RAGResponse holds the outgoing answer payload.
type RAGResponse struct {
	Answer string `json:"answer"`
	Error  string `json:"error,omitempty"`
}

// HandleRAG processes incoming POST requests for the RAG pipeline.
func HandleRAG(w http.ResponseWriter, r *http.Request) {
	// Extract a timeout context so long-running queries do not block workers.
	ctx, cancel := context.WithTimeout(r.Context(), 15*time.Second)
	defer cancel()

	var req RAGRequest
	// Decode the JSON body into the request struct.
	if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
		http.Error(w, "invalid payload", http.StatusBadRequest)
		return
	}

	// Run the pipeline with the deadline-aware context.
	answer, err := RunRAG(ctx, req.Query)
	if err != nil {
		w.Header().Set("Content-Type", "application/json")
		w.WriteHeader(http.StatusInternalServerError)
		json.NewEncoder(w).Encode(RAGResponse{Error: err.Error()})
		return
	}

	w.Header().Set("Content-Type", "application/json")
	json.NewEncoder(w).Encode(RAGResponse{Answer: answer})
}

The handler extracts a timeout context from the incoming request context. The context.WithTimeout call creates a child context that automatically cancels after fifteen seconds. The defer cancel() ensures resources are released even if the function returns early. This prevents goroutine leaks when the vector store or LLM API hangs. The request context already carries the client connection deadline. Reusing it keeps your timeout logic consistent.

The request body decodes into a struct. If the JSON is malformed, the handler returns a 400 status code immediately. The pipeline runs next. If it fails, the handler sets the content type, writes a 500 status code, and encodes the error into the response struct. The omitempty tag on the error field keeps the JSON clean when everything succeeds. Struct tags like json:"query" are standard Go convention. They map your internal field names to the external API format without renaming your Go code.

Run gofmt on every file before committing. The formatter enforces consistent indentation and brace placement. Argue about logic, not whitespace.

Where things break

RAG pipelines introduce network calls, external dependencies, and strict timing requirements. Runtime failures require more care than compile-time errors. The most common issue is a missing context deadline. If you pass context.Background() all the way to the LLM API and the API hangs, your HTTP handler blocks forever. The goroutine stays alive until the process restarts. Always attach a timeout or use r.Context() from the HTTP request.

Another frequent problem is unstructured error handling. If you log the error but continue execution, the LLM receives an empty context string. The model hallucinates confidently. Check every error. Return early. Wrap the cause so debugging tools can trace the failure back to the vector store or the HTTP client.

Memory pressure also appears when you fetch too many documents. Vector stores often return top-k results. If you set k to fifty and each chunk is two thousand tokens, your prompt exceeds the model context window. The API returns a length error or truncates silently. Keep k small. Filter results by relevance score before passing them to the generator.

Chunking strategy matters more than people expect. If you split documents by raw character count, you will cut sentences in half. The embedding model loses semantic coherence. Split by paragraphs or use a recursive splitter that respects markdown headers and newline boundaries. Clean whitespace before embedding. Store the original document ID alongside each chunk so you can link back to the source in the final response.

The worst pipeline bug is the one that silently degrades. Monitor latency percentiles. Track retrieval hit rates. Log when the fallback answer triggers.

Picking your architecture

RAG is not the only way to connect private data to a language model. You need to match the pattern to your data shape and latency requirements.

Use simple prompt engineering when your data fits in a few hundred tokens and changes rarely. Paste the static text directly into the system prompt. Skip the database entirely.

Use a caching layer when the same questions repeat constantly. Store the final answer in Redis or an in-memory map keyed by query hash. Return the cached result before touching the vector store.

Use Retrieval Augmented Generation when your dataset exceeds the model context window and updates frequently. Chain a vector store for semantic search with an LLM client for synthesis.

Use fine-tuning when you need the model to adopt a specific tone or learn domain-specific terminology that does not appear in standard training data. Fine-tuning changes the model weights. It does not replace the need for up-to-date facts.

Use a hybrid search pipeline when keyword matching and semantic similarity both matter. Combine BM25 scoring with vector embeddings. Rank the merged results before sending them to the generator.

Where to go next