You have a Go service and a local AI model
You've built a Go backend that handles user requests. You want to add a feature that summarizes text or generates code, but you don't want to send sensitive data to a third-party API, and you're tired of per-token billing. You install Ollama, load a model like llama3, and now you have a local endpoint at localhost:11434. The hard part isn't the AI. It's wiring your Go code to that HTTP endpoint correctly, handling timeouts, and deciding whether you need the full response at once or want to stream tokens as they arrive.
Ollama runs as a local server. Your Go program acts as a client. You send a POST request with JSON containing the model name and your prompt. Ollama processes the request and returns JSON with the result. The protocol is standard HTTP. No special libraries required. The standard library handles everything.
The Ollama API is just HTTP and JSON
Ollama exposes a REST API. The /api/generate endpoint accepts a JSON body with the model name and prompt. It returns a JSON response with the generated text. If you set stream: true, it returns a sequence of JSON objects, one per token.
Think of Ollama like a chef in a kitchen. You send a ticket with the order details. The chef cooks and sends back the food. If you ask for streaming, the chef sends you ingredients as they're chopped so you can watch the progress. If you don't, you wait until the dish is plated.
Go's net/http package sends the request. encoding/json marshals your structs to JSON and unmarshals the response. bytes provides efficient byte buffers. That's the entire stack.
Minimal example
Here's the simplest way to get a response. You define structs that match the JSON schema, marshal the request, post it, and decode the result.
// GenerateRequest matches the JSON schema for the /api/generate endpoint.
type GenerateRequest struct {
Model string `json:"model"`
Prompt string `json:"prompt"`
}
// GenerateResponse holds the model's output text.
type GenerateResponse struct {
Response string `json:"response"`
}
func main() {
// Marshal the request struct into JSON bytes.
req := GenerateRequest{Model: "llama3", Prompt: "Hello"}
body, _ := json.Marshal(req)
// Post to the local API; bytes.NewReader is efficient for small payloads.
resp, err := http.Post("http://localhost:11434/api/generate", "application/json", bytes.NewReader(body))
if err != nil {
panic(err)
}
defer resp.Body.Close()
// Decode the JSON stream directly into the response struct.
var result GenerateResponse
json.NewDecoder(resp.Body).Decode(&result)
fmt.Println(result.Response)
}
Walkthrough
The code starts by defining two structs. GenerateRequest has Model and Prompt fields with JSON tags. The tags tell the JSON encoder to use lowercase keys in the output. If you omit the tags, the keys become Model and Prompt, and Ollama ignores them because it expects model and prompt.
json.Marshal converts the struct to a byte slice. This operation is fast and allocates a new slice. The result is valid JSON like {"model":"llama3","prompt":"Hello"}.
http.Post sends the request. It creates a new http.Client with default settings, builds the request, and executes it. The third argument is an io.Reader. bytes.NewReader wraps the byte slice without copying. It implements io.Reader efficiently.
The response body is an io.ReadCloser. You must close it to release the connection back to the pool. defer resp.Body.Close() ensures cleanup even if the function returns early.
json.NewDecoder(resp.Body).Decode(&result) reads from the response body and unmarshals the JSON into the struct. Using a decoder is better than io.ReadAll followed by json.Unmarshal because it avoids allocating a buffer for the entire response. The decoder reads chunks as needed.
If the model takes a long time to load, the request might hang. The default HTTP client has no timeout. In production, you need a context with a deadline.
Convention aside: if err != nil { return err } is verbose by design. The community accepts the boilerplate because it makes the unhappy path visible. In production code, never ignore the error from http.Post or json.Unmarshal. The minimal example uses panic for brevity, but real code should return errors.
Realistic example with context
Production code needs timeouts and context propagation. Ollama can hang if the model is loading or the GPU is saturated. You should pass a context.Context to the HTTP call so cancellation propagates.
// GenerateText sends a prompt to Ollama with a context for cancellation.
func GenerateText(ctx context.Context, model, prompt string) (string, error) {
req := GenerateRequest{Model: model, Prompt: prompt}
body, err := json.Marshal(req)
if err != nil {
return "", fmt.Errorf("marshal request: %w", err)
}
// Use NewRequestWithContext to bind the HTTP call to the context.
url := "http://localhost:11434/api/generate"
httpReq, err := http.NewRequestWithContext(ctx, http.MethodPost, url, bytes.NewReader(body))
if err != nil {
return "", fmt.Errorf("create request: %w", err)
}
httpReq.Header.Set("Content-Type", "application/json")
// Execute the request; context deadlines prevent hanging on slow models.
client := &http.Client{}
resp, err := client.Do(httpReq)
if err != nil {
return "", fmt.Errorf("post request: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
return "", fmt.Errorf("ollama returned status %d", resp.StatusCode)
}
return "", nil
}
// ... inside GenerateText, after status check ...
var result GenerateResponse
err = json.NewDecoder(resp.Body).Decode(&result)
if err != nil {
return "", fmt.Errorf("decode response: %w", err)
}
return result.Response, nil
The function takes ctx as the first parameter. This follows the Go convention for context. Callers can pass a context with a timeout using context.WithTimeout. If the timeout expires, the HTTP call aborts.
http.NewRequestWithContext creates a request bound to the context. client.Do executes the request. If the context is cancelled, Do returns an error.
The function checks the status code. Ollama returns 200 OK on success. Other codes indicate errors like 404 if the model is not found.
Error wrapping with %w allows callers to use errors.Is or errors.As to inspect the error chain.
Convention aside: context.Context always goes as the first parameter, conventionally named ctx. Functions that take a context should respect cancellation and deadlines. Don't pass a context that lives too long; always set a deadline or timeout for external calls.
Streaming responses
Models generate tokens sequentially. Waiting for the full response blocks your user. Ollama supports streaming, where it sends a JSON object for each token. This is essential for chat interfaces or any UI where latency matters.
// StreamResponse represents a single chunk from the streaming endpoint.
type StreamResponse struct {
Response string `json:"response"`
Done bool `json:"done"`
}
func streamPrompt(ctx context.Context, model, prompt string) error {
req := GenerateRequest{Model: model, Prompt: prompt, Stream: true}
body, _ := json.Marshal(req)
resp, err := http.Post("http://localhost:11434/api/generate", "application/json", bytes.NewReader(body))
if err != nil {
return err
}
defer resp.Body.Close()
// Decode each JSON object from the stream until done.
dec := json.NewDecoder(resp.Body)
for {
var chunk StreamResponse
if err := dec.Decode(&chunk); err == io.EOF {
break
} else if err != nil {
return err
}
fmt.Print(chunk.Response)
if chunk.Done {
break
}
}
return nil
}
The request includes Stream: true. Ollama responds with a stream of JSON objects. Each object contains a response field with a token or partial token, and a done field that is true on the final object.
The decoder loop calls Decode repeatedly. Each call reads one JSON object from the stream. When the stream ends, Decode returns io.EOF. The loop breaks.
If chunk.Done is true, the model finished generating. You can break early if you don't need to consume the rest of the stream.
Streaming requires more code than a single request, but it provides a better user experience. The client can display tokens as they arrive instead of showing a spinner for ten seconds.
Convention aside: When streaming, always check the done flag. Some clients assume the stream ends with io.EOF, but Ollama sends a final object with done: true before closing. Relying only on EOF might miss the final chunk or hang if the server keeps the connection open.
Pitfalls and errors
If you define a struct field with a lowercase name, the compiler rejects it with json: struct field model must be exported. JSON tags only work on exported fields. The field name must start with a capital letter.
If you forget to set stream: false in the request, Ollama might return a stream depending on the version. If you decode a stream into a single struct, you get only the first chunk. The response appears truncated. Always be explicit about streaming mode.
If you use io.ReadAll on a streaming response, you block until the entire response is received. This defeats the purpose of streaming and can cause memory pressure for long outputs. Use json.NewDecoder to read incrementally.
If the model is not loaded, Ollama returns a 404 status code. Check the status code before decoding the body. If you try to decode an error response into your success struct, you get garbage data or a decode error.
If you forget to close the response body, the connection leaks. The HTTP client reuses connections, but unclosed bodies prevent reuse. Always defer resp.Body.Close().
Convention aside: Don't pass a *string. Strings are already cheap to pass by value. Passing a pointer adds indirection without saving memory. Use string parameters unless you need to modify the value or represent nil.
Decision matrix
Use the standard library net/http when you want zero dependencies and full control over the request lifecycle.
Use a community client library when you prefer convenience methods for streaming and error handling over raw HTTP calls.
Use the /api/generate endpoint for single-turn prompts where you only need the final text.
Use the /api/chat endpoint when you need to maintain conversation context across multiple messages.
Use the streaming response format when the model takes longer than a few seconds and you want to display tokens as they arrive.
Use a non-streaming request when you need the complete response before proceeding and latency is acceptable.
Ollama is a server. Treat it like one. Respect timeouts, handle errors, and stream when it matters.