The model runner versus the model server
You trained a model in Python. It achieves 98% accuracy on your validation set. Now you need to serve predictions to a mobile app that sends ten thousand requests per second. You spin up a Python server, and the CPU usage spikes. The GIL serializes your requests. Latency jumps. You look at Go.
Go is rarely the language that runs the matrix multiplications. It does not have native tensor types, SIMD auto-vectorization for deep learning, or built-in CUDA bindings. If you try to write a neural network from scratch in Go, you will spend months optimizing loops that a C++ library handles in microseconds.
Go makes sense for ML inference when it acts as the server, not the engine. Go excels at networking, concurrency, and glue code. It handles HTTP connections, manages request queues, enforces timeouts, and routes traffic. The actual inference happens in a C++ library like ONNX Runtime, TensorFlow C API, or a custom CUDA kernel. Go wraps that library, manages the lifecycle, and serves the results.
The pattern is simple. Go receives the request. Go prepares the data. Go calls the C++ engine. Go returns the response. Go handles the concurrency so you can scale to thousands of concurrent connections without writing a single thread pool.
Why Go avoids the math
Machine learning inference relies on dense linear algebra. You are multiplying massive matrices of floating-point numbers. Performance depends on memory layout, cache locality, and hardware acceleration.
C++ libraries use contiguous memory blocks and leverage SIMD instructions to process multiple values per cycle. They manage memory manually to avoid allocation overhead during inference. They bind directly to GPU drivers to offload computation.
Go's design priorities differ. Go optimizes for developer productivity and safe concurrency. The garbage collector reclaims memory automatically. This is great for general-purpose programming. It introduces non-deterministic pauses that can hurt latency-sensitive workloads. Go's compiler does not auto-vectorize loops to the same degree as LLVM with aggressive optimization flags. Go does not provide low-level memory control for pinning buffers to GPU memory.
This does not mean Go cannot serve models. It means Go should not implement the math. The language is a logistics manager, not the factory worker. Go coordinates the flow of data. C++ performs the heavy lifting.
The cgo bridge
Go communicates with C libraries through cgo. This tool generates a bridge between Go code and C code. You can call C functions from Go and pass data back and forth.
The bridge has a cost. Crossing the boundary requires marshaling data. Pointers must be handled carefully. The Go runtime must track memory to prevent the garbage collector from moving data while C holds a reference.
Use cgo sparingly. Call the C library once per request or once per batch. Do not call it per pixel or per element. The overhead of the transition can dominate the computation time if the payload is small.
// inference.go
package main
/*
#cgo LDFLAGS: -L./lib -linference_engine
#include "inference.h"
*/
import "C"
import (
"unsafe"
)
// RunModel invokes the C++ inference engine with the provided input data.
// It returns the prediction score or an error if the engine fails.
func RunModel(data []byte) (float64, error) {
// Check for empty input before crossing the cgo boundary.
if len(data) == 0 {
return 0, nil
}
// Convert the Go slice to a C pointer.
// unsafe.Pointer prevents the garbage collector from relocating the slice.
// The memory remains pinned until this function returns.
ptr := (*C.char)(unsafe.Pointer(&data[0]))
// Call the C function.
// This transitions from the Go runtime to the C runtime.
// The cost of this transition is fixed per call.
result := C.predict(ptr, C.int(len(data)))
// Convert the C float back to a Go float64.
return float64(result), nil
}
The unsafe.Pointer call is critical. Go slices are references to heap or stack memory. The garbage collector can move that memory at any time. If C holds a pointer to Go memory and the GC moves it, the C code dereferences a stale address. The program crashes. unsafe.Pointer tells the GC to keep the memory in place. The compiler inserts a barrier to prevent collection while the pointer is live.
Anatomy of a production inference handler
A real inference service needs more than a function call. It needs HTTP handling, error management, timeouts, and metrics. Go's standard library provides all of this without external dependencies.
The handler receives a request. It decodes the JSON payload. It calls the inference engine. It returns the result. If the engine takes too long, the handler cancels the request and returns an error. This prevents slow models from exhausting server resources.
// server.go
package main
import (
"context"
"encoding/json"
"log"
"net/http"
"time"
)
// PredictionRequest defines the JSON structure for incoming inference requests.
type PredictionRequest struct {
Features []float64 `json:"features"`
}
// PredictionResponse defines the JSON structure for the model output.
type PredictionResponse struct {
Score float64 `json:"score"`
Error string `json:"error,omitempty"`
}
// HandlePredict serves inference requests over HTTP.
// It respects context deadlines and returns structured JSON responses.
func HandlePredict(engine *InferenceEngine) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
// Context carries the deadline from the client or proxy.
// Pass it through to the engine to support cancellation.
ctx := r.Context()
// Apply a server-side timeout to protect against slow inference.
// This ensures the goroutine does not hang indefinitely.
ctx, cancel := context.WithTimeout(ctx, 500*time.Millisecond)
defer cancel()
var req PredictionRequest
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
// Return 400 for malformed JSON.
http.Error(w, "invalid request body", http.StatusBadRequest)
return
}
// Convert features to the format expected by the engine.
// This might involve flattening or scaling the data.
inputData := prepareInput(req.Features)
// Run inference with context support.
// The engine must check ctx.Done() if the operation is long-running.
score, err := engine.Predict(ctx, inputData)
if err != nil {
// Return 503 if the engine fails or times out.
// This signals a temporary service issue to the client.
log.Printf("inference failed: %v", err)
http.Error(w, "inference service unavailable", http.StatusServiceUnavailable)
return
}
// Encode the response as JSON.
resp := PredictionResponse{Score: score}
w.Header().Set("Content-Type", "application/json")
if err := json.NewEncoder(w).Encode(resp); err != nil {
// Log encoding errors but do not panic.
// The connection may have been closed by the client.
log.Printf("failed to encode response: %v", err)
}
}
}
The context.Context parameter is the first argument in the Predict method. This follows Go convention. Functions that perform I/O or long-running work accept a context as the first parameter. The context allows the caller to cancel the operation. If the client disconnects, the context is cancelled. The engine can abort the computation and release resources.
Error handling uses the standard if err != nil pattern. Go does not have exceptions. Errors are values. The handler checks the error and returns an appropriate HTTP status code. This makes the failure path explicit and easy to audit.
Pitfalls: pointers, pauses, and panics
Wrapping C++ inference engines introduces specific risks. The most common issue involves pointer passing. Go's cgo checker enforces strict rules about pointers. You cannot pass a Go pointer to a C function that stores it for later use. You cannot pass a slice of Go pointers to C.
If you violate these rules, the compiler rejects the code with cgo argument has Go pointer to Go pointer. This error occurs when you try to pass a complex structure that contains nested pointers. The solution is to flatten the data. Pass a contiguous byte slice. Let C reconstruct the structure on its side.
Garbage collection pauses can affect latency. Go's GC runs concurrently, but it still causes brief stop-the-world pauses. If your inference engine allocates large tensors in Go before passing them to C, the GC may pause while scanning those allocations. To minimize impact, allocate memory in C and reuse it. Pass pre-allocated buffers back and forth. Avoid per-request allocations in the hot path.
Goroutine leaks happen when the C library blocks without respecting cancellation. If the engine does not check the context, the Go goroutine remains blocked until the C function returns. This consumes memory and file descriptors. Always implement a cancellation signal in the C wrapper. Use a channel or a flag that the engine polls periodically.
The worst goroutine bug is the one that never logs. If a goroutine leaks silently, the server degrades over time. Use a leak detector or monitor goroutine counts in production. Set up alerts if the count grows beyond a threshold.
Batching: where Go earns its keep
Go's concurrency model shines when batching requests. Batching combines multiple requests into a single inference call. This amortizes the overhead of model loading and kernel launch. It increases throughput significantly.
Implementing batching in Python requires complex async queues and lock management. Go makes it straightforward with channels and goroutines.
Create a goroutine that reads requests from a channel. Collect requests until the batch size is reached or a timeout expires. Send the batch to the engine. Distribute the results back to the original request handlers.
// batcher.go
package main
import (
"context"
"time"
)
// BatchItem holds a request and a channel for the response.
type BatchItem struct {
Input []byte
Result chan<- BatchResult
}
// BatchResult contains the prediction score and any error.
type BatchResult struct {
Score float64
Err error
}
// BatchProcessor collects requests and sends them to the engine in batches.
// It respects the maximum batch size and flush interval.
func BatchProcessor(ctx context.Context, engine *InferenceEngine, maxBatch int, flushInterval time.Duration) {
items := make([]BatchItem, 0, maxBatch)
timer := time.NewTimer(flushInterval)
timer.Stop()
for {
select {
case <-ctx.Done():
// Drain remaining items on shutdown.
processBatch(ctx, engine, items)
return
case item := <-requestChan:
items = append(items, item)
if len(items) == maxBatch {
// Batch is full. Process immediately.
go processBatch(ctx, engine, items)
items = items[:0]
timer.Stop()
} else if !timer.Stop() {
// Timer was already stopped.
<-timer.C
}
timer.Reset(flushInterval)
case <-timer.C:
// Flush interval expired. Process whatever we have.
if len(items) > 0 {
go processBatch(ctx, engine, items)
items = items[:0]
}
timer.Stop()
}
}
}
// processBatch sends a batch of items to the engine and distributes results.
func processBatch(ctx context.Context, engine *InferenceEngine, items []BatchItem) {
// Aggregate inputs from all items.
batchInput := aggregateInputs(items)
// Run inference on the batch.
scores, err := engine.PredictBatch(ctx, batchInput)
// Distribute results back to the request handlers.
for i, item := range items {
result := BatchResult{Err: err}
if err == nil {
result.Score = scores[i]
}
item.Result <- result
}
}
The batch processor runs as a single goroutine. It uses a select statement to handle multiple events. It collects items, checks the batch size, and manages the timer. When the batch is ready, it spawns a goroutine to process the batch. This keeps the collector unblocked.
The processBatch function aggregates inputs, calls the engine, and sends results back through channels. The request handlers wait on their result channels. This decouples the HTTP layer from the inference layer. The server can handle more requests than the engine can process concurrently.
Decision matrix
Use Go when you need high-throughput HTTP serving for a pre-trained model. Go handles thousands of concurrent connections with minimal memory overhead. The standard library provides everything you need for production-grade networking.
Use Go when you are building a microservice that orchestrates multiple models. Go makes it easy to route requests, aggregate results, and manage dependencies. The type system catches integration errors at compile time.
Use Go when you need to integrate inference into an existing Go backend. Sharing state, configuration, and logging between services is simpler when they use the same language. You avoid the complexity of cross-language communication.
Use Python when you are prototyping a new model or training. Python has the richest ecosystem for ML research. Libraries like PyTorch and TensorFlow provide high-level APIs that accelerate development.
Use C++ when you need bare-metal latency and zero abstraction overhead. C++ gives you full control over memory and execution. It is the best choice for embedded devices or real-time systems where every microsecond counts.
Use a dedicated serving framework like Triton when you need dynamic batching, model versioning, and GPU management. These frameworks handle the complexity of production serving. They support multiple model formats and hardware backends.
Go is the server, not the engine. Cross the cgo bridge once per request, not once per pixel. Context is the lifeline. Pass it through every call site. Trust the standard library for networking. Let C++ handle the math.