Performance Implications of Cgo

You found a blazing fast C library for image compression. You wrap it in Go, write a benchmark, and the numbers look terrible. The C code is supposed to be instant, but your Go program is crawling. You check the CPU usage and see spikes where nothing should be happening. The bottleneck isn't the algorithm. It's the bridge between Go and C.

Go and C live in different worlds. Go has a garbage collector, safe pointers, and lightweight goroutines. C has manual memory management, raw pointers, and OS threads. When you use cgo, you build a bridge between these worlds. The bridge works, but it has a toll. Every time you cross, you pay in context switches, argument marshaling, and garbage collection overhead. If you cross the bridge once per operation, the toll eats your performance.

How the Bridge Works

cgo is not just a function call. It is a transition between two runtimes. When Go code calls a C function, the Go runtime must pause the goroutine, switch stacks, convert arguments, and jump to C code. When C returns, the process reverses. This round trip is orders of magnitude slower than a native Go call.

The overhead comes from three sources. First, the context switch. Go goroutines run on a managed stack. C functions run on a C stack. The runtime must save the goroutine state, allocate or switch to a C stack frame, and transfer control. Second, argument marshaling. Go and C use different calling conventions and type layouts. The bridge converts Go types to C types and back. Third, garbage collection. The Go GC scans goroutine stacks to find live pointers. If a goroutine is blocked in C, the GC must scan the C stack to find Go pointers, which adds latency to pause times.

Here's what a minimal cgo call looks like. The structure is simple, but the machinery underneath is heavy.

package main

/*
#include <stdio.h>

// C function that does a tiny amount of work
void do_work(int n) {
    volatile int x = 0;
    for (int i = 0; i < n; i++) {
        x += i;
    }
}
*/
import "C"

func main() {
    // Triggers context switch to C stack and marshals int argument
    C.do_work(C.int(1000))
}

The comment block before import "C" contains C code. The cgo preprocessor extracts this code, compiles it, and generates a bridge. The import "C" line is a pseudo-import that tells the Go compiler to invoke cgo. The C. prefix accesses symbols from the C side. The call C.do_work is not a direct jump. It goes through the bridge.

The bridge does work. It converts the Go int to a C int. It switches stacks. It calls the C function. It waits for the result. It switches back. For a function that does trivial work, the bridge cost dominates the total time. For a function that does heavy computation, the bridge cost becomes a small fraction. The key is to make the work in C large enough to amortize the crossing.

The bridge has a toll. Pay it once with a heavy load, not a thousand times with a feather.

Batching: Amortizing the Cost

The most effective way to reduce cgo overhead is batching. Instead of calling C for every item, group items and call C once. This reduces the number of context switches and marshaling operations. The C function loops over the batch internally, doing the work without returning to Go.

Consider processing a slice of integers. A naive approach calls C for each element. A batched approach passes the slice pointer and length to C, letting C loop over the data.

Here's a realistic example. The Go code prepares a batch and sends it to C in a single call.

package main

/*
#include <stdlib.h>

// C function processes a batch of integers
void process_batch(int* data, int count) {
    for (int i = 0; i < count; i++) {
        // Do work on data[i]
        data[i] *= 2;
    }
}
*/
import "C"

import "unsafe"

// ProcessBatch sends the whole slice to C in one call
func ProcessBatch(items []int) {
    if len(items) == 0 {
        return
    }
    // Pass pointer to first element and length
    // C receives a raw pointer and count, loops internally
    C.process_batch((*C.int)(unsafe.Pointer(&items[0])), C.int(len(items)))
}

The unsafe.Pointer converts the Go slice header to a C pointer. C does not understand Go slices. It needs a raw pointer and a count. The &items[0] gets the address of the first element. The len(items) gives the count. C loops over the array in place.

This pattern reduces the number of cgo calls from N to 1. The overhead is constant regardless of batch size. For large batches, the overhead becomes negligible. For small batches, you might still see impact. Tune the batch size based on profiling.

One call with a thousand items beats a thousand calls with one item.

Memory and the Garbage Collector

Go and C manage memory differently. Go uses a garbage collector that automatically reclaims unused memory. C uses manual allocation with malloc and free. When you cross the border, you must respect both systems.

If C allocates memory and returns a pointer to Go, the Go GC does not know about the allocation. The GC will not free the memory. You must call C.free when done. If you forget, you leak memory. If you try to let the GC free it, you get undefined behavior.

If Go allocates memory and passes a pointer to C, the GC tracks the memory. C can read and write the memory, but C must not free it. If C frees Go memory, the GC will try to scan freed memory, causing crashes.

Here's how to handle C-allocated memory safely. You allocate in C, use the pointer, and free in C.

package main

/*
#include <stdlib.h>
#include <string.h>

// C function allocates and returns a buffer
char* create_buffer(int size) {
    char* buf = (char*)malloc(size);
    memset(buf, 0, size);
    return buf;
}
*/
import "C"

import "unsafe"

// AllocateBuffer allocates memory in C and returns a Go byte slice
func AllocateBuffer(size int) []byte {
    // C.malloc returns a raw pointer; Go GC ignores this allocation
    ptr := C.create_buffer(C.int(size))
    if ptr == nil {
        return nil
    }
    // Convert C pointer to Go slice
    // The slice header points to C memory; GC does not manage this
    return unsafe.Slice((*byte)(unsafe.Pointer(ptr)), size)
}

// FreeBuffer releases C-allocated memory
func FreeBuffer(buf []byte) {
    if len(buf) == 0 {
        return
    }
    // C.free releases the memory; forgetting this causes a leak
    C.free(unsafe.Pointer(unsafe.SliceData(buf)))
}

The AllocateBuffer function calls C to allocate memory. It converts the pointer to a Go slice using unsafe.Slice. The slice points to C memory. The GC does not track this memory. The caller must call FreeBuffer to release it.

Using runtime.SetFinalizer to clean up C memory is risky. Finalizers run on a separate thread and timing is non-deterministic. The finalizer might run long after you expect, or not at all if the program exits. Use finalizers only for best-effort cleanup. Never rely on them for correctness.

The garbage collector fears the unknown. Keep Go pointers out of C memory.

Pitfalls and Runtime Errors

cgo introduces pitfalls that do not exist in pure Go. The compiler and runtime enforce rules to keep memory safe, but the errors can be confusing.

One common error is passing Go pointers to C in ways that violate the rules. The Go runtime checks cgo calls for pointer violations. If you pass a Go pointer to a Go pointer, the runtime panics.

The compiler rejects this with runtime: cgo argument has a Go pointer to a Go pointer. This error happens when you pass a struct containing a pointer to C. C does not understand Go pointers. The GC might move Go pointers (in theory, though Go currently does not move pointers, the rule exists for safety). Passing a pointer to a pointer breaks the invariant.

Another pitfall is the garbage collector scanning C stacks. If many goroutines are blocked in C, the GC must scan all their C stacks. This increases GC pause times. If your program has many goroutines calling cgo, you might see GC latency spikes. Profile your program to check GC overhead.

C functions cannot be inlined. The Go compiler inlines small functions to reduce call overhead. C functions are compiled separately and are black boxes to the Go compiler. The call always happens. This means small C functions have higher overhead than equivalent Go functions.

The receiver name convention applies to cgo wrappers too. If you wrap a C function in a Go method, use a short receiver name like (c *CWrapper). Do not use (this *CWrapper) or (self *CWrapper). Go convention favors brevity.

The worst cgo bug is the one that never logs. Memory leaks in C are silent until the system runs out of memory. Use tools like pprof to check heap usage. Look for growing allocations that do not correlate with Go objects.

The bridge is safe, but only if you follow the rules. Break the rules and the runtime stops you.

When to Use Cgo

cgo is a powerful tool, but it comes with costs. Use it when you need functionality that does not exist in Go, or when a C library provides performance that Go cannot match. Avoid it when pure Go alternatives exist.

Use a pure Go library when the algorithm can be implemented in Go; native code avoids context switches, inlines freely, and plays nicely with the garbage collector.

Use cgo with batching when you depend on a C library and can group operations so the work done in C dwarfs the cost of the border crossing.

Use a separate process or RPC when the C code is unstable or requires a different memory model; isolation prevents crashes from leaking into your Go runtime.

Use runtime.SetFinalizer sparingly for C memory cleanup; finalizers run asynchronously and provide no guarantee on when resources are released.

Reach for cgo only when you have measured the need. Profile first. If pure Go meets your requirements, stay in Go. If you must cross, batch your calls and manage memory carefully.

cgo is a bridge, not a destination. Cross it, do the work, and get back.

Where to go next

Using cgo slows down your program because it forces Go to stop and talk to C code every time you call a function, which is like stopping at a border crossing instead of driving straight through. It also makes memory management harder because Go's automatic cleanup doesn't know about memory C allocated. You should use it only when you absolutely need a C library, and try to make fewer, bigger calls to C instead of many small ones.