How to Write Assembly Functions in Go

Write assembly functions in Go by creating a `.s` file in your package, defining a global symbol with `TEXT`, and calling it from Go using `//go:nosplit` and `//go:linkname` or `//go:export` depending on direction.

How to Write Assembly Functions in Go

You are staring at a CPU profile. The flame graph is a single tall bar. You have tuned the algorithm, removed allocations, and the Go compiler still generates a few extra instructions that waste cycles in a tight loop. The bottleneck is a specific bit manipulation or a memory access pattern that the compiler refuses to optimize. You need to take the wheel. You need to write assembly.

Go supports inline assembly via separate .s files. The syntax is not x86 AT&T or Intel. It is Plan 9 assembly, a portable layer that maps to the underlying architecture. You write the logic once, and the Go toolchain adapts it to the target CPU.

Plan 9 syntax acts like a universal remote control. The buttons are labeled ADD and MOV, but the remote translates those presses into the specific codes your processor understands. On x86, MOVQ moves 64 bits. On ARM64, MOVD does the same job with a different instruction. The assembler handles the translation. You focus on the data flow, not the register names.

Convention aside: gofmt does not touch .s files. You format assembly by hand. The community standard is 8 spaces for indentation in assembly files, matching the Plan 9 convention. Most editors can be configured for this. Do not fight the convention; use 8 spaces so your assembly aligns with the rest of the ecosystem.

Plan 9 syntax is the bridge. Write once, assemble everywhere.

The Minimal Bridge

The simplest way to call assembly is to declare a function in Go and implement it in a .s file. The Go file provides the type signature and directives. The assembly file provides the instructions.

Here is the Go declaration. The //go:nosplit directive tells the compiler not to insert a stack split check at the start of the function.

// adder.go
package main

//go:nosplit
func Add(x, y int) int

Here is the assembly implementation. The TEXT directive defines the function. The ยท character represents the current package. SB is the symbol base.

// adder.s
// Add adds two integers and returns the result.
TEXT ยทAdd(SB), NOSPLIT, $0-24
    MOVQ x+8(FP), AX  // Load x from the argument slot
    MOVQ y+16(FP), BX // Load y from the argument slot
    ADDQ BX, AX       // Compute sum in AX
    MOVQ AX, ret+24(FP) // Write result to the return slot
    RET               // Return control to the caller

The TEXT directive starts the function definition. ยทAdd(SB) names the symbol. The dot ยท separates the package from the function name. SB anchors the name to the symbol table. The flags follow. NOSPLIT matches the Go directive. It tells the runtime this function cannot grow the stack. If the stack needs to grow, the program panics.

The size $0-24 describes the stack frame. The 0 means the function allocates zero bytes of local stack space. The 24 means the caller reserves 24 bytes for arguments and return values. On a 64-bit system, three int values take 24 bytes.

Arguments live at positive offsets from the frame pointer FP. x is at x+8(FP). y is at y+16(FP). The return value lives at ret+24(FP). Go assembly hides the register allocation for arguments. You access arguments by name and offset. The assembler generates the correct instructions to fetch the value, whether it lives in a register or on the stack. This abstraction makes Go assembly safer than raw assembly.

The dot separates the package. The flags protect the stack.

Realistic Example: Slices and Linking

Real assembly often processes slices or uses //go:linkname to manage symbol visibility. Here is a function that hashes a byte slice. The Go declaration uses //go:linkname to bind the public Hash function to the internal assembly symbol hash_asm.

// crypto.go
package crypto

//go:nosplit
//go:linkname Hash hash_asm
func Hash(data []byte) uint64

Here is the assembly implementation. The function iterates over the slice bytes and accumulates a sum.

// crypto.s
// hash_asm computes a simple hash over a byte slice.
TEXT ยทhash_asm(SB), NOSPLIT, $0-32
    MOVQ data_base+0(FP), AX  // Load slice data pointer
    MOVQ data_len+8(FP), BX   // Load slice length
    XORQ CX, CX               // Initialize hash accumulator
    CMPQ BX, $0               // Check if slice is empty
    JE   done                 // Skip loop if empty
loop:
    MOVBLZX (AX), DX          // Load byte, zero-extend to 64-bit
    ADDQ DX, CX               // Add byte to hash
    INCQ AX                   // Advance pointer
    DECQ BX                   // Decrement length
    JNE  loop                 // Repeat until length is zero
done:
    MOVQ CX, ret+24(FP)       // Store hash result
    RET                       // Return

The slice header is 16 bytes. The pointer is at offset 0. The length is at offset 8. The capacity is at offset 16, but this function does not use it. The stack frame size is $0-32. The arguments take 24 bytes (slice header plus return value). The extra 8 bytes account for alignment or padding in the caller's frame.

MOVBLZX loads a byte and zero-extends it to 64 bits. This is essential. If you load a byte into a 64-bit register without zero-extension, the upper bits contain garbage. The hash would be wrong.

//go:linkname maps the Go name Hash to the assembly name hash_asm. This allows you to expose a function without a Go body, or to rename a function for compatibility.

Convention aside: //go:linkname is dangerous. It bypasses the linker's safety checks. Use it only when necessary, like calling assembly from a different package or interfacing with C. The compiler warns about it. The community treats it like unsafe: it works, but you own the consequences.

Linkname is a bypass. Use it with intent.

Pitfalls and Errors

Assembly gives you direct control, which means you can break things easily. The compiler does not check assembly offsets. You must calculate them manually.

If you forget NOSPLIT in the assembly but use //go:nosplit in Go, the build fails with function ยทAdd(SB) has NOSPLIT flag but Go declaration lacks //go:nosplit. The assembler enforces consistency. The flags must match.

If you allocate stack space in assembly but declare $0 in the size, you corrupt the stack. The runtime detects this with a stack overflow or bad stack boundary panic. The size directive must match the actual usage.

If you call a standard library function from assembly that might grow the stack, the runtime panics with runtime: morestack on nosplit stack. Assembly functions are leaf functions by default. They cannot call other functions that need stack growth unless you handle the stack manually. The convention is to keep assembly functions leaf functions or use NOSPLIT everywhere in the assembly chain.

If you access the wrong offset, you read garbage or crash. The compiler does not catch this. You must verify the stack layout. Use go tool compile -S to inspect the generated assembly for Go functions and learn the calling convention.

The stack is a contract. Break the offsets and you break the program.

When to Write Assembly

Assembly is powerful but expensive to maintain. Use it only when you have a clear reason.

Use Go code when the compiler generates efficient instructions and the logic is complex. The Go compiler is smart enough for 99% of use cases. It inlines functions, unrolls loops, and optimizes memory access. Write Go first.

Use assembly when you need a specific instruction sequence that the compiler cannot emit, such as a hardware-specific atomic operation or a SIMD instruction not yet supported by intrinsics.

Use assembly when you are optimizing a proven bottleneck and profiling confirms that instruction-level changes reduce execution time. Profile before you write assembly. Guessing is useless.

Use //go:linkname when you must expose an assembly symbol to a different package or rename a function for compatibility with external tools.

Use C via cgo when you need to call existing C libraries or use C syntax for assembly, though cgo adds overhead and breaks the garbage collector's ability to scan the stack freely.

Profile first. Assembly is a scalpel, not a hammer.

Where to go next