How to Use SIMD Instructions in Go Assembly

When the loop won't speed up

You are processing a 4K image frame by frame. Your Go loop calculates brightness for every pixel. The profiler screams that 90% of the time is spent in a tight loop doing simple arithmetic. You have optimized the algorithm. You have checked the cache. The bottleneck is still the CPU doing one pixel at a time. The hardware can do four pixels at once. You need SIMD.

SIMD stands for Single Instruction, Multiple Data. The CPU has wide registers that can hold multiple numbers. Instead of adding a + b, you add [a1, a2, a3, a4] + [b1, b2, b3, b4] in one instruction. The throughput quadruples without quadrupling the instruction count.

Go's compiler is smart. It auto-vectorizes many loops automatically. Sometimes it cannot. The loop structure is too complex. The data alignment is uncertain. The instruction set requires a specific mnemonic the compiler avoids. When the compiler stalls and the profiler points at the same line, you write assembly.

SIMD in plain words

Imagine a factory line. Standard code processes one item per cycle. A worker picks up a box, paints it, and puts it down. SIMD processes a tray of items per cycle. The worker picks up a tray of four boxes, paints all four in one motion, and puts the tray down. The instruction is the same. The payload is larger.

In x86-64, the vector registers are XMM0 through XMM15. Each holds 128 bits. You can treat that as four 32-bit floats, two 64-bit doubles, or sixteen 8-bit integers. The instruction tells the CPU how to interpret the bits. VADDPS adds packed single-precision floats. VADDPD adds packed doubles. The PS and PD suffixes define the data layout.

Go assembly exposes these registers directly. You write the mnemonics in a .s file. The assembler translates them to machine code. You manage the stack layout, the register allocation, and the data movement. The Go runtime still manages the goroutine and the heap. You are writing a leaf function that the compiler calls.

Minimal example: x86-64

Here is the skeleton of a SIMD function in Go assembly. It adds two 128-bit vectors element-wise and writes the result to a third location.

// simdAdd adds two 128-bit vectors element-wise.
// It reads four floats from a and b, adds them, and writes to dst.
TEXT ·simdAdd<>(SB), NOSPLIT, $0-32
    // Load 128 bits from the first input pointer into X0.
    VMOVDQU 0(SP), X0
    // Load 128 bits from the second input pointer into X1.
    VMOVDQU 16(SP), X1
    // Add the vectors. X0 becomes X0 + X1.
    VADDPS X0, X1, X0
    // Store the result back to the destination pointer.
    VMOVDQU X0, 32(SP)
    RET

The TEXT directive defines the function. ·simdAdd<>(SB) names the function. The · is the package prefix. The <> indicates no type parameters. NOSPLIT tells the runtime not to insert stack growth checks. $0-32 defines the stack frame: zero bytes of local stack, thirty-two bytes for arguments and results.

Go passes arguments on the stack in 64-bit chunks. 0(SP) is the first argument pointer. 16(SP) is the second. 32(SP) is the result pointer. VMOVDQU moves unaligned double-quad-words. It loads 16 bytes from memory into the vector register. VADDPS performs the addition. RET returns control to the caller.

Assembly gives you the metal. The compiler gives you the safety net. Pick your battles.

Walkthrough: stack, registers, and flow

The function starts with the stack pointer SP pointing to the arguments. The caller has already pushed the pointers. The assembly code reads them directly. VMOVDQU 0(SP), X0 fetches the address of the first vector from the stack, dereferences it, and loads the 128-bit value into X0. The U in VMOVDQU means unaligned. The CPU handles any alignment penalty. Using aligned instructions on unaligned memory causes a segmentation fault.

The vector registers X0 and X1 now hold the data. VADDPS X0, X1, X0 adds the contents. The destination is the first operand. X0 is updated in place. The result is four floats: dst[i] = a[i] + b[i].

The result pointer lives at 32(SP). VMOVDQU X0, 32(SP) stores the 128 bits back to the address on the stack. The caller reads the result from that location.

The NOSPLIT directive is critical. Go functions can grow their stack dynamically. The runtime inserts a check at the start of every function to ensure there is enough stack space. If the stack is full, it moves the stack frame and updates pointers. That check is expensive. NOSPLIT disables it. You promise the function is a leaf: it calls no other functions and does not recurse. If you break that promise, the program crashes with a stack overflow.

Convention aside: gofmt formats Go code. It does not touch .s files. Your indentation is your responsibility. Most teams adopt a 4-space indent for assembly to match Go. Run gofmt on the Go stub, not the assembly.

Realistic example: processing a batch

A single vector add is rarely useful. Real code processes slices. Here is a function that adds vectors from two slices. It handles the loop in assembly for maximum control.

// ProcessBatch adds vectors from two slices.
// It iterates over the length, adding 128-bit chunks.
TEXT ·ProcessBatch<>(SB), NOSPLIT, $0-32
    // Load length into AX.
    MOVL 24(SP), AX
    // If length is zero, skip the loop.
    TESTL AX, AX
    JZ done
    // Load pointers into general-purpose registers.
    MOVQ 0(SP), BX
    MOVQ 8(SP), CX
    MOVQ 16(SP), DX
loop:
    // Load 128-bit vectors from source pointers.
    VMOVDQU 0(CX), X0
    VMOVDQU 0(DX), X1
    // Perform vector addition.
    VADDPS X0, X1, X0
    // Store result to destination pointer.
    VMOVDQU X0, 0(BX)
    // Advance pointers by 16 bytes (four floats).
    ADDQ $16, BX
    ADDQ $16, CX
    ADDQ $16, DX
    // Decrement counter and loop if more work remains.
    DECL AX
    JNZ loop
done:
    RET

The stack layout is dst *float32, src1 *float32, src2 *float32, len int. On 64-bit systems, pointers are 8 bytes and int is 8 bytes. The offsets are 0, 8, 16, and 24. The total argument size is 32 bytes.

The code loads the length into AX. TESTL AX, AX sets flags based on the value. JZ done jumps to the end if the length is zero. The pointers move into BX, CX, and DX. The loop loads vectors, adds them, stores the result, advances the pointers by 16 bytes, and decrements the counter. DECL AX decrements the 32-bit register. JNZ loop continues until the counter hits zero.

Loops in assembly are fast. They are also where off-by-one errors live forever.

Architecture differences

Go runs on many architectures. The mnemonics change. The logic remains the same. x86-64 uses VADDPS. ARM64 uses FADD. Loong64 uses VADD.W. You must maintain separate .s files for each architecture. Go does not auto-translate assembly.

Here is the ARM64 equivalent. ARM64 uses V registers for vectors. The syntax is different.

// simdAddARM adds vectors on ARM64.
// It uses ARM64 vector registers and mnemonics.
TEXT ·simdAddARM<>(SB), NOSPLIT, $0-32
    // Load vectors into V0 and V1.
    LD1 {V0.4S}, [X0]
    LD1 {V1.4S}, [X1]
    // Add packed single-precision floats.
    FADD V0.4S, V0, V1
    // Store result.
    ST1 {V0.4S}, [X2]
    RET

LD1 loads a vector from memory. {V0.4S} specifies the register and type: V0 holding four single-precision floats. FADD performs the addition. ST1 stores the result. The stack offsets are the same. The register names and instructions differ.

Convention aside: Cross-compilation requires separate files. Name them simd_amd64.s, simd_arm64.s, and simd_loong64.s. Go selects the file based on the target architecture. The Go stub remains the same. The implementation diverges.

Pitfalls and errors

Assembly bypasses the type system. The compiler cannot check your pointer arithmetic. You can read garbage or crash the program.

The compiler rejects the program with relocation target ·simdAdd not defined if the Go stub is missing. The stub declares the function signature. The assembly file implements it. Both must exist.

A segmentation fault occurs if you use aligned instructions on unaligned memory. VMOVDQA requires 16-byte alignment. VMOVDQU does not. If the pointer is not aligned, use the unaligned variant. The performance difference is small on modern CPUs. The crash is not.

A stack overflow occurs if you call another function from a NOSPLIT assembly function. NOSPLIT disables the stack check. The called function may grow the stack. The runtime cannot intervene. The stack overflows and the goroutine panics. Keep NOSPLIT functions as leaves. Call only other NOSPLIT functions or runtime intrinsics that guarantee no stack growth.

The assembler complains with undefined: X15 if you reference a register that does not exist on the target architecture. x86-64 has X0 through X15. ARM64 has V0 through V31. Check the architecture documentation.

Convention aside: unsafe often lives next to assembly. If you are writing assembly, you are likely managing pointers manually. The compiler cannot check your pointer arithmetic. Use unsafe.Pointer in the Go stub to bridge the gap. The assembly sees raw addresses. The Go code sees types. The boundary is where bugs hide.

The assembler doesn't save you from yourself. Check your offsets.

Decision matrix

Use standard Go loops when performance is acceptable or the compiler auto-vectorizes the code. Use unsafe.Slice when you need to treat a raw pointer as a slice for iteration without copying data. Use assembly when profiling identifies a tight loop that the compiler cannot optimize and SIMD throughput is required. Use go:nosplit directives when writing assembly functions that must not trigger stack growth checks.

Where to go next

SIMD (Single Instruction, Multiple Data) lets your computer process many numbers at once instead of one by one. In Go, you write these special fast instructions in a separate assembly file, similar to writing a recipe for the CPU to follow directly. It's like using a conveyor belt to paint 100 cars simultaneously instead of painting them one at a time.