When Go isn't fast enough
You run the profiler on a hot path. The flame graph points to a tight loop that processes thousands of integers per second. The Go compiler did its job, but you need every last cycle. You reach for the assembler. The syntax looks alien. The stack layout is explicit. You are about to talk directly to the CPU instead of relying on the compiler's conservative register allocation.
Writing assembly in Go is not about rewriting your entire application. It is about isolating a specific bottleneck and giving the machine exact instructions. The Go toolchain expects you to stay in Go for 99 percent of the codebase. Assembly is the emergency brake, not the steering wheel.
What the Go assembler actually is
Go ships with its own assembler. It does not use AT&T syntax. It does not use Intel syntax. It uses a custom dialect designed to match Go's calling convention, stack management, and garbage collector. The syntax reads like a simplified instruction set where every line maps to a single CPU operation.
Think of the Go compiler as a general contractor. It handles architecture, plumbing, electrical work, and scheduling. The assembler is the master carpenter you hire to carve a single intricate joint. You control the registers, the memory addresses, and the exact order of operations. The rest of the program stays in Go. The two worlds meet at the linker.
The assembler uses frame pointers to track arguments and local variables. It uses the TEXT directive to define function entry points. It relies on explicit stack offsets instead of named parameters. You write the file with a .s extension, drop it in a package, and the go build command treats it as a first-class citizen alongside .go files.
The assembler doesn't optimize for you. You are the optimizer.
Writing your first assembly function
Here is the simplest bridge between Go and assembly. The assembly file defines a function that adds two 64-bit integers. The Go file declares the signature and calls it.
Here is the assembly routine that reads arguments from the stack, computes the sum, and writes the result back.
// add.s
// +build ignore
// TEXT defines the function entry point. ยท separates the package from the name.
// NOSPLIT tells the runtime to skip stack overflow checks for this function.
// $0-16 means 0 bytes of local stack and 16 bytes of arguments.
TEXT ยทAdd(SB), NOSPLIT, $0-16
// Load the first argument (x) from stack offset 0 into register AX
MOVQ x+0(FP), AX
// Load the second argument (y) from stack offset 8 into register BX
MOVQ y+8(FP), BX
// Add BX to AX. The result stays in AX.
ADDQ BX, AX
// Write the result from AX back to the return value slot at offset 16
MOVQ AX, ret+16(FP)
// Return control to the caller. The stack frame is cleaned up automatically.
RET
Here is the Go side that exposes the assembly function to the rest of the program.
// main.go
package main
//go:linkname Add add.Add
func Add(x, y int64) int64
func main() {
// Call the assembly function. The compiler generates a standard call instruction.
result := Add(10, 20)
// Print the result to verify the bridge works.
println(result)
}
The //go:linkname directive is a linker bridge. It tells the Go linker to resolve the unexported Add symbol in this package to the Add symbol defined in the assembly file. Without it, the linker cannot connect the two files. The directive is a convention in the Go community for low-level interop, but it bypasses normal visibility rules. Use it only when you actually need assembly.
How the compiler and linker bridge the gap
When you run go build, the compiler processes the .go files first. It sees the //go:linkname comment and records a promise: this package exports a function named Add that will be resolved later. The compiler generates a call site that pushes two 64-bit integers onto the stack and jumps to the symbol address.
The assembler processes the .s file next. It translates the TEXT directive into a symbol table entry. The $0-16 frame size tells the runtime exactly how much stack space the function expects. The NOSPLIT flag is critical. Go routines can grow their stacks dynamically. If the runtime inserts a stack split check inside your assembly function, it could corrupt registers or misalign the frame. NOSPLIT disables that safety net for the duration of the function.
The linker runs last. It matches the //go:linkname promise to the actual symbol in the object file. It patches the call site with the correct memory address. The result is a single executable where Go code and assembly code share the same stack and register state.
Stack layouts are contracts. Break them and the runtime panics.
Inspecting object files and compiled output
You rarely guess how the compiler translates your code. You verify it. The go tool suite ships with the standard distribution and gives you direct access to the intermediate representation and final machine code.
Here is the standard workflow for compiling a package to an object file and inspecting its symbols.
# Compile the package to an object file without linking.
go tool compile -o main.o main.go
# List all symbols in the object file. Global symbols start with uppercase.
go tool nm main.o
# Disassemble the object file to see the actual machine instructions.
go tool objdump -d main.o
The nm command shows you every function, variable, and constant the compiler emitted. It separates exported symbols from internal ones. You use it to audit naming collisions or verify that the linker will find what you expect.
The objdump command shows you the raw assembly the compiler generated. You use it to verify inlining, check register allocation, and spot unexpected memory loads. If you suspect the compiler is spilling registers to the stack too often, objdump proves it.
The -S flag on go tool compile is useful during development. It prints the assembly listing directly to stdout instead of writing an object file. You pipe it to a file or a pager to compare your hand-written assembly against the compiler's output.
Trust the disassembly. The compiler lies less than your intuition.
Pitfalls and compiler boundaries
Writing assembly in Go introduces a narrow set of failure modes. Most of them stem from stack layout mismatches or calling convention violations.
If your assembly function expects 16 bytes of arguments but the Go call site passes 24, the runtime reads garbage from the stack. The compiler rejects this at link time with link: symbol Add has wrong number of parameters. If the types match but the sizes differ, you get a silent corruption that manifests as a segmentation fault or a corrupted return value.
Forgetting NOSPLIT on a function that manipulates registers can trigger a stack split at an unsafe moment. The runtime will panic with runtime: stack split at bad time. The fix is always to add NOSPLIT to the TEXT directive and keep the function short enough that it never needs to grow the stack.
Mixing C calling conventions with Go assembly breaks immediately. Go passes arguments on the stack. C passes them in registers on most modern platforms. If you try to call a C function directly from Go assembly without a wrapper, the linker complains with undefined: C_function or the program crashes on entry. The convention in the Go community is to use cgo for C interop and keep pure Go assembly strictly within the Go calling convention.
Another boundary is formatting. gofmt does not touch .s files. You format assembly manually. The community convention is to align operands vertically and keep comments to the right of the instruction. Consistency matters more than style because assembly is read by humans who are already debugging a performance issue.
The worst assembly bug is the one that returns garbage without panicking.
When to reach for assembly
You do not write assembly because it is cool. You write it because the profiler proved the compiler cannot squeeze out the last percent of performance you need. The decision matrix is straightforward.
Use Go assembly when you need to bypass the compiler's conservative register allocation for a tight loop that runs millions of times per second. Use go tool objdump when you want to verify that inlining actually happened and inspect the exact instruction sequence the compiler emitted. Use go tool nm when you need to audit exported symbols and check for naming collisions before linking. Use standard Go code when the profiler shows the bottleneck is I/O, memory allocation, or network latency instead of CPU cycles.
Profile first. Assemble second.