When the profiler points to a function that looks innocent
You profiled your service. The CPU flame graph shows a spike in a function you wrote. The code is clean: a loop summing integers, or a helper that formats a string. The logic is sound. Yet the profiler screams that this function consumes half your CPU time. You suspect the compiler generated a slow loop, or perhaps it failed to inline a helper, causing excessive stack allocations. You need to see what the machine actually executes.
Go lets you peek under the hood. You can inspect the assembly output to verify what the compiler emitted. Assembly is the bridge between your source code and the CPU. Reading it reveals the exact sequence of register moves, memory accesses, and jumps. It shows you the cost of every operation.
Assembly is the compiler's hand movements
Go compiles to native machine code. The compiler translates your source into a sequence of CPU instructions. Assembly is the human-readable representation of those instructions. Each line of assembly corresponds to a single operation the processor executes.
Think of Go code as a recipe. It lists ingredients and steps. Assembly is the video of the chef cooking. The recipe says "chop onions." The video shows the knife angle, the speed of the hand, and whether the chef pauses to check the phone. If the dish takes too long, the recipe might be fine, but the video reveals the chef is using a dull knife. Assembly shows you the compiler's knife skills.
Go assembly uses a specific syntax. It is not AT&T or Intel syntax. It is Go's own dialect. Instructions are uppercase. Operands are separated by commas. The destination is always the last operand. This consistency makes the output easier to parse once you know the rules.
Generating assembly from Go code
The standard way to see assembly is to ask the compiler to print it during compilation. You use the -gcflags flag to pass options to the compiler. The -S flag tells the compiler to emit assembly instead of object code.
go build -gcflags="-S" main.go
This command compiles main.go and prints the assembly to standard output. The output includes the program counter offset, the source line number, the instruction, and the operands. It also shows the hex encoding if you use additional flags, but the default output focuses on the readable instructions.
Minimal example: a function call
Here's a trivial function to generate assembly. The function takes two integers and returns their sum.
package main
// Add returns the sum of two integers.
func Add(a, b int) int {
return a + b
}
Run the build command with the flag. The output for Add looks like this.
# output:
TEXT "".Add(SB), $0-16
MOVQ a+8(FP), AX
MOVQ b+16(FP), BX
ADDQ BX, AX
MOVQ AX, ret+24(FP)
RET
The output has five lines. The first line is the function header. The next four lines are instructions. The last line returns control to the caller.
Walking through the output
The TEXT line marks the start of the function. "".Add(SB) is the symbol name. SB stands for Symbol Base. It anchors the name in the global symbol table. $0-16 describes the stack frame. The first number is the size of local variables. The second number is the size of arguments. Here, there are zero bytes of locals and 16 bytes of arguments. Two 64-bit integers take 16 bytes.
Go passes all arguments on the stack. Registers are not used for argument passing in the standard ABI. This makes assembly easier to read. You always look at offsets from the frame pointer to find arguments.
FP is the frame pointer. It points to the base of the current stack frame. Arguments live at positive offsets from FP. a+8(FP) means the first argument is 8 bytes above the frame pointer. The offset accounts for the return address and alignment padding. b+16(FP) is the second argument.
MOVQ moves a 64-bit value. The Q stands for Quad word. MOVQ a+8(FP), AX loads the value of a from the stack into the AX register. AX is a general-purpose register. The compiler uses registers to hold temporary values.
ADDQ BX, AX adds the value in BX to AX. The result stays in AX. This is the core logic of the function.
MOVQ AX, ret+24(FP) stores the result back to the stack. Return values in Go are treated like arguments in the caller's frame, but in the callee's assembly, they often appear at positive offsets from FP. The caller expects the result at this location.
RET returns to the caller. It pops the return address from the stack and jumps to it.
Registers are scarce. The x86-64 architecture has a limited set of general-purpose registers. The compiler assigns registers automatically. If it runs out, it spills values to the stack. Spilling is expensive. You can see spills as extra MOVQ instructions that move data between registers and stack offsets.
Realistic example: loops and optimizations
Loops are where assembly matters most. A poorly optimized loop can dominate CPU usage. Here's a function that sums a slice.
// SumSlice sums all elements in a slice.
func SumSlice(data []int) int {
total := 0
for i := 0; i < len(data); i++ {
total += data[i]
}
return total
}
Here's the assembly output for the loop. The compiler generates code to check bounds and iterate.
# output:
TEXT "".SumSlice(SB), $48-24
MOVQ data+16(FP), AX
MOVQ data+24(FP), BX
CMPQ BX, $0
JE 100
MOVQ $0, total+32(FP)
MOVQ $0, i+40(FP)
JMP 20
20:
CMPQ i+40(FP), BX
JGE 100
MOVQ i+40(FP), CX
MOVQ data+16(FP), DX
MOVQ (DX)(CX*8), SI
ADDQ SI, total+32(FP)
INCQ i+40(FP)
JMP 20
100:
MOVQ total+32(FP), AX
MOVQ AX, ret+48(FP)
RET
The output shows the loop structure. CMPQ compares values. JE jumps if equal. JGE jumps if greater or equal. INCQ increments a value.
The compiler checks the slice length first. CMPQ BX, $0 compares the length to zero. JE 100 jumps to the end if the slice is empty. This avoids entering the loop for empty slices.
Inside the loop, CMPQ i+40(FP), BX checks the index against the length. JGE 100 exits the loop if the index reaches the length. This is the loop condition.
The memory access MOVQ (DX)(CX*8), SI loads an element from the slice. DX holds the slice pointer. CX holds the index. (DX)(CX*8) calculates the address by multiplying the index by 8 (the size of an int) and adding it to the pointer. This is pointer arithmetic.
The compiler eliminated the bounds check inside the loop. In Go, accessing a slice element usually includes a runtime check to ensure the index is within bounds. The compiler proved that the loop condition guarantees the index is valid. It removed the redundant check. This is a common optimization. You can verify this by looking for CMP instructions before memory accesses. If they are missing, the compiler optimized them away.
Assembling raw .s files
Go supports writing assembly in .s files. These files use the same syntax as the compiler output. You can write performance-critical functions in assembly and link them with Go code.
The go tool asm command assembles .s files into object files. The -S flag prints the assembly source alongside the hex encoding. This is useful for verifying hand-written assembly.
go tool asm -S file.s
The output shows the instruction, the hex bytes, and the line number. The hex encoding is the binary representation of the instruction. Each byte corresponds to a part of the machine code.
# output:
0000 TEXT "".MyFunc(SB), $0-0
0000 4889F8 MOVQ DI, AX
0003 C3 RET
The first column is the offset. The second column is the hex. The third column is the instruction. 4889F8 is the hex for MOVQ DI, AX. C3 is the hex for RET.
If you make a mistake in a .s file, the assembler rejects it. The compiler rejects malformed instructions with bad instruction or undefined symbol if you reference a name that doesn't exist. For example, go tool asm: bad instruction: MOVL AX, BX appears if you use invalid syntax. The assembler validates register usage and instruction encoding.
Pitfalls and conventions
Assembly output depends on the Go version. Optimizations change the output. The compiler may inline functions, unroll loops, or reorder instructions. To get a stable view, disable optimizations and inlining.
go build -gcflags="-N -l" main.go
The -N flag disables optimizations. The -l flag disables inlining. This produces assembly that closely matches the source code. It is easier to read but slower to run. Use this mode for debugging, not for performance analysis.
Convention aside: use //go:noinline to force a function boundary. If you want to inspect the assembly of a function, but the compiler inlines it, you won't see it in the output. Adding //go:noinline above the function definition prevents inlining. This ensures the function appears as a distinct block in the assembly.
Convention aside: gofmt does not format .s files. You must format assembly manually. Indentation matters for readability. Use consistent spacing and comments to document your assembly.
Registers have roles. SP is the stack pointer. FP is the frame pointer. AX, BX, CX, DX are general-purpose. R8 through R15 are additional general-purpose registers on x86-64. R10 and R11 are often used for scratch operations. The compiler assigns registers automatically. You can see the assignment in the output.
The worst assembly bug is the one that never logs. Hand-written assembly can crash the runtime if it corrupts the stack or registers. Always test assembly functions with the race detector and under load.
When to read assembly
Use go build -gcflags="-S" when you need to inspect the compiler's output for a specific function. Use go tool asm -S when you are writing raw assembly in a .s file and want to verify the hex encoding. Use go test -bench when you want to measure performance without reading assembly. Use pprof when you need to find hotspots in a running program. Use assembly inspection when the benchmark shows unexpected behavior and you need to verify inlining or loop structure. Use -N -l flags when you need a stable assembly view that matches the source code. Use //go:noinline when you need to force a function boundary for inspection.
Assembly is the truth. Benchmarks are the lie detector. Don't optimize until you see the assembly. Registers are scarce. Spills are expensive. The compiler is smart. Trust it until the assembly proves otherwise.