The 15-character limit that broke on Tuesday
You build a username validator. The requirement is straightforward: allow up to fifteen characters. You write if len(username) > 15 { return ErrTooLong }. It passes every test case in your local environment. Then a user signs up with "JosΓ©" or "ε±±η°ε€ͺι". The validator rejects them, or worse, truncates their name in the middle of a character. The bug is not in your logic. It is in your measurement tool.
Strings are tape, not beads
Go strings are not arrays of characters. They are immutable sequences of bytes. The language stores them as UTF-8 encoded data, which means every symbol takes between one and four bytes depending on its origin. The built-in len() function does not count symbols. It counts bytes. When you ask for the length of a string, Go hands you the raw memory footprint, not the human-readable character count.
Think of a string like a spool of magnetic tape. len() measures the physical length of the tape. Runes measure the number of distinct symbols stamped onto it. ASCII characters take one byte each, so the tape length and symbol count match perfectly. Latin accents, Cyrillic, Chinese, and emojis take two, three, or four bytes. The tape gets longer. The symbol count stays the same.
Bytes are cheap. Runes are what users see.
Measuring bytes versus runes
Here is the simplest way to see the difference in action. The code creates a mixed ASCII and CJK string, measures it with len(), then measures it with the standard library UTF-8 helper.
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
// Mixed ASCII and Chinese characters
s := "Hello δΈη"
// Reads the string header directly. O(1) operation.
byteLen := len(s)
// Walks the byte sequence and counts valid UTF-8 code points.
runeLen := utf8.RuneCountInString(s)
// Prints both measurements side by side
fmt.Printf("Bytes: %d, Runes: %d\n", byteLen, runeLen)
}
The output shows eleven bytes but only eight runes. The five ASCII letters take one byte each. The two Chinese characters take three bytes each. The space takes one byte. Five plus six plus one equals eleven. The human eye sees eight distinct symbols.
Under the hood, Go represents a string as a two-field struct: a pointer to the underlying byte array and an integer length. len() simply reads that integer field. It never touches the actual data. That is why it runs in constant time and costs virtually nothing. utf8.RuneCountInString() does the opposite. It iterates through every byte, checks the leading bits to determine where one character ends and the next begins, and increments a counter. It runs in linear time relative to the string size.
Trust the header for size. Walk the tape for meaning.
How UTF-8 actually packs data
Understanding why len() diverges from character count requires looking at the encoding rules. UTF-8 was designed to be backward compatible with ASCII while supporting the entire Unicode standard. It achieves this through variable-length prefix bytes.
Single-byte characters use the pattern 0xxxxxxx. The leading zero tells the decoder this is a complete character. ASCII letters, digits, and punctuation all fit here. Two-byte characters start with 110xxxxx followed by a continuation byte 10xxxxxx. Three-byte characters start with 1110xxxx followed by two continuation bytes. Four-byte characters start with 11110xxx followed by three continuation bytes.
The Go runtime uses these bit patterns to navigate the string. When utf8.RuneCountInString() encounters a byte starting with 110, it knows to skip the next byte. When it sees 1110, it skips two. It never needs to consult a massive lookup table. It just follows the prefix rules. This keeps the decoder fast and cache-friendly, but it still requires scanning every byte. You cannot jump to the tenth character without walking through the first nine.
Go convention places UTF-8 utilities in the unicode/utf8 package, not in strings. The strings package operates on raw bytes by design. This separation keeps the core string manipulation functions lightweight and predictable. If you need character-aware operations, you import utf8 explicitly. The compiler will not warn you if you mix them, but the runtime behavior will diverge the moment non-ASCII data appears.
When byte boundaries bite
The mismatch between bytes and runes causes silent data corruption when you slice strings by index. Go allows direct indexing with s[i], but it returns a single byte, not a character. If you index into the middle of a multi-byte sequence, you extract a fragment that does not form valid UTF-8.
Try slicing "δΈη"[1] and the compiler will not stop you. It happily hands you a byte value. If you convert that fragment to a string and pass it to a function expecting valid text, the program panics at runtime with invalid UTF-8 encoding. Database drivers, JSON encoders, and HTTP handlers all validate UTF-8 boundaries. Feeding them partial bytes breaks the contract.
The same trap appears with the strings package. Functions like strings.Index and strings.Contains operate on raw bytes. They work perfectly for ASCII. They return byte offsets for Unicode. If you use a byte offset to truncate a string, you will almost certainly cut a character in half. The compiler rejects this with slice bounds out of range only if you exceed the total byte length. It will never catch a mid-character slice.
Slice by index at your own risk. Range over the string instead.
Real-world validation and truncation
Real applications need to enforce character limits for passwords, bios, or database columns. The safe pattern uses a range loop, which automatically decodes UTF-8 and yields runes one by one. This avoids manual byte arithmetic and guarantees you never split a character.
package main
import (
"errors"
"fmt"
)
// ValidateUsername ensures the input stays within the character limit.
func ValidateUsername(name string, maxLen int) error {
// Range over strings yields runes, not bytes.
count := 0
for range name {
count++
// Short-circuit on the first overflow to avoid unnecessary work.
if count > maxLen {
return errors.New("username exceeds maximum length")
}
}
return nil
}
func main() {
// Tests the validator against a multi-byte string
err := ValidateUsername("ε±±η°ε€ͺι", 4)
fmt.Println(err)
}
The range keyword handles the heavy lifting. It reads the byte sequence, decodes each rune, and hands you the character index. You get a clean count without importing unicode/utf8. The early return keeps the function fast for strings that exceed the limit.
Go convention favors explicit loops for validation over hidden helper functions. The boilerplate is visible, which means the boundary condition is obvious to anyone reading the code. If you need the actual substring, slice by rune index using []rune(name)[:maxLen], but be aware that casting to a rune slice allocates a new array on the heap. Use it only when you actually need to mutate or reorder characters.
Measure bytes for the machine. Count runes for the human.
Choosing the right ruler
Pick the right measurement tool based on what you are actually trying to do.
Use len() when you are measuring network payloads, file sizes, or buffer capacities. The operating system and transport layers care about bytes, not characters.
Use utf8.RuneCountInString() when you need a quick character count for logging, metrics, or simple validation. It avoids the allocation cost of converting the string to a rune slice.
Use a range loop when you need to inspect, filter, or transform characters individually. It decodes UTF-8 on the fly and gives you direct access to each code point.
Use []rune(s) when you must reorder, reverse, or slice a string by character index. The conversion allocates memory, so reserve it for cases where you actually need random access to runes.
Use ASCII-only constraints when performance is critical and you can guarantee the input charset. Restricting the domain eliminates the byte-rune mismatch entirely.