How to Handle Unicode and UTF-8 in Go

The character count that breaks your UI

You build a text field that limits input to twenty characters. You grab the length with len(input), slice it at index twenty, and save it to the database. The first few users type English. Everything works. Then someone pastes a Japanese greeting or a string of emojis. The UI shows half a character. The database rejects the payload with an encoding error. Your twenty-character limit actually allowed twelve or thirty, depending on the input.

Go does not hide Unicode from you. It forces you to choose what you are measuring. The language stores every string as a sequence of UTF-8 bytes. When you ask for the length, you get bytes. When you iterate, you get Unicode code points. The gap between those two concepts is where most string bugs live.

Bytes, runes, and the tape measure

UTF-8 is a variable-width encoding. ASCII characters take one byte. Latin accents take two. Most CJK characters take three. Emojis and less common symbols take four. Go represents this reality directly: a string is an immutable sequence of bytes. There is no hidden character array underneath.

The language provides rune as an alias for int32. A rune holds a single Unicode code point. When you convert a string to a slice of runes, Go decodes the UTF-8 bytes and packs each code point into a 32-bit integer. The conversion allocates memory and runs a decoder, but it gives you predictable indexing.

Think of a string like a roll of film. len() measures the physical length of the roll in inches. That tells you how much shelf space it needs. It does not tell you how many photographs are on it. utf8.RuneCountInString() counts the photographs. range walks through the roll and hands you each photograph one by one.

Bytes measure storage. Runes measure meaning.

Counting and iterating

Here is the simplest way to see the difference between byte length and character count:

package main

import (
	"fmt"
	"unicode/utf8"
)

func main() {
	// Mixed ASCII and multi-byte characters to show encoding differences
	s := "Hello 世界"
	
	// len counts raw bytes, not visible characters
	fmt.Println("Bytes:", len(s))
	
	// RuneCountInString decodes UTF-8 and returns the code point count
	fmt.Println("Runes:", utf8.RuneCountInString(s))
	
	// range over a string automatically decodes UTF-8 and yields runes
	for _, r := range s {
		// %c prints the character, %04X prints the hex code point
		fmt.Printf("Rune: %c (U+%04X)\n", r, r)
	}
}

When the program runs, len(s) returns eleven. The word "Hello" takes five bytes. Each Chinese character takes three bytes. Five plus six equals eleven. utf8.RuneCountInString(s) returns seven. The loop prints seven iterations. Each iteration yields a rune value, which is just a 32-bit integer holding the Unicode code point.

The range keyword does the heavy lifting. Under the hood, it calls the same decoder that utf8.DecodeRuneInString uses. It reads the first byte, checks the leading bits to determine the sequence length, consumes the continuation bytes, validates the encoding, and returns the decoded code point. If the string contains invalid UTF-8, range yields the replacement character U+FFFD and advances by one byte to avoid an infinite loop.

Strings are immutable. You cannot change a character in place. Attempting to assign to an index like s[0] = 'X' triggers a compile-time error: cannot assign to s[0]. The compiler enforces this because strings are read-only headers pointing to backing arrays. If you need to modify content, convert to a slice first.

Safe truncation in practice

Counting is straightforward. Slicing by character count requires care. Direct indexing cuts through byte sequences and produces invalid UTF-8. Here is a function that truncates to a maximum number of code points without breaking multi-byte characters:

package main

import (
	"fmt"
	"unicode/utf8"
)

// TruncateRune returns the first n runes from s as a new string.
// If s contains fewer than n runes, it returns s unchanged.
func TruncateRune(s string, n int) string {
	// Convert to []rune to get predictable character-level indexing
	runes := []rune(s)
	
	// Guard against negative or zero limits
	if n <= 0 {
		return ""
	}
	
	// Slice the rune slice safely, then convert back to UTF-8 bytes
	if n >= len(runes) {
		return s
	}
	
	// string() re-encodes the runes into valid UTF-8
	return string(runes[:n])
}

func main() {
	// Test with a string that would break on byte slicing
	input := "Hello 世界"
	
	// Truncate to 6 characters: keeps "Hello 世", drops "界"
	result := TruncateRune(input, 6)
	
	// Verify the output length matches the requested rune count
	fmt.Println("Truncated:", result)
	fmt.Println("Rune count:", utf8.RuneCountInString(result))
}

The function converts the string to a []rune slice. That allocation is the price of character-level safety. The slice operation runes[:n] works exactly like integer indexing because every element is now a fixed-size 32-bit value. Converting back to string runs the UTF-8 encoder, which guarantees the output is valid. The utf8 package provides fast, allocation-free helpers for validation, but you cannot avoid allocation when you need random access to code points.

Grapheme clusters complicate the picture further. A single visual character can consist of multiple code points. Emojis with skin tone modifiers, family emojis, or accented Latin letters often combine a base rune with one or more combining marks. utf8.RuneCountInString counts code points, not visual glyphs. If your UI requires exact visual width or glyph count, the standard library stops at runes. You will need a third-party package like github.com/rivo/uniseg to handle grapheme segmentation.

Strings are immutable. Copy to a slice before you mutate.

Where the standard library trips you up

The most common mistake is treating len() as a character counter. It is not. It measures bytes. Network buffers, file sizes, and HTTP headers care about bytes. User interfaces and database character limits care about runes. Mixing them up causes silent data corruption or rejected payloads.

Indexing a string with s[i] returns a byte. If i lands in the middle of a multi-byte sequence, you get a continuation byte that is not valid on its own. Passing that fragment to a decoder or printing it directly produces garbled output. The compiler will not stop you from indexing, but utf8.DecodeRuneInString(s[i:]) will return utf8.RuneError and a size of one.

Another trap is assuming rune and byte are interchangeable. They are not. byte is an alias for uint8. rune is an alias for int32. Assigning a rune to a byte without explicit conversion triggers a type mismatch: cannot use r (variable of type rune) as byte value in assignment. The compiler forces you to acknowledge the width difference.

Validation is often skipped until it breaks. Untrusted input from forms, APIs, or files may contain malformed UTF-8. Always run utf8.ValidString(input) before processing. It scans the byte sequence and returns false if any continuation byte is missing or if a code point exceeds the Unicode maximum. Invalid strings will cause panics in some third-party parsers and produce replacement characters in others.

The worst string bug is the one that silently drops half a character. Validate early. Count explicitly.

Picking the right unit

Go gives you multiple ways to measure and manipulate text. The right choice depends on what you are trying to achieve.

Use len(s) when you need byte size for network buffers, file storage calculations, or HTTP header limits. Use utf8.RuneCountInString(s) when you need a character count for UI input limits, database VARCHAR constraints, or display width approximations. Use []rune(s) when you need random access to individual code points, want to slice by character count, or must reverse a string safely. Use range s when you want to iterate over every valid Unicode code point without manual decoding or index management. Use utf8.ValidString(s) when you receive untrusted input and need to guarantee well-formed UTF-8 before passing it to parsers or databases.

Pick the right unit for the job. Count bytes for the wire, count runes for the screen.

Where to go next

Go treats text as a stream of bytes that follow the UTF-8 standard, which supports every character in the world. You don't need to manually convert text formats because the language handles the math automatically. Think of it like a universal translator that understands every language's alphabet without you needing to change settings.