How to Use the unicode and unicode/utf8 Packages

The unicode and unicode/utf8 packages automatically support Unicode 17 in Go, providing functions to validate, count, and classify characters without manual configuration.

The byte vs rune trap

You build a comment form that limits text to 280 characters. The frontend counts 280 and lets the user submit. The backend rejects the payload. You check the database and see the string is 500 bytes long. The user is confused. The bug isn't in your logic. The bug is that Go strings are bytes, and your frontend counted characters while your backend counted bytes.

Go's string type is an immutable sequence of bytes. It does not know about characters. When you write len("Hello, 世界"), Go returns 11, not 8. The characters '世' and '界' each take three bytes in UTF-8 encoding. If you slice the string at byte index 10, you cut '界' in half and produce invalid UTF-8 data.

The unicode/utf8 package handles the byte-level mechanics of UTF-8 encoding. It validates strings, counts characters, and decodes runes without allocating memory. The unicode package classifies runes by their properties, telling you if a character is a letter, digit, or space according to the Unicode standard. Together, they bridge the gap between raw bytes and human-readable text.

UTF-8 in Go: bytes first, runes second

UTF-8 is a variable-length encoding. ASCII characters use one byte. Characters from other scripts use two, three, or four bytes. Go stores strings as UTF-8 byte sequences. A rune in Go is an alias for int32 and represents a single Unicode code point.

The unicode/utf8 package provides functions that operate directly on strings or byte slices. These functions scan the byte stream, interpret the UTF-8 structure, and return results without creating intermediate allocations. This makes them fast and safe for high-throughput code.

The unicode package works with runes. It exposes functions like IsLetter, IsDigit, and Fold that check Unicode properties. These functions help you classify characters and handle case-insensitive comparisons correctly across all scripts.

Here's the difference between byte length and character count.

package main

import (
	"fmt"
	"unicode/utf8"
)

func main() {
	s := "Hello, 世界"

	// len counts bytes, not characters.
	// '世' and '界' are 3 bytes each in UTF-8.
	fmt.Println(len(s)) // prints 11

	// RuneCountInString decodes the UTF-8 stream to count code points.
	// It scans the string without allocating memory.
	count := utf8.RuneCountInString(s)
	fmt.Println(count) // prints 8
}

Bytes are for the machine. Runes are for the user. Know the difference.

Validating and counting without allocation

Network data, user input, and file contents can contain malformed UTF-8. If you pass invalid bytes to a function that expects valid text, you risk silent data corruption or panics downstream. The utf8.ValidString function checks if a string contains only well-formed UTF-8 sequences. It scans the bytes, verifies start bytes and continuation bytes, and rejects overlong encodings. It returns a boolean and never allocates.

When you need to extract runes, utf8.DecodeRuneInString returns the rune and its width in bytes. If the string starts with an invalid sequence, it returns utf8.RuneError and a width of one. This sentinel value lets you handle errors without branching on an error return type. Go conventions favor this pattern in tight loops to avoid the overhead of error objects.

Here's how to validate input and decode safely without allocating memory.

package main

import (
	"fmt"
	"unicode/utf8"
)

// ValidateInput checks if a string contains valid UTF-8 data.
// It returns true only if every byte sequence is a well-formed rune.
func ValidateInput(data string) bool {
	// ValidString scans the byte slice without allocation.
	// It rejects overlong encodings and invalid continuation bytes.
	return utf8.ValidString(data)
}

// FirstRune extracts the first code point from a string.
// It handles multi-byte characters correctly.
func FirstRune(s string) rune {
	// DecodeRuneInString returns the rune and its width in bytes.
	// If the string is empty, it returns the replacement character.
	// We discard the width using _ because we only need the rune.
	r, _ := utf8.DecodeRuneInString(s)
	return r
}

func main() {
	valid := "café"
	invalid := "café\x80" // invalid continuation byte

	fmt.Println(ValidateInput(valid))   // prints true
	fmt.Println(ValidateInput(invalid)) // prints false

	fmt.Println(FirstRune("世界")) // prints 世
}

The underscore _ discards the width value intentionally. It tells the compiler and other readers that you considered the second return value and chose to drop it. Use it sparingly, but it's the right tool here.

Invalid UTF-8 is a silent data corruption waiting to happen. Validate early.

The unicode package: properties and classification

The unicode package classifies runes based on Unicode properties. Functions like IsLetter, IsDigit, IsSpace, and IsPunct check if a rune belongs to a specific category. These functions work with all Unicode scripts, not just ASCII. The letter '世' is a letter. The digit '٣' (Arabic-Indic digit three) is a digit.

Case handling is another area where unicode shines. Go's strings.ToLower converts ASCII letters and some common scripts, but it doesn't handle all case mappings correctly. unicode.Fold provides case-insensitive comparison that works across scripts. It folds runes to a common form, allowing you to compare strings without worrying about case.

Here's how to classify runes and check alphanumeric content using Unicode properties.

package main

import (
	"fmt"
	"unicode"
)

// IsAlphanumeric checks if a rune is a letter or a digit.
// It uses Unicode properties, not ASCII ranges.
func IsAlphanumeric(r rune) bool {
	// IsLetter includes characters from all scripts, not just A-Z.
	// IsDigit includes full-width digits and other numeric forms.
	return unicode.IsLetter(r) || unicode.IsDigit(r)
}

func main() {
	// Test with ASCII
	fmt.Println(IsAlphanumeric('A')) // prints true
	fmt.Println(IsAlphanumeric('1')) // prints true

	// Test with non-ASCII Unicode characters
	fmt.Println(IsAlphanumeric('世')) // prints true
	fmt.Println(IsAlphanumeric('é')) // prints true
}

The unicode package updates automatically with the Go toolchain. When you upgrade Go, you get the latest Unicode version without changing code. The standard library bundles the Unicode data, so your program always reflects the current standard.

Don't hardcode ASCII ranges. Use Unicode properties.

Realistic example: safe text truncation

Truncating text is a common task for UIs and databases. If you slice a string by byte index, you risk cutting a rune in half. The result is invalid UTF-8 that breaks downstream parsers. A safe truncator counts runes and stops at a rune boundary.

Here's a function that truncates text to a maximum number of characters while preserving UTF-8 validity.

package main

import (
	"fmt"
	"unicode/utf8"
)

// TruncateRune returns a substring of s limited to maxRunes characters.
// It ensures the result is valid UTF-8 and never splits a code point.
func TruncateRune(s string, maxRunes int) string {
	if maxRunes <= 0 {
		return ""
	}

	// Count runes and find the byte index for the limit.
	// We iterate using DecodeRune to advance by character width.
	count := 0
	byteIndex := 0
	for byteIndex < len(s) {
		// DecodeRuneInString returns the rune and its byte width.
		// We discard the rune because we only need the width.
		_, size := utf8.DecodeRuneInString(s[byteIndex:])
		byteIndex += size
		count++

		// Stop after reaching the rune limit.
		if count == maxRunes {
			break
		}
	}

	// Return the substring up to the calculated byte index.
	return s[:byteIndex]
}

func main() {
	text := "Hello, 世界"
	// Limit to 5 characters: "Hello"
	fmt.Println(TruncateRune(text, 5))
	// Limit to 7 characters: "Hello, 世"
	fmt.Println(TruncateRune(text, 7))
}

This function iterates through the string, decoding one rune at a time. It accumulates the byte index and stops when it reaches the rune limit. The result is always valid UTF-8. The function handles empty strings, strings shorter than the limit, and strings with multi-byte characters.

Allocation kills performance. Decode in place.

Pitfalls and compiler errors

The most common mistake is using len(s) to count characters. len returns the byte length. If you need the character count, use utf8.RuneCountInString. The compiler won't stop you, but your logic will break on non-ASCII input.

Converting a string to a []rune slice allocates memory. The expression []rune(s) creates a new slice and copies every rune. For a 1 MB string, this can allocate 4 MB of memory. In a hot loop, this triggers the garbage collector. Use utf8 functions to iterate or count without allocation.

Looping over a string with for range replaces invalid UTF-8 sequences with utf8.RuneError. You lose the original bytes. If data integrity matters, validate the string first with utf8.ValidString.

The compiler enforces type safety between strings and runes. You cannot pass a string to a function expecting a rune. The compiler rejects this with cannot use s (type string) as rune in argument. You also cannot create a rune literal with multiple characters. The compiler complains with invalid character literal (more than one character) if you write 'ab'. Use 'a' for a rune or "ab" for a string.

When you compare strings for equality, Go compares bytes. "café" and "café" might differ if one uses a composed character and the other uses a base character plus a combining accent. Use unicode.Fold or a normalization library if you need to compare text that might have different encodings.

The worst goroutine bug is the one that never logs. The worst string bug is the one that silently corrupts data. Validate input at the boundary.

Decision matrix

Use utf8.ValidString when you need to verify byte-level encoding integrity before processing text. Use utf8.RuneCountInString when you need the character count without allocating memory. Use utf8.DecodeRuneInString when you need to iterate or extract runes from a string slice. Use unicode.IsLetter or unicode.IsDigit when you need to classify a rune by its Unicode properties. Use unicode.Fold when you need case-insensitive comparison across scripts. Use []rune(s) when you need random access to characters or frequent mutation, and allocation cost is acceptable. Use len(s) when you need the byte size for storage or network transmission calculations.

Trust the standard library. The unicode and unicode/utf8 packages are optimized, correct, and updated with every Go release.

Where to go next

These packages help your Go program handle text from any language correctly. The unicode package tells you what kind of character you have (like a letter or number), while unicode/utf8 helps you count and check text safely. Think of it as a built-in translator that ensures your code understands emojis, Chinese characters, and special symbols just as well as English letters.