How to Work with Character Encodings (UTF-8, ISO-8859-1) in Go

The UTF-8 default

You open a CSV exported from a legacy banking system. The names look fine in the old app, but your Go program prints Müller as MÃ¼ller. Or the file crashes your parser with a malformed UTF-8 sequence. You didn't do anything wrong. The file just speaks a different language than Go expects.

Go treats all source code and string literals as UTF-8. When you write "hello", the compiler stores that as UTF-8 bytes. The standard library assumes every string value contains valid UTF-8. If you pass garbage bytes to a function that expects text, you get garbage out. The type system won't stop you from creating a string with invalid bytes, but the runtime behavior will be unpredictable.

Go strings are UTF-8. Always. If your data comes from the outside world in ISO-8859-1, Windows-1252, or Shift-JIS, you must convert it to UTF-8 before processing. Converting at the boundary keeps the rest of your code simple and safe.

Bytes, runes, and the string type

A Go string is just a slice of bytes with a length. The type is immutable. You cannot change the bytes inside a string. When you pass a string to a function, Go copies the pointer and length, not the underlying bytes. Strings are cheap to pass by value. Don't pass a *string. The pointer adds indirection without saving memory.

A rune is an alias for int32. It represents a Unicode code point. The letter ü is a single rune with value 0x00FC. In UTF-8, that rune encodes as two bytes: 0xC3 0xBC. Confusing bytes and runes is the most common source of bugs in Go text processing.

package main

import "fmt"

func main() {
	// "Müller" contains 6 runes but 8 bytes.
	// The ü encodes as two bytes in UTF-8.
	s := "Müller"

	// len returns the byte count, not the rune count.
	fmt.Println(len(s)) // prints: 8

	// range over a string yields runes.
	// i is the byte index, r is the rune value.
	for i, r := range s {
		fmt.Printf("index %d: rune %c (0x%x)\n", i, r, r)
	}
}

The range loop hides the complexity of UTF-8 decoding. It iterates over runes, skipping the internal byte structure. The index i advances by the number of bytes consumed for each rune. If you need the character count, use utf8.RuneCountInString. If you need to manipulate individual characters, convert to []rune. The conversion allocates a new slice and decodes every rune.

package main

import "fmt"

func main() {
	s := "Müller"

	// Convert to []rune to get a slice of code points.
	// This allocates memory and decodes the string.
	runes := []rune(s)

	// Now len returns the rune count.
	fmt.Println(len(runes)) // prints: 6

	// Access by rune index, not byte index.
	fmt.Println(string(runes[3])) // prints: l
}

A string is a byte slice with a length. A rune is a number. Confusing them breaks your code.

Decoding legacy data

When you read data from a file, database, or network, you often get raw bytes. If the source uses ISO-8859-1, those bytes map to characters differently than UTF-8. The byte 0x8C is ü in ISO-8859-1. In UTF-8, 0x8C is an invalid start byte. If you cast the bytes directly to a string, you get a string containing invalid UTF-8. Functions like strings.Contains or regexp.Match may fail or return wrong results.

Use the golang.org/x/text/encoding/charmap package to convert byte slices. The package provides decoders for common legacy encodings. A decoder transforms bytes from the source encoding to UTF-8. The result is a new byte slice containing valid UTF-8.

package main

import (
	"fmt"

	"golang.org/x/text/encoding/charmap"
)

func main() {
	// ISO-8859-1 encodes ü as a single byte 0x8C.
	// Go strings expect UTF-8, where ü is two bytes: 0xC3 0xBC.
	raw := []byte{0x4D, 0x75, 0x6C, 0x8C, 0x6C, 0x6C, 0x65, 0x72}

	// NewDecoder returns a transformer that maps ISO-8859-1 bytes to UTF-8.
	// The transformer holds the encoding table and handles the conversion.
	decoder := charmap.ISO8859_1.NewDecoder()

	// Bytes runs the transformation on the slice.
	// It returns the UTF-8 result, bytes consumed, and an error.
	// The third return value is discarded with _ since we process the whole slice.
	utf8Bytes, _, err := decoder.Bytes(raw)
	if err != nil {
		panic(err)
	}

	// string() converts []byte to string.
	// The result is valid UTF-8 and safe for all standard library text functions.
	text := string(utf8Bytes)
	fmt.Println(text) // prints: Müller
}

The compiler rejects charmap.ISO8859_1 with undefined: charmap if you forget the import. Go does not bundle golang.org/x/text in the standard library. You must install the dependency with go get golang.org/x/text/encoding/charmap. The compiler complains with cannot use raw (type []byte) as string in argument if you try to pass a byte slice where a string is expected. You must convert explicitly using string(raw).

Transform at the boundary. Keep the rest of your code in UTF-8.

Walking through the transformation

At compile time, Go checks that charmap.ISO8859_1 exists and that NewDecoder returns a transform.Transformer. It does not validate the bytes. At runtime, decoder.Bytes iterates over the input slice. For each byte, it looks up the mapping table.

Bytes in the range 0x00 to 0x7F are ASCII. The decoder passes them through unchanged. For byte 0x8C, the decoder finds the Unicode code point U+00FC. It emits the UTF-8 sequence 0xC3 0xBC. The result is a new []byte containing the UTF-8 data. The original slice remains unchanged.

Converting to string copies the UTF-8 bytes into a string header. The string is immutable. You can pass it to any function that expects text. The strings package, regexp package, and fmt package all assume UTF-8. They will work correctly on the decoded string.

If the input contains bytes that have no mapping in the target encoding, the decoder may return an error or substitute a replacement character. ISO-8859-1 maps every byte to a code point, so decoding rarely fails. Encoding UTF-8 to ISO-8859-1 can fail if the string contains characters outside the ISO-8859-1 range.

Real-world stream processing

Real data comes from files or networks, not hardcoded slices. You cannot always buffer the entire payload in memory. Use transform.NewReader to wrap an io.Reader. The wrapper reads from the underlying reader, transforms the bytes, and outputs UTF-8. Downstream code sees a UTF-8 stream without knowing the source encoding.

package main

import (
	"fmt"
	"io"
	"os"

	"golang.org/x/text/encoding/charmap"
	"golang.org/x/text/transform"
)

// ReadISOFile reads a file encoded in ISO-8859-1 and returns a UTF-8 string.
// It handles the full file read and transformation in one pass.
func ReadISOFile(path string) (string, error) {
	f, err := os.Open(path)
	if err != nil {
		return "", err
	}
	defer f.Close()

	// Wrap the file reader with a transformer.
	// The transformer reads from f and outputs UTF-8 bytes.
	decoder := charmap.ISO8859_1.NewDecoder()
	reader := transform.NewReader(f, decoder)

	// ReadAll consumes the transformed stream.
	// The result is UTF-8 bytes.
	data, err := io.ReadAll(reader)
	if err != nil {
		return "", err
	}

	return string(data), nil
}

func main() {
	text, err := ReadISOFile("legacy.csv")
	if err != nil {
		panic(err)
	}
	fmt.Println(text)
}

The if err != nil pattern is verbose by design. The Go community accepts the boilerplate because it makes the error path visible. Every function that can fail returns an error. Check it immediately. Context is plumbing. If this read might hang on a slow network, pass a context.Context to a cancellable reader and respect deadlines.

The receiver name convention applies to methods. If you add a method to a type, use a short name matching the type. (r *Reader) Read(...) is standard. Not (this *Reader) or (self *Reader). Public names start with a capital letter. Private names start lowercase. No keywords like public or private.

Validate early. Trust nothing from the outside world.

Pitfalls and validation

Double encoding is the worst nightmare. If you decode UTF-8 as ISO-8859-1, then re-encode to UTF-8, you get mojibake. The bytes get mangled. Müller becomes MÃ¼ller. Once data is double-encoded, recovering the original text is hard. Prevent this by enforcing UTF-8 at the input boundary. Log the encoding of every external source.

The string(byteSlice) conversion does not validate UTF-8. It just copies bytes. If the bytes are invalid, the string contains invalid UTF-8. Use utf8.ValidString to check. If the string is invalid, fix the data or replace bad sequences. The golang.org/x/text/encoding package provides Replacement and Fail handling strategies.

package main

import (
	"fmt"
	"strings"
	"unicode/utf8"
)

func main() {
	// Simulate invalid UTF-8 by mixing bytes.
	// 0xC3 is a valid start byte, but 0x28 is not a continuation byte.
	invalid := []byte{0x48, 0x65, 0x6C, 0x6C, 0xC3, 0x28, 0x6F}

	s := string(invalid)

	// ValidString checks if the string contains only valid UTF-8.
	if !utf8.ValidString(s) {
		fmt.Println("Invalid UTF-8 detected")
	}

	// To fix, replace invalid sequences with the Unicode replacement character.
	// This makes the string safe for processing.
	fixed := strings.ToValidUTF8(s, "�")
	fmt.Println(fixed)
}

The compiler rejects utf8.ValidString with undefined: utf8 if you forget the import. The unicode/utf8 package is in the standard library. The compiler complains with cannot use invalid (type []byte) as string in argument if you pass a byte slice to ValidString. Convert to string first.

Goroutines are cheap. Channels are not magic. If you spawn a goroutine to decode a stream, ensure the channel closes when the stream ends. A goroutine leak happens when the goroutine waits on a channel that never gets closed. Always have a cancellation path.

Decision matrix

Use string(byteSlice) when the bytes are already valid UTF-8 and you just need a string type.

Use charmap decoders when reading legacy formats like ISO-8859-1, Windows-1252, or Shift-JIS.

Use utf8.DecodeRuneInString when you need to inspect individual code points without converting the whole string.

Use transform.NewReader when processing streams where you cannot buffer the entire payload in memory.

Use golang.org/x/text/encoding/htmlindex when parsing HTML that might declare its own charset in a meta tag.

Use plain UTF-8 strings when building new systems: force the encoding at the boundary and never look back.

Trust gofmt. Argue logic, not formatting. The tool decides indentation and layout. Most editors run it on save.

Where to go next

Go assumes everything is UTF-8, which is the modern standard for text. If you have old files or data in formats like ISO-8859-1, you must translate them into UTF-8 before Go can read them correctly. Think of it like a translator converting a foreign language document into English so your team can understand it.