How to Read and Write Parquet Files in Go

The Go standard library does not natively support Parquet files; you must use a third-party library like `github.com/xitongsys/parquet-go`. Install the library, import it, and use `parquet.WriteParquetFile` to write and `parquet.ReadParquetFile` to read data.

The columnar advantage

You have a dataset with ten million rows. You need to calculate the average age. A CSV file forces you to read every single line, parse every column, and discard the ones you don't need. The CPU cycles wasted on parsing names and emails add up. The disk I/O reads the entire file even though you only care about one column. Parquet stores data by column. You can read just the age column and ignore the rest. The file is smaller, faster to read, and easier to compress.

Go does not include Parquet support in the standard library. The format is complex, with binary encoding, compression codecs, and a rich type system. You need a third-party library to handle the serialization. The most common choice is github.com/xitongsys/parquet-go. It maps Go structs to Parquet schemas using struct tags. This approach feels natural to Go developers, but it hides some runtime mechanics that can trip you up.

How Parquet maps to Go structs

Parquet is a columnar storage format. Instead of storing rows sequentially, it groups values by column. This allows aggressive compression because values in a single column often have similar patterns. Go is a row-oriented language. You define structs that represent rows. The library bridges the gap by inspecting your structs at runtime and building a Parquet schema.

The bridge relies on struct tags. Tags are metadata attached to struct fields. The library reads these tags to determine column names, types, and encoding options. If you miss a tag or use the wrong syntax, the library cannot map the field. The compiler does not validate tags. Tags are just strings. You get runtime errors when the library tries to parse them.

One critical convention applies here. Struct fields must be exported. If you use lowercase names, the reflection-based library cannot read the tags. The compiler won't complain. You get an empty file or a panic. Always capitalize the first letter of fields you want to serialize.

Writing a Parquet file

Here's the simplest way to write a Parquet file. You define a struct with tags, create a writer, write records, and stop the writer. The tags tell the library how to map Go types to Parquet types.

package main

import (
	"fmt"

	"github.com/xitongsys/parquet-go-source/local"
	"github.com/xitongsys/parquet-go/parquet"
	"github.com/xitongsys/parquet-go/writer"
)

// Record maps to a Parquet row.
// Tags define column name, type, and conversion.
type Record struct {
	Name string `parquet:"name=Name, type=BYTE_ARRAY, convertedtype=UTF8"`
	Age  int    `parquet:"name=Age, type=INT32"`
}

func main() {
	// Open a local file for writing.
	fw, err := local.NewLocalFileWriter("data.parquet")
	if err != nil {
		fmt.Println("Error creating file writer:", err)
		return
	}

	// Create a Parquet writer.
	// The third argument is the buffer size in rows.
	pw, err := writer.NewParquetWriter(fw, new(Record), 4)
	if err != nil {
		fmt.Println("Error creating parquet writer:", err)
		return
	}

	// Write records one by one.
	pw.Write(Record{"Alice", 30})
	pw.Write(Record{"Bob", 25})

	// Flush buffer and write footer.
	if err := pw.WriteStop(); err != nil {
		fmt.Println("Error stopping write:", err)
		return
	}
}

The tag syntax is strict. name sets the column name in the file. type sets the Parquet physical type. convertedtype sets the logical type. For strings, you must use BYTE_ARRAY with convertedtype=UTF8. Without convertedtype, the library treats the string as raw bytes. This is a common source of bugs. The data writes successfully, but readers interpret the bytes incorrectly.

The buffer size controls how many rows are written before flushing to disk. A larger buffer improves performance by reducing I/O calls. A smaller buffer reduces memory usage. Pick a size that balances throughput and memory pressure.

Reading a Parquet file

Reading works in reverse. You create a reader, specify the struct type, and read the data. The reader parses the file footer, finds the schema, and deserializes the binary chunks back into Go structs.

package main

import (
	"fmt"

	"github.com/xitongsys/parquet-go-source/local"
	"github.com/xitongsys/parquet-go/reader"
)

// Record must match the schema written to the file.
type Record struct {
	Name string `parquet:"name=Name, type=BYTE_ARRAY, convertedtype=UTF8"`
	Age  int    `parquet:"name=Age, type=INT32"`
}

func main() {
	// Open a local file for reading.
	fr, err := local.NewLocalFileReader("data.parquet")
	if err != nil {
		fmt.Println("Error creating file reader:", err)
		return
	}

	// Create a Parquet reader.
	// Buffer size controls how many rows are read at once.
	pqReader, err := reader.NewParquetReader(fr, new(Record), 4)
	if err != nil {
		fmt.Println("Error creating parquet reader:", err)
		return
	}

	// Read all rows into a slice.
	var out []Record
	if err := pqReader.Read(&out); err != nil {
		fmt.Println("Error reading parquet:", err)
		return
	}

	// Process the data.
	for _, r := range out {
		fmt.Println(r.Name, r.Age)
	}
}

The Read function loads all rows into memory. This is convenient for small files. It causes out-of-memory crashes for large files. For production code, use streaming readers. Read rows in chunks or use ReadByMap to process data incrementally. Memory management is your responsibility. The library does not garbage collect rows for you.

What happens under the hood

When you run the code, the library uses reflection to inspect the struct. It reads the tags and builds a Parquet schema. The schema defines the columns, types, and encoding. The writer opens the file and writes the schema to the footer. Each call to Write serializes the struct fields into binary format. The data is buffered in memory. When the buffer fills or you call WriteStop, the library compresses the data, writes the row group metadata, and updates the footer.

Reading follows the same path. The reader parses the footer to find the schema. It checks if the schema matches your struct. If the types don't match, you get a runtime error. The reader deserializes the binary chunks and populates your Go structs. The buffer size controls how many rows are fetched from disk in one I/O operation.

This reflection-based approach is flexible. You can change the struct and the file format adapts. It is also slow. Reflection adds overhead. For high-performance pipelines, consider code generation or manual serialization. The parquet-go library offers code generation tools to emit static serialization code. This removes the reflection cost at runtime.

Realistic pipeline with complex types

Real data has optional fields, nested structures, and slices. Parquet handles these with nullable types and repeated fields. Go maps pointers to nullable fields and slices to repeated fields. You also need to handle errors properly. The library returns errors for I/O issues and schema mismatches. Wrap errors to provide context.

package main

import (
	"fmt"

	"github.com/xitongsys/parquet-go-source/local"
	"github.com/xitongsys/parquet-go/writer"
)

// Event represents a log entry with optional metadata.
type Event struct {
	Timestamp int64   `parquet:"name=Timestamp, type=INT64"`
	Message   string  `parquet:"name=Message, type=BYTE_ARRAY, convertedtype=UTF8"`
	Level     *string `parquet:"name=Level, type=BYTE_ARRAY, convertedtype=UTF8"`
	Tags      []string `parquet:"name=Tags, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REPEATED"`
}

func writeEvents(filename string) error {
	// Open file for writing.
	fw, err := local.NewLocalFileWriter(filename)
	if err != nil {
		return fmt.Errorf("open file: %w", err)
	}

	// Create writer with buffer.
	pw, err := writer.NewParquetWriter(fw, new(Event), 100)
	if err != nil {
		return fmt.Errorf("create writer: %w", err)
	}

	// Prepare data.
	level := "INFO"
	events := []Event{
		{Timestamp: 1000, Message: "Started", Level: &level, Tags: []string{"init"}},
		{Timestamp: 1001, Message: "Done", Level: nil, Tags: []string{"end", "success"}},
	}

	// Write events.
	for _, e := range events {
		if err := pw.Write(e); err != nil {
			return fmt.Errorf("write event: %w", err)
		}
	}

	// Stop writer.
	if err := pw.WriteStop(); err != nil {
		return fmt.Errorf("stop writer: %w", err)
	}

	return nil
}

func main() {
	if err := writeEvents("events.parquet"); err != nil {
		fmt.Println(err)
	}
}

The Level field is a pointer. This maps to a nullable column in Parquet. If the pointer is nil, the value is null. The Tags field is a slice. This maps to a repeated field. Parquet stores repeated values as a list. The tag repetitiontype=REPEATED is required for slices. Without it, the library treats the slice as a single value and fails.

Error wrapping uses fmt.Errorf with %w. This preserves the error chain. You can unwrap the error later to check for specific types. The if err != nil pattern is verbose by design. It makes the unhappy path visible. Don't hide errors. Return them immediately.

Pitfalls and runtime errors

The struct tags are the most common source of bugs. If you misspell a tag key or use the wrong type, the library panics or returns a confusing error. The compiler won't catch tag errors because tags are just strings. You get runtime failures. The library rejects invalid tags with parquet-go: tag "name=..." is invalid. This error appears when the writer tries to build the schema.

Another pitfall is schema mismatch. If you read a file with a struct that doesn't match the schema, the reader fails. The library checks types at runtime. You get errors like parquet-go: schema mismatch or reflect: call of reflect.Value.Type on zero Value. These errors indicate that the struct definition doesn't align with the file. Always verify the schema before reading.

Memory usage is a silent killer. Reading a large file into a slice loads everything into RAM. The Read function allocates a slice and grows it as needed. For files larger than available memory, the program crashes. Use streaming readers. Read rows in chunks. Process and discard data as you go. The worst goroutine bug is the one that never logs. The same applies to memory leaks. Monitor heap usage in production.

Tags are contracts. Export your fields. Validate your tags. Test with real data. The library is powerful, but it requires discipline.

When to use Parquet

Use Parquet when you need columnar storage for analytics or large datasets. Use CSV when you need human-readable text and simple tooling. Use JSON when you need nested structures and interoperability with web APIs. Use the standard library encoding/csv when you want zero dependencies.

Parquet shines in data pipelines. It compresses well. It supports schema evolution. It integrates with tools like Spark, DuckDB, and Pandas. Go is a great language for building the pipelines that produce and consume Parquet files. The parquet-go library provides the bridge. Master the tags, manage memory, and handle errors. Your data will flow efficiently.

Tags are contracts. Export your fields. Validate your tags. Test with real data. The library is powerful, but it requires discipline.

Where to go next