How to Build a Web Scraper in Go

You need data that isn't in an API

You found a website with exactly the data you need. A list of job postings, product prices, or research papers. There is no JSON endpoint. There is no official API. Just a wall of HTML. You could copy-paste by hand, or you could write a scraper. Go is surprisingly good at this because it treats HTTP as a first-class citizen and gives you a lightweight parser without the bloat of a full browser framework.

Scraping boils down to two steps. First, you send an HTTP request and get a response. Second, you parse the HTML body to find the nodes you care about. Go's standard library handles the request part. For parsing, the community standard is golang.org/x/net/html, which builds a DOM tree you can walk. You don't need a heavy framework unless the site relies on JavaScript to render content. If the HTML is there in the source, Go can grab it.

The DOM tree in Go

HTML is a tree of elements. The parser turns that tree into Go structs. Each node has a type, a tag name, a list of attributes, and pointers to its children and siblings. You walk the tree recursively to find what you need.

Here's the simplest scraper: fetch a page, build the DOM tree, and walk it to find every anchor tag.

package main

import (
	"fmt"
	"net/http"
	"golang.org/x/net/html"
)

func main() {
	// fetch the page
	resp, err := http.Get("https://example.com")
	if err != nil {
		panic(err)
	}
	// ensure the body closes to free resources
	defer resp.Body.Close()

	// parse the HTML stream into a node tree
	doc, err := html.Parse(resp.Body)
	if err != nil {
		panic(err)
	}

	// recursive function to walk the tree
	var traverse func(*html.Node)
	traverse = func(n *html.Node) {
		// check if this node is an anchor tag
		if n.Type == html.ElementNode && n.Data == "a" {
			// iterate attributes to find href
			for _, a := range n.Attr {
				if a.Key == "href" {
					fmt.Println(a.Val)
				}
			}
		}
		// recurse into children
		for c := n.FirstChild; c != nil; c = c.NextSibling {
			traverse(c)
		}
	}
	traverse(doc)
}

The http.Get function returns a response and an error. Always check the error. Network calls fail. defer resp.Body.Close() is mandatory. The HTTP client reuses connections, and leaving the body open leaks memory and file descriptors. html.Parse takes an io.Reader and returns the root node. The recursive function visits every node. If it finds an <a> tag, it loops through attributes to grab the href.

Extracting text content

The parser gives you nodes, not strings. golang.org/x/net/html does not provide an innerText property. To get the text inside a tag, you have to walk the tree and collect TextNode values. This is a common stumbling block. You write a helper function that concatenates text nodes while ignoring script and style tags.

Here's a helper that extracts text from a node and its children:

// ExtractText returns the concatenated text content of a node
func ExtractText(n *html.Node) string {
	// skip script and style tags
	if n.Type == html.ElementNode && (n.Data == "script" || n.Data == "style") {
		return ""
	}
	// collect text from this node
	var text string
	if n.Type == html.TextNode {
		text = n.Data
	}
	// recurse into children
	for c := n.FirstChild; c != nil; c = c.NextSibling {
		text += ExtractText(c)
	}
	return text
}

You call this function when you find the element you want. If you find a <div class="price">, you pass that node to ExtractText to get the price string. The function skips script and style tags because you don't want JavaScript code in your data. It concatenates text nodes, which handles cases where text is split across multiple nodes.

Realistic scraper with structure

Real scrapers need structure, timeouts, and clean error handling. This function extracts product titles from a hypothetical list page. It uses context for timeouts and returns a slice of structs.

package main

import (
	"context"
	"fmt"
	"net/http"
	"time"
	"golang.org/x/net/html"
)

// Product holds scraped data
type Product struct {
	Title string
	Price string
}

// ScrapeProducts fetches a page and extracts product data
func ScrapeProducts(ctx context.Context, url string) ([]Product, error) {
	// context carries the deadline
	req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
	if err != nil {
		return nil, err
	}
	// client with timeout prevents hanging
	client := &http.Client{Timeout: 10 * time.Second}
	// execute the request
	resp, err := client.Do(req)
	if err != nil {
		return nil, err
	}
	// close body to release connection
	defer resp.Body.Close()

	// parse HTML into DOM
	doc, err := html.Parse(resp.Body)
	if err != nil {
		return nil, err
	}

	// slice to collect results
	var products []Product
	// recursive walker
	var traverse func(*html.Node)
	traverse = func(n *html.Node) {
		// match product title class
		if n.Type == html.ElementNode && n.Data == "div" {
			for _, a := range n.Attr {
				if a.Key == "class" && a.Val == "product-title" {
					// extract text content
					title := ExtractText(n)
					// append to results
					products = append(products, Product{Title: title})
					return
				}
			}
		}
		// recurse
		for c := n.FirstChild; c != nil; c = c.NextSibling {
			traverse(c)
		}
	}
	traverse(doc)
	return products, nil
}

The context.Context parameter goes first. This is a Go convention. Functions that take a context should respect cancellation and deadlines. The http.Client has a Timeout field. Without it, your scraper hangs on slow servers. The defer resp.Body.Close() call is still there. The recursive function matches a <div> with a specific class, extracts the text, and appends it to the results slice.

Convention aside: if err != nil { return err } is verbose by design. The community accepts the boilerplate because it makes the unhappy path visible. You see every error check. You don't have to hunt for where a failure might be swallowed.

Concurrency for multiple URLs

Scraping one page is easy. Scraping a hundred pages sequentially is slow. Go shines here. You can spawn a goroutine for each URL and collect results via a channel. You need to bound concurrency so you don't overwhelm the target server or exhaust your own resources.

Here's a worker pool pattern for concurrent scraping:

// ScrapeURLs fetches multiple URLs concurrently
func ScrapeURLs(ctx context.Context, urls []string) ([]Product, error) {
	// semaphore to limit concurrency
	sem := make(chan struct{}, 5)
	// channel for results
	results := make(chan []Product, len(urls))
	// wait group to track goroutines
	var wg sync.WaitGroup

	// launch workers
	for _, url := range urls {
		wg.Add(1)
		go func(u string) {
			defer wg.Done()
			// acquire semaphore slot
			sem <- struct{}{}
			defer func() { <-sem }()
			// scrape the URL
			products, err := ScrapeProducts(ctx, u)
			if err != nil {
				return
			}
			// send results
			results <- products
		}(url)
	}

	// close results channel when done
	go func() {
		wg.Wait()
		close(results)
	}()

	// collect results
	var allProducts []Product
	for ps := range results {
		allProducts = append(allProducts, ps...)
	}
	return allProducts, nil
}

The semaphore channel limits concurrency to five goroutines. Each goroutine acquires a slot before scraping and releases it after. The results channel collects data. The wait group ensures the results channel closes only when all goroutines finish. This pattern protects downstream services and keeps memory usage bounded.

Pitfalls and compiler errors

Scraping looks simple until it breaks. Here are the common traps.

Regex for HTML. Don't use regular expressions to parse HTML. HTML is nested, attributes appear in different orders, and tags can span multiple lines. Regex breaks on self-closing tags, comments, or malformed markup. Use a parser. The compiler won't stop you, but your scraper will return garbage.

Missing timeouts. Without a timeout, your scraper hangs on slow servers. Use http.Client with Timeout or context.WithTimeout. The compiler won't catch this. Your program will just block until the network stack gives up, which can take minutes.

Forgetting to close the body. If you forget defer resp.Body.Close(), you leak connections. The HTTP client reuses connections, and leaving the body open prevents reuse. Eventually, you run out of file descriptors and the program crashes. The compiler won't warn you. You have to discipline yourself to close the body.

Compiler errors. If you forget to import golang.org/x/net/html, the compiler rejects the program with undefined: html. If you pass a string to html.Parse instead of an io.Reader, the compiler complains with cannot use url (string) as io.Reader value in argument. If you try to access a field that doesn't exist on html.Node, you get n.Text undefined (type *html.Node has no field or method Text). Read the error. It tells you exactly what's wrong.

Goroutine leaks. If you spawn goroutines for scraping but don't wait for them, they leak. Always use a wait group or a channel to coordinate. The worst goroutine bug is the one that never logs.

Convention aside: gofmt is mandatory. Don't argue about indentation. Let the tool decide. Most editors run it on save. Your scraper code will look like every other Go codebase.

Decision: when to use what

Scraping tools vary by complexity. Pick the right one for your use case.

Use net/http and golang.org/x/net/html when the site serves static HTML and you need a lightweight, dependency-free scraper.

Use the colly framework when you are building a larger crawler with automatic retries, cookie handling, and storage backends.

Use chromedp or playwright-go when the content is rendered by JavaScript and the HTML source doesn't contain the data you need.

Use a third-party API when the site has an official API; scraping is a fallback, not a first choice.

Use plain sequential code when you only need to scrape a few pages; concurrency adds complexity that isn't always worth it.

Scraping is just HTTP and parsing. Keep it simple. Close the body. Don't regex HTML. Trust the tree.

Where to go next

A web scraper is a program that visits websites and automatically extracts specific data like links or prices. You use it when you need to gather information from many pages faster than a human could. Think of it as a robot that reads a book and copies only the sentences you asked for.