The regex sieve: validating text patterns in Go
You are building a signup form. The user types an email address. You need to decide if that string is valid before it hits your database. Or you are parsing a configuration file full of URLs. Or you are extracting IP addresses from a firewall log. Regular expressions are the tool for pattern matching. Go's regexp package provides a fast, safe engine for these tasks.
Regular expressions are a compact language for describing text patterns. You write a pattern, and the engine checks if a string fits. It works like a sieve: you define the shape of the holes, and only strings that match the shape pass through. Go uses the RE2 engine, which guarantees linear-time execution. This means your regex will never hang the program, even with malicious input. Other regex engines can suffer from catastrophic backtracking, where a bad pattern causes exponential slowdowns. RE2 avoids this by design. You get safety without sacrificing speed.
How the engine works
The regexp package compiles a pattern into a state machine. Compilation is the expensive step. Matching is fast. If you compile a pattern inside a loop, you pay the compilation cost on every iteration. Compile once, match many times.
The package offers two ways to compile. regexp.MustCompile panics if the pattern has a syntax error. regexp.Compile returns an error instead. Use MustCompile for hardcoded patterns. The panic catches typos during development. Use Compile for patterns that come from user input or configuration files. You need to handle syntax errors gracefully in those cases.
package main
import (
"fmt"
"regexp"
)
// emailRegex is compiled once at package initialization.
// Package-level variables are initialized before main runs.
// This avoids recompiling the pattern on every function call.
var emailRegex = regexp.MustCompile(`^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$`)
// ValidateEmail checks if a string matches the email pattern.
// It returns true only if the entire string fits the pattern.
func ValidateEmail(email string) bool {
// MatchString returns true if the pattern matches anywhere in the string.
// The anchors ^ and $ force the match to cover the whole string.
return emailRegex.MatchString(email)
}
func main() {
// Test valid and invalid inputs.
fmt.Println(ValidateEmail("user@example.com")) // true
fmt.Println(ValidateEmail("not-an-email")) // false
fmt.Println(ValidateEmail("user@.com")) // false
}
The pattern uses ^ to anchor to the start of the string and $ to anchor to the end. Without anchors, MatchString returns true if the pattern matches any substring. This is a common bug. A pattern like @example.com matches user@bad@example.com. Anchors prevent false positives.
Breaking down common patterns
Regex patterns can look like gibberish. Breaking them down reveals the logic. The patterns below cover email, URL, IPv4, and phone numbers. They are approximations. Real-world formats are often more complex. Use these as starting points and adjust for your specific needs.
Email validation
Email validation via regex is a trap. The full RFC 5322 specification is massive and includes edge cases like quoted strings and comments. No regex covers every valid email without also accepting garbage. The goal is a "good enough" pattern that catches obvious typos and rejects nonsense.
The pattern ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$ works for most cases.
^asserts the start of the string.[a-zA-Z0-9._%+-]+matches one or more allowed characters for the local part. The+quantifier means "one or more".@matches the literal at symbol.[a-zA-Z0-9.-]+matches the domain name.\.matches a literal dot. The backslash escapes the dot, which otherwise means "any character".[a-zA-Z]{2,}matches the top-level domain, requiring at least two letters.$asserts the end of the string.
This pattern rejects empty strings, missing domains, and single-letter TLDs. It accepts most standard email addresses. It rejects internationalized domain names and emails with Unicode characters. If you need full Unicode support, consider a dedicated library or a more permissive pattern followed by server-side verification.
URL validation
URLs have a strict structure. The net/url package can parse and validate URLs properly. Regex is fragile for URLs because the spec allows many variations. Use regex only for simple checks, like ensuring a string starts with http:// or https://.
The pattern ^https?://[a-zA-Z0-9.-]+(?:/[a-zA-Z0-9._~:/?#\[\]@!$&'()*+,;=-]*)?$ handles basic HTTP URLs.
^https?://matcheshttp://orhttps://. The?makes thesoptional.[a-zA-Z0-9.-]+matches the domain.(?:...)is a non-capturing group. It groups parts of the pattern without creating a capture group for extraction./[a-zA-Z0-9._~:/?#\[\]@!$&'()*+,;=-]*matches the path and query string. The*quantifier means "zero or more".$anchors to the end.
This pattern accepts https://example.com and https://example.com/path?q=1. It rejects ftp://example.com and example.com. If you need to manipulate URLs, use net/url.Parse. It returns a structured object with fields for scheme, host, path, and query.
IPv4 addresses
IPv4 addresses are four octets separated by dots. Each octet ranges from 0 to 255. Regex can enforce this range precisely.
The pattern ^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$ validates IPv4 addresses.
25[0-5]matches 250 to 255.2[0-4][0-9]matches 200 to 249.[01]?[0-9][0-9]?matches 0 to 199. The?makes the leading digit optional.- The groups are combined with
|for alternation. - The whole octet pattern repeats three times, followed by a dot.
- The final octet matches without a trailing dot.
This pattern rejects 256.1.1.1 and 1.1.1.1.1. It accepts 192.168.1.1 and 0.0.0.0. IPv6 addresses are much harder to validate with regex. Use net.ParseIP for IP address validation. It handles both IPv4 and IPv6 correctly.
Phone numbers
Phone numbers vary by country. Formatting rules differ wildly. Regex for phone numbers is usually country-specific. The pattern below handles US numbers in common formats.
The pattern ^\(?([0-9]{3})\)?[-.\s]?([0-9]{3})[-.\s]?([0-9]{4})$ matches US phone numbers.
\(?matches an optional opening parenthesis.([0-9]{3})captures the area code. Parentheses create a capture group.\)?matches an optional closing parenthesis.[-.\s]?matches an optional separator: hyphen, dot, or whitespace.- The pattern repeats for the exchange and subscriber numbers.
This pattern accepts (555) 123-4567, 555-123-4567, and 555.123.4567. It rejects 5551234567 if you require separators. Adjust the pattern based on your input requirements. For global phone validation, use a library like libphonenumber. It handles country codes, formatting, and validation rules for every region.
Realistic usage: extraction and validation
Validation checks if a string matches a pattern. Extraction pulls specific parts out of the string. Use FindStringSubmatch to extract groups. The function returns a slice of strings. Index 0 is the full match. Index 1 is the first capture group. Index 2 is the second group, and so on.
package main
import (
"fmt"
"regexp"
)
// extractDomain pulls the domain part from an email address.
// It returns an empty string if the email doesn't match the pattern.
func extractDomain(email string) string {
// The pattern captures the domain in a group using parentheses.
// The first group becomes index 1 in the result slice.
re := regexp.MustCompile(`^[^@]+@([a-zA-Z0-9.-]+)$`)
matches := re.FindStringSubmatch(email)
// Check if the match succeeded and the group exists.
// FindStringSubmatch returns nil if there's no match.
if matches == nil || len(matches) < 2 {
return ""
}
// Return the captured group.
return matches[1]
}
func main() {
fmt.Println(extractDomain("user@example.com")) // example.com
fmt.Println(extractDomain("bad-email")) // (empty string)
}
The pattern ^[^@]+@([a-zA-Z0-9.-]+)$ uses [^@]+ to match the local part. The ^ inside the brackets negates the character class. It matches any character except @. This is simpler than listing allowed characters. The domain is captured in parentheses. The function checks the length of the result slice before accessing index 1. This prevents a panic if the match fails.
Go convention favors returning errors over booleans for validation. Wrap the result in an error type so callers can inspect the failure.
package main
import (
"errors"
"fmt"
"regexp"
)
// phoneRegex validates US phone numbers.
var phoneRegex = regexp.MustCompile(`^\(?([0-9]{3})\)?[-.\s]?([0-9]{3})[-.\s]?([0-9]{4})$`)
// ValidatePhone checks if a string is a valid US phone number.
// It returns an error if the format is invalid.
func ValidatePhone(phone string) error {
if !phoneRegex.MatchString(phone) {
// Return a typed error.
// Callers can use errors.Is to check for this specific error.
return errors.New("invalid phone number format")
}
return nil
}
func main() {
err := ValidatePhone("(555) 123-4567")
if err != nil {
fmt.Println("Validation failed:", err)
}
}
The errors.New function creates a simple error. For more context, use fmt.Errorf with the %w verb to wrap the error. This preserves the error chain. Go convention dictates that if err != nil checks should happen immediately. The boilerplate makes the unhappy path visible. Don't hide errors.
Pitfalls and compiler errors
Regex bugs are subtle. The compiler catches syntax errors in patterns, but it cannot catch logic errors. A pattern can compile successfully and still match the wrong strings.
If you use MustCompile with a bad pattern, the program panics at startup. The panic message describes the syntax error. For example, a missing closing parenthesis triggers regexp: Compile: error parsing regexp: missing closing ): .... This is helpful during development. It stops you from deploying broken code.
If you use Compile with a bad pattern, it returns an error. You must handle this error. Ignoring it leads to nil pointer dereferences later. The compiler warns with error returned but not used if you discard the error. Fix the warning by checking the error or using _ to discard it intentionally.
// Bad: ignoring the error from Compile.
// The compiler warns: error returned but not used.
re, _ := regexp.Compile(userInput)
// Good: handling the error.
re, err := regexp.Compile(userInput)
if err != nil {
return fmt.Errorf("invalid regex pattern: %w", err)
}
Another pitfall is performance. Compiling regex is slow. Never compile inside a hot loop. Move compilation to package level or cache the compiled *regexp.Regexp. The regexp package is safe for concurrent use. Multiple goroutines can call MatchString on the same compiled regex without locks.
Unicode handling is another gotcha. Go's regexp package supports UTF-8, but character classes like \w match only ASCII word characters. \w is equivalent to [A-Za-z0-9_]. It does not match accented letters or non-Latin scripts. Use explicit character ranges or Unicode properties if you need full Unicode support. The pattern [\p{L}\p{N}] matches any Unicode letter or number. This requires the regexp package's Unicode support.
Decision matrix
Pick the right tool for the job. Regex is powerful, but it adds complexity. Use simpler tools when they suffice.
Use regexp.MustCompile when the pattern is hardcoded and you want to fail fast at startup. The panic catches typos early.
Use regexp.Compile when the pattern comes from user input or configuration, and you need to handle syntax errors gracefully.
Use MatchString when you only need a yes/no answer and don't need to extract parts of the match.
Use FindStringSubmatch when you need to extract specific groups from the input, like the domain from an email.
Use FindAllStringSubmatch when you need to find all occurrences of a pattern in a large text, such as extracting all URLs from a log file.
Use simple string functions like strings.Contains or strings.HasPrefix when the pattern is trivial. Regex adds overhead and complexity for no gain.
Use net/url.Parse when you need to validate or manipulate URLs. The parser handles edge cases and returns a structured object.
Use net.ParseIP when you need to validate IP addresses. It handles both IPv4 and IPv6 correctly.
Use a dedicated library like libphonenumber when you need global phone number validation. Regex cannot handle the complexity of international formats.
Compile once. Match many. Anchors save you from false positives. RE2 keeps your server safe from regex bombs. Regex validates format. Business logic validates truth.