How to Implement Language Detection in Go
You build a comment section for a global app. A user types "ยกQuรฉ tal!" and hits submit. Your backend receives the string, but the string itself doesn't carry a label saying "this is Spanish." Your moderation pipeline needs to know the language to apply the right rules, or your search index needs to pick the correct tokenizer. Go doesn't guess languages for you. You have to bring a detector.
Language detection relies on statistical models. The library contains a database of character frequencies for each language. When you pass text, the algorithm calculates the probability of the text belonging to each language in the database. It returns the language with the highest probability. This works well for long texts with distinct vocabulary. It struggles with names, code snippets, or very short strings. The detector is just a probability calculator. You feed it text, it calculates the likelihood of each language, and returns the winner.
Language detection is not magic. It is pattern matching. The detector compares the input against a profile of each language. A profile is a statistical summary of how often certain character sequences appear. English text has a high frequency of "th", "he", and "in". Spanish text has "qu", "ll", and "de". The detector builds a frequency map of the input and calculates a distance metric to each profile. The closest profile wins. This approach works because languages have distinct distributions of characters and digrams. It fails when the input lacks enough data to build a reliable frequency map, or when the input contains characters shared across many languages, like digits or common punctuation.
Here's the simplest way to detect a language: import a library, pass a string, get back a code.
package main
import (
"fmt"
"github.com/abadojack/whatlanggo"
)
func main() {
// Sample text for detection
text := "Hello world"
// Detect returns a struct containing the language code, name, and confidence score
lang := whatlanggo.Detect(text)
// Access the ISO 639-2 code field
fmt.Println(lang.Code)
}
The output is eng. The library returns an ISO 639-2 code, not the two-letter code you might expect. whatlanggo uses three-letter codes to align with the standard. English is eng, Spanish is spa, and German is ger. If your system expects two-letter codes, you need to map the result.
When you call Detect, the library scans the bytes of the string. It builds a frequency map of character sequences. It compares that map against pre-computed profiles for dozens of languages. The result includes a confidence score. High confidence means the patterns match strongly. Low confidence means the text is too short, too ambiguous, or contains a mix of languages. The library returns the language with the highest score, even if the score is low. The Lang struct has three fields: Code, Name, and Confidence. Code is the ISO code. Name is the human-readable language name. Confidence is a float between 0 and 1. A confidence of 0.9 means the detector is very sure. A confidence of 0.2 means the detector is guessing.
In a real service, you detect language inside a request handler and use the result to route work or store metadata.
package main
import (
"encoding/json"
"net/http"
"github.com/abadojack/whatlanggo"
)
// DetectHandler processes POST requests and returns the detected language
func DetectHandler(w http.ResponseWriter, r *http.Request) {
// Decode JSON body into a temporary struct
var payload struct {
Text string `json:"text"`
}
// Return 400 if the JSON is malformed or missing
if err := json.NewDecoder(r.Body).Decode(&payload); err != nil {
http.Error(w, "invalid JSON", http.StatusBadRequest)
return
}
// Run detection; returns a struct with Code, Name, and Confidence
lang := whatlanggo.Detect(payload.Text)
// Build response map with the language code and human-readable name
result := map[string]string{
"code": lang.Code,
"name": lang.Name,
}
// Set header and encode response
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(result)
}
func main() {
http.HandleFunc("/detect", DetectHandler)
http.ListenAndServe(":8080", nil)
}
The handler decodes the request, runs detection, and returns the result. The error handling follows the standard Go pattern. The community accepts the boilerplate because it makes the unhappy path visible. If the JSON is bad, the function returns early. If detection runs, the result is safe to use. Trust gofmt. Argue logic, not formatting. Most editors run gofmt on save, so the code stays consistent without debate.
Short strings break detection. If you pass "a", the detector might guess French or return a random result because there aren't enough patterns. Always check the confidence score. If the score is below a threshold, treat the result as unknown. Mixed languages confuse the detector. If the text is "Hello mundo", the detector picks the dominant pattern. This can be wrong. CJK languages share characters. Chinese, Japanese, and Korean all use Han characters. A detector might struggle to distinguish them based on characters alone. Some libraries use word boundaries or specific vocabulary lists to separate CJK. If your app handles East Asian text, test the detector thoroughly.
If you forget to import the package, the compiler rejects the build with undefined: whatlanggo. If you import it but don't use it, you get imported and not used. Go requires every import to be referenced. The compiler enforces clean code. If you pass the wrong type to Detect, the compiler complains with cannot use x (type int) as string value in argument. The type system catches mistakes early.
Detection is probabilistic. Trust the score, not just the code.
Use whatlanggo when you need a lightweight, pure-Go detector that compiles quickly and covers the most common languages. Use go-langdetect when accuracy on short or noisy text matters more than binary size and import speed. Use a confidence threshold when a wrong guess causes data corruption and you prefer to flag text for manual review. Use an external API when you need to detect over a hundred languages or require access to proprietary models without maintaining dependencies. Use a simple prefix check when you only need to distinguish between two languages based on a known marker like a locale tag in the input.