DEV Community

Cover image for A Minimal Go Toolkit for Cleaning, Validating, and Querying CSV/TSV/Excel/Parquet Files
nchika
nchika

Posted on

A Minimal Go Toolkit for Cleaning, Validating, and Querying CSV/TSV/Excel/Parquet Files

In many software systems, not all data lives inside a database.
Sometimes it’s stored in structured files such as CSVs, TSVs, or spreadsheets. And in practice, these files are not always clean. Data may be entered manually by someone who isn’t familiar with the format, values may be missing, and rows may contain inconsistencies that break downstream processing.

When the file is large, trying to “fix it in Excel” often leads to crashes — or simply makes it difficult to identify where the invalid records are.

I faced these issues repeatedly in real projects, and eventually decided to build a set of libraries to address them. Since I’m a Go developer, adopting the Python ETL ecosystem was never my first choice.

Instead, I created three small Go libraries that focus on preprocessing, validation, and lightweight analysis of structured file data:

  • fileprep — preprocessing + field-level validation using struct tags
  • fileframe — a tiny, immutable DataFrame for filtering and inspection
  • filesql — run SQL directly on CSV/TSV/LTSV/Excel/Parquet via embedded SQLite

These libraries are independent from one another.
What connects them is the shared use of the io.Reader interface.
By passing data as a familiar io.Reader, you don’t have to think about any custom data structures inside the libraries — you can work with the standard Go primitives you already know.

Now, let me explain the characteristics of each library.


fileprep

Features

  • Preprocessing: trim, replace, Unicode normalization, type coercion, etc.
  • Clear error reporting: identifies which row and which column caused the failure
  • Supports composite validation (cross-column rules)
  • Works with CSV/TSV/LTSV/Parquet/Excel — anything fed through io.Reader

Example

type User struct {
    Name  string `prep:"trim" validate:"required"`
    Email string `prep:"trim,lowercase"`
    Age   string
}

func main() {
    csvData := `name,email,age
  John Doe  ,JOHN@EXAMPLE.COM,30
Jane Smith,jane@example.com,25
`

    processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
    var users []User

    reader, result, err := processor.Process(strings.NewReader(csvData), &users)
    if err != nil {
        fmt.Printf("Error: %v\n", err)
        return
    }

    fmt.Printf("Processed %d rows, %d valid\n", result.RowCount, result.ValidRowCount)

    for _, user := range users {
        fmt.Printf("Name: %q, Email: %q\n", user.Name, user.Email)
    }

    // reader can be passed directly to filesql
    _ = reader
}
Enter fullscreen mode Exit fullscreen mode

Output:

Processed 2 rows, 2 valid
Name: "John Doe", Email: "john@example.com"
Name: "Jane Smith", Email: "jane@example.com"
Enter fullscreen mode Exit fullscreen mode

fileframe

Features

  • Immutable DataFrame
  • Filtering, mapping, grouping
  • Ideal for “one-time transformations” on small/medium CSV/TSV datasets

Example

// Sample sales data
csvData := `product,amount,category
Apple,100,Fruit
Banana,150,Fruit
Carrot,80,Vegetable
Orange,120,Fruit
Broccoli,90,Vegetable`

df, err := fileframe.NewDataFrame(strings.NewReader(csvData), fileframe.CSV)
if err != nil {
    fmt.Println("Error:", err)
    return
}

fmt.Printf("Total rows: %d\n", df.Len())
fmt.Printf("Columns: %v\n", df.Columns())

// Filter
filtered := df.Filter(func(row map[string]any) bool {
    amount, ok := row["amount"].(int64)
    return ok && amount > 100
})
fmt.Printf("Rows with amount > 100: %d\n", filtered.Len())

// GroupBy + Sum
groupedDf, err := df.GroupBy("category")
if err != nil {
    fmt.Println("Error:", err)
    return
}
grouped, err := groupedDf.Sum("amount")
if err != nil {
    fmt.Println("Error:", err)
    return
}

for _, row := range grouped.ToRecords() {
    fmt.Printf("  %s: %.0f\n", row["category"], row["sum_amount"])
}
Enter fullscreen mode Exit fullscreen mode

Output:

Total rows: 5
Columns: [product amount category]
Rows with amount > 100: 2
Fruit: 370
Vegetable: 170
Enter fullscreen mode Exit fullscreen mode

filesql

filesql is not “SQL on files” in the literal sense.
Internally, it loads your data into a temporary SQLite database, giving you SQL capabilities without requiring you to manage the database yourself.

Example

func main() {
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    db, err := filesql.OpenContext(ctx, "data.csv")
    if err != nil {
        log.Fatal(err)
    }
    defer db.Close()

    rows, err := db.QueryContext(ctx, "SELECT * FROM data WHERE age > 25")
    if err != nil {
        log.Fatal(err)
    }
    defer rows.Close()

    for rows.Next() {
        var name string
        var age int
        if err := rows.Scan(&name, &age); err != nil {
            log.Fatal(err)
        }
        fmt.Printf("Name: %s, Age: %d\n", name, age)
    }
}
Enter fullscreen mode Exit fullscreen mode

If you're a Go developer who doesn’t need the full Apache or Python ETL ecosystems, these three lightweight libraries may fit your workflow well.

The user base is still small, so there may be undiscovered bugs — feedback and issue reports are very welcome.

Top comments (0)