DEV Community

Cover image for Building a PostgreSQL Parser in Go: What Broke After We Open-Sourced It
Eitamos Ring
Eitamos Ring

Posted on

Building a PostgreSQL Parser in Go: What Broke After We Open-Sourced It

Building a PostgreSQL Parser in Go: What Broke After We Open-Sourced It

postgresparser is a pure-Go PostgreSQL SQL parser. It turns SQL text into structured metadata (tables, columns, joins, filters, DDL actions, parameters) without executing queries.

We thought it was solid. Open source proved we were wrong.

Here is what open source forced us to learn.

The biggest shift was not “more bug reports.” It was use-case expansion.

We built for our workflow. Users showed up with very different workloads.
In the first week after release, most feedback centered on deterministic batch parsing.

Our internal assumptions broke immediately

Inside a single team, ambiguous behavior survives because everyone “knows” the rules. Public users do not have that context.

The first pressure point was multi-statement SQL. We had ParseSQL (single statement) and figured batch parsing was “close enough.” It was not.

People were using the parser for:

  • CI linting pipelines
  • production tools
  • llm wrappers

People asked practical questions we could not answer cleanly:

  • Which exact statement failed?
  • Is this a warning or a hard failure?
  • Can I map diagnostics to the original SQL text reliably?

Those questions forced us to define strict contracts instead of relying on implied behavior.

If your tool consumes SQL in bulk, batch correlation is everything.

Broken behavior example

This input exposed the issue quickly:

SELECT 1;
SELECT FROM;
SELECT 2;
Enter fullscreen mode Exit fullscreen mode

Early batch behavior made correlation awkward because results were compacted and diagnostics were not statement-first. If you’re building CI checks or migration tooling, “something in the batch failed” is not actionable.

Now each statement has deterministic correlation (Index, RawSQL, Query, Warnings), so downstream code can point to the exact source statement.

Before/after API diff

- type ParseBatchResult struct {
-   Queries          []*ParsedQuery
-   Warnings         []ParseWarning
-   TotalStatements  int
-   ParsedStatements int
- }
+ type StatementParseResult struct {
+   Index    int
+   RawSQL   string
+   Query    *ParsedQuery   // nil => IR conversion failure
+   Warnings []ParseWarning // statement-scoped warnings
+ }
+
+ type ParseBatchResult struct {
+   Statements       []StatementParseResult
+   TotalStatements  int
+   ParsedStatements int
+   HasFailures      bool
+ }
Enter fullscreen mode Exit fullscreen mode

That shape is less convenient for quick demos, but much better for real integration.

Real SQL in the wild is much uglier than test fixtures

Open source usage also brought SQL shapes we did not have in internal tests:

  • trailing semicolons and odd whitespace
  • invalid syntax in the middle of an otherwise valid batch
  • mixed DDL + DML scripts
  • ONLY variants in DDL paths

The parser had to become resilient without becoming vague. That meant:

  • better statement-level warning attribution
  • explicit failure semantics (Query == nil)
  • tighter handling across DDL relation extraction paths

One concrete snippet (current behavior)

batch, err := postgresparser.ParseSQLAll(sql)
if err != nil {
    log.Fatal(err)
}

fmt.Printf("total=%d parsed=%d has_failures=%t\n",
    batch.TotalStatements, batch.ParsedStatements, batch.HasFailures)

for _, stmt := range batch.Statements {
    fmt.Printf("idx=%d failed=%t warnings=%d raw=%q\n",
        stmt.Index, stmt.Query == nil, len(stmt.Warnings), stmt.RawSQL)
}
Enter fullscreen mode Exit fullscreen mode

That is the integration model people asked for: deterministic, inspectable, and boring in the best way.

Why this matters

Open source removed our ability to hand-wave edge cases.

The loop became:

  1. implement
  2. get challenged
  3. simplify
  4. lock behavior with tests
  5. document the contract

That loop made postgresparser better than it would have been as an internal-only tool.
Internal tools can survive ambiguity. Public libraries cannot.

If you're building something on top of postgresparser, open an issue. Real-world SQL keeps improving the contract.

Top comments (2)

Collapse
 
clarabennettdev profile image
Clara Bennett

Really appreciate the honesty about what broke post open-source. Parsing SQL is deceptively hard — Postgres has so many edge cases in its grammar that even their own parser is ~15k lines of yacc. Building a pure Go parser for this is ambitious. How do you handle dialect-specific extensions and newer syntax like MERGE or JSON path expressions? Also curious if you considered wrapping pg_query_go (which uses the actual Postgres parser via CGO) vs building from scratch. The pure-Go approach is way more portable but the correctness tradeoffs are real.

Collapse
 
eitamos_ring_0508146ca448 profile image
Eitamos Ring

You nailed the trade-off. We definitely looked at pg_query_go, but CGO is often a dealbreaker for our users. If you're deploying to Lambda, ARM, or scratch images, you really want that CGO_ENABLED=0 portability. go build and nothing else is a huge win for ops.
for things like MERGE and JSON, we use an ANTLR4 grammar. It’s definitely a climb to keep up with the 15k lines of yacc in the native parser, but it lets us map the syntax directly into a clean Go IR that’s easy to work with.

As for correctness—honestly, open-sourcing it was the missing piece. Users throw "uglier" SQL at us than we ever would've written in a test fixture, and that real-world stress test is how we're closing the gap on those edge cases,
Also the goal of the packages was to be dumb simple to use and provide a nice IR layer for reading it