DEV Community

SEN LLC
SEN LLC

Posted on

Writing a Safe SQL INSERT Generator Taught Me Why ORMs Exist

Writing a Safe SQL INSERT Generator Taught Me Why ORMs Exist

Every engineer writes a CSV → SQL INSERT script at some point. Usually it's a 20-line fgetcsv loop with string interpolation, and usually it's wrong. I wrote one properly — four dialects, type inference, a prepared-statement escape hatch — and learned more about SQL string literals than I wanted to.

📦 GitHub: https://github.com/sen-ltd/csv-to-sql

csv-to-sql converting a CSV with an apostrophe in the data to MySQL, Postgres and prepared-statement output

The problem everyone has

You know this week. A PM sends you a CSV in Slack. "Can you get this into the database?" The spreadsheet has 400 rows, nothing that a nightly ETL job would justify, and the data is small enough that you could probably do it by hand if you hated yourself. So you write a quick PHP script.

<?php
$h = fopen('users.csv', 'r');
$headers = fgetcsv($h);
while ($row = fgetcsv($h)) {
    $values = array_map(fn($v) => "'$v'", $row);
    echo "INSERT INTO users VALUES (" . implode(',', $values) . ");\n";
}
Enter fullscreen mode Exit fullscreen mode

Thirty seconds, done. Except:

  • Row 3 is Carol's. Your script emits 'Carol's' — that's invalid SQL, the statement aborts halfway through the file, you have no idea which rows made it in.
  • The ZIP code column starts with 01234. PHP happily (int)-casts it to 1234. Your PM notices a week later.
  • One of the names is Robert'); DROP TABLE users;--. Your PM doesn't notice.

This is the "just a quick script" trap. I decided to stop paying it every few months and build csv-to-sql properly. Four dialects (MySQL, Postgres, SQLite, standard), four modes (plain insert, insert-ignore, upsert, prepared statements), type inference, batch size control, zero runtime dependencies. PHP 8.2, fgetcsv under the hood, 73 unit tests. The whole tool is about 700 lines.

This post is about the three things I got wrong on my first pass, and why I now think the "prepared statement" mode — the escape hatch that forces the caller to do the work — is actually the honest recommendation.

Mistake #1: using \' for single quotes

My first instinct was "MySQL uses \' to escape single quotes, right?"

$escaped = str_replace("'", "\\'", $value);
echo "'$escaped'";
Enter fullscreen mode Exit fullscreen mode

This is wrong in a subtle way. It works most of the time on MySQL. It breaks on:

  • Any Postgres database. Postgres doesn't recognize \' by default — it treats the backslash as a literal character. 'Carol\'s' is parsed as 'Carol\' followed by s', which is a syntax error.
  • SQLite. Same deal — SQLite has no concept of backslash escapes inside string literals.
  • MySQL with sql_mode = NO_BACKSLASH_ESCAPES. This mode is on by default for some managed MySQL instances, and it changes the meaning of \' to "a literal backslash followed by a closing quote." Your previously-safe SQL is now an injection vector.

The SQL standard has one universal escape for single quotes inside string literals: double them. Carol's becomes 'Carol''s'. Every mainstream engine accepts this — MySQL, Postgres, SQLite, Oracle, SQL Server, standard conformant ones. It is the one rule you can rely on.

So the escape function in csv-to-sql is deliberately boring:

public function literal(string|int|float|bool|null $value): string
{
    if ($value === null) {
        return 'NULL';
    }
    if (is_bool($value)) {
        return $this->dialect->boolLiteral($value);
    }
    if (is_int($value)) {
        return (string) $value;
    }
    if (is_float($value)) {
        if (!is_finite($value)) {
            throw new \InvalidArgumentException('NaN/Inf cannot be emitted as SQL literal');
        }
        return (string) $value;
    }

    // String path. Reject embedded NUL — no dialect can encode it in a literal.
    if (strpos($value, "\0") !== false) {
        throw new \InvalidArgumentException('NUL byte in value cannot be escaped');
    }

    // The core rule: double any single quote. Works in every dialect we target.
    $escaped = str_replace("'", "''", $value);
    return "'" . $escaped . "'";
}
Enter fullscreen mode Exit fullscreen mode

There's no addslashes. No mysqli_real_escape_string. No backslash handling. Just double the single quote and wrap the result in single quotes. The dialect differences go in boolLiteral() (MySQL wants TRUE/FALSE, SQLite wants 1/0) and in identifier quoting (backticks vs double-quotes) — not in the string literal escape.

Unit test:

public function testAllDialectsAgreeOnQuoteDoubling(): void
{
    foreach ([new MySqlDialect(), new PostgresDialect(), new SqliteDialect(), new StandardDialect()] as $d) {
        $e = new Escaper($d);
        $this->assertSame("'it''s'", $e->literal("it's"), "dialect {$d->name()} broke '' escaping");
    }
}
Enter fullscreen mode Exit fullscreen mode

Passing this test for all four dialects felt anticlimactic. That's the point: the correct answer is the same everywhere.

Mistake #2: treating int-looking columns as integers

Type inference sounds like it should be simple: scan the first 100 rows, if every value parses as an integer, declare the column INTEGER. Done.

Then your PM sends a CSV with a ZIP code column. 01234, 02111, 90210. intval("01234") is 1234. You just dropped leading zeros from every row.

Or a phone column. 0123456789. That's a phone number. It's not an integer, it's a string that happens to contain digits.

The rule that works: a column with any leading-zero value is not an integer. Standalone 0 is fine, but 007 or 01234 must fall back to text. This is the single most important heuristic in the whole inferrer, and it's the reason for the --type-map override flag — real-world CSVs have columns that look numeric but aren't, and the only general-purpose fix is "let the human override."

public function looksLikeInt(string $s): bool
{
    if ($s === '' || $s === '-' || $s === '+') {
        return false;
    }
    $body = ($s[0] === '-' || $s[0] === '+') ? substr($s, 1) : $s;
    if ($body === '' || !ctype_digit($body)) {
        return false;
    }
    // Leading zero? Keep as text to preserve phone numbers / ZIPs.
    if (strlen($body) > 1 && $body[0] === '0') {
        return false;
    }
    return true;
}
Enter fullscreen mode Exit fullscreen mode

The type inference ladder is conservative top-down: if every non-null value in a column is an integer → INTEGER. Else if every non-null value is numeric → REAL. Else if every non-null value is a recognized boolean token (true, false, yes, no, t, f, y, n, case-insensitive) → BOOLEAN. Else → TEXT. One non-matching value drops the column to the next rung. It's an all-or-nothing decision per column because mixed-type columns are almost always bugs in the upstream data, and TEXT is the only type guaranteed to round-trip everything.

The --type-map override is the escape hatch for when inference is wrong:

csv-to-sql contacts.csv contacts --type-map zip:text,phone:text,score:real
Enter fullscreen mode Exit fullscreen mode

Mistake #3: thinking inference was a substitute for prepared statements

My first version of the tool didn't have a --mode prepared option. Type inference plus proper escaping, I reasoned, is enough — you get safe SQL, and you can pipe it straight into mysql.

But every serious user of the tool asked the same question: "can I get this as a prepared statement instead?" And I understood why. Inference is a guess. The escape function is correct, but the output is a single giant .sql file with every value interpolated into a string. If the escape function has a bug — if someone finds an input shape I didn't think of — it's a production SQL injection. There is no second line of defense.

Prepared statements are the second line of defense. The SQL shape is fixed (INSERT INTO users (id, name) VALUES (?, ?)), the values never touch the SQL parser, and the database driver handles the escaping with knowledge of the actual connection state (including whether NO_BACKSLASH_ESCAPES is set). My escape function and the database's escape function can disagree about edge cases, and only one of them is authoritative.

So --mode prepared emits two things: the SQL template with ? placeholders, and a JSON file of parameter rows. You hand both to your PDO-based loader:

csv-to-sql users.csv users --mode prepared --params-out params.json
Enter fullscreen mode Exit fullscreen mode
-- prepared statement SQL (use parameter array for values):
INSERT INTO `users` (`id`, `name`, `age`, `active`) VALUES (?, ?, ?, ?);
-- parameter rows written to: params.json
Enter fullscreen mode Exit fullscreen mode
$sql = file_get_contents('insert.sql');
$params = json_decode(file_get_contents('params.json'), true);
$stmt = $pdo->prepare($sql);
foreach ($params as $row) {
    $stmt->execute($row);
}
Enter fullscreen mode Exit fullscreen mode

This is the mode I recommend for anything that touches a production database. The plain --mode insert is for throwaway local bootstrapping — loading a CSV into a local SQLite file you're going to rm in an hour. If you're going anywhere near production, --mode prepared is the answer.

Writing this mode was the moment I realized what I was building. Every piece of functionality in csv-to-sql — the per-dialect escaper, the type inference, the batch sizing, the identifier quoting — exists because the job of safely getting data into a database is annoyingly hard, and ORMs exist so you don't have to do it by hand. My 700-line tool is basically the first 10% of a Doctrine or an Eloquent, with all the hard cases (relations, transactions, migrations) cut out. That's fine — it does one job — but now I see ORMs with different eyes. Every feature in Doctrine that seemed like bloat is a feature I almost reimplemented this week.

Tradeoffs I made explicitly

  • No schema migrations. --include-create emits a bare CREATE TABLE IF NOT EXISTS with the inferred types. That's it. Alembic, this is not.
  • No BLOB/binary support. CSV is a text format; binary data belongs in a different pipeline.
  • Wide tables are memory-heavy. I build each tuple in memory and stream out one statement per batch. A 10 MB CSV is fine; a 10 GB CSV is not the right tool.
  • Batch size tradeoff. Larger batches = faster ingest but more memory per statement and longer parse time on the server. 100 is a reasonable default; 500 is where I tune it for imports I actually care about.

Try it in 30 seconds

docker build -t csv-to-sql https://github.com/sen-ltd/csv-to-sql.git

mkdir -p /tmp/csv && cat > /tmp/csv/users.csv << 'CSV'
id,name,age,active
1,alice,30,true
2,"bob, jr",25,false
3,carol's,,true
CSV

# MySQL with carol's properly escaped
docker run --rm -v /tmp/csv:/work csv-to-sql users.csv users --dialect mysql

# Postgres upsert
docker run --rm -v /tmp/csv:/work csv-to-sql users.csv users \
  --dialect postgres --mode upsert --key id

# Prepared statements + JSON params
docker run --rm -v /tmp/csv:/work csv-to-sql users.csv users --mode prepared
Enter fullscreen mode Exit fullscreen mode

If you find an input the escape function doesn't handle, open an issue — and in the meantime, use --mode prepared. That's the whole pitch.

Code's on GitHub: sen-ltd/csv-to-sql. MIT licensed.

Top comments (0)