SEN LLC

Posted on Apr 16

CSV Sort, Done Right: Why `sort -t,` Is a Trap and What a 300-Line PHP CLI Does Instead

#php #cli #csv #tutorial

CSV Sort, Done Right: Why `sort -t,` Is a Trap and What a 300-Line PHP CLI Does Instead

TL;DR — Repo: https://github.com/sen-ltd/csv-sort. sort -t, looks like it sorts CSV. It doesn't. It splits on literal commas, which breaks the moment you have a quoted "Tokyo, Japan". And everything's a string, so "10" < "2". csv-sort is a small PHP 8.2 CLI that reads CSV with fgetcsv, sorts with a type-aware comparator, writes with fputcsv. Zero Composer runtime deps. Docker image is 51 MB. 57 tests. MIT.

docker run --rm -v "$PWD:/work" ghcr.io/sen-ltd/csv-sort /work/users.csv --key age --type age=int

There's a special kind of bug you only see once you actually look at the output of a shell pipeline. Someone on your team writes:

sort -t, -k2 users.csv > sorted.csv

It runs. It produces output. The output has the right number of bytes and a header row on top. You tick the JIRA ticket. Three weeks later a customer reports that their users list is truncated, and you open the file and discover that the row for alice, "Tokyo, Japan", 30 has been sliced in half right through the quoted cell, because sort -t, doesn't know what a CSV quoted field is. It splits on every literal comma in the line. It doesn't parse CSV. It never did.

csv-sort is my answer to "please, someone, just sort this CSV file without breaking it." It's a small PHP CLI, about 300 lines of source across six files, built on PHP 8.2's stdlib: fgetcsv, fputcsv, usort. Zero runtime Composer deps. The Docker image is 51 MB. I wrote 57 tests. This article is about the three things the tool gets right that sort -t, gets wrong, and the PHP code that makes them fall out almost for free.

Problem 1: quoted commas

Here's the canonical bad case, one that I've personally shipped to production twice:

name,city
alice,"Tokyo, Japan"
bob,Osaka

If you feed this to sort -t, -k2, sort sees three fields on the alice line (alice, "Tokyo, Japan"), sorts by the second, and the row is now silently mangled. Even worse: if your dataset is big enough, you won't notice. The row count is still right. The header is still on top. Somewhere in the middle, a bunch of rows have been sorted as if they had two separate fields where they should have had one. You'll find out from your customer.

The reason this happens isn't that sort is buggy. It's that sort was never told your file is CSV. It's told the delimiter is , and it obeys. The CSV grammar says: inside double quotes, a comma is not a separator. sort does not implement the CSV grammar. No version of sort does. If you want CSV semantics, you have to go through a CSV parser.

PHP has had one in core forever: fgetcsv. It handles quoted commas, escaped quotes (""), quoted newlines, and CRLF line endings correctly. The whole reader in csv-sort is a thin wrapper around it:

final class Reader
{
    public function readAll(): array
    {
        $headers = [];
        $rows = [];
        $first = true;

        while (($row = fgetcsv($this->handle, 0, $this->delimiter, '"', '"')) !== false) {
            if ($row === [null]) {
                continue; // blank line: PHP returns [null]
            }
            $normalized = array_map(
                static fn($c): string => (string) ($c ?? ''),
                $row,
            );
            if ($first && $this->hasHeader) {
                // Strip UTF-8 BOM if present — only ever on the first header cell
                if ($normalized !== [] && str_starts_with($normalized[0], "\xEF\xBB\xBF")) {
                    $normalized[0] = substr($normalized[0], 3);
                }
                $headers = $normalized;
                $first = false;
                continue;
            }
            $rows[] = $normalized;
        }
        return ['headers' => $headers, 'rows' => $rows];
    }
}

Three details you find out only when you write tests for it:

fgetcsv returns [null] on a blank line, not false and not []. If you don't skip it, a blank line at the end of your file becomes an empty "row" in your sort output.
UTF-8 BOM. Files exported from Excel routinely start with \xEF\xBB\xBF. The BOM sticks on the first header cell — so "ID" is now "\ufeffID" and your --key ID fails with "unknown column." You strip it exactly once, exactly on the first cell of the header row.
fputcsv is the mirror of fgetcsv. Round-tripping a row through reader → sorter → writer re-quotes only the cells that need quoting, and handles the "" escape for literal quotes. You do not have to write the writer. Just don't use your own string concatenation; go through fputcsv.

That's quoted commas handled. Zero custom parsing code.

Problem 2: no type awareness

Here's the second classic failure mode. You want to sort by age:

name,age
alice,30
bob,25
carol,100

You run sort -t, -k2 — forgetting for a moment that it's already broken on quoted commas — and you get:

carol,100    # !
alice,30
bob,25

carol,100 comes first, because "100" sorts lexicographically before "25" because '1' < '2'. Everything's a string. sort has -n for numeric sort, but it still doesn't know your CSV has a header row that isn't a number, so it's going to sort the header into the middle of your data (or you use tail -n +2 and lose it entirely). There is no single sort invocation that gives you a sorted CSV with the header preserved and the "age" column treated as an integer.

In PHP we can just write a comparator that knows what type each column is:

final class Comparator
{
    public static function compare(string $a, string $b, string $type): int
    {
        return match ($type) {
            'int'    => self::compareNullable(self::parseInt($a),   self::parseInt($b)),
            'float'  => self::compareNullable(self::parseFloat($a), self::parseFloat($b)),
            'date'   => self::compareNullable(self::parseDate($a),  self::parseDate($b)),
            default  => strcmp($a, $b),
        };
    }

    public static function parseInt(string $s): ?int
    {
        if ($s === '') return null;
        $rest = ($s[0] === '+' || $s[0] === '-') ? substr($s, 1) : $s;
        if ($rest === '' || !ctype_digit($rest)) return null;
        return (int) $s;
    }

    public static function parseFloat(string $s): ?float
    {
        if ($s === '') return null;
        $v = filter_var($s, FILTER_VALIDATE_FLOAT);
        return $v === false ? null : $v;
    }

    public static function parseDate(string $s): ?int
    {
        if ($s === '') return null;
        $t = strtotime($s);
        return $t === false ? null : $t;
    }

    private static function compareNullable(int|float|null $a, int|float|null $b): int
    {
        if ($a === null && $b === null) return 0;
        if ($a === null) return 1;   // nulls sort AFTER real values
        if ($b === null) return -1;
        return $a <=> $b;
    }
}

A few things worth pointing out, because they're the parts I had to think about more than twice:

Why parseInt is stricter than (int) $s. PHP's cast (int) "12abc" silently returns 12. That's the worst possible behavior for a sort: it hides dirty data by pretending it sorted correctly. The strict parser returns null for anything that isn't ^[-+]?\d+$, which pushes the dirty row into the null bucket where it can be seen.

Why parseInt rejects "1.0" and " 1 ". Those strings are not integers. They're floats and padded ints respectively. For an int column they're a type error the user declared themselves into. If they want "1.0" to sort as 1, they should be using --type col=float.

Why filter_var for floats. Same reason, same bug as (int). (float) "12abc" is 12.0; filter_var("12abc", FILTER_VALIDATE_FLOAT) is false. The latter is what you want.

Why nulls sort after real values. Consider a column of ages where 98 rows have a valid integer and 2 rows say "N/A". If I silently coerce "N/A" to 0, those rows sort to the top of an ascending sort and the user thinks the youngest people in their dataset are named N/A. If I sort nulls after valid values, the clean rows come first and the dirty ones accumulate at the bottom where the user will see them. The behavior is predictable and it surfaces data quality problems instead of hiding them.

Why there's a bool type missing. Because "yes/no/true/false/1/0/Y/N/オン/オフ" is a rabbit hole. Either you pick one convention (rude to users of the others) or you accept all of them (rude to users who typed a typo). I picked neither: if you want booleans sorted, use --type col=int or --type col=string and be explicit about the encoding.

The "mixed types" case. compare("10", "9abc", 'int') returns -1 because 10 parses as an int and 9abc doesn't, so the nullable rule kicks in. This is exactly what you want: the parseable value wins, and you don't get the nonsensical "did 9abc come before or after 10? Let's lexicographically compare them and hope" that you'd get from defaulting to string.

Problem 3: multi-key with per-column reverse

Third thing sort can do but clunkily. Suppose you want "sort by name ascending, then by age descending within each name." In sort land that's roughly:

sort -t, -k1,1 -k2,2nr users.csv    # GNU sort extension syntax

Which works if you remember the syntax, which I have never remembered on the first try. In csv-sort, you write:

csv-sort users.csv --key name,-age --type age=int

The - prefix on age means "reverse this column only." Everything else stays ascending. And the usort wrapper that implements it is genuinely short:

final class Sorter
{
    public function sort(array $rows, array $keys, string $globalOrder = 'asc'): array
    {
        if ($keys === []) return $rows;
        $globalDir = $globalOrder === 'desc' ? -1 : 1;

        usort($rows, static function (array $a, array $b) use ($keys, $globalDir): int {
            foreach ($keys as $k) {
                $cmp = Comparator::compare(
                    $a[$k->index] ?? '',
                    $b[$k->index] ?? '',
                    $k->type,
                );
                if ($cmp !== 0) {
                    $dir = $k->reverse ? -1 : 1;
                    return $cmp * $dir * $globalDir;
                }
            }
            return 0;
        });
        return $rows;
    }
}

That's it. The outer foreach walks keys in declaration order and returns on the first non-zero result, which is the standard way to do a multi-key comparator. The per-column $dir multiplies with the global $globalDir, so --order desc --key name,-age reverses everything and then un-reverses age, which gives name descending and age ascending within each name. This is the same truth table as Excel's multi-key sort and it's what users expect when they think in reversals.

One more subtle thing: PHP's usort has been stable since PHP 8.0. In earlier versions it was not, which led to hilarious bugs in code that relied on stable ordering for --unique-style dedupe. I rely on stability for the --unique flag (which keeps the first occurrence of each sort key), and I enforce PHP 8.2 minimum in composer.json, so the assumption holds.

The --unique implementation is a nice little trick that stability buys you:

public function unique(array $rows, array $keys): array
{
    if ($keys === [] || $rows === []) return $rows;
    $out = [];
    $lastKey = null;
    foreach ($rows as $row) {
        $fp = [];
        foreach ($keys as $k) $fp[] = $row[$k->index] ?? '';
        $fpStr = implode("\x1f", $fp);
        if ($fpStr !== $lastKey) {
            $out[] = $row;
            $lastKey = $fpStr;
        }
    }
    return $out;
}

Because the rows are sorted and the sort is stable, duplicates are guaranteed to be adjacent. So I only need to compare each row against the previous kept row — no seen-set, no hash table, linear scan. "Keep the first occurrence" falls out because stable sort preserves input order among ties.

The test strategy

57 PHPUnit tests, roughly split:

Comparator (18 tests): string lex, int parse including "10" vs "2", negative ints, garbage-rejection for each type, empty-string semantics per type, date mixed formats, unknown-type fallback.
Reader (9 tests): quoted commas, escaped quotes (""), quoted newlines, UTF-8 BOM, blank lines, tab delimiter, semicolon delimiter, headerless mode, empty input.
Sorter (8 tests): int ascending, global desc, per-column reverse, date desc, stability, unique keep-first, empty keys, empty rows.
CLI (22 tests): every flag end-to-end through Cli::run, including exit codes, quoted-comma round-trip, --unique --out file, unknown column (exit 1), unknown type (exit 2), --no-header with numeric column reference.

The killer test is testQuotedCommaRoundTrip — it writes a CSV with "Tokyo, Japan", runs the full CLI, and asserts both that the row count is right and that the quoted string survives verbatim in the output. If that test ever fails, the whole tool is broken, and it's the canary I look at first.

Tradeoffs, honestly

Not a streaming sorter. Sorting is inherently non-streaming (you can't emit row 1 until you've seen them all), so the whole file lives in memory during usort. For a 10 GB CSV you want an external merge sort with temp files, which is a genuinely different tool. For the CSVs I actually sort — under 10M rows — the in-memory approach is 10x simpler and fast enough.
No Unicode collation. string comparisons are strcmp on UTF-8 bytes. "á" sorts after "z" because its byte value is higher. If you need locale-aware collation, use Collator::compare — it's one switch in the comparator, but the interface for "which locale?" is a can of worms I didn't want to open for a small tool.
date only accepts what strtotime accepts. That's ISO 8601, RFC 2822, and the common human forms. If your dates are "DD/MM/YYYY" or "令和6年4月15日", you need to preprocess.
Single-character delimiters only. fgetcsv's API constraint, not mine. If you need "||" you're probably not dealing with a real CSV anyway.

None of these are cases I didn't think about. They're cases I decided not to implement because the tool is small enough to modify if and when I actually need them. YAGNI.

Try it in 30 seconds

git clone https://github.com/sen-ltd/csv-sort
cd csv-sort
docker build -t csv-sort .

cat > users.csv << 'EOF'
name,age,city
alice,30,"Tokyo, Japan"
bob,25,Osaka
carol,45,"Kyoto, Japan"
EOF

docker run --rm -v "$PWD:/work" csv-sort /work/users.csv \
  --key age --type age=int

And then, for the one line of proof you're here for — try it against sort -t, -k3:

sort -t, -k3 users.csv    # watch the row split on the quoted comma

That's the whole pitch. CSV is a file format with rules. If you don't use a parser that knows them, you're not sorting CSV — you're corrupting it and calling the output a sort.

Links

Repo: https://github.com/sen-ltd/csv-sort
MIT license, PHP 8.2+, zero Composer runtime deps
Part of SEN's 100-repo portfolio — we build small, focused tools and write up what we learned

DEV Community

CSV Sort, Done Right: Why `sort -t,` Is a Trap and What a 300-Line PHP CLI Does Instead

CSV Sort, Done Right: Why `sort -t,` Is a Trap and What a 300-Line PHP CLI Does Instead

Problem 1: quoted commas

Problem 2: no type awareness

Problem 3: multi-key with per-column reverse

The test strategy

Tradeoffs, honestly

Try it in 30 seconds

Top comments (0)

CSV Sort, Done Right: Why sort -t, Is a Trap and What a 300-Line PHP CLI Does Instead

Problem 1: quoted commas

Problem 2: no type awareness

Problem 3: multi-key with per-column reverse

The test strategy

Tradeoffs, honestly

Try it in 30 seconds

CSV Sort, Done Right: Why `sort -t,` Is a Trap and What a 300-Line PHP CLI Does Instead