frisk55frisk

Posted on Mar 31

Why Your Concordance DAT File Won't Parse: The þ Delimiter Encoding Trap

#python #ediscovery #encoding #legaltech

If you work with eDiscovery load files, you've probably hit this: you receive a Concordance DAT file, try to import it into your review platform, and it fails silently or produces garbled data. No useful error message. Just broken records.

There's a good chance the problem is the þ character.

What makes Concordance DAT files special

Concordance DAT is the most common load file format in eDiscovery. It uses two unusual delimiter characters:

Field separator: þ (thorn, Unicode U+00FE)
Quote character: ® (registered sign, Unicode U+00AE)

These were chosen decades ago specifically because they almost never appear in actual document metadata. Smart choice — until encoding enters the picture.

The trap: þ has two different byte representations

Here's where things break. The þ character is encoded differently depending on whether the file is CP1252 (Windows-1252) or UTF-8:

Encoding	þ bytes	® bytes
CP1252	`FE`	`AE`
UTF-8	`C3 BE`	`C2 AE`

CP1252 uses a single byte. UTF-8 uses two bytes. If your parser assumes the wrong encoding, it will look for the wrong byte sequence and either miss the delimiters entirely or split fields in the wrong place.

What this looks like in practice

Say you have a simple DAT file with three fields:

þDOCIDþþBEGBATESþþCUSTODIANþ
þDOC-001þþABC-000001þþSmith, Johnþ

If this file was saved in CP1252, the hex for the first delimiter looks like:

FE 44 4F 43 49 44 FE

But if you read it as UTF-8, your parser sees FE and thinks: "That's not a valid UTF-8 start byte." Depending on the tool, it will either throw a UnicodeDecodeError, replace the character with �, or silently skip it. In any case, your field boundaries are gone.

The reverse is equally broken. A UTF-8 file has:

C3 BE 44 4F 43 49 44 C3 BE

If you read this as CP1252, C3 becomes Ã and BE becomes ¾. Your parser sees Ã¾ instead of þ and fails to find any delimiters at all. The entire file becomes one giant unparseable line.

Why this happens so often

The root cause is that DAT files have no encoding declaration. There's no BOM requirement, no header, no metadata. You receive a .dat file and you have to guess.

Older tools (Concordance itself, many vendor export pipelines) default to CP1252 because they were built on Windows. Newer cloud platforms (Relativity, Everlaw) increasingly expect UTF-8. When files pass through multiple systems — produced on one platform, QC'd on another, loaded into a third — the encoding can get silently changed or corrupted at any step.

The most dangerous scenario: someone opens a CP1252 DAT file in a text editor that assumes UTF-8, makes a small edit, and saves. Now you have a file where most lines are CP1252 but a few lines got re-encoded to UTF-8. Mixed encoding in a single file. Good luck parsing that.

How to detect the encoding

A reliable detection strategy uses multiple signals:

Step 1: Check for BOM. If the file starts with EF BB BF, it's UTF-8 with BOM. If it starts with FF FE, it's UTF-16 LE. If there's no BOM, keep going.

Step 2: Try strict UTF-8 decode. Read the first 64KB and attempt to decode as UTF-8 with errors='strict'. If it succeeds with no errors, it's very likely UTF-8.

Step 3: Look for the þ character specifically. Search for byte FE (CP1252 thorn) vs byte sequence C3 BE (UTF-8 thorn). If you find FE appearing at regular intervals (field boundaries), it's almost certainly a CP1252 file using þ as delimiter.

Step 4: Fall back to chardet. Statistical detection as a last resort.

In Python:

def detect_dat_encoding(filepath):
    with open(filepath, 'rb') as f:
        raw = f.read(65536)

    # Step 1: BOM
    if raw.startswith(b'\xef\xbb\xbf'):
        return 'utf-8-sig'
    if raw.startswith(b'\xff\xfe'):
        return 'utf-16-le'

    # Step 2: Try UTF-8
    try:
        raw.decode('utf-8', errors='strict')
        return 'utf-8'
    except UnicodeDecodeError:
        pass

    # Step 3: Look for CP1252 thorn as delimiter
    # In CP1252, þ is 0xFE. If it appears frequently
    # and at consistent intervals, it's likely a delimiter.
    fe_count = raw.count(b'\xfe')
    if fe_count > 5:
        return 'cp1252'

    # Step 4: chardet fallback
    import chardet
    result = chardet.detect(raw)
    return result['encoding']

How to convert safely

Once you know the encoding, converting to UTF-8 is straightforward — but you need to verify the delimiters survived:

def convert_dat_to_utf8(input_path, output_path, source_encoding='cp1252'):
    with open(input_path, 'r', encoding=source_encoding) as f:
        content = f.read()

    # Verify the thorn delimiter is present after decoding
    if 'þ' not in content:
        raise ValueError(
            f"No þ delimiter found after decoding as {source_encoding}. "
            f"Wrong encoding?"
        )

    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(content)

    # Sanity check: count delimiters before and after
    with open(input_path, 'rb') as f:
        original_raw = f.read()
    with open(output_path, 'rb') as f:
        converted_raw = f.read()

    original_count = original_raw.count(b'\xfe')  # CP1252 thorn
    converted_count = converted_raw.count(b'\xc3\xbe')  # UTF-8 thorn

    if original_count != converted_count:
        raise ValueError(
            f"Delimiter count mismatch: {original_count} in original, "
            f"{converted_count} in converted. Possible data corruption."
        )

The key insight: always verify delimiter count before and after conversion. If the counts don't match, something went wrong.

The mixed encoding nightmare

The worst case is a file with mixed encodings — some lines in CP1252, others in UTF-8. This happens when:

A DAT file is assembled from multiple vendor productions
Someone edited specific lines in a different tool
A processing pipeline partially re-encoded the file

To detect this, you need to check each line independently:

def find_mixed_encoding_lines(filepath):
    """Try UTF-8 first for each line. If it fails, that line is
    likely CP1252 (or another single-byte encoding).  Note: we
    don't try cp1252 decode as a positive signal because CP1252
    accepts almost every byte sequence — a successful CP1252
    decode proves nothing."""
    fallback_lines = []
    with open(filepath, 'rb') as f:
        for line_num, line in enumerate(f, 1):
            try:
                line.decode('utf-8', errors='strict')
            except UnicodeDecodeError:
                fallback_lines.append(line_num)

    if fallback_lines:
        print(f"Mixed encoding: {len(fallback_lines)} non-UTF-8 line(s)")
        for line_num in fallback_lines[:10]:
            print(f"  Line {line_num}: likely CP1252")
    else:
        print("All lines decode as valid UTF-8")

The tool I built to handle this

After dealing with these issues repeatedly, I built an open-source CLI tool called loadfile that automates all of the above:

pip install loadfile

# Detect encoding
loadfile detect production.dat

# Find mixed encoding lines
loadfile scan-encoding production.dat

# Convert encoding
loadfile convert production.dat --to utf-8

# Validate the file after conversion
loadfile validate production.dat

It handles 9 DAT dialect variants, auto-detects delimiters, and does line-level mixed encoding detection. It's free and MIT-licensed.

If you work with eDiscovery load files and want to try it, I'm looking for feedback — especially edge cases from real production data that I haven't seen yet. Open an issue on GitHub if you find something that breaks.

Summary

The þ delimiter in Concordance DAT files is encoded as 1 byte (FE) in CP1252 and 2 bytes (C3 BE) in UTF-8
If your parser assumes the wrong encoding, field boundaries break silently
DAT files have no encoding declaration, so you must detect it
Always verify delimiter counts before and after encoding conversion
Mixed encoding files (some lines CP1252, some UTF-8) exist in the wild and require line-by-line detection

DEV Community