<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: frisk55frisk</title>
    <description>The latest articles on DEV Community by frisk55frisk (@frisk55frisk).</description>
    <link>https://dev.to/frisk55frisk</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3853527%2F2bef45c8-9e5a-474f-bbce-ebebdc6a5f4f.jpg</url>
      <title>DEV Community: frisk55frisk</title>
      <link>https://dev.to/frisk55frisk</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/frisk55frisk"/>
    <language>en</language>
    <item>
      <title>Why Your Concordance DAT File Won't Parse: The þ Delimiter Encoding Trap</title>
      <dc:creator>frisk55frisk</dc:creator>
      <pubDate>Tue, 31 Mar 2026 13:26:40 +0000</pubDate>
      <link>https://dev.to/frisk55frisk/why-your-concordance-dat-file-wont-parse-the-th-delimiter-encoding-trap-20mj</link>
      <guid>https://dev.to/frisk55frisk/why-your-concordance-dat-file-wont-parse-the-th-delimiter-encoding-trap-20mj</guid>
      <description>&lt;p&gt;If you work with eDiscovery load files, you've probably hit this: you receive a Concordance DAT file, try to import it into your review platform, and it fails silently or produces garbled data. No useful error message. Just broken records.&lt;/p&gt;

&lt;p&gt;There's a good chance the problem is the &lt;code&gt;þ&lt;/code&gt; character.&lt;/p&gt;

&lt;h2&gt;
  
  
  What makes Concordance DAT files special
&lt;/h2&gt;

&lt;p&gt;Concordance DAT is the most common load file format in eDiscovery. It uses two unusual delimiter characters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Field separator&lt;/strong&gt;: &lt;code&gt;þ&lt;/code&gt; (thorn, Unicode U+00FE)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quote character&lt;/strong&gt;: &lt;code&gt;®&lt;/code&gt; (registered sign, Unicode U+00AE)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These were chosen decades ago specifically because they almost never appear in actual document metadata. Smart choice — until encoding enters the picture.&lt;/p&gt;

&lt;h2&gt;
  
  
  The trap: þ has two different byte representations
&lt;/h2&gt;

&lt;p&gt;Here's where things break. The &lt;code&gt;þ&lt;/code&gt; character is encoded differently depending on whether the file is CP1252 (Windows-1252) or UTF-8:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Encoding&lt;/th&gt;
&lt;th&gt;þ bytes&lt;/th&gt;
&lt;th&gt;® bytes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CP1252&lt;/td&gt;
&lt;td&gt;&lt;code&gt;FE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;AE&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UTF-8&lt;/td&gt;
&lt;td&gt;&lt;code&gt;C3 BE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;C2 AE&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;CP1252 uses a single byte. UTF-8 uses two bytes. If your parser assumes the wrong encoding, it will look for the wrong byte sequence and either miss the delimiters entirely or split fields in the wrong place.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this looks like in practice
&lt;/h2&gt;

&lt;p&gt;Say you have a simple DAT file with three fields:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;þDOCIDþþBEGBATESþþCUSTODIANþ
þDOC-001þþABC-000001þþSmith, Johnþ
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If this file was saved in CP1252, the hex for the first delimiter looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nasm"&gt;&lt;code&gt;&lt;span class="nf"&gt;FE&lt;/span&gt; &lt;span class="mi"&gt;44&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="nv"&gt;F&lt;/span&gt; &lt;span class="mi"&gt;43&lt;/span&gt; &lt;span class="mi"&gt;49&lt;/span&gt; &lt;span class="mi"&gt;44&lt;/span&gt; &lt;span class="nv"&gt;FE&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But if you read it as UTF-8, your parser sees &lt;code&gt;FE&lt;/code&gt; and thinks: "That's not a valid UTF-8 start byte." Depending on the tool, it will either throw a &lt;code&gt;UnicodeDecodeError&lt;/code&gt;, replace the character with &lt;code&gt;�&lt;/code&gt;, or silently skip it. In any case, your field boundaries are gone.&lt;/p&gt;

&lt;p&gt;The reverse is equally broken. A UTF-8 file has:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nasm"&gt;&lt;code&gt;&lt;span class="nf"&gt;C3&lt;/span&gt; &lt;span class="nv"&gt;BE&lt;/span&gt; &lt;span class="mi"&gt;44&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="nv"&gt;F&lt;/span&gt; &lt;span class="mi"&gt;43&lt;/span&gt; &lt;span class="mi"&gt;49&lt;/span&gt; &lt;span class="mi"&gt;44&lt;/span&gt; &lt;span class="nv"&gt;C3&lt;/span&gt; &lt;span class="nv"&gt;BE&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you read this as CP1252, &lt;code&gt;C3&lt;/code&gt; becomes &lt;code&gt;Ã&lt;/code&gt; and &lt;code&gt;BE&lt;/code&gt; becomes &lt;code&gt;¾&lt;/code&gt;. Your parser sees &lt;code&gt;Ã¾&lt;/code&gt; instead of &lt;code&gt;þ&lt;/code&gt; and fails to find any delimiters at all. The entire file becomes one giant unparseable line.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this happens so often
&lt;/h2&gt;

&lt;p&gt;The root cause is that DAT files have no encoding declaration. There's no BOM requirement, no header, no metadata. You receive a &lt;code&gt;.dat&lt;/code&gt; file and you have to guess.&lt;/p&gt;

&lt;p&gt;Older tools (Concordance itself, many vendor export pipelines) default to CP1252 because they were built on Windows. Newer cloud platforms (Relativity, Everlaw) increasingly expect UTF-8. When files pass through multiple systems — produced on one platform, QC'd on another, loaded into a third — the encoding can get silently changed or corrupted at any step.&lt;/p&gt;

&lt;p&gt;The most dangerous scenario: someone opens a CP1252 DAT file in a text editor that assumes UTF-8, makes a small edit, and saves. Now you have a file where most lines are CP1252 but a few lines got re-encoded to UTF-8. Mixed encoding in a single file. Good luck parsing that.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to detect the encoding
&lt;/h2&gt;

&lt;p&gt;A reliable detection strategy uses multiple signals:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Check for BOM.&lt;/strong&gt; If the file starts with &lt;code&gt;EF BB BF&lt;/code&gt;, it's UTF-8 with BOM. If it starts with &lt;code&gt;FF FE&lt;/code&gt;, it's UTF-16 LE. If there's no BOM, keep going.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Try strict UTF-8 decode.&lt;/strong&gt; Read the first 64KB and attempt to decode as UTF-8 with &lt;code&gt;errors='strict'&lt;/code&gt;. If it succeeds with no errors, it's very likely UTF-8.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Look for the þ character specifically.&lt;/strong&gt; Search for byte &lt;code&gt;FE&lt;/code&gt; (CP1252 thorn) vs byte sequence &lt;code&gt;C3 BE&lt;/code&gt; (UTF-8 thorn). If you find &lt;code&gt;FE&lt;/code&gt; appearing at regular intervals (field boundaries), it's almost certainly a CP1252 file using &lt;code&gt;þ&lt;/code&gt; as delimiter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Fall back to chardet.&lt;/strong&gt; Statistical detection as a last resort.&lt;/p&gt;

&lt;p&gt;In Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;detect_dat_encoding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;65536&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 1: BOM
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\xef\xbb\xbf&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8-sig&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\xff\xfe&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-16-le&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 2: Try UTF-8
&lt;/span&gt;    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;strict&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;UnicodeDecodeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;pass&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 3: Look for CP1252 thorn as delimiter
&lt;/span&gt;    &lt;span class="c1"&gt;# In CP1252, þ is 0xFE. If it appears frequently
&lt;/span&gt;    &lt;span class="c1"&gt;# and at consistent intervals, it's likely a delimiter.
&lt;/span&gt;    &lt;span class="n"&gt;fe_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\xfe&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;fe_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cp1252&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

    &lt;span class="c1"&gt;# Step 4: chardet fallback
&lt;/span&gt;    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;chardet&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chardet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;detect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;encoding&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  How to convert safely
&lt;/h2&gt;

&lt;p&gt;Once you know the encoding, converting to UTF-8 is straightforward — but you need to verify the delimiters survived:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;convert_dat_to_utf8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source_encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cp1252&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;source_encoding&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Verify the thorn delimiter is present after decoding
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;þ&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No þ delimiter found after decoding as &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;source_encoding&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Wrong encoding?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Sanity check: count delimiters before and after
&lt;/span&gt;    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;original_raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;converted_raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;original_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;original_raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\xfe&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# CP1252 thorn
&lt;/span&gt;    &lt;span class="n"&gt;converted_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;converted_raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\xc3\xbe&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# UTF-8 thorn
&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;original_count&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;converted_count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Delimiter count mismatch: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;original_count&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; in original, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;converted_count&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; in converted. Possible data corruption.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: &lt;strong&gt;always verify delimiter count before and after conversion&lt;/strong&gt;. If the counts don't match, something went wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  The mixed encoding nightmare
&lt;/h2&gt;

&lt;p&gt;The worst case is a file with mixed encodings — some lines in CP1252, others in UTF-8. This happens when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A DAT file is assembled from multiple vendor productions&lt;/li&gt;
&lt;li&gt;Someone edited specific lines in a different tool&lt;/li&gt;
&lt;li&gt;A processing pipeline partially re-encoded the file&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To detect this, you need to check each line independently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;find_mixed_encoding_lines&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Try UTF-8 first for each line. If it fails, that line is
    likely CP1252 (or another single-byte encoding).  Note: we
    don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t try cp1252 decode as a positive signal because CP1252
    accepts almost every byte sequence — a successful CP1252
    decode proves nothing.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;fallback_lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line_num&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;strict&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;UnicodeDecodeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;fallback_lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line_num&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;fallback_lines&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mixed encoding: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fallback_lines&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; non-UTF-8 line(s)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line_num&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;fallback_lines&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Line &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;line_num&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: likely CP1252&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;All lines decode as valid UTF-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The tool I built to handle this
&lt;/h2&gt;

&lt;p&gt;After dealing with these issues repeatedly, I built an open-source CLI tool called &lt;a href="https://github.com/frisk55frisk/loadfile-tool" rel="noopener noreferrer"&gt;loadfile&lt;/a&gt; that automates all of the above:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;loadfile

&lt;span class="c"&gt;# Detect encoding&lt;/span&gt;
loadfile detect production.dat

&lt;span class="c"&gt;# Find mixed encoding lines&lt;/span&gt;
loadfile scan-encoding production.dat

&lt;span class="c"&gt;# Convert encoding&lt;/span&gt;
loadfile convert production.dat &lt;span class="nt"&gt;--to&lt;/span&gt; utf-8

&lt;span class="c"&gt;# Validate the file after conversion&lt;/span&gt;
loadfile validate production.dat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It handles 9 DAT dialect variants, auto-detects delimiters, and does line-level mixed encoding detection. It's free and MIT-licensed.&lt;/p&gt;

&lt;p&gt;If you work with eDiscovery load files and want to try it, I'm looking for feedback — especially edge cases from real production data that I haven't seen yet. Open an issue on &lt;a href="https://github.com/frisk55frisk/loadfile-tool" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; if you find something that breaks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;þ&lt;/code&gt; delimiter in Concordance DAT files is encoded as 1 byte (&lt;code&gt;FE&lt;/code&gt;) in CP1252 and 2 bytes (&lt;code&gt;C3 BE&lt;/code&gt;) in UTF-8&lt;/li&gt;
&lt;li&gt;If your parser assumes the wrong encoding, field boundaries break silently&lt;/li&gt;
&lt;li&gt;DAT files have no encoding declaration, so you must detect it&lt;/li&gt;
&lt;li&gt;Always verify delimiter counts before and after encoding conversion&lt;/li&gt;
&lt;li&gt;Mixed encoding files (some lines CP1252, some UTF-8) exist in the wild and require line-by-line detection&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>ediscovery</category>
      <category>encoding</category>
      <category>legaltech</category>
    </item>
  </channel>
</rss>
