DEV Community

Noel Worden
Noel Worden

Posted on • Edited on

Identifying and Removing Hidden Characters

I came across an interesting problem this week. There was a hidden character in a CSV preventing me from successfully performing a batch data insert in the Elixir/Phoenix project I've been working on.

The TL;DR: You can use the tool xxd to help identify hidden characters. You can then use the pbcopy and pbpaste shell commands to eliminate that hidden character.

As I said, I was trying to ingest a CSV and batch insert that block of data into a table. You could say CSV imports and exports are the backbone of this project, so we have some pretty established patterns established. The brunt of the import work is handled with the NimbleCSV library for Elixir. When it's configured property, NimbleCSV will parse out the CSV, and you can then pipe that into an Enum.each, and essentially insert each line of the CSV as a full changeset over and over. The whole function looks like this:

["path", "to", "file"]
|> CSVUtil.data_from_csv()
|> Enum.each(fn attrs ->
%Product{}
|> Product.changeset(attrs)
|> Repo.insert!()
end)

When I tried to execute this data insertion I ran into this error:

** (Ecto.InvalidChangesetError) could not perform insert because changeset is invalid.
Errors
%{
vendor_id: [{"can't be blank", [validation: :required]}]
}
Applied changes
%{
expense_account_number: 500000,
income_account_number: 400000,
name: "Discount"
}
Params
%{
"expense_account_number" => "500000",
"income_account_number" => "400000",
"name" => "Discount",
"\uFEFFvendor_id" => "the_vendor"
}

The error stated that I couldn't insert that record because the vendor_id attribute could not be blank. And vendor_id was blank because when it was being extracted as the key from the CSV it was coming in with extra characters prepended to it: \uFEFFvendor_id. The keys of the param struct were defined by the headers of the CSV, so that meant, at some point those characters had been included to the vendor_id column title. It just so happened that vendor_id was the first column and header of the file, so if something like this was going to happen, I'm not surprised it was happening here.

First thing I did was look at the CSV in VSCode, and it looked fine there. Then I opened it in Excel, but again, it looked perfectly fine. I figured I should probably see exactly what these extra characters were, since it was obviously not going to be a simple fix. A detailed answer can be found here, but it's basically a Unicode Byte Order Mark, which is included at the beginning of a file and tells GUIs how to interpret the coding of the rest of the data. This makes sense as to why I was not seeing it as an extra character in any of the software I was using, but also a bit of a head scratcher. How could I remove a Unicode character if I couldn't see it in any GUIs I was using?

Luckily I have awesome colleagues, and one suggested some tools via the terminal. We started with xxd, which gives a hexdump of a given file. I executed this command:

head -n 4 ~/downloads/data_dump.csv | xxd

And got a rather large output, but this was the first line:

00000000: efbb bf76 656e 646f 725f 6964 2c6e 616d ...vendor_id,nam
  • side note: You can execute xxd more simply, like this:

    xxd ~/downloads/data_dump.csv

    But that outputs the hexdump of the entire file. By using head -n 4 and piping the file path to xxd I was telling it to only return the first 4 lines of the file

Finally, I was able to see it! It showed as efbb and the ... in the readout to the right (which wasn't showing in any GUI I had been using). Now, how to remove it? It turned out that OS X shell comes with pbcopy/pbpaste commands built in. In addition to being useful copy and paste tools in the command line, they also strip out this type of stuff. Unfortunately, I can't find the search result where we realized this, but I know it worked in my case. These are the two commands I executed:

pbcopy < path/to/data_dump.csv
pbpaste > path/to/data_dump.csv

Then I ran the file through xxd again to confirm, and this was the first line of the output:

00000000: 7665 6e64 6f72 5f69 642c 6e61 6d65 2c69 vendor_id,name,i

The unicode value was indeed removed. I took that updated CSV, ran it through the batch insertion function again, and got no errors.


This post is part of an ongoing This Week I Learned series. I welcome any critique, feedback, or suggestions in the comments.

Image of Datadog

Master Mobile Monitoring for iOS Apps

Monitor your app’s health with real-time insights into crash-free rates, start times, and more. Optimize performance and prevent user churn by addressing critical issues like app hangs, and ANRs. Learn how to keep your iOS app running smoothly across all devices by downloading this eBook.

Get The eBook

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay