DEV Community

Cover image for Elixir and my ISO-8859-1 character encoding problem
Víctor Adrián
Víctor Adrián

Posted on • Originally published at lobotuerto.com on

Elixir and my ISO-8859-1 character encoding problem


You can check the latest updated version of this article at lobotuerto's notes - Elixir and my ISO-8859-1 character encoding problem.


Being from México, I have been wrestling with character encoding issues for a long time, in several languages…

Now, it’s Elixir’s time.

The problem

When working my way through The little Elixir & OTP guidebook —a highly recommended one— I got stuck at the ID3 parser example program:

defmodule ID3Parser do
  def parse(file_name) do
    case File.read(file_name) do
      {:ok, mp3} ->
        mp3_byte_size = byte_size(mp3) - 128

        <<_::binary-size(mp3_byte_size), id3_tag::binary>> = mp3

        <<"TAG",
          title::binary-size(30),
          artist::binary-size(30),
          album::binary-size(30),
          year::binary-size(4),
          _rest::binary>> = id3_tag

        IO.puts "#{artist} - #{title} (#{album} #{year})"

      _ ->
        IO.puts "Couldn't open #{file_name}"
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

Using Clementine I edited the ID3 tags for a file namedsome-song.mp3.

And put Éso as its title.

I wanted to know if the program would handle those just fine. It did not.


It was all right when the ID3 tags contained only valid ASCII characters, as soon as I put an accented character in the title, artist or album what I got was an error like this:

iex(1)> ID3Parser.parse "some-song.mp3"

** (ArgumentError) argument error
    (stdlib) :io.put_chars(:standard_io, :unicode, [
      <<89, 111, 112, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 32, 45, 32, 201, 115, 111,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...>>, 10])
Enter fullscreen mode Exit fullscreen mode

The solution

After some research here and there, then some error reporting… I found out that ID3v1 tags —the ones the program is trying to parse— should in theory be encoded as ISO-8859-1, also known as Latin 1.

What I needed was a way to convert those bytes from ISO-5589-1 (Latin 1) to UTF-8 (Unicode), and give IO.puts something it could print without problems.

I found exactly that in this Erlang facility:

:unicode.characters_to_binary(your_string, :latin1)
Enter fullscreen mode Exit fullscreen mode

This is the final program that correctly parses ID3v1 tags in their expected encoding —careful, the encoding is expected, but in no way guaranteed:

defmodule ID3Parser do
  def parse(file_name) do
    case File.read(file_name) do
      {:ok, mp3} ->
        mp3_byte_size = byte_size(mp3) - 128

        <<_::binary-size(mp3_byte_size), id3_tag::binary>> = mp3

        <<"TAG",
          title::binary-size(30),
          artist::binary-size(30),
          album::binary-size(30),
          year::binary-size(4),
          _rest::binary>> = id3_tag

        to_convert = [title, artist, album, year]
        [title, artist, album, year] =
          Enum.map(to_convert, fn tag -> from_latin1(tag) end)

        IO.puts "#{artist} - #{title} (#{album} #{year})"

      _ ->
        IO.puts "Couldn't open #{file_name}"
    end
  end

  defp from_latin1(string) do
    :unicode.characters_to_binary(string, :latin1)
  end
end
Enter fullscreen mode Exit fullscreen mode

Hopefully this will help someone else in the same predicament.

Links

Top comments (2)

Collapse
 
rhnonose profile image
Rodrigo Nonose

Had similar problems recently as I'm from Brazil, used Codepagex for the conversion and went pretty smoothly.

Collapse
 
lobo_tuerto profile image
Víctor Adrián

Thanks for the recommendation!