You can check the latest updated version of this article at lobotuerto's notes - Elixir and my ISO-8859-1 character encoding problem.
Being from México, I have been wrestling with character encoding issues for a long time, in several languages…
Now, it’s Elixir’s time.
The problem
When working my way through The little Elixir & OTP guidebook —a highly recommended one— I got stuck at the ID3 parser example program:
defmodule ID3Parser do
def parse(file_name) do
case File.read(file_name) do
{:ok, mp3} ->
mp3_byte_size = byte_size(mp3) - 128
<<_::binary-size(mp3_byte_size), id3_tag::binary>> = mp3
<<"TAG",
title::binary-size(30),
artist::binary-size(30),
album::binary-size(30),
year::binary-size(4),
_rest::binary>> = id3_tag
IO.puts "#{artist} - #{title} (#{album} #{year})"
_ ->
IO.puts "Couldn't open #{file_name}"
end
end
end
Using Clementine I edited the ID3 tags for a file namedsome-song.mp3
.
And put Éso
as its title
.
I wanted to know if the program would handle those just fine. It did not.
It was all right when the ID3 tags contained only valid ASCII characters, as soon as I put an accented character in the title
, artist
or album
what I got was an error like this:
iex(1)> ID3Parser.parse "some-song.mp3"
** (ArgumentError) argument error
(stdlib) :io.put_chars(:standard_io, :unicode, [
<<89, 111, 112, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 32, 45, 32, 201, 115, 111,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...>>, 10])
The solution
After some research here and there, then some error reporting… I found out that ID3v1 tags —the ones the program is trying to parse— should in theory be encoded as ISO-8859-1, also known as Latin 1
.
What I needed was a way to convert those bytes from ISO-5589-1 (Latin 1) to UTF-8 (Unicode), and give IO.puts
something it could print without problems.
I found exactly that in this Erlang facility:
:unicode.characters_to_binary(your_string, :latin1)
This is the final program that correctly parses ID3v1 tags in their expected encoding —careful, the encoding is expected, but in no way guaranteed:
defmodule ID3Parser do
def parse(file_name) do
case File.read(file_name) do
{:ok, mp3} ->
mp3_byte_size = byte_size(mp3) - 128
<<_::binary-size(mp3_byte_size), id3_tag::binary>> = mp3
<<"TAG",
title::binary-size(30),
artist::binary-size(30),
album::binary-size(30),
year::binary-size(4),
_rest::binary>> = id3_tag
to_convert = [title, artist, album, year]
[title, artist, album, year] =
Enum.map(to_convert, fn tag -> from_latin1(tag) end)
IO.puts "#{artist} - #{title} (#{album} #{year})"
_ ->
IO.puts "Couldn't open #{file_name}"
end
end
defp from_latin1(string) do
:unicode.characters_to_binary(string, :latin1)
end
end
Hopefully this will help someone else in the same predicament.
Top comments (2)
Had similar problems recently as I'm from Brazil, used Codepagex for the conversion and went pretty smoothly.
Thanks for the recommendation!