Jeffrey

Posted on Apr 3

Make Your Own Dictionary Suite

#bash #docker #dictionary

Background

Nowadays, everyone looks up information on the internet as mobile network coverage has been improving. To avoid Ads and minimize data privacy issues, I normally avoid installing mobile apps on my phone and do most tasks in a browser. If I really need immediate access to some website, I will create shortcuts on my home screen or use PWA to access online resources.

Having stayed in the UK for a few months, I found that my mobile network connection is very unstable, especially when staying indoors. Therefore I started to look for offline dictionary apps. One disadvantage of using typical dictionary apps is that each app only contains entries from a single source. To incorporate more sources, I have to install multiple apps corresponding to each source. For instance, if I want to have Oxford, Longman, and Cambridge dictionaries on my phone, I will need to install three separate apps, leading to cluttered storage space and increased effort to switch between them. This could be a huge inconvenience for language learners.

Solution

As I do more research, it turned out that

dictionary data and app can be separated
there have been several common well-established open dictionary file formats
open-source conversion tools are available

With these criteria satisfied, essentially any dictionary can be ported into one single app as long as the format is recognizable to the app. Thanks to the open-source community, many people have shared their high-quality dictionary files so we do not need to create our own from scratch. Most of the time we only need to convert the files into a format recognizable to the app.

I have chosen an open-source and minimalistic app Aard2 as my dictionary app, which accepts ".slob" files. There are many other formats such as stardict, mdict etc. In the following section, I will show you my conversion procedure, which should also apply to other formats. Once the dictionary files are ready, they can be imported into the app, becoming your own dictionary suite.

Setup

Here are a few tools I often use to perform the conversion

Pyglossary
Docker
Vim
Bash

There might be more but the core tool to do the conversion is pyglossary, which acts like ffmpeg in video format conversion, or pandoc in document format conversion. It supports most of the dictionary formats.

For some reason, I have difficulties to run pyglossary on my Macbook so I install it on Docker instead. The Dockerfile is available on the github repository.

To build the docker image,

git clone https://github.com/ilius/pyglossary.git
cd pyglossary-master
docker build -t pyglossary .

Once the image is built, navigate to the directory containing your dictionary files, then

docker run -it --rm -v "$PWD":/dict -w /dict pyglossary

You will be brought to an interactive environment guiding you through the conversion step by step. The dictionary files are located in "/dict/" due to the "-v "$PWD":/dict" flag, which makes the current directory on the host machine available as /dict inside the container. "-w /dict" overrides the default working directory "/" by "/dict" so you can specify file locations without the hassle of writing a long file path.

Example

Last section covers the basic conversion workflow. It works in general. Unfortunately, sometimes there are many tricky situations you might encounter during the process. Here I am showing how I resolve the issues when converting Merrriam Webster Collegiate Dictionary in mdict found in some forum into slob.

Mdict is composed of two files, Mdx and Mdd. Mdx contains all the entries, and Mdd contains the corresponding media files.

If I input the mdx file into pyglossary, it will automatically find the corresponding mdd file with the same file name under the same directory, as shown in the message:

If I open the dictionary, this is what I will get:

It looks fairly good but there are two problems:

nothing happens when I click the speaker icon
dictionary name is messed up.

My solution is that instead of converting mdict directly into slob, we can first convert it into csv so the content can be read as plain text, then convert it back to slob after making modifications to it.

Since csv can only show text content, an additional directory will be created to store the media files during conversion.

Merriam-Webster_Collegiate_Dictionary_11th.csv_res
Merriam-Webster_Collegiate_Dictionary_11th.csv

Before doing anything, let's take a look at the first 15 lines of the csv file by

head -n 15  ./Merriam-Webster_Collegiate_Dictionary_11th.csv | nl

The "nl" command will print the line number alongside the content.

   1    "#name","Merriam-Webster&apos;s Collegiate Dictionary"
   2    "#description","<table width=""100%"" border=""0"" bgcolor=#80354a><tr><td align=""center""><IMG src=""cover.png""></td></tr></table>
   3    <table width=""100%"" border=""0"" bgcolor=#1e3c72><tr><td align=""center""><font  color=white><b>Merriam-Webster&apos;s Collegiate Dictionary</b></font><br><font  color=#e5e5e5>11th Editon</font><br><font  color=#fcc046><b><i>New Ways to Find the Words You Need Today<i/></b></font></td></tr></table>
   4    <br><b>Number of Entries: </b>119,775/ 97,814(mdd)
   5    <br><b>Features:</b>
   6    <br>·225,000 clear and precise definitions
   7    <br>·More than 40,000 word-use examples
   8    <br>·More than 7500 phrases and idions
   9    <br>·A comprehensive coverage of all fields of knowledge
  10    <br>·165,000 entries with correct spellings and pronunciations
  11    <br>·More than 700 illustrations, tables and diagrams for at-a-glance information
  12    <br><br><b>Data from ABBYY Lingvo .dsl source files; Last converted by Hugh for Mdict: </b>2013 05 25"
  13    "1080","<font style=""font-weight:bold;"">1080</font><br><a href=""sound://10800001.spx""><img src=""Sound.png"" border=""0""></a> <b><font color=#CA0000>noun</font></b><br><i>also</i> <b>ten-eighty</b> \\(ˌ)ten-ˈā-tē\\<div style='display:block;background-color:#f6f0e6;'><span style=""color: #585858; background-color: #E6E6E6;font-weight:bold;"">&nbsp;ETYMOLOGY </SPAN>&nbsp;from its laboratory serial number</font></div><div style='display:block;background-color:#f6f0e6;'><span style=""color: #585858; background-color: #E6E6E6;font-weight:bold;"">&nbsp;DATE </SPAN>&nbsp;1945</font></div><b>:</b> a poisonous preparation of sodium fluoroacetate used as a rodenticide and pesticide"
  14    "12-step","<font style=""font-weight:bold;"">12-step</font><br><a href=""sound://12ste01v.spx""><img src=""Sound.png"" border=""0""></a> \\ˈtwelv-ˌstep\\ <b><font color=#CA0000>adjective</font></b><div style='display:block;background-color:#f6f0e6;'><span style=""color: #585858; background-color: #E6E6E6;font-weight:bold;"">&nbsp;DATE </SPAN>&nbsp;1983</font></div><b>:</b> of, relating to, characteristic of, or being a program that is designed especially to help an individual overcome an addiction, compulsion, serious shortcoming, or traumatic experience by adherence to 12 tenets emphasizing personal growth and dependence on a higher spiritual being"
  15    "18-wheeler","<font style=""font-weight:bold;"">18-wheel·er</font><br><a href=""sound://18_whe01.spx""><img src=""Sound.png"" border=""0""></a> <b><font color=#CA0000>noun</font></b><br><i>or</i> <b>eigh·teen-wheel·er</b> \\ˌā(t)-(ˌ)tēn-ˈwē-lər\\<div style='display:block;background-color:#f6f0e6;'><span style=""color: #585858; background-color: #E6E6E6;font-weight:bold;"">&nbsp;DATE </SPAN>&nbsp;1976</font></div><b>:</b> a trucking rig consisting of a tractor and a trailer and typically having eighteen wheels"

We can find that the first 12 lines contain some metadata of the dictionary, including the false dictionary name. For each of the remaining lines, the first column shows the entry, and the second column shows the html with inline css. Therefore, we can treat a dictionary reader as a program that decompresses the dictionary files then display the html content for each entry like a browser.

In Merriam-Webster_Collegiate_Dictionary_11th.csv_res, we can run this command to see what kind of media files there are.

ls -1 | awk -F . '!a[$NF]++{print $NF}'

It produces the following output:

spx
jpg
png

Since "spx" is a format not supported by browsers, the spx files have to be converted into formats like mp3 or ogg, which can be done in ffmpeg. To speed up the process, we can use xargs to do the conversion in parallel.

find . -name '*.spx' | awk '{a=$0; gsub(/\.spx$/, ".ogg", $0); printf("ffmpeg -y -hide_banner -loglevel error -i %s %s\n", a, $0)}' | xargs -I "@" -P 0 sh -c "@"

rm *.spx

Once the conversion is complete, we update the file paths in the a tags in the html using sed.

sed 's/\.spx/.ogg/g' ./Merriam-Webster_Collegiate_Dictionary_11th.csv > ./temp.csv

mv temp.csv ./Merriam-Webster_Collegiate_Dictionary_11th.csv

To correct the dictionary name on the first line, simply change the second column of the first line to whatever you want. You might wonder why I edit csv with commandline tools exclusively rather than Excel or a text editor. The reason is that a dictionary typically contains a huge amount of entries. For some large csv, even vim takes several seconds to load up the file, not to mention Excel, which might crash immediately.

The final step is to convert the csv back to slob so that Aard2 can read it. Make sure the csv's filename and the directory containing media files are the same except that the directory's name is ended with ".csv_res". Pyglossary only link files with correct filenames.

One more thing I would like to add is that in reality I always convert dictionary files into csv first rather than convert them directly into the final format I want. Since html is embedded inside the dictionary files, it is also possible for someone injecting javascript into the files. Usually the javascript is only used for controlling the interface, but I used to encounter a situation that the javascript connected to some domain to fetch resources for the dictionary. This led to two problems:

the javascript code could be malicious
the dictionary was no longer offline Therefore, I suggest inspecting the csv to ensure there are nothing suspicious inside the files. To err on the side of caution, you might also want to block internet access for the dictionary app.

Conclusion

Given good network coverage in modern cities, most of the time we do not need offline tools. Despite the convenience, I think we should stay alert to over-reliance on the internet and be aware of data privacy. On the other hand, an offline app occasionally still have its practical values.

In this article, I have shared some of my experience of how I deal with the dictionary files. It appears a little complicated but the fundamental idea is simple. There are many cases and techniques I did not cover here but I hope the example helps illustrate my workflow. If you have better approaches or any thoughts about offline tools, please feel free to share your experience.

DEV Community

Make Your Own Dictionary Suite

Background

Solution

Setup

Example

Conclusion

Top comments (0)

Read next

The Ultimate Guide to Docker Networking: Tips, Tricks, and Best Practices

Production Multistage Dockerfile For React Application 🚀

OSCAR 2022 sea surface velocity streamplot animation

Deep Dive into Multistage Dockerfile with a Golang App ⚙️🚢