Justin Houghton

Posted on Mar 2, 2020 • Edited on Jul 21, 2021

Lost in Machine Translation. A tiny intro into the field of MT.

#machinelearning #python #datascience

Machine translation or MT is the field of study that works on using software to translate text or speech from one language into another.) If you’ve ever tried learning a new language, you know that word for word translation alone can’t provide great translation — you need to comprehend whole phrases to be able to more accurately define a word in context!

There are some that claim that word ambiguity will by definition prevents MT from ever being able to distinguish two meanings of the same word. However researchers are hard at work developing new systems aimed at tackling this problem head on including deep and shallow MT, which will be discussed in a later post! For the purposes of this introduction, we’ll focus on providing a quick overview of 5 subfields of MT that — while falling victim to the ambiguity and context problems MT currently faces — still manages to provide accurate enough translations to revolutionize communication.

As a subfield of computational linguistics and natural language processing, MT is often grouped into 5 distinct areas.

Rule-based
Statistical
Example-based
Hybrid MT
Neural MT

Rule-based MT

Unlike many other MT approaches, rule-based MT uses syntax rules and morphological rules to assist in translation. Rule based systems link an input sentence with the structure of the output sentence which helps to preserve its meaning.

At its simplest, rule-based MT requires 3 distinct things.

A dictionary that can map each word in a given language with its desired output language.
Grammatical and syntactical rules for the input language.
Grammatical and syntactical rules for the desired output language.

Some notable advantages of this type of MT are that no bilingual texts are required, it’s general domain independent and that it allows for total control over debugging and adding new rules.

Statistical

Like the name suggests, statistical ML attempts to generate translations using statistical methods based on bilingual text. Google Translate one of the most well known translation services, switched to statistical translation in 2007 from SYSTRAN, which it had been using in previous years. These types of translations work by detecting patterns in millions of documents that have already been translated by humans and using those patterns to predict based on its findings.

One clear important downside to statistical MT is that it is dependent on huge amounts of already translated text. Although there are newer approaches like METIS 2 that focus on finding patterns in syntactic structure allowing for smaller corpus size, it is safe to assume right now that the more translated texts a particular language has, the more accurate any new translation will be.

Example-based

Not to dissimilar to statistical MT, example-based MT is based on the idea of an analogy and requires the use of already translated texts in it’s corpus.

Given a particular sentence (A) needing translation, different sentences (B and C) are selected from the corpus that have similar sub components. The sub components of B and C are then used to translate the sub components of the A — which when pieced together form a newly translated sentence.

Hybrid MT

Hybrid MT aims to capitalize on the strength of both rule-based and statistical MT. There have been a number of different approaches proposed on how to build a hybrid MT system, but there is generally two distinct ways it used in practice.

Rules post-processed by statistics.

Just like in a rules based translation, in this first example of hybrid MT, translations are performed by a rules based engine. The different is that statistics are then used to post-process the output in an effort to refine the translation output. Here, statistics helps correct a rules based engine output.

Statistics guided by the syntactic and metamorphic rules.

Rules are used in pre-process to help guide the statistical translation, and also used in the post-process to help adjust/refine the translation output. In this example of hybrid MT, rules are used to help correct a statistics based engine during its pre and post processing phase.

Neural MT

Neural machine translation (NMT) is the latest approach to machine translation that uses neural networks to predict the likelihood of a set of words. One huge advantage of NMT is that they require only a fraction of the memory needed by traditional translation techniques (listed above).

While no current system claims to have found the golden ticket of MT systems — in practice MT has proven to be extremely valuable for changing the way that individuals and businesses across the world communicate. Everyone from the US Military to Facebook is working hard to build their own translation systems that will both bring the world closer together and aid in homeland security. With the release of Google’s pixel buds that claim to provide real time translation across countless languages, will business owners or elementary school kids of the future have little need to learn a foreign language?

This ends the first of many small posts aiming to provide useful summaries of machine learning sub fields, concepts, algorithms and workflows for pythonistas and developers alike who are interesting in dabling or working in ML. If you find a mistake or have feedback for how I summarized a particular topic/field, please don't be afraid let me know in the comments.

“We all need people who will give us feedback. That’s how we improve.” - Bill Gates

Until next time!

DEV Community

Lost in Machine Translation. A tiny intro into the field of MT.

Rule-based MT

Statistical

Example-based

Hybrid MT

Neural MT

Top comments (0)

Your AI Code Assistant

Okay