DEV Community

Discussion on: How would you create a translator app?

Collapse
 
dmfay profile image
Dian Fay

If I had to do this and NLP were out of the question, I'd start by tokenizing input text at the sentence and word level to get an array of arrays of words. Next task is to identify each word, which is monumentally tricky for a few reasons:

  • tenses, plurals, possessives, conjugations, and declensions all affect spelling
  • the input data isn't necessarily coming from a perfect typist
  • homographs exist

Taking as given the existence of lookup tables for all these combinations (Wiktionary is a godsend), the problem of no or multiple matches can be at least approached by introducing a confidence factor. Exact match = 100% confidence. n matches = 100/n % confidence each. No matches, and things get fun: I'd start by searching the lookup tables for similar values using the Levenshtein distance algorithm, factoring the distance and perhaps word length into match confidence.

Result so far is an array of (arrays of (arrays of [match, confidence] tuples) word-candidates) sentences. And this is probably about where I'd give up on going it alone, because the next step is teaching a computer grammar in both the origin and target languages in order to assign each word a position and part of speech in a mappable sentence structure. That's not even an NLP question, it's a "dedicate your entire career to scratching the surface" question -- just ask Noam Chomsky!