Theodend: A High-Efficiency, Small Corpus Size Machine Translator

Us is riht micel ðæt we rodera weard, wereda wuldorcining, wordum herigen, modum lufien.
He is mægna sped, heafod ealra heahgesceafta, frea ælmihtig.
Næs him fruma æfre, or geworden, ne nu ende cymþ ecean drihtnes, ac he bið a rice ofer heofenstolas.

Right is it that we praise the King of heaven, the Lord of hosts, and love Him with all our hearts.
For He is great in power, the Source of all created things, the Lord Almighty.
Never hath He known beginning, neither cometh an end of His eternal glory.

 — Old English Book of Genesis

The problem I've encountered as I've gained an interest in Old English is that there is an extremely limited literary base and there are not very many people studying it. As a result, the best kind of "translator" one can find online for Old English, or Anglo-Saxon, is an online dictionary—which, while helpful, cannot translate complete sentences in the wondrous fashion of Google Translate and friends.

I wanted to change that by designing an accessible algorithm for statistical machine translation and make it easier to translate dying, extinct, or just defunct languages with limited written text available, and I wrote a pretty cool paper on my process, below.

Computing translations with statistical machine translation, the technique used by Google Translate and most other translation programs, utilizes parallel texts in the target language (in this case, modern English) and the source language (Anglo-Saxon). It's fairly easy to associate words in each language when you align texts sentence-by-sentence and pair it with a bit of solid computing power. Most methods use live-translated proceedings from the UN as corpora for most majorly-spoken modern languages, and you can easily get the 10k - 100k sentences necessary to produce reliable translations.

But when you have Old English, or other languages without such accessible, large-volume corpora, you can't use statistical machine translation on this scale. One gains reliability of translations through more sentences fed in to train the model, and those simply aren't available for dying languages or those from long ago. The alternative is a rule-based approach, but that takes dozens of experts, a significant budget, and painstaking efforts that will probably ignore any and all idioms or oddities of structure.

I used Google and Amazon's cloud computing services, leveraging 16- and 32-core virtual machines with 200GB solid state drives and 32 to 64GB of RAM for about $2 an hour—plenty of resources for parsing through thousands of sentences and creating a database of word associations, then repeating to maximize probabilities. After that, I applied some heuristic algorithms to smooth out the resulting databases, associating every word with a translation only once or multiple times only as part of a phrase. The basic concept is a variation of IBM's Model 1 Algorithm with fertility.

Unfortunately, I haven't yet achieved my goal of creating a publicly-accessible online Anglo-Saxon translator, because in the short time I was working on this paper, I wasn't able to source accurately-aligned parallel texts in Anglo Saxon and Modern English of at least 3000 words. For their grammatical similarity (and equal difficulty of accurate translation), to prove that my algorithm was successful at small-corpus translation, I used Modern German as the target text.

The next steps in this project, on which I am currently working, are to:

  1. Take the time to optimize and ensure alignment of a roughly 3000-sentence-pair Anglo-Saxon corpus with its Modern English translations (partly done)
  2. Train the algorithm for Anglo-Saxon (should not take very long, considering its similarity to German—only worry would be advanced character support in Unicode for Anglo-Saxon)
  3. Conduct tests to determine the final BLEU score of the Anglo Saxon algorithm and compare it to larger training set sizes
  4. Port the database to a web server and build a frontend to make the full-sentence translator publicly accessible to researchers, and release the source code on GitHub.

Here's a PDF of my paper for the Siemens Competition, should you wish to read more:

Improved Hybrid Machine Translation for Small Sized Corpora (by Jake Glass)