Not long ago our company took part in an intriguing project on the effectiveness of different Neural Machine Translation engines, including systems from Google, Microsoft and IBM. The research was undertaken by Intento, which offers a common service platform for artificial intelligence from various providers.
Machine translation technology is improving every year, and if everyone laughed off the prospect of automatic translators in the mid 2000s then the situation quickly changed with the appearance of Neural Machine Translation systems in 2016.
Neural Machine Translation (NMT) uses deep learning neural network methods and in many respects outperforms the statistical machine translation models that were previously considered the most effective.
In 2018 several players in the machine translation market quickly realized the possibility of their engines' domain adaptation (i.e. adaptation to a specific subject area). Intento analyzed the effectiveness of such adaptations in their study and compared them to a few stock ("pre-trained") engines based on the criteria of effectiveness (based on the quality of delivery), price, the size of the required learning sample, learning time and data protection policies.
An English -> German biomedical data file was chosen for teaching the engines. A random collection of segments was extracted for testing the learning neural networks. Then the prepared translations were analyzed with the help of LEPOR algorithm to reveal the smartest systems.
But that's still not all. The parts of the segments that were translated by the various engines with significant differences were sent for manual examination at Logrus IT. Our linguists received the original text along with both the unlabeled versions of the translation from the 13 trial systems and a model human translation. In this way, each translation's source was hidden for the sake of the experiment's purity — we did not know which translation we were checking: human or machine, and which particular engine was used if it was a machine.
We used our own methodology during the translation's quality control, including some evaluation criteria (adequacy, readability, terminology, style, etc.), along with an error severity scale (from major severity to perfect). You can read more about that in more detail here.
A rating system was put together for the number and nature of mistakes allowed in the translation based on analysis data (see figure).
As we see, the human translation was by no means the best! This was largely due to the imperfect quality of the teaching samples on which the results of the neural network's work directly depend. Therefore, the systems that learned to most accurately imitate the existing text from the specific samples in the target language were moved forward in the initial automatic analysis.