Leonid Glazychev, Ph.D., CEO, Logrus IT
Part 1: Introduction - Quality measurement landscape fragmentation - Automated and manual approaches used - What we truly need to measure
Let us start with a short digression. A woman is visiting her recently married daughter and her husband for Thanksgiving. The mother watches her daughter prepare the turkey and sees that she starts with cutting the body into two halves.
“Why have you cut the turkey?” she asks.
“It’s how I’d always seen you do it growing up.”
“But I only did it because we had a tiny stove, and the turkey simply did not fit inside as is!”
Critical thinking means taking a fresh look at everything you do, including things that we all take for granted or processes we endlessly and subconsciously repeat simply because we're used to repeating them.
In our case, it means to resist taking existing quality evaluation methods or old habits for granted and seeking objective, substantiated answers about their applicability, reliability, and existing correlations, if they exist.
The first steps in this direction were made in the Intento research in 2018. Logrus IT had the privilege to contribute to that research and provided all manual LQAs and LQA-related stats.
The experiment compared 14 various NMT engines. The sample text comprised 45 randomly selected long segments with no tags (~2,000 words) from the pharmaceutical patents’ domain. It was translated from English to German. Selected segments demonstrated maximal standard deviation among engines, with an average hLEPOR score of 0.71 in each case.
We did a blind atomic (segment-level) LQA on all 14 translated samples, one of which was an actual human translation taken from an isolated piece of the same TM/corpus that was used to train some NMT engines. All in all, 5 expert reviewers were used. Each of them reviewed the same 9 segments per person (300 – 500 words) across all 14 engines to minimize subjectivity.
Here’s the brief quality summary:
Automated evaluation does not reflect loss of translation adequacy, including additions, omissions, wrong word order, etc.
Based on these results, Intento suggested to use the share of good MT translations as a quality indicator. Their primary assumption was that major errors equal major rework.
This leads us directly to the next section, namely, to the remaining big questions.
If the MT sentence translation contains a single absolutely irrelevant or unacceptable word or an omission that completely mangles or obscures the meaning of the sentence, LQA will produce the highest severity error (big impact), while fixing the issue only takes seconds (replacing or adding a single word).
The above experiment was carried out on standalone, non-consecutive units, so holistic evaluations were out of question. At the same time, we know that these evaluations are quick and correlate well with human sentiment. Theoretically, it could make a lot of sense to apply holistic evaluations to all contiguous texts. (Please find details on the 3D, hybrid quality approach in a separate article: Reliably Measuring Something That Isn't Completely Objective: The Quality Triangle Approach to Translation Quality Assurance).
So far there is no direct evidence that LQA results for the MT pre-translation stage actually correlate with human editing (finalization) effort. Moreover, there are reasons to suspect the opposite. Some critical or major errors, like a horrible typo on a home page (like the work “luck” inadvertently replaced by an expletive by a single letter) can be fixed in just a second. That’s exactly why we need to include the third, alternative quality measurement approach (editing time) in the comparison.
The experiment we staged centered around the primary question above, i.e. finding an efficient, affordable, reliable solution to estimate/predict MT or [MT+] crowdsourcing quality.
We understood that the answer (methodology) may be goal-dependent, i.e. different solutions may be required for materials that are expected to be used as final and materials to be edited/improved.
In order to answer the question, we carefully checked correlations among all three known methods of translation quality measurement, including:
In order to provide reliable results, we tried to stay on scientific ground at all times, by no means relying on anecdotal evidence. The approach we applied can be summarized as follows:
The experiment was carried out on a relatively typical, contiguous text on IT security (5,000 words) that was translated from English into Russian. We used a team of five translation professionals of varying skills and experience, which simulated a real-world situation in a translation company.
The whole experiment included three consecutive stages:
This step has not only provided additional data, but it was essential as a “safety check”. We needed to make sure that the quality of edited MT was similar to human translation quality, otherwise any conclusions derived would be irrelevant.
We split the total volume into five equally sized pieces of statistically sound size (approximately 1,000 words each) and, to eliminate all artificial correlations, we split the work as follows:
Work division among team members is presented in Table 1 below. Each line illustrates what a particular expert did during the experiment.
For example, Expert #3:
Table 1. Work distribution among team members. Table cells contain text piece numbers.