BLOG

How to Efficiently & Affordably Evaluate MT and Crowdsourcing Translation Quality (Part 2)

 

Leonid Glazychev, Ph.D., CEO, Logrus IT

Part 1: Introduction - Quality measurement landscape fragmentation - Automated and manual approaches used - What we truly need to measure

 

A brief ode to critical thinking – first steps

Let us start with a short digression. A woman is visiting her recently married daughter and her husband for Thanksgiving. The mother watches her daughter prepare the turkey and sees that she starts with cutting the body into two halves.

“Why have you cut the turkey?” she asks.

“It’s how I’d always seen you do it growing up.”

“But I only did it because we had a tiny stove, and the turkey simply did not fit inside as is!”

Critical thinking means taking a fresh look at everything you do, including things that we all take for granted or processes we endlessly and subconsciously repeat simply because we're used to repeating them.

In our case, it means to resist taking existing quality evaluation methods or old habits for granted and seeking objective, substantiated answers about their applicability, reliability, and existing correlations, if they exist.

 

Intento research

The first steps in this direction were made in the Intento research in 2018. Logrus IT had the privilege to contribute to that research and provided all manual LQAs and LQA-related stats.

The experiment compared 14 various NMT engines. The sample text comprised 45 randomly selected long segments with no tags (~2,000 words) from the pharmaceutical patents’ domain. It was translated from English to German. Selected segments demonstrated maximal standard deviation among engines, with an average hLEPOR score of 0.71 in each case.

We did a blind atomic (segment-level) LQA on all 14 translated samples, one of which was an actual human translation taken from an isolated piece of the same TM/corpus that was used to train some NMT engines. All in all, 5 expert reviewers were used. Each of them reviewed the same 9 segments per person (300 – 500 words) across all 14 engines to minimize subjectivity.

Here’s the brief quality summary:

  1. No correlation was found between LQA results and hLEPOR scores.

Automated evaluation does not reflect loss of translation adequacy, including additions, omissions, wrong word order, etc.

  1. Human translation was #7. This emphasizes the questionable quality of materials taken mechanically from huge TMs mentioned earlier (that have never been evaluated or cleaned).
  2. The quality of MT translations in general was not even close to a good human one. Normalized atomistic evaluation produced normalized quality ratings between 35-39% for the best NMT engines (and much, much worse for some others), while for human translation typical expectations are above 60%.
  3. 60-70% of some MT translations were very good, meaning no errors or minor ones only.

Based on these results, Intento suggested to use the share of good MT translations as a quality indicator. Their primary assumption was that major errors equal major rework.

This leads us directly to the next section, namely, to the remaining big questions.

 

Remaining big questions

  1. Does atomic error severity/number reflect text rework effort? (Do major errors truly mean major rework?)

If the MT sentence translation contains a single absolutely irrelevant or unacceptable word or an omission that completely mangles or obscures the meaning of the sentence, LQA will produce the highest severity error (big impact), while fixing the issue only takes seconds (replacing or adding a single word).

  1. Will 3D, hybrid quality evaluation that adds holistic adequacy and intelligibility measurement clarify the picture?

The above experiment was carried out on standalone, non-consecutive units, so holistic evaluations were out of question. At the same time, we know that these evaluations are quick and correlate well with human sentiment. Theoretically, it could make a lot of sense to apply holistic evaluations to all contiguous texts. (Please find details on the 3D, hybrid quality approach in a separate article: Reliably Measuring Something That Isn't Completely Objective: The Quality Triangle Approach to Translation Quality Assurance).

  1. What does automation (n-gram-based methods) really measure?
  • Human perception of MT “as is” (like LQA)?
  • Productivity gain (editing effort)?
  • Something else or nothing at all?
  1. Are there any correlations?
  • Do better LQA results mean shorter editing time?
  • Does an hLEPOR score of 71% mean anything for LQA or editing effort?

So far there is no direct evidence that LQA results for the MT pre-translation stage actually correlate with human editing (finalization) effort. Moreover, there are reasons to suspect the opposite. Some critical or major errors, like a horrible typo on a home page (like the work “luck” inadvertently replaced by an expletive by a single letter) can be fixed in just a second. That’s exactly why we need to include the third, alternative quality measurement approach (editing time) in the comparison.

  1. Finally, getting back to the title of this article: Is there a RELIABLE, QUICK and EFFICIENT way to estimate/predict MT or crowdsourcing quality?

 

The experiment

Goals and approach

The experiment we staged centered around the primary question above, i.e. finding an efficient, affordable, reliable solution to estimate/predict MT or [MT+] crowdsourcing quality.

We understood that the answer (methodology) may be goal-dependent, i.e. different solutions may be required for materials that are expected to be used as final and materials to be edited/improved.

In order to answer the question, we carefully checked correlations among all three known methods of translation quality measurement, including:

  1. Automation (n-gram-based)
  2. Manual, 3D hybrid LQA
  3. Editing Time

In order to provide reliable results, we tried to stay on scientific ground at all times, by no means relying on anecdotal evidence. The approach we applied can be summarized as follows:

  1. Conduct experiments on reasonable, convincing volumes
  2. Apply a solid methodology
  3. Ensure all steps are properly followed, eliminate artificial correlations, verify results

Experiment setup

The experiment was carried out on a relatively typical, contiguous text on IT security (5,000 words) that was translated from English into Russian. We used a team of five translation professionals of varying skills and experience, which simulated a real-world situation in a translation company.

The whole experiment included three consecutive stages:

  1. The text was translated manually as well as using three (3) popular generic MT engines: Google NMT, DeepL, and Google SMT (the latter to diversify the field, as well as for SMT vs. NMT comparison purposes). We would have loved to try out more engines, but had natural resource/budget limitations…
  2. For all raw machine translations, we carried out the following:
  1. All three machine translations were edited by humans with the goal to bring them up to human translation standards. We measured editing time in each case.
  2. Finally, we did a second round of blind LQA evaluations, once again applying the Logrus IT 3D, hybrid model. This time the LQA was done for all four translations (one human and three edited after MT).

This step has not only provided additional data, but it was essential as a “safety check”. We needed to make sure that the quality of edited MT was similar to human translation quality, otherwise any conclusions derived would be irrelevant.

Experiment details

We split the total volume into five equally sized pieces of statistically sound size (approximately 1,000 words each) and, to eliminate all artificial correlations, we split the work as follows:

  1. Raw MT output from three different engines (Google NMT, DeepL, and Google SMT) was evaluated by one and the same independent, professional LQA expert across all engines. The same 3D metric based on the Logrus IT 3D, hybrid model was applied in all cases.
  2. Each team member manually translated one piece out of five.
  • Timing benchmarks were done for each translated piece (5 altogether).
  • To minimize the effect of inevitable variance in individual translation speeds we used the same team of translators/editors in each case and averaged all timing benchmarks across all experts.
  1. We ran hLEPOR scores for each MT engine (raw MT vs. human translation produced earlier).
  2. During the editing step each person edited three (3) different pieces of text after MT, each piece produced by a different engine.
  • Timing benchmarks were done for each edited piece (3 MT engines x 5 pieces, 15 altogether).
  • We made sure that nobody edited the same piece of text after MT multiple times, because after the first editing session final translations are already “imprinted” in translator’s mind, and subsequent sessions go faster irrespective of the MT quality.
  1. We ran hLEPOR scores again for each MT engine. This time we made the following comparisons:
  • Edited MT vs. raw MT
  • Edited MT vs. human translation
  1. Finally, each team member did a blind LQA on all four (4) versions of yet another piece of text: one human and three edited MT translations produced by various engines. We made sure that nobody did an LQA on a piece they had previously translated or edited to preserve as much objectivity as possible and minimize artificial correlations (we tend to assign higher scores to translations more similar to our own). All ratings were averaged, thus reducing the effect of each reviewer’s unavoidable personal preferences.

Work division among team members is presented in Table 1 below. Each line illustrates what a particular expert did during the experiment.

For example, Expert #3:

  • Translated piece #3
  • Edited pieces #4, #5, and #1 after different MT engines
  • Did LQA on all four (4) versions of piece #2 (one human translation and three post-edited MT translations).

Table 1. Work distribution among team members. Table cells contain text piece numbers.

 

Part 3: Editing effect – Trustworthiness of results – Holistic vs. atomistic evaluation – An in-depth look at automated (n-gram-based) quality evaluations – MT editing speed and its correlation with LQA results

 

BACK
This website uses cookies. If you click the ACCEPT button or continue to browse the website, we consider you have accepted the use of cookie files. Privacy Policy