General Approach to Crowdsourcing-Based LQAs
As already mentioned in Part I, in the vast majority of real-life cases, nobody can afford the luxury of employing an expert panel to evaluate the translation quality of any particular document or web portal. We typically have to use a single reviewer who only looks at a certain percentage of the content. To produce meaningful, reliable results despite this limitation we need to apply the methodology described (quality triangle or square, which separates two primary types of quality indicators:
- Semi-objective, holistic, not completely accurate by design, and need to be evaluated against acceptance thresholds on a pass/fail basis, and
- Objective, atomistic
Providing sufficient and uniform training as well as all available reference materials to all reviewers without exception is essential for the latter. Otherwise, atomistic sentence/unit-level quality evaluation would also inevitably lose its objectivity and accuracy, providing misleading and meaningless results.
The seemingly innocuous concept of mandatory training and use of reference materials poses a serious problem for the crowdsourcing environment, because we are asking for unpaid public effort from enthusiasts and cannot expect people to spend too much time on preparations. While many of us are ready to contribute and hour or two to a good cause, or, better yet, spend it on something truly interesting, the idea of logging long hours to meticulously go through mostly boring materials like terminology glossaries, style guides, special requirements, instructions, etc., doesn’t have great appeal. Any project involving serious preparations as a prerequisite is essentially dead on arrival.
Hence, the blasphemous and somewhat mind-boggling rule number one for crowdsourcing-based LQAs: Minimal reviewer training or no training at all. Just explain the task in the simplest terms possible…
The other obstacle facing crowdsourcing LQAs is of a similar nature, and threatens the very core of atomistic quality assessment. As discussed in Part I, atomistic quality evaluation rests on two pillars: an extensive and thorough classification of quality issues and a set of relative weights (quality vector) assigned to issues or issue categories. Coming up with an objective quality rating requires logging and counting all issues and classifying them properly, which, in turn, requires time and relies heavily on thorough knowledge of the whole quality framework by the reviewer. Remove this knowledge from the equation, and each person immediately starts classifying issues in an arbitrary manner, i.e. we can forget about accurate calculations or objectivity. (There is ample evidence of exactly this nightmarish scenario developing everywhere in the industry with frightening regularity). Mastering all intricacies of an extensive quality framework and learning how to classify issues properly and without ambiguity requires time and is hardly possible without training. This means that we need to forget about the objectivity of the atomistic quality rating when crowdsourcing is concerned and approach it from a different angle.
Rule number two for crowdsourcing-based LQAs: Forget the idea of using any quality issue framework unless it is completely trivial.
Finally, we should not expect people who have volunteered to review the content to spend excessive time on logging issues or sharing their findings.
Rule number three for crowdsourcing-based LQAs: Do not require or expect reviewers to log all encountered quality issues. Just ask them to provide some typical examples and make the summary form/template for submitting results as simple and short as possible.
To summarize: complicated requirements, quality frameworks, special rules, strict definitions or other formalities, lengthy training materials, etc. are obviously out of the question for crowdsourcing-based LQAs.
The landscape painted so far looks so barren and ominous that it is time to plant some positive news…
Complete lack of either formality or special reviewer training is a serious impediment, but the crowdsourcing approach has one vital upside as well: statistics. What we lose in one area, we may gain in another.
Doing professional LQAs, we can’t afford to collect statistical data, and have to limit ourselves to a single reviewer looking at the content or a part of it. To minimize subjectivity and increase evaluation accuracy, we need to employ the methodology discussed in detail in Part I <@@link>. This methodology takes into account the inevitable statistical variance in results. This variance would be evident in a clean experiment (review) involving a number of reviewers rather than a single one. Metrics based on this methodology rely heavily on proper reviewer training, extensive issue frameworks, and availability of ancillary materials.
With the crowdsourcing approach, we face the opposite: Training, issue frameworks, ancillary materials, and everything else that requires serious time investment is taken out of the picture, but we can stage a review carried out by a number of people. This would allow us to compute meaningful averages and standard deviations, producing significantly more accurate results, from a statistical viewpoint.
Important. The number of people reviewing each part of the content must be statistically valid, i.e. at the very least, 10-15; 100+ would be much better, but is not always possible.
Simplified Quality Square Metric for Crowdsourcing-Based LQAs
Based on all considerations outlined in the previous section, we can now present both the LQA metric adapted for the crowdsourcing environment and the accompanying process, which is fine-tuned for this metric and produces reliable, statistically valid results.
The Quality Square approach (Adequacy – Readability – Showstopper problems – Atomistic Quality) was selected because major, conspicuous errors typically play an important role for content reviewed through crowdsourcing. In this particular case, it is applied in a simplified form, with no detailed issue definitions or other formal clarifications or requirements. Each review produces four ratings for each text on a 0-10 scale:
- The number of major (showstopper) errors
These are errors that can seriously distort the meaning of the text, make it unclear, introduce factual errors or political incorrectness, use pejorative text, etc. Each such error should be corrected separately with great care, as it may seriously and negatively affect the overall perception by the user and/or result in incorrect user actions. Under normal conditions and according to professional translation standards, NO major errors are typically permissible, irrespective of the volume of text involved. These errors should all be eliminated at the editing/reviewing stage before publication.
Rating: 0 = Two or more major errors, 10 = No major errors. Each reviewer is requested to provide a brief description of each showstopper error or examples of recurring errors if they exceed two.
- Holistic translation readability (fluency). Reflects how easy the text is to read and understand as a whole
Rating: 0 = Completely unreadable/incomprehensible text, 10 = Perfectly intelligible and readable text. Each reviewer is requested to provide a brief explanation.
- Holistic translation adequacy (accuracy). Reflects how suitable the translation is overall for the intended audience, and how accurately the overall message is conveyed as a whole
Rating: 0 = Completely inadequate text, 10 = Perfectly conveyed meaning. Each reviewer is requested to provide a brief explanation.
- Atomistic quality. Reflects pervasiveness of “local”, sentence-level, non-critical issues with country standards, adequacy, readability, syntax, grammar, formatting, tags, links, and similar things.
Rating: 0 = Overabundance of atomistic-level errors, 10 = Completely error-free text. Each reviewer is requested to provide a brief explanation that includes examples of the most frequent errors or error categories.
For public LQAs, the atomistic quality category is not formalized in any way, and each reviewer is expected to simply evaluate technical quality of the content in the same way as he or she evaluates adequacy and fluency, i.e. on a holistic level, based on the overall impression. We simply need to list typical issues that fall under this category by example.
This area is the one most heavily relying on error classification and count and other technical details as far as no formal, time-consuming procedures are applied. Evaluation variance is expected to be strongest for this category, with lower result reliability. That is the reason for putting it at the end of the list.
Process and Environment
The process is fine-tuned to offset the negatives of the simplified metric and is adapted to the crowdsourcing environment.
- LQA review scope is defined and described briefly, but clearly, to prevent reviewers from deviating into outside areas.
- Translated content needs to be frozen by the time the review starts, because updates and scope changes cannot coexist with the crowdsourcing environment.
- Project description and scope definition are published through an online portal that needs to provide simple interactive forms for registering participants and entering feedback (evaluation results and reviewers’ comments).
This is the only way to ensure a minimum level of response uniformity and minimize collateral damage associated with incorrect information entry into offline files, format conversions, etc.
- To partially compensate for lack of special training, or other components of a professional LQA mentioned above, it is strongly recommended to limit the circle of contributors to language professionals only.
Based on extensive knowledge of the industry, we can definitively say that this alone would result in a considerably more consistent and informative feedback compared to the case when any enthusiast speaking a particular language can participate irrespective of the background.
- It is essential to collect sufficient statistics. In practical terms, it means that each area needs to be reviewed by no less than ten people (the more, the better). This is required in order to keep error margins at a reasonable level.
- All results submitted by individual reviewers are pre-processed in order to discard irrelevant or marginal evaluations.
This is a crucial step in an environment with no formal requirements or training. Bypassing it will result in significant distortion of calculated median ratings, because there is always a certain share of volunteer reviewers who are not qualified for the task, are too inattentive, did not understand the goals or the context, etc. All comments and explanations provided during the review need to be read and analyzed for signs of irrelevance, because the share of irrelevant or completely incorrect entries is expected to be much higher compared to professional LQAs.
The list of pre-processing checks is compiled individually for each project. It must include standard operations carried out during statistical analysis, like eliminating marginal results, but also needs to address project specifics. This includes, for instance:
- Discarding very low or very high ratings provided without any explanation (this situation indicates lack of attention and thus degrades the value of the rating itself)
- Discarding ratings accompanied by signs of obvious misunderstanding on the part of the reviewer (highlighting a major error that is not an error at all, etc.)
Some practical examples of discarded ratings are provided in the next section that discusses a real project carried out using the suggested metric and process.
- After the pre-processing stage, individual reviewer ratings are averaged, and standard deviations are calculated to assess reliability of results.
- Each of the four median ratings is compared to its pre-defined threshold.
The metric and process discussed in previous sections allow us to obtain reliable, statistically sound LQA results in the crowdsourcing environment. At the same time, it is essential to emphasize that crowdsourcing-based LQAs can by no means serve as a valid replacement for professional LQAs, which are carried out by specially trained professionals and based on extensive and well-defined formal criteria fine-tuned to the client’s requirements, and provide an unparalleled level of sophistication, accuracy, and consistency.
The goal of a crowdsourcing-based LQA is about obtaining quick results at a minimal (or zero) cost and getting a rough evaluation of translation quality. These results can serve as a good starting point, and can reveal whether there are significantly serious problems with translated content and whether a subsequent professional LQA is required.
Acceptance thresholds in this case are replaced by “alarm-raising” ones as far as we do not expect the results to be too accurate, but need to distinguish between quality that raises significant concerns and the one that does not.
Applying the Simplified Quality Square Metric for a Real-Life Project
The LQA project described below was carried out free of charge by Logrus International for GALA, where the project was embraced and actively promoted by then-CEO, Hans Fenstermacher, and Serge Gladkoff, my colleague and business partner at Logrus and GALA board member. The original request for this review came directly from the US government and dealt with quality evaluation of the translated version of the Affordable Care Act Spanish-Language Website. It was a unique chance for us to try out the aforementioned methodology and process on real-life content of high public importance.
The fact that the website had a significant level of exposure and attracted enough attention to simplify the task of finding qualified volunteers made it a perfect candidate. We are very grateful to all contributors who have volunteered to look at the website and provide their feedback promptly and free of charge!
Strictly following the process outlined earlier, we only invited seasoned professional translators to participate in the review, and created a special mini-portal for them. The portal contained brief error category definitions, as presented earlier in the Metric section, and allowed participants to enter ratings and comments through a simple online form.
Overall, 17 Spanish-speaking volunteers took part in the project. This allowed us to obtain sufficiently reliable results without spending excessive time on data pre-processing and analysis.
A comprehensive data pre-processing was undertaken. All standalone “perfect” evaluations were discarded during pre-processing in cases when nobody else considered the text perfect, as were marginally high or low ratings provided without proper explanation. All reviewer errors and associated skewed ratings caused by incomplete understanding of project specifics (language specifics, projects scope – see details below) were also excluded from analysis.
Important Project Specifics
As mentioned earlier, while the metric and process are essentially the same in all cases, all project specifics need to be taken into account during data pre-processing to eliminate unsubstantiated or incorrect ratings. For the project in question, at least two factors stood out:
- Target language specifics. It is critical to point out the singularity of the chosen target language itself, which resulted in less consistent and informative results. Most translation and LQA tasks include translating or reviewing translations into languages targeted for a specific region. Most reviewers have significant experience with translating or reviewing materials targeted at Spanish-speaking countries or regions, such as Latin America in general (LatAm), or specific countries (such as Argentina, Mexico or Spain). Each reviewer may have had a particular language “flavor” in mind, such as LatAm or “Spain Spanish”, but the website is intended for the Spanish‐speaking population in the US, which is extremely diverse. The target audience consists of people with various backgrounds, speaking a wide variety of Spanish, or even “Spanglish”. While the typical approach in this case might be to use the most neutral and universal terminology and grammar possible (this is exactly how one of the reviewers characterized the translation), such a “super‐neutral” version of the language might not sound natural to some native speakers, who may have reacted more critically than the translation merited.
- Understanding the review scope. Some “major errors” were not actually linguistic, but functional issues beyond the LQA scope. This was the case for one reviewer, who considered navigating health insurance plans and prices in English as a major error. Another reviewer also pointed out numerous spelling errors in responses obtained through the chat feature. Our guess is that the chat was actually a conversation with a live person (rather than with automated agent software), and perhaps an online translation engine was used by someone who was not proficient in Spanish.
These and similar errors were disregarded during pre-processing of results, because we were targeting translation quality alone (not portal usability or functionality), but raised additional questions about translation scope and Spanish proficiency of agents supporting the online chat feature (or their usage of online translation engines), as far as it is still a part of the overall user experience.
LQA Results by Category
After all required pre-processing and analysis, the following results were obtained for each area of feedback:
- Major errors. Out of 17 reviewers, at least 7 found one or more major errors on the website
This adjusted figure takes into account missing explanations or misclassified major errors that were ignored during analysis (like the use of “Spanglish”, which we considered a minor error in the given context, provided the meaning was correct and clear). It may not be too easy for a non-Spanish speaker to draw a clear line between a comprehensible sentence that is simply poorly written and a sentence with a completely incomprehensible or distorted meaning. That said, even a single case of such an error is too many, and our reviewers identified multiple major errors in the website translation.
Summary for Major Errors: There are strong reasons to believe that there are multiple major, showstopper-level errors on the website.
- Readability Rating
Having discarded any marginal evaluations lacking comments, our results show an average Readability Rating as 6.2 out of 10 with a standard deviation of 2.2. This is a consistent and reliable result, i.e. the standard deviation (2.2) is well below the average rating (6.2) or the scale’s upper limit (10), and the bulk of the distribution curve is comfortably confined between 4 and 8.4.
Summary for Holistic Readability: The text is readable (rating above 5), but barely so, and leaves much to be desired in view of its importance and high level of public exposure. At this level and for this type of content a proper target for average readability is at least 8 out of 10.
3. Adequacy (Accuracy) Rating
Having discarded any marginal evaluation lacking comments, our results show an average Adequacy Rating as 6.5 out of 10 with a standard deviation of 1.9. This is a consistent and reliable result, i.e. standard deviation (1.9) is well below the average rating (6.5) or the scale’s upper limit (10), and the bulk of the distribution curve is comfortably confined between 4.6 and 8.4. The evaluations are relatively close in this case, indicating greater unanimity among reviewers.
Summary for Holistic Adequacy: The translated text adequately conveys the meaning of the source (rating above 5), but barely so, and leaves much to be desired in view of its importance and high level of public exposure. At this level, and for this type of content, a proper target for average adequacy is at least 8 out of 10.
4. Atomistic Quality Rating
This area relies most heavily on error classification and count, as well as other technical details. Since our LQA did not involve full-scale, formal, time‐consuming procedures (see LQA methodology in Part I), we expected the evaluation variance to be strongest here.
The average across all evaluations was 5.4, with a standard deviation of 2.8 (i.e. considerable). Eliminating two reviewers who rated the text as completely error‐free (10) (a statistical anomaly, since all others noticed multiple conspicuous errors), we calculated the adjusted value of 4.7 out of 10 for the average atomistic quality rating, with a 2.4 standard deviation.
Summary for Atomistic Quality: Despite the fact that our review was by design a less than ideal community feedback-based LQA, resulting in rating inconsistency among reviewers, it is clear that most reviewers found too many noticeable and annoying technical/minor mistakes in the text, as reflected in the low average rating, which is unsatisfactory. Substantial remedial work is clearly called for in this area.
Metric and Process Summary
Despite the fact that a simplified, crowdsourcing-oriented metric and process were used, we can assert the following:
- Both primary quality indicators, Holistic Readability and Holistic Quality, can be relied upon with reasonable confidence and provide a good basis for assessing overall translated content quality.
In this particular case, overall Spanish translation quality is acceptable, but barely. Ratings for both Readability and Adequacy were consistently low-to‐average, around 6+ out of 10, which would be in the range expected for Machine Translation (MT) output with slight to medium post-editing.
- We can make a sufficiently reliable judgment about the presence of Showstopper Errors in the text.
- As expected, holistic assessment of Atomistic Quality is not accurate enough for making quantitative conclusions. A professional LQA would have resulted in a more precise and consistent rating, and would have also produced a complete roster of all errors together with their severities, as well as general recommendations. On the other hand, this assessment gives a good general idea of the pervasiveness of non-critical, atomistic-level errors
For instance, the results conclusively prove that in some cases, the translation appears to lack the steps considered essential for professional localization work, such as glossary creation, editing, and proofreading. Final language and technical quality checks, including terminology consistency checks, were either not carried out at all, or done without due diligence or the proper process. Professional translations done by reputable companies following industry best practices simply cannot result in so many uncorrected major errors, so many major translation inconsistencies, or such widespread linguistic and technical errors.
Overall, major LQA results look trustworthy and consistent, and they provide a reliable high-level picture of the real translation quality, which serves as experimental proof that the whole model, including methodology and process suggested, works quite well, even in the relatively extreme crowdsourcing environment!