A new study by Brian Thompson, Mehak Preet Dhaliwal, Peter Frisch, Tobias Domhan, and Marcello Federico found that web content often needed to be translated into many languages.
The study found that these multi-way translations show they were created using Machine Translation (MT) by tight fisted companies hoping to save a buck on human translators.
Multi-way parallel, machine-generated content not only rules the translations in lower resource languages; it makes up a large chunk of the total web content in those languages.
“We also find evidence of a selection bias in the type of content translated into many languages, consistent with rubbish English content being translated in bulk into many lower resource languages via MT,” the paper said.
The issue is that the web is full of data used by other sources elsewhere – including building AI databases. However, poor machine translations result in these databases being muddled or corrupted.
The boffins said, “Our work raises serious concerns about training models such as multilingual large language models on both monolingual and bilingual data scraped from the web. "