Few if any events in history have brought the importance of big data to popular awareness more than the COVID-19 pandemic. Statistics gathered from around the world are driving public policy and shaping private behavior. Here we’ll focus on the linguistic dimension of this global struggle to communicate essential information both to policymakers, healthcare providers and to the general public. The challenge is how to communicate rapidly changing data across language borders so that essential information is not lost in translation. But there are also more controversial uses of big data that are translated along the way to find users.
Machine Translation using Big Data by the Leading Corporations
Given the scale of the problem, translation services are increasingly yielding to the efficiencies and throughput of machine translation. There are simply not enough human translators and interpreters to go around. Happily, thanks to the application of neural network methodologies in the last decade, the quality of machine translation has increased dramatically, dominated by developments in this area of the biggest tech companies, collectively dubbed by the acronym FAMGA: Facebook, Apple, Microsoft, Google, and Amazon. Each of those corporations in their own way has relied on big data to compete on the leading linguistic edge. Instead of crunching numbers, however, they’re crunching words.
Social Media Translation and Privacy Challenges in COVID Tracking
Facebook snagged first place in several categories of the 2019 WMT competition, leveraging large-scale sampled back-translation, a big data technique based on Neural Machine Translation, requiring vast amounts of bilingual training data – sentences for which reference translations are available. Bilingual data is hard to come by, so the Facebook team used back-translation as a workaround. In the end, the team uses roughly 10 billion words of additional data for its task. Facebook has unmatched access to content, using the comments and posts of its 2 billion or so users as training material.
It’s one thing to use posted language for experimental purposes in a language competition. It’s another altogether to exploit member posts on sensitive health matters like the novel coronavirus and the COVID-19 pandemic. As a J. Scott Marcus of the Bruegel Institute has observed, users “volunteer” information in various ways: in their posts to social media, in their use of mobile services and providing location data, in seeking health information. According to Marcus, big data has been used for strategic planning concerning COVID, for tracing potentially infected persons, and for the provision of guidance, advice, and information to infected individuals and the general public.
Translating Privacy Concerns in Connection with Voluntary Data Collected
Citizens may not be aware that the provision of “voluntary” data would be used to track them down and potentially quarantine them or expose the tracking of their movements. More than a country – starting with China, then South Korea, Taiwan, Israel, and others have explicitly used some or all of this information. In general, high tech companies have cooperated with national governments in making their data available, although privacy protections such as GPRS in Europe have deterred such uses in the European Union.
Virus tracking initiatives use machine translation to “normalize” communications and make them accessible in a preferred language to public health officials. For example, in Israel, social media communications in Arabic are auto-translated to Hebrew by machine translation techniques for the purpose of finding potential virus carriers.
Public uses of Machine Translation and Interpretation on a Massive Scale
Another example of the massive application of machine translation has been for screening visitors at international airports. In addition to thermal imagine and the now ubiquitous “thermometer pistols”, border officials are using hand-held voice-interpreters to question arriving passengers about their travel histories or medical symptoms.
The same considerations hold true for informing sectors of the public which do not speak the dominant language. Providing up-to-date information about coronavirus is a problem for migrants who do not speak the dominant language of the country in which the resident. In the Netherlands, according to a VOA report, volunteers set up a health desk to assist new immigrants who don’t speak Dutch. In Australia, the government sponsors a massive translation program at the nation’s border. Translating and Interpreting Service (TIS National) is a service provided by the Department of Immigration and Border Protection for non-English speakers who use both human interpreters and machine translation.
The need is massive in US hospitals. The New York Times reported in April 2020 on the vast scale of the difficulties of Hispanic sufferers of COVID-19 in the United States, suffering disproportionately, representing some 34% of casualties from the disease in New York. To cope with the need, New York hospitals are increasingly turning to video remote interpretation, where health care providers call in to services where an interpreter is available on demand.
Last year, even before the COVID crisis broke, the not-for-profit Translators without Borders (TWB), with support from Cisco, introduced an innovative machine translation initiative called Gamayun aimed at helping individuals who speak marginalized, minority languages. “People who speak marginalized languages lack access to critical and life-saving information,” Grace Tang, who manages the program for TWB. Voice interpretation and text translation based on AI and big data tech will help the program scale up to 10 marginalized languages over 5 years, according to a Cisco spokesman.
The Perils and Pitfalls of a Big Data and Machine Translation Project
Perhaps the most famous, or perhaps notorious, case of a project combining big data and machine translation is Project Baseline, an initiative of Alphabet-backed Verily. U.S. President Donald Trump, in March 2020, raised a ruckus, when he claimed that Google was backing a nationwide initiative to tracking the novel coronavirus using bilingual screening questions. A similar controversy arose with Vital Software’s Covid-19 symptom-checker, translated into 15 languages for the state of Oregon. While the community-based project was launched, the scale remains on the county level in selected states, not the national level. It’s still going through “teething pains.” To its credit, the project takes data privacy concerns seriously, given the massive amounts of sensitive information being collected from individuals.
The bottom line on the use of big data for machine translation and other purposes in the COVID crisis is that it’s being done “on the fly” and under intense pressure – a fact that almost invariably results in cut corners and high expectations not always met. The data is “noisy” and sub-optimal, to quote Facebook’s report on its WMT victory. Let’s hope that efforts to combine big data and machine language methodologies in these difficult days are also successful so that lives are not needlessly lost in translation.