LSE - Small Logo
LSE - Small Logo

Sabrina Argoub

June 9th, 2021

The NLP divide: English is not the only natural language

0 comments | 9 shares

Estimated reading time: 10 minutes

Sabrina Argoub

June 9th, 2021

The NLP divide: English is not the only natural language

0 comments | 9 shares

Estimated reading time: 10 minutes

 

Natural Language Processing (NLP) is an area of artificial intelligence that aims to understand, analyse and make sense of human languages. It deals with systems that can understand language and perform tasks such as translation, grammar checking, and topic classification.

That is why NLP plays a crucial role in the development of AI-powered tools for news organisations.

In the pursuit of innovation and new engaging formats, NLP can be a powerful tool for newsrooms. Agnes Stenbom – Responsible Data and AI Specialist at Schibsted and PhD candidate at KTH Royal Institute of Technology – notes that by investing in research and practical implementations of models in their own languages, newsrooms can unlock great potential in domains such as content analysis and editorial insights, or even content creation.

The development and implementation of NLP technology, however, is not even-handed.

The vast majority of technological advances have been in English-based NLP systems. To understand the implications of the disparity between English and other languages, we asked two newsroom teams from our JournalismAI Collab – La Nación in Argentina and Inkyfada in Tunisia – how they are approaching and implementing NLP to support their journalism.

The missing data

According to Chayma Mehdi – editor-in-chief at Inkyfada – the most prominent disadvantage for non-English languages is the lack of data.

Within the field of NLP, languages are differentiated between high-resource and low-resource. High-resource languages are those for which a large amount of data is available as well as libraries, which are collections of functions and resources that allow the application of NLP. By far the most resourced language is English.

Many companies have already taken it upon themselves to collect, annotate, and publish data that can be used to train NLP models in English. For other languages, very little data is available, to the point that news organisations are often forced to find and collect the data for themselves.

Even when libraries are available, it is not a guarantee of success.

For languages such as Spanish, the process is still time-consuming as models and pipelines in the libraries are not as trained as the ones in English. Data and innovation journalist Delfina Arambillet and her team at La Nación often have to retrain the models themselves and feed them with new and different words to increase the accuracy.

Trial and error

On top of data, time is another scarce resource. The process of implementing NLP technologies in non-English newsrooms can’t keep up with the fast-paced news cycle. La Nación uses AI tools for projects that are not related to breaking news and that cannot be expected to adhere to strict deadlines.

Flor Coelho – new media research and training manager at La Nación – explains that “people need time to experiment and test these tools and should be given the opportunity to try and fail. That’s how we learn.”

Flor acknowledges that they are in a privileged position to have a team dedicated to researching these new technologies. To move forward in this field, it is important to set the AI strategy as part of the agenda of the newsroom:

In a way, the data team works as an investigative unit with a long-term horizon. But instead of investigating a journalistic story, we are doing a meta-investigation on the new technologies and tools that will improve the journalism we produce.

How to overcome the data and time limitations

To overcome their challenges, the two newsrooms have explored different approaches.

The team at Inkyfada suggests looking first into existing solutions. Duplicating the work done in English is the first step toward applying NLP to your desired language, as Chayma explains:

We look at what models work in English and fine-tune them to our needs. We evaluate the model’s quality and then train it in the Arabic language.

When tuning from English to another language works, this approach opens up the opportunity for the model to be applied to multiple solutions and even to explore multi-language solutions. The best strategy is to explore both these options while continuing to collect and clean data in your desired language for the specific task you want to work on.

For La Nación, collaboration is the key solution. Building relationships with local academic institutions help provide more resources both for academic research and for advancements in the technologies that the newsroom will be able to implement. Newsrooms can team up with faculties and ideally provide researchers and students with the opportunity to apply their expertise on concrete case studies. As a result, the shared workload can make working on NLP less time-consuming.

From a long term perspective, the collaboration is also beneficial as students are introduced to journalism, which is not the typical field that attracts computer science graduates.

Human oversight and realistic expectations are the keys to success 

Every language brings its own challenges, because of the availability of data resources, the structure of different languages, or the presence of cultural dimensions like attitudes and traditions that define the way we communicate. These are challenges that are present both while building models that process language as well as from an outward perspective when looking at how people will react to or appreciate machine-generated text.

“I think language AI, in general, is such a clear example of how immensely cultural artificial intelligence is. There is no one-size-fits-all approach”, Agnes adds. For this reason, human oversight and realistic expectations are key to the success of AI implementation in low-resource languages.

As part of her research on how journalism may responsibly leverage AI technologies, Agnes recently co-authored a paper with her collaborator Tobias Norlund, in which they sought to evaluate the perceived “human-likeness” and “informativeness” of text produced by an NLP model trained on data from a Swedish online forum.

They found that the model performed quite well, with 68% of its posts deemed plausible to be human-written. But for more than half of the automatically-generated posts, human evaluators disagreed on whether they could pass as human-like or not.

However small this particular evaluation study was, that’s an important insight for journalism. We need to be mindful of the attitudes and preferences of our audiences when implementing NLP solutions.

The experiences of newsrooms trying to make NLP work for journalism in languages other than English show that there are still significant technical roadblocks to overcome.

But the same barriers indicate the need and the potential for more collaboration and co-creation between newsrooms and between journalists and academics.


 

A first version of this article was published in our newsletter on June 3rd, 2021. Don’t miss our exclusive content and sign up for the monthly newsletter. 

The article was written by Sabrina ArgoubJournalismAI Community Coordinators. JournalismAI is a project of POLIS – the journalism think-tank at the London School of Economics and Political Science – and it’s powered by the Google News Initiative.

 

About the author

Sabrina Argoub

Posted In: JournalismAI | Original Reporting