The buzz around AI technologies such as ChatGPT-3 is palpable. The sheer power and creativity of generative AI have quickly captured the attention and imagination of newsroom professionals. Equally, though, generative AI has thrown up difficult questions about information integrity, disinformation, policy and reliance on Big Tech.
Yet, a more optimistic tone seems to be settling around the practical applications of tools such as ChatGPT, Bard and DallE-2 in creating and reducing manual efforts in news production, providing more semi-automated solutions, and expanding the scope of personalisation. The Reuters Institute went so far as to call 2023 the ‘breakthrough year for AI and its application for journalism’.
However, one distinguishing feature very few have touched on is ChatGPT’s ability to speak multiple languages, 95 languages in total, to be exact. As a multi-lingual journalist who has produced content in both Arabic and English, I’ve been exploring the opportunities and limitations of multi-lingual journalism using ChatGPT.
It’s important to note that Large Language Models (LLMs) such as ChatGPT work best in English. The application’s superiority, creativity and linguistic accuracy in English compared to non-anglo languages are well documented across the web.
ChatGPT works better in English because of the accessibility and scale of open-source data used to train the models themselves. A large proportion of the existing data leveraged by OpenAI (ChatGPT developer) is based on significant English-rich web sources such as CommonCrawl, Webtext2, Books1, Books2 and Wikipedia. (Source: The paper that introduced GTP-3 in 2020 (Brown et.al 2020)).
The most significant data source among those is Common Crawl which makes up 60% of the ChatGPTs final training data. The company Common crawl provides monthly statistics about its scraping output. Unsurprisingly, English comes top of the league for the Distribution of Languages, taking between 40/50% of the overall proportion of web-scrapped data. Russian comes next with 5.9%, Dutch at 5.8%, French at 4.7% and Spanish at 4.6%, and Arabic comes in at 0.6%.
OpenAI themselves do not provide any detailed statistical breakdowns of the language distribution of their final model. However, we know that ChatGPT is a deep-learning model. This means the more we use it; the more data is available for the model to keep training and improving. I, for one, have noticed a significant improvement in Arabic language output within a matter of months since its launch in November 2022.
Neural network-based models such as ChatGPT operate by learning patterns, structures and semantics of different languages, so tasks such as translation and text summarisation should be a relatively easy task for ChatGPT-3 in most languages; not surprising as there is a plethora of AI translation tools for a while now. However, the real shortfalls for non-English speaking journalists and newsrooms are the areas in which Western newsrooms will again be a step ahead.
Rest of The World
Some of the areas that ChatGPT will likely fail in comparison to English or more popular Western languages are:
Providing content for articles, generating engaging social posts, quotes, and more general cultural nuanced understanding and irony. South Asian languages particularly seem to be struggling in terms of ‘linguistic diversity and complex pragmatics’, according to Kin So-hyun’s article in AsianNews Network. This could also be because many South Asian countries use their own search engines for which much of that data is not scrapped. For example, Baidu, the Chinese search engine, announced last week that a ChatGPT competitor ERNIE Bot is due to launch in March 2023. Clearly, much of the global south also now sees the potential of generative AI and the need to diversify.
In their latest blog post, OpenAI also recognises the need for diversifying their models, so they have called for ‘public input’ to improve fairness and representation by launching a Researcher Access Program.
Global news organisations and smaller non-English newsrooms should get ahead now with forward-thinking, multi-disciplinary experimentation about how they might leverage their own existing data and other open-source data to improve LLM’s language datasets. Although Chat-GPT has a long list of limitations in all languages, these models will undoubtedly become smarter, faster and more accurate as time passes. Investing in a larger variety of languages now will give global and smaller local newsrooms a competitive edge in a world heading towards even more automation.
Finally, as news consumers become more fastidious about what, how and when to consume news, it seems only natural that they too also want more options in language preferences. Chat-GPT can potentially make articles, headlines and social posts more nuanced and engaging to audiences globally without the need for human or rule-based text translation.
This article by journalist Fadah Jassem