Flushing out the bad agents in big data leaks

The JournalismAI Fellowship began in June 2022 with 46 journalists and technologists from news organisations globally collaborating on using artificial intelligence to enhance their journalism. At the halfway mark of the 6-month long programme, our Fellows describe their journey so far, the progress they’ve made, and what they learned along the way. In this blog post, you’ll hear from team Bad Will Hunting.

For the 2022 JournalismAI Fellowship, the Daily Maverick, Follow The Money, and The Guardian teamed up with a shared drive to uncover “bad” agents hidden away within extensive digital corpora, of which whistleblower leaks are the most emblematic use case.

Our vision was to build an AI pipeline to automatically surface and organise the best contenders for “people of interest” in large caches of thousands of documents. This would empower investigative journalists to expedite their work and potentially uncover previously unknown agents or groundbreaking stories.

The first months of the project were spent defining an end-to-end pipeline, and which datasets would best allow us to put together a prototype in the allocated time frame. We considered multiple candidate datasets, but ultimately decided to build and trial our method on a set of articles from The Guardian newsroom. This dataset had advantages such as being readily accessible and containing a number of well known entities to develop and validate our approach.

Our goals also demanded substantial knowledge of state-of-the-art techniques in machine learning and natural language processing. Equally, we intended to build-up our understanding of entity linking techniques capable of mapping extracted entities to online Knowledge Bases such as Google Knowledge Base or Wikidata.

We therefore sought the guidance of experts like Explosion, the team behind spaCy and other industry leading NLP products, and established a close collaboration with Friedrich Lindenberg, who is currently working on a related project.

With these pieces in place we mapped out a conceptual flowchart with the different steps needed to enrich our entities with local semantic information and freely available knowledge base data.

Translating a seemingly straightforward idea into an algorithm required a complex solution. The Bad Willing Hunting project pipeline (depicted above) resulted from our discussions on how to achieve the objective of finding “bad agents”, and their connections, hidden throughout multiple documents. The pipeline starts with the extraction of named entities from an initial pool of documents. These entities then need to be enriched with context from knowledge bases and disambiguated into single identities. A final step is to develop metrics to select the top most relevant entities, or present the information graphically. Many of these steps are interrelated and can feed into each other, and we are actively developing the best way of hitting these targets.

Reliable entity enrichment depends on robust disambiguation (is our “John Doe” a football player, a notorious musician, or someone else entirely?) and entity linking (what is the correct ID of our John Doe in a given knowledge base?). As both tasks are complementary, our team is currently split into two workstreams.

Workstream 1 has been working on an agent-in-the-loop disambiguation approach (in collaboration with Friedrich Lindenberg) that focuses on building identities around entities. These identities are formed by considering surrounding entities and give insight into the context in which entities appear. This not only gives us information useful for disambiguation and entity linking, but it may also highlight previously unknown persons-of-interest that won’t show up in existing knowledge bases.

Workstream 2 is focused on training a knowledge base entity linker model using spaCy. This model will predict the most likely mapping between each entity and a list of candidate identities extracted from a knowledge base, based on semantic context and candidate notoriety. Known information about the most likely matched identity can thus be assigned to the entity and entities mapped to the same candidate can be disambiguated into a single identity.

Once this stage is complete, we hope to move on to combining the two workstreams into an open-source toolset. We also intend to research the best methods to present our findings, perhaps through graph visualisation methods.

In terms of outcomes for the Fellowship, our ideal scenario would see the team develop an end-to-end prototype capable of extracting and organising the people of interest within our dataset, in particular the “bad agents” of editorial relevance. This pipeline should ideally be flexible enough to accommodate alternative datasets and be adapted for other editorial tasks. In practice, however, there are many challenges to our proposal.

First, the project is quite ambitious given the amount of time allocated for the Fellowship. Secondly, the idea itself has, due to its complexity, been actively researched for a number of years, and our contribution to the field will likely be marginally incremental. Finally, real investigative datasets are quite varied and require extensive processing and our pipeline would need to be adapted case-by-case to accommodate these steps.

Nonetheless, reaching even a minimum output of clearly understanding the methodology and efforts required to build the envisioned AI solution, acquiring the skills required for its implementation and establishing an editorial understanding of how the AI solution can be applied by our investigative teams, would place our institutions in a prime position to develop an AI solution to their journalistic needs.

Thus, we are fully confident we will achieve our goals for the Fellowship and use our learnings to help the investigative journalist community in the years to come.

The Bad Will Hunting team is formed by:

Alet Law, Audience Development Manager, Daily Maverick
Tinashe Munyuki, Retention Manager, Daily Maverick
Luis Flores, Data Scientist, The Guardian
Chris Moran, Head of Editorial Innovation, The Guardian
Dimitri Tokmetzis, Senior Investigative Journalist and Data Team Lead, Follow The Money
Heleen Emanuel, Data Journalist and Creative Developer, Follow The Money

Do you have skills and expertise that could help team Bad Will Hunting? Get in touch by sending an email to Fellowship Manager Lakshmi Sivadas at lakshmi@journalismai.info.

JournalismAI is a global initiative of Polis and it’s supported by the Google News Initiative. Our mission is to empower news organisations to use artificial intelligence responsibly.

Header image: Philipp Schmitt & AT&T Laboratories Cambridge / Better Images of AI / Data flock (faces) / CC-BY 4.0

Lakshmi Sivadas

September 15th, 2022

Flushing out the bad agents in big data leaks

Lakshmi Sivadas

September 15th, 2022

Flushing out the bad agents in big data leaks

About the author

Lakshmi Sivadas

Meet the JournalismAI Fellows of 2022

June 7th, 2022

Can newsrooms join forces to unlock the power of AI?

November 2nd, 2021

10 things you should know about AI in journalism

September 7th, 2022

Tracing echoes: Our journey to the center of manipulated narratives

September 13th, 2022

Lakshmi Sivadas

September 15th, 2022

Flushing out the bad agents in big data leaks

Lakshmi Sivadas

September 15th, 2022

Flushing out the bad agents in big data leaks

About the author

Lakshmi Sivadas

Related Posts

Meet the JournalismAI Fellows of 2022

June 7th, 2022

Can newsrooms join forces to unlock the power of AI?

November 2nd, 2021

10 things you should know about AI in journalism

September 7th, 2022

Tracing echoes: Our journey to the center of manipulated narratives

September 13th, 2022