How a newsroom team of four built their own machine learning model to augment their investigations into government contracts.
Funes is an algorithm developed by Peruvian newsroom Ojo Público for detecting corruption in public contracting. It draws from academic research and newsroom knowledge on Peruvian corruption. The development of Funes pointed to new use cases and questions: How to interpret the algorithm’s output? How to validate the results and which cases does it detect or miss? How to communicate these results to the public? Gianfranco Rossi, software developer at Ojo Público, shared their learnings at the JournalismAI Festival.
AI can be a new source. It gives leads – in the best case, ones we couldn’t see by ourselves – but the story still has to be built through investigation and reporting.
- You need clear and shared working definitions in order to bridge the gap between reporters and technologists in your newsroom.
- Create your own data mappings when the data you use comes from different sources. Data is never in one place, often incomplete, inconsistent, unavailable, and not uniform nor structured in a logical way.
- It takes journalists to act upon the results generated by a machine learning model. In the best cases, the model might give you leads, but the story still has to be built through investigation and reporting.
It took a team of four – one editor, two journalists, and a statistician – fifteen months to create Funes, an algorithm to detect corruption in public contracting, and produce the first stories with it. The development of the algorithm led not only to validate and learn more on known cases of corruption but also showed new story-leads, which the team now regularly follows up on. But before Funes could be put to use, the team had to overcome some challenges along their path.
Inception of Funes
Gianfranco Rossi, the statistician on the team, explained how Ojo Público came up with Funes, and how it was based on earlier work they’d done on Fondos de papel (an investigation into private financing of political parties) and Lava Jato (a Latin-American public contract corruption scandal).
They built an in-house look-up system for government contracts called Lisbeth – as the hacker/investigator protagonist of Stieg Larsson’s Millennium series – and one night, over drinks, they wondered what they could do with all the data if only they could analyse more of it, faster… And with that, they put themselves onto a complicated but exciting journey.
Building the model
One of their first objectives was to define and agree upon what corruption actually is. As Gianfranco noted: “For this particular topic you’re looking for a hidden action, a latent variable.” The team understood early on that they needed to define proxies for corruption in order to understand what to look for in the data.
Another challenge was that the data was (and still is) stored in many different places, in different formats. It is often incomplete, inconsistent, and unavailable. Building a detailed mapping to get an overview of what data they had and what other data they needed was essential to the team.
The definition of risk indicators was another essential ingredient. Fortunately, the team could draw on the work of Mihály Fazekas and use (parts of) his model – the so-called Fazekas indicators – which is a comprehensive review of corruption proxies (2017). Getting a list of risk indicators helped the team to assign values to contracts – their unit of analysis. Gianfranco explained that these types of contracts usually consist of a tender, a supplier, and a state entity. A risk indicator called “Tender Risk” would score ‘high’ if the tender had only a single bidder.
Results & next steps
After creating mappings for these indicators, the team could start working on their model and feed it with data. While getting some promising results, the team makes sure to carefully assess the accuracy and value of their algorithm. Funes has already allowed the Ojo Público team to publish some powerful investigations but, promising as the algorithm is, they are still looking for ways to further improve it. Gianfranco hinted at the possibility of moving from a supervised to an unsupervised way of learning – an important distinction in the field of machine learning – and to look for new ways to evaluate the accuracy of the model, noting that ”the objective for this work is not academic, but to finding new cases of corruption and keep holding powerful entities to account.”
In the process of building Funes, the Ojo Público team encountered a series of specific issues they could learn from:
- Data analysis and modelling forces you to understand the data, and state systems, inside out.
- Often, datasets are too big to open with a consumer product like Excel. You need tailor-made tools. One of the team members had to learn to use programming language R in order to process the data.
- The process of data analysis leads to considering new details: extracting risk indicators (from the Fazekas model) seemed at times not applicable: “We had to create some new risk indicators, specific to the Peruvian context”, Gianfranco told the audience.
- Challenges to prepare for include invalid data, missing data, security problems on government systems, and many more.
In putting Funes to work and evaluating its output, Gianfranco explained that the team realised how still fundamental was the role of the journalists to be able to act upon the results generated by the model: “The model might give you leads, but the story still has to be built through investigation and reporting”.
- Explore the FUNES website
- Check the slides used by Gianfranco at the Festival
- Calculating Corruption: Peru’s Ojo Público Creates Tool to Gauge Contracting Risks
This article was written by Laurens Vreekamp, Design thinker, trainer, and Sprint facilitator.