In this blog post Mike Thelwall, Kayvan Kousha, Paul Wilson, Mahshid Abdoli, Meiko Makita, Emma Stuart and Jonathan Levitt discuss the results of a recent project for UKRI that made recommendations about whether artificial intelligence (AI) could be used as part of the Research Excellence Framework (REF). It assessed whether AI could support or replace the decisions of REF subpanel members in scoring journal articles. The project developed an AI system to predict REF scores and discussed the results with members of sub-panels from most Units of Assessment (UoAs) from REF2021. It was also given temporary access to provisional REF2021 scores to help develop and test this system.
An AI system to assess journal article quality
The main challenge was to design an AI system to assign quality scores to journal articles as accurately as possible. There are two common styles of AI: knowledge-based and machine learning. The former has knowledge about the task it has to perform and applies it to solve problems. For example, a knowledge-based system to recommend careers to school leavers might be fed with a database of career types and the skills needed from them, then ask the pupils their preferences and try to match them with existing careers. In contrast, the machine learning approach has no knowledge but tries to guess answers by pattern matching. For example, a machine learning approach to recommending careers might be to compare pupil CVs to the CVs of people in various careers, recommending the closest match.
For journal article prediction, there is no knowledge base related to quality that could be leveraged to predict REF scores across disciplines, so only the machine learning AI approach is possible. All previous attempts to produce related predictions have used machine learning (or statistical regression, which is also a form of pattern matching). Thus, we decided to build machine learning systems to predict journal article scores. As inputs, based on an extensive literature review of related prior work, we chose: field and year normalised citation rate; authorship team size, diversity, productivity, and field and year normalised average citation impact; journal names and citation rates (similar to the Journal Impact Factor); article length and abstract readability; and words and phrases in the title, keywords and abstract. We used provisional REF2021 scores for journal articles with these inputs and asked the AI to spot patterns that would allow it to accurately predict REF scores.Fig.1: AI robots making guesses from pattern matching surface-level information
We tried many different technical specifications (see main report) and ran thousands of experiments to assess different strategies. Each experiment built a different AI system on some of the data (REF2021 journal articles and provisional scores) and assessed its accuracy on the remainder. A separate solution was built for each year and UoA.
The basic systems were built on half of the eligible journal articles, predicting the remainder. These achieved a maximum accuracy of 72% in one Unit of Assessment (UoA), with the accuracy being much lower in most UoAs. This level of accuracy is unacceptable for individual articles but the errors tended to cancel out across all of an institution’s outputs to an individual UoA, so the total scores for each institution tended not to be greatly influenced by switching to AI for some of the outputs. In the best case, the Pearson correlation between the total institutional score with and without AI was 0.998, making the total results almost indistinguishable.
We showed the results to a series of focus groups of REF2021 subpanel members and an overwhelming majority thought that even this very high correlation was not enough for them to be happy with the AI.
We showed the results to a series of focus groups of REF2021 subpanel members and an overwhelming majority thought that even this very high correlation was not enough for them to be happy with the AI. The reason was that UK academics are also very interested in the average score, or Grade Point Average (GPA) of each institution within each UoA. In fact, despite warnings against it from UKRI, all institutions seem to use REF results to form league tables of each UoA, using them to assess or report their performance. The correlations between fully human GPAs and partly AI GPAs reached 0.906, but smaller institutions still risked substantial ranking shifts. For example, in one the of the most accurate UoAs, Clinical Medicine, a small submission had a reasonable chance of dropping 8 places in the league table due to AI mistakes. We told focus groups that the problem was primarily one for smaller submissions and it could be resolved by using less AI for smaller submissions to reduce their liability to error, but they were adamant that this should not be done. Thus, there is a statistical solution to the problem of higher error rates for small submissions, but it was not acceptable to REF assessors because it involves submissions receiving unequal treatment.
Why could we not generate more accurate predictions? We think it is because the system can only access very surface-level information to identify patterns to guess the quality of a journal article (Fig.1), whereas subpanel members can harness a lifetime of knowledge and experience as well as being able to read and understand the article itself (Fig.2).
Inform, but not replace?
An alternative way of using the AI predictions would be as evidence to inform subpanel members to help them decide difficult cases. This is how bibliometrics seem to have been used in REF2021 by some subpanels. The AI would have two substantial advantages over the bibliometrics: it makes specific score recommendations rather than being an indicator; and it includes an estimate of its confidence in the prediction (e.g., 83% confident of 4*). Thus, it seems clearly better than the bibliometrics and needs less expertise to understand. The REF subpanel members we showed this solution to were mostly happy with it. Whilst it does not save them time, it should help to improve the accuracy of the scores on journal articles that are difficult to classify.
Our final recommendation is to pilot test the AI system in the next REF, but not use it to inform judgements.
Despite this, we are not recommending this solution because in our judgement, its benefits are marginally outweighed by the perverse incentive it would generate for institutions to overvalue journal impact factors. UKRI has signed the Declaration on Research Assessment (DORA) against overuse of journal impact factors and is currently attempting to reduce its influence in the sector and so an AI system informing REF scores that relied partly on a journal impact calculation would be unwelcome, even though it was only one of the thousand system inputs.
Pilot testing and future work needed
Our final recommendation is to pilot test the AI system in the next REF, but not use it to inform judgements. Why recommend pilot testing if it would create a perverse incentive? Pilot testing would raise the profile of the AI task, allowing researchers to develop potentially more accurate innovative solutions. Pilot testing would also allow the promotion of publishing journal article full texts that are suitable for machine learning. This would be of great benefit to AI. Currently, a small minority of REF2021 journal articles are available online in a suitable form. The production of more accurate AI could then result in time saving advantages that would outweigh the perverse incentive disadvantages. Alternatively, if the battle against journal impact factors is effectively won, or pilot testing suggested that the AI system did not generate perverse incentives in practice, then this would open the door to its use in the future, even without substantial increases in accuracy.
Readers can learn more about the project and read the teams final report at: http://cybermetrics.wlv.ac.uk/ai/
The content generated on this blog is for information purposes only. This Article gives the views and opinions of the authors and does not reflect the views and opinions of the Impact of Social Science blog (the blog), nor of the London School of Economics and Political Science. Please review our comments policy if you have any concerns on posting a comment below.
Image Credit: In text images reproduced with permission of the authors, featured image Gabriel Heinzer via Unsplash.