Mike Thelwall

Kayvan Kousha

Mahshid Abdoli

Meiko Makita

Emma Stuart

Paul Wilson

Jonathan Levitt

January 16th, 2023

Can artificial intelligence assess the quality of academic journal articles in the next REF?

4 comments | 35 shares

Estimated reading time: 7 minutes

Mike Thelwall

Kayvan Kousha

Mahshid Abdoli

Meiko Makita

Emma Stuart

Paul Wilson

Jonathan Levitt

January 16th, 2023

Can artificial intelligence assess the quality of academic journal articles in the next REF?

4 comments | 35 shares

Estimated reading time: 7 minutes

In this blog post Mike Thelwall, Kayvan Kousha, Paul Wilson, Mahshid Abdoli, Meiko Makita, Emma Stuart and Jonathan Levitt discuss the results of a recent project for UKRI that made recommendations about whether artificial intelligence (AI) could be used as part of the Research Excellence Framework (REF). It assessed whether AI could support or replace the decisions of REF subpanel members in scoring journal articles. The project developed an AI system to predict REF scores and discussed the results with members of sub-panels from most Units of Assessment (UoAs) from REF2021. It was also given temporary access to provisional REF2021 scores to help develop and test this system.

An AI system to assess journal article quality

The main challenge was to design an AI system to assign quality scores to journal articles as accurately as possible. There are two common styles of AI: knowledge-based and machine learning. The former has knowledge about the task it has to perform and applies it to solve problems. For example, a knowledge-based system to recommend careers to school leavers might be fed with a database of career types and the skills needed from them, then ask the pupils their preferences and try to match them with existing careers. In contrast, the machine learning approach has no knowledge but tries to guess answers by pattern matching. For example, a machine learning approach to recommending careers might be to compare pupil CVs to the CVs of people in various careers, recommending the closest match.

For journal article prediction, there is no knowledge base related to quality that could be leveraged to predict REF scores across disciplines, so only the machine learning AI approach is possible. All previous attempts to produce related predictions have used machine learning (or statistical regression, which is also a form of pattern matching). Thus, we decided to build machine learning systems to predict journal article scores. As inputs, based on an extensive literature review of related prior work, we chose: field and year normalised citation rate; authorship team size, diversity, productivity, and field and year normalised average citation impact; journal names and citation rates (similar to the Journal Impact Factor); article length and abstract readability; and words and phrases in the title, keywords and abstract. We used provisional REF2021 scores for journal articles with these inputs and asked the AI to spot patterns that would allow it to accurately predict REF scores.Fig.1: AI robots making guesses from pattern matching surface-level information

We tried many different technical specifications (see main report) and ran thousands of experiments to assess different strategies. Each experiment built a different AI system on some of the data (REF2021 journal articles and provisional scores) and assessed its accuracy on the remainder. A separate solution was built for each year and UoA.

System accuracy

The basic systems were built on half of the eligible journal articles, predicting the remainder. These achieved a maximum accuracy of 72% in one Unit of Assessment (UoA), with the accuracy being much lower in most UoAs. This level of accuracy is unacceptable for individual articles but the errors tended to cancel out across all of an institution’s outputs to an individual UoA, so the total scores for each institution tended not to be greatly influenced by switching to AI for some of the outputs. In the best case, the Pearson correlation between the total institutional score with and without AI was 0.998, making the total results almost indistinguishable.

We showed the results to a series of focus groups of REF2021 subpanel members and an overwhelming majority thought that even this very high correlation was not enough for them to be happy with the AI.

We showed the results to a series of focus groups of REF2021 subpanel members and an overwhelming majority thought that even this very high correlation was not enough for them to be happy with the AI. The reason was that UK academics are also very interested in the average score, or Grade Point Average (GPA) of each institution within each UoA. In fact, despite warnings against it from UKRI, all institutions seem to use REF results to form league tables of each UoA, using them to assess or report their performance. The correlations between fully human GPAs and partly AI GPAs reached 0.906, but smaller institutions still risked substantial ranking shifts. For example, in one the of the most accurate UoAs, Clinical Medicine, a small submission had a reasonable chance of dropping 8 places in the league table due to AI mistakes. We told focus groups that the problem was primarily one for smaller submissions and it could be resolved by using less AI for smaller submissions to reduce their liability to error, but they were adamant that this should not be done. Thus, there is a statistical solution to the problem of higher error rates for small submissions, but it was not acceptable to REF assessors because it involves submissions receiving unequal treatment.

Why could we not generate more accurate predictions? We think it is because the system can only access very surface-level information to identify patterns to guess the quality of a journal article (Fig.1), whereas subpanel members can harness a lifetime of knowledge and experience as well as being able to read and understand the article itself (Fig.2).

Fig 2. REF subpanel members harnessing considerable knowledge to score journal articles.

Inform, but not replace?

An alternative way of using the AI predictions would be as evidence to inform subpanel members to help them decide difficult cases. This is how bibliometrics seem to have been used in REF2021 by some subpanels. The AI would have two substantial advantages over the bibliometrics: it makes specific score recommendations rather than being an indicator; and it includes an estimate of its confidence in the prediction (e.g., 83% confident of 4*). Thus, it seems clearly better than the bibliometrics and needs less expertise to understand. The REF subpanel members we showed this solution to were mostly happy with it. Whilst it does not save them time, it should help to improve the accuracy of the scores on journal articles that are difficult to classify.

Our final recommendation is to pilot test the AI system in the next REF, but not use it to inform judgements.

Despite this, we are not recommending this solution because in our judgement, its benefits are marginally outweighed by the perverse incentive it would generate for institutions to overvalue journal impact factors. UKRI has signed the Declaration on Research Assessment (DORA) against overuse of journal impact factors and is currently attempting to reduce its influence in the sector and so an AI system informing REF scores that relied partly on a journal impact calculation would be unwelcome, even though it was only one of the thousand system inputs.

Pilot testing and future work needed

Our final recommendation is to pilot test the AI system in the next REF, but not use it to inform judgements. Why recommend pilot testing if it would create a perverse incentive? Pilot testing would raise the profile of the AI task, allowing researchers to develop potentially more accurate innovative solutions. Pilot testing would also allow the promotion of publishing journal article full texts that are suitable for machine learning. This would be of great benefit to AI. Currently, a small minority of REF2021 journal articles are available online in a suitable form. The production of more accurate AI could then result in time saving advantages that would outweigh the perverse incentive disadvantages. Alternatively, if the battle against journal impact factors is effectively won, or pilot testing suggested that the AI system did not generate perverse incentives in practice, then this would open the door to its use in the future, even without substantial increases in accuracy.

Readers can learn more about the project and read the teams final report at: http://cybermetrics.wlv.ac.uk/ai/

The content generated on this blog is for information purposes only. This Article gives the views and opinions of the authors and does not reflect the views and opinions of the Impact of Social Science blog (the blog), nor of the London School of Economics and Political Science. Please review our comments policy if you have any concerns on posting a comment below.

Image Credit: In text images reproduced with permission of the authors, featured image Gabriel Heinzer via Unsplash.

About the author

Mike Thelwall

Mike Thelwall, Professor of Data Science, leads the Statistical Cybermetrics Research Group at the University of Wolverhampton, UK. He researches citation analysis and altmetrics and is a member of the UK Forum for Responsible Research Metrics. His books include, “Web indicators for research evaluation: A practical guide”.

Kayvan Kousha

Dr. Kayvan Kousha is a senior researcher of scientometrics, webometrics and altmetrics at the University of Wolverhampton, UK. He has developed methods to systematically gather and analyse wider impacts of research outside traditional citation indexes such as from Google Books, Google Patents, online course syllabi, Wikipedia, clinical documents, news stories and online book reviews.

Mahshid Abdoli

Mahshid Abdoli is a social media researcher in the Statistical Cybermetrics Research Group at the University of Wolverhampton, UK. She is studying altmetrics and social media contents to gain qualitative evidence about their use in academia.

Meiko Makita

Dr Meiko Makita is a sociologist currently based at the School of Health Sciences at University of Dundee, UK. From 2015 to 2022 she contributed as a social media analysis researcher to the Statistical Cybermetrics and Research Evaluation Group at the University of Wolverhampton. She is particularly interested in analysing digitally-mediated health information practices and discourses.

Emma Stuart

Dr Emma Stuart is a social media researcher in the Statistical Cybermetrics and Research Evaluation Group at the University of Wolverhampton. She specialises in content analysis and is interested in looking at how social media is used from a wide range of different perspectives.

Paul Wilson

Dr Paul Wilson is a Senior Lecturer in Statistics at the University of Wolverhampton, UK. He is a specialist in statistical modelling, especially of count data, with particular emphasis on zero-modified data and assessment of model fit.

Jonathan Levitt

Dr Jonathan Levitt is Senior Research Fellow in Obstacles Faced by Disadvantaged People at the University of Wolverhampton, UK and a writer on disability. He is researching the under-representation of people with disabilities in academia.

Posted In: AI Data and Society | REF2021 | REF2029 | Research evaluation

Mike Thelwall

Kayvan Kousha

Mahshid Abdoli

Meiko Makita

Emma Stuart

Paul Wilson

Jonathan Levitt

January 16th, 2023

Can artificial intelligence assess the quality of academic journal articles in the next REF?

Mike Thelwall

Kayvan Kousha

Mahshid Abdoli

Meiko Makita

Emma Stuart

Paul Wilson

Jonathan Levitt

January 16th, 2023

Can artificial intelligence assess the quality of academic journal articles in the next REF?

An AI system to assess journal article quality

System accuracy

Inform, but not replace?

Pilot testing and future work needed

About the author

Mike Thelwall

Kayvan Kousha

Mahshid Abdoli

Meiko Makita

Emma Stuart

Paul Wilson

Jonathan Levitt

4 Comments

Leave a Comment Cancel reply

Impact Monoculture – Are all impact case studies the same old story?

November 9th, 2021

Do we need all the components of the Research Excellence Framework?

May 11th, 2022

The REF’s singular focus on excellence limits academic diversity

April 29th, 2021

The impact agenda in four acts – Or, how impact moved from concept to governing principle

July 15th, 2021

Mike Thelwall

Kayvan Kousha

Mahshid Abdoli

Meiko Makita

Emma Stuart

Paul Wilson

Jonathan Levitt

January 16th, 2023

Can artificial intelligence assess the quality of academic journal articles in the next REF?

Mike Thelwall

Kayvan Kousha

Mahshid Abdoli

Meiko Makita

Emma Stuart

Paul Wilson

Jonathan Levitt

January 16th, 2023

Can artificial intelligence assess the quality of academic journal articles in the next REF?

An AI system to assess journal article quality

System accuracy

Inform, but not replace?

Pilot testing and future work needed

About the author

Mike Thelwall

Kayvan Kousha

Mahshid Abdoli

Meiko Makita

Emma Stuart

Paul Wilson

Jonathan Levitt

4 Comments

Leave a Comment Cancel reply

Related Posts

Impact Monoculture – Are all impact case studies the same old story?

November 9th, 2021

Do we need all the components of the Research Excellence Framework?

May 11th, 2022

The REF’s singular focus on excellence limits academic diversity

April 29th, 2021

The impact agenda in four acts – Or, how impact moved from concept to governing principle

July 15th, 2021