LSE - Small Logo
LSE - Small Logo

Kayvan Kousha

Mike Thelwall

January 15th, 2025

What happens when you let ChatGPT assess impact case studies?

1 comment | 15 shares

Estimated reading time: 6 minutes

Kayvan Kousha

Mike Thelwall

January 15th, 2025

What happens when you let ChatGPT assess impact case studies?

1 comment | 15 shares

Estimated reading time: 6 minutes

The potential benefits of AI to different parts of the research cycle are currently a matter of significant study. In this post, Kayvan Kousha and Mike Thelwall consider whether large language models, such as ChatGPT, can be applied to assess the quality of REF impact case studies.


Impact case studies in the UK Research Excellence Framework (REF) national university assessments are five-page evidence-based claims about how research carried out by the submitting unit (such as a department) has generated useful societal impacts.

Assessing a case study seems to be substantially more difficult than assessing traditional academic outputs because they are highly varied. Assessors need to evaluate the strength of the pathway to impact, the extent (breadth and depth) of the impact, and the amount of credit for the impact that is due to the submitting department.

In this context a tool that could estimate scores for impact case studies could be useful to support decision making processes, especially for departments and academics making decisions about which ones to choose and how to write them. Ultimately, this is a text processing task, so is in scope for Large Language Models (LLMs) like ChatGPT, especially given that they have shown some ability to predict the REF quality of journal articles. We therefore tested it and found that it could do a reasonable job of predicting likely impact case study scores, but with disciplinary differences, as explained below.

What is in an impact case study?

Impact case studies are structured as follows:

• Summary of the Impact (100 words): A brief description of the specific impact described in the case study.
• Underpinning Research (500 words): A description of the research conducted that led to the impact.
• References to the Research (6 references): Citations of key research outputs that underpin the impact.
• Details of the Impact (750 words): An in-depth account of the impact, including how the research contributed to it and who benefited.
• Sources to Corroborate the Impact (10 references): Evidence supporting the claims of impact, such as testimonials or official reports.

Most impact case studies can be read online.

How can ChatGPT evaluate them?

After checking the legality of submitting case studies to ChatGPT in terms of copyright, we used the ChatGPT API, which does not learn from data input. The API includes a system prompt, which can be used to explain the task, and a chat session that can consist of a question to ChatGPT and its response. For the system instructions, we combined impact and quality definitions from the official REF2021 guidelines with instructions for assessors from the panel criteria and working methods document to form a self-contained description of impact and how to assess it. The instructions were slightly reformulated to the style used in OpenAI’s examples for ChatGPT, which mainly consisted of telling ChatGPT what it is pretending to be rather than describing the task abstractly. The system instructions formed three quarters of a page of A4, starting as follows:

“You are an academic expert, assessing impact case studies, which describing specific impacts that have occurred from academic research. You will provide a score of 1* to 4* alongside a detailed justification. …”

We used only the first system instructions and did not try any variations for our main results since previous experience with this type of task suggested that they would make little difference. We then prompted ChatGPT with “Score the following impact case study:”, the title, then all or part of the case study (see below). We used ChatGPT 4o-mini, through the API. We submitted the requests five times for each case study and used the mean score as the prediction.

Experiments and results

We used the above approach for the 6,220 qualifying public impact case studies. Since the main purpose was score prediction, rather than peer review insights into strengths and weaknesses, we focused on obtaining the best predictions. Since we don’t know what score any case study received, we used departmental mean scores calculated from the REF2021 results website as a proxy for individual scores. Our target was to obtain score predictions having the highest correlation with these departmental mean impact case study scores, rather than scores that are closest to them because ChatGPT is essentially much better at getting the correct order for scores than getting the correct value for them in this type of task. Given a high correlation, the ChatGPT scores can easily be corrected for scale with a transformation or lookup table.

We tried entering the entire case study or a subset of the sections. We found that entering the title and summary alone gave much better predictions (higher correlations with departmental averages) than entering full case study. Strangely, ChatGPT seemed to be so impressed by them that it scored virtually all as 4* when complete text was entered!

The results show that ChatGPT has a genuine but weak capability to detect the quality of impact case studies, as reflected by the indirect Pearson correlation of 0.337. It seems to work best by “trusting” the summary claims, presumably because it is unable to effectively assess the detailed narrative and evidence presented to support them.
Disciplinary differences and prompt differences

We repeated the experiments with strict prompts to try to dampen ChatGPT’s enthusiasm for 4* ratings, but this only made a marginal improvement. We also compared the results between units of assessment to illustrate which it seems to work best and worst for (Fig.1).

Graph showing Pearson correlations between average ChatGPT score for an impact case study and the departmental score, by unit of assessment for the average of 30 iterations of the very strict prompt with half scores applied to each impact case study Title + Summary. Error bars indicate 95% confidence intervals.

Fig,1: Pearson correlations between average ChatGPT score for an impact case study and the departmental score, by unit of assessment for the average of 30 iterations of the very strict prompt with half scores applied to each impact case study Title + Summary. Error bars indicate 95% confidence intervals.

As suggested by the above, if you want to apply generative AI to score your own impact case studies then, you could do it with the ChatGPT API, but it is really only worth doing for the initial summary and not the complete document. You should repeat it at least five times (30 would be better) and take the average. This is likely to be close to 4* so you should ignore the value, but you could use it to compare with the score with other case studies from the same unit of assessment. This will be just a clever guess, not a proper evaluation, so don’t use it for any real decisions unless you have exhausted all sensible alternatives!


This post draws on the authors’ preprinit, Assessing the societal influence of academic research with ChatGPT: Impact case study evaluations, published on arXiv. 

The content generated on this blog is for information purposes only. This Article gives the views and opinions of the authors and does not reflect the views and opinions of the Impact of Social Science blog (the blog), nor of the London School of Economics and Political Science. Please review our comments policy if you have any concerns on posting a comment below.

Image Credit: Tada Images on Shutterstock


Print Friendly, PDF & Email

About the author

Kayvan Kousha

Dr. Kayvan Kousha is a senior researcher of scientometrics, webometrics and altmetrics at the University of Wolverhampton, UK, where he leads the Statistical Cybermetrics Research Group. He has developed methods to systematically gather and analyse wider impacts of research outside traditional citation indexes such as from Google Books, Google Patents, online course syllabi, Wikipedia, clinical documents, news stories and online book reviews.

Mike Thelwall

Mike Thelwall, Professor of Data Science, is in the Information School at the University of Sheffield, UK. He researches artificial intelligence, citation analysis, altmetrics, and social media. His current (free) book is, “Quantitative Methods in Research Evaluation Citation Indicators, Altmetrics, and Artificial Intelligence”.

Posted In: AI Data and Society | Featured | Impact | REF2029 | Research evaluation

1 Comments