The use of AI-generated images, text and code are becoming a normal occurrence in academic work. Mohammad Hosseini and Kristi Holmes reflect on a recent misadventure with AI image generation and suggest researchers ask themselves two questions to ensure they have sufficient expertise to protect the integrity of research, before using these tools.
A survey of 3,838 researchers published by Nature revealed that 31% of respondents used generative artificial intelligence (GenAI). Of these, 17% used it daily and 42% weekly. The top three tasks cited among use cases in research included “refining text” (63%), “code generation/editing/troubleshooting” (56%), and “finding/summarizing the literature” (29%).
The world has countless problems that could benefit from researchers’ expertise, and any tool that makes them more efficient should be embraced, right?
In reality, however, adoption of any new technology requires careful consideration of its benefits as well as potential harms. We know that AI tools are fallible. Specifically, they are subject to different kinds of errors and biases, some of which are easily identifiable, but others less so. This significant limitation and the fact that research findings impact policy and vital aspects of our lives, raises an ethical question as to whether researchers should still use GenAI tools?
As always the answer is: “it depends…”, but even a clear cut case can throw up complex considerations.
A cautionary tale with images
GenAI images are increasingly common in academic promotional material and presentations (not everyone can afford paid image libraries and open libraries often lack specificity). While preparing a presentation in January 2024, the first author used StabilityAI to generate an image for a slide. To highlight one of the less-frequently represented groups of researchers, namely Muslims, he used the following prompt “using artificial intelligence in a research lab with Muslims”. You can see for yourself the result:
Image generated by StabilityAI on January 7th at 4:21PM CST. Mohammad Hosseini used the following prompt “using artificial intelligence in a research lab with Muslims.” We acknowledge that this image might be disrespectful to many, but we used in its original form to raise concerns about errors and biases of GenAI.
The model appears to have grafted the face of a man onto the body of a hijab-wearing woman. It is important to mention from the outset that this image could be considered as disrespectful by Muslims. Particularly, due to its seeming connotations with the stereotype of hijab wearing women having ‘rough’ or masculine features.
From a technical point of view, this image contains both errors and biases. It is erroneous because the face of a man is grafted on the head and body of a woman. Biased, because of the (apparent) assumption that a Muslim person is supposed to have a beard and moustache, regardless of their gender. Due to similar errors and insensitivities, Google’s generative model (Gemini) recently paused producing images of people.
From a technical point of view, this image contains both errors and biases.
These kinds of errors and biases could be due to two factors. First, training data. These models are trained on large amounts of data. Erroneous and biased training data generates low quality content; “garbage in, garbage out” is a succinct description. Perhaps, the reason for generating this image, the lack of good available imagery of Muslim men and women in a research context, is itself partly to blame for the final output? Nevertheless, the black box nature of GenAI tools (the hidden computational weightings that generate content) makes it impossible to know exactly what bias is being replicated. Second, the algorithms that translated the prompt to an understandable command for the GenAI model might have been biased. Regardless of the source, errors and biases are prevalent.
It may be relatively easy to identify visual errors and biases, but they can undermine research integrity in other contexts, especially when they are more subtle. For instance, when AI is used to summarise the literature, biases could result in having crucial aspects of the original texts watered down, slanted or completely removed in favour of a particular opinion, worldview, or ideology. If a researcher has not read the full text or is unfamiliar with the context, how can they identify these biases? Even tasks such as code generation can be undermined with biases impacting equity and accessibility. For instance, code optimized for high-resource environments will perform poorly in low-resource settings without powerful hardware. When generating user interfaces, biases could result in unequitable environments that perform poorly in certain contexts (e.g., right to left languages) or lack proper accessibility features for diverse groups, thereby promoting ableism and excluding neurodiverse users.
An ethical choice
So, returning to our initial question, can we ethically use GenAI tools?
While efforts to improve GenAI should be encouraged and supported, from a research integrity perspective, it is beside the point that they will become more accurate or less biased in the future, because they are being used now. Current GenAI models can amplify biases ingrained in their training data and algorithms, and generate results that lend support to harmful, discriminatory, and unfair policies. Particularly, when it comes to biases related to gender, race, sexuality, ethnicity, age, ability, and socioeconomic status, GenAI can exacerbate existing biases in hypotheses, models, theories, and policies.
Given their shortcomings, GenAI tools are still in ‘beta version’, and yet to pass safety, security, accuracy and other measures necessary for reliable research tools. However, it is important for GenAI users to remember that it is not always possible to know GenAI’s shortcomings. Our image has blatantly visible errors and biases, but two questions researchers employing GenAI tools should ask themselves are:
– Can I always verify the accuracy of GenAI in a specific context?
– Can I identify all kinds of errors and biases in generated content?
If the answer to these questions is “no” or “not always”, then users should consider alternative methods, because their expertise or judgement are likely insufficient to protect the integrity and trustworthiness of research. This suggestion is in line with established international and institutional codes of conduct, which stress the reliability and robustness of research and used methods/tools. For example, one of the four core principles of The European Code of Conduct for Research Integrity pertains to “Reliability in ensuring the quality of research, reflected in the design, methodology, analysis, and use of resources”. Similarly, Guidelines and Policies for the Conduct of Research in the Intramural Research Program at the National Institutes of Health also highlight and define rigor as “the robust and unbiased application of the scientific method to well-defined research questions”.
The suggestion that GenAI should only be used in contexts in which users’ expertise or judgement are sufficient to protect the integrity of research seems reasonable, and has precedent in research. For instance, the Mark Israel and Iain Hay’s argument against researchers using high-risk and traumatized subjects for research if they are unable to prevent psychological harm.
That said, enforcing this suggestion is much more complicated, due in large part to the widespread accessibility of GenAI, putting the onus on users to remain within the contours of their own expertise, discipline, and context and only using GenAI when their expertise or judgement is sufficient to identify possible biases and errors. Efforts to help make this process easier and more reliable are sorely needed to help users mitigate bias and support responsible use of GenAI tools. Furthermore, demarcating areas where it is fine to use GenAI versus forbidden use cases can be helpful to specify how the concept of rigor should apply in uncertain situations involving GenAI. Indeed, given the widespread use of GenAI in research and beyond, promoting its responsible and cautious use is essential to prevent errors and exacerbation of biases which can erode society’s trust in science.
The content generated on this blog is for information purposes only. This Article gives the views and opinions of the authors and does not reflect the views and opinions of the Impact of Social Science blog (the blog), nor of the London School of Economics and Political Science. Please review our comments policy if you have any concerns on posting a comment below.
Featured image credit: Yasmin Dwiputri & Data Hazards Project, Better Images of AI, Safety Precautions, (CC-BY 4.0).
Thank you for sharing this example, Mohammad and Kristi. I think that Spicer et al’s (2023) framework would be helpful here. I am using it with students, to help them reflect when it is safe/ correct / useful to use GenAi: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4678265