LSE - Small Logo
LSE - Small Logo

Irene Pasquetto

Zoë Cullen

Andrea Thomer

Morgan Wofford

November 19th, 2024

Open research data poses real world risks that need to be managed

2 comments | 13 shares

Estimated reading time: 7 minutes

Irene Pasquetto

Zoë Cullen

Andrea Thomer

Morgan Wofford

November 19th, 2024

Open research data poses real world risks that need to be managed

2 comments | 13 shares

Estimated reading time: 7 minutes

Drawing on their recent work to identify forms of research data misuse Irene V. Pasquetto, Zoë Cullen, Andrea Thomer and Morgan Wofford outline seven kinds of research data misuse and provide recommendations for how policymakers, researchers and data professionals can work to mitigate these risks.


The growth of open science, has encouraged researchers to share their data to promote transparency, innovation, and collaboration. However, openness brings risks. Data misuse can take various forms, from accidental errors to deliberate manipulation, and it can undermine trust in science, slow progress, and even do harm. To take one example, during the COVID-19 pandemic, leaked medical data containing patients’ private information was used to create fear-based narratives. In several cases, unverified data on infection rates and vaccine effectiveness was spread across social media, leading to confusion and fuelling vaccine skepticism.

So, what exactly constitutes research data misuse, and how can it be prevented or mitigated?

Defining data misuse

Data misuse can encompass a broad range of activities. Misuse can stem from unintentional errors, like methodological mistakes during analysis, or from deliberate action, such as the manipulation of data to support a specific agenda.

One of the key challenges in defining misuse is its relational nature. What one research community views as misuse may be seen as acceptable by another. For example, reusing data without consent may be considered unethical in one field but accepted in another. Similarly, methods once considered fringe may gain acceptance, making it difficult to label their earlier use as misuse.

Misuse can stem from unintentional errors, like methodological mistakes during analysis, or from deliberate action, such as the manipulation of data to support a specific agenda.

Moreover, as science becomes more open and transparent, and non-experts increasingly engage in interpreting scientific data, most of the existing curatorial guidelines for data sharing and reuse remain tailored for expert data practices. Notable exceptions include educational settings, data literacy programs, and citizen science projects, all of which typically collaborate with public(s) already supportive of consensus science. Beyond these, there is little understanding of how to release and contextualize science data for public consumption. We identified seven common aspects of data misuse:

Analytical Errors

These involve mistakes in how data is analysed, often leading to incorrect conclusions. Such errors may stem from a lack of understanding of the data’s context or limitations. For example, in Data Feminism, D’Ignazio and Klein report on how journalists from a prominent data-driven news site analysed open news data from the GDELT Project–a digital media archive–and mistakenly reported that kidnapping events in Africa were happening at a faster rate than they were. This error occurred due to miscounting instances in the data, which was due not only to journalistic oversight but also to a lack of contextual information about the data it hosts.

Misinterpretation

This type of misuse results from a misunderstanding of the meaning or implications of research data, potentially leading to incorrect conclusions. For example, in 2019, NASA released wildfire data from Australia, which an artist used to create a map visualizing a month’s worth of fires. The map went viral when celebrities shared it to raise awareness, but many mistakenly presented it as a live depiction of ongoing wildfires. This led to confusion, with some claiming the fire crisis was exaggerated, while others used the misunderstanding to argue that wildfires in Australia were happening at a normal rate.

Misrepresentation

Data can be deliberately or unintentionally manipulated to support specific viewpoints. A recent example of this occurred during the 2020 U.S. presidential election. Claims of voter fraud were widely circulated, with individuals using isolated datasets to create misleading narratives about ballot irregularities. In some cases, datasets were misrepresented to claim that certain voting patterns were statistically impossible, even though further analysis showed these claims were based on a misunderstanding of how election data is collected and reported.

Reputational harm

This kind of misuse entails using or presenting data in a way that damages the reputation of the original data collectors, analysts, or curators. This can happen through failure to cite data, negated or unwarranted authorship, or generating public shame for perceived flaws in data. It can take many forms. A primary example would be the 2009 “Climategate” hacking and release of thousands of private emails from climate scientists. Skeptics of climate change seized upon selective excerpts from released emails, accusing the scientists of manipulating data to exaggerate global warming. Despite multiple independent investigations that cleared the scientists of wrongdoing, the scandal fueled public distrust of climate science and was used to undermine efforts to address climate change.

Privacy and geo-privacy violations

In fields like ecology or biomedicine, data misuse can lead to breaches of privacy, in which personally identifiable information is shared, or geo-privacy, in which the locality of sensitive organisms or environments are shared. For example, in a research project in Australia sharks were tagged by researchers for scientific purposes to gain insights into marine ecology, shark movement, behavior, and conservation efforts. A government agency identified a tagged shark as an imminent threat, leading to its targeted killing. This misuse of tagged data undermined the conservation and knowledge generation efforts of the research.

Exploitation

In some cases, research data collected from marginalised communities is reused without fair compensation or benefit-sharing. This can deepen existing inequalities, particularly when data is used to generate commercial value without regard for the communities that contributed to it. For instance, Indigenous land data, collected for conservation purposes, has been misused by commercial entities to plan resource extraction activities, which went against the interests of the local communities.

Uncritical use of biased and offensive data

This misuse consists of the failure to critically evaluate and challenge data sources that contain prejudices, offensive content, or discriminatory elements, perpetuating harmful stereotypes or narratives. Local governments used a biased algorithmic model to predict child abuse risk, disproportionately targeting low-income families due to the extensive data available on them compared to wealthier families.

Preventing Data Misuse: Key Strategies

Given the risks associated with data misuse, what can be done to prevent it or mitigate it?

Invest in Digital Curation: Have a specialised workforce in place of professional data curators that can carefully document data provenance, formats, granularity, quality, and so on, following best practices as well as disciplinary standards.

Self-reflexive research practice: Encourage data reusers to adopt self-reflexive research practices, especially when there is a potential for the data to be biased or harmful, by providing concrete examples of what such practices entail.

Ad hoc Documentation and Metadata: Adopt different forms of data descriptions in addition to metadata, for example, data papers or readme files. In some cases, narrative approaches might be more suitable to give the necessary contextual information to potential data reusers.

Credit Data Workers: Be explicit about how to give credit to data workers (licenses, formal data citations, etc.), keeping in mind that the category extends beyond data creators or collectors. Formal data citation, in particular, can address several issues related to data misuse discussed in the paper, as they standardize proper attribution of contributors and provide contextual information about a dataset;

Ethical Guidelines for Open Review: Encourage data replication by reusers, while at the same time being explicit about how eventual errors found in the data or in the analyses should be communicated;

Data Reuse Best Practices: Encourage data stewardship for reusers, for example, by providing guidelines on how to contextualize and document secondary analyses of open data;

Equitable Benefits: When applicable, work with community leaders and consultants to ensure that communities that agree to provide data for research obtain tangible benefits and that local epistemologies are taken seriously.

Balancing Openness and Protection

A key point of this work is to show that misuse of open research data cannot be avoided entirely. However, this reality should not be used as an excuse not to release critical data, or to avoid taking the potential harm caused by data misuse seriously.

Data intermediaries who are responsible for curating, releasing, and managing open research data need to become proactive about preventing, mitigating, and responding to instances of data misuse. In doing so, a strategy that has been proven effective is to emphasise the harm that misuse can generate, rather than promoting the superiority of one set or rules and protocols over another.

 


This post draws on the authors’ article, What is research data “misuse”? And how can it be prevented or mitigated? published in JASIST.

The content generated on this blog is for information purposes only. This Article gives the views and opinions of the authors and does not reflect the views and opinions of the Impact of Social Science blog (the blog), nor of the London School of Economics and Political Science. Please review our comments policy if you have any concerns on posting a comment below.

Image Credit: A9 STUDIO on Shutterstock.


Print Friendly, PDF & Email

About the author

Irene Pasquetto

Dr. Pasquetto is an Assistant Professor at the College of Information at the University of Maryland and a Senior Research Fellow and Senior Editor at the Shorenstein Center on Media, Politics, and Public Policy at the Harvard Kennedy School.

Zoë Cullen

Zoë Cullen is a PhD candidate at the University of Michigan. Her work focuses on understanding how platforms and AI influence our social interactions and information ecosystems.

Andrea Thomer

Dr Thomer is an Associate Professor at the College of Information Science at the University of Arizona. She studies the creation and maintenance of knowledge infrastructures, particularly in the natural sciences. 

Morgan Wofford

Morgan Wofford is a Ph.D. candidate at the University of Michigan's School of Information, studying the politics and practices of open data access and reuse.

Posted In: AI Data and Society | Featured | Libraries | Open Research | Research ethics

2 Comments