There has been renewed enthusiasm in recent weeks for greater data-sharing practices in the social sciences, due in no small part to the Reinhart-Rogoff controversy. Here, data curation specialist Carly Strasser provides answers to some frequently asked questions from those still sceptical about breaking down the technical, practical, and theoretical barriers to data sharing.
If you are a fan of data sharing, open data, open science, and generally openness in research, you’ve heard them all: excuses for keeping data out of the public domain. If you are NOT a fan of openness, you should be. For both groups (the fans and the haters), I’ve decided to construct a “Frankenstein monster” blog post composed of other peoples’ suggestions for how to deal with the excuses.
I have drawn some comebacks from Christopher Gutteridge, University of Southampton, and Alexander Dutton, University of Oxford. They created an open google doc of excuses for closing off data and appropriate responses, and generously provided access to the document under a CC-BY license. I also reference the UK Data Archive‘s list of barriers and solutions to data sharing, available via the Digital Curation Centre‘s PDF, “Research Data Management for Librarians”.
People will contact me to ask about stuff
Christopher and Alex (C&A) say: “This is usually an objection of people who feel overworked and that [data sharing] isn’t part of their job…” I would add to this that science is all about learning from each other – if a researcher is opposed to the idea of discussing their datasets, collaborating with others, and generally being a good science citizen, then they should be outed by their community as a poor participant.
People will misinterpret the data
C&A suggest this: “Document how it should be interpreted. Be prepared to help and correct such people; those that misinterpret it by accident will be grateful for the help.” From the UK Data Archive: “Producing good documentation and providing contextual information for your research project should enable other researchers to correctly use and understand your data.”
It’s worth mentioning, however, a second point C&A make: “Publishing may actually be useful to counter willful misrepresentation (e.g. of data acquired through Freedom of Information legislation), as one can quickly point to the real data on the web to refute the wrong interpretation.”
My data is not very interesting
C&A: “Let others judge how interesting or useful it is — even niche datasets have people that care about them.” I’d also add that it’s impossible to decide whether your dataset has value to future research. Consider the many datasets collected before “climate change” was a research topic which have now become invaluable to documenting and understanding the phenomenon. From the UK Data Archive: “Who would have thought that amateur gardener’s diaries would one day provide essential data for climate change research?”
I might want to use it in a research paper
Anyone who’s discussed data sharing with a researcher is familiar with this excuse. The operative word here is might. How many papers have we all considered writing, only to have them shift to the back burner due to other obligations? That said, this is a real concern.
C&A suggest the embargo route: “One option is to have an automatic or optional embargo; require people to archive their data at the time of creation but it becomes public after X months. You could even give the option to renew the embargo so only things that are no longer cared about become published, but nothing is lost and eventually everything can become open.” Researchers like to have a say in the use of their datasets, but I would caution to have any restrictions default to sharing. That is, after X months the data are automatically made open by the repository.
I would also add that, as the original collector of the data, you are at a huge advantage compared to others that might want to use your dataset. You have knowledge about your system, the conditions during collection, the nuances of your methods, et cetera that could never be fully described in the best metadata.
I’m not sure I own the data
No doubt, there are a lot of stakeholders involved in data collection: the collector, the PI (if different), the funder, the institution, the publisher, … C&A have the following suggestions:
- Sometimes as it’s as easy as just finding out who does own the data
- Sometimes nobody knows who owns the data. This often seems to occur when someone has moved into a post and isn’t aware that they are now the data owner.
- Going up the management chain can help. If you can find someone who clearly has management over the area the dataset belongs to they can either assign an owner or give permission.
- Get someone very senior to appoint someone who can make decisions about apparently “orphaned” data.
My data is too complicated.
C&A: “Don’t be too smug. If it turns out it’s not that complicated, it could harm your professional [standing].” I would add that if it’s too complicated to share, then it’s too complicated to reproduce, which means it’s arguably not real scientific progress. This can be solved by more documentation.
My data is embarrassingly bad
C&A: “Many eyes will help you improve your data (e.g. spot inaccuracies)… people will accept your data for what it is.” I agree. All researchers have been on the back end of making the sausage. We know it’s not pretty most of the time, and we can accept that. Plus it helps you strive will be at managing and organizing data during your next collection phase.
It’s not a priority and I’m busy
Good news! Funders [in the US and increasingly in the UK] are making it your priority! New sharing mandates in the OSTP memorandum state that any research conducted with federal funds must be accessible. You can expect these sharing mandates to drift down to you, the researcher, in the very near future (6-12 months).
This was originally posted on the Data Pub blog from the California Digital Library and is reposted with permission.
Note: This article gives the views of the author, and not the position of the Impact of Social Science blog, nor of the London School of Economics.
Carly Strasser is a data curation specialist at the California Digital Library, part of the University of California system. She has a PhD in Biological Oceanography, which informs her work on helping researchers better manage and share their data. She is involved in development and implementation of many of the UC Curation Center‘s services, including the DMPTool (software to guide researchers in creating a data management plan) and DataUp (an application that helps researchers organize, manage, and archive their tabular data).