A significant shift in how researchers approach their data is needed if transparent and reproducible research practices are to be broadly advanced. Carly Strasser has put together a useful guide to embracing open science, pitched largely at graduate students. But the tips shared will be of interest far beyond the completion of a PhD. If time is spent up front thinking about file organization, sample naming schemes, backup plans, and quality control measures, many hours of heartache can be averted.
I’m guessing that first year graduate students are knee-deep in courses, barely considering their potential thesis project. But for those that can multi-task, I have compiled this list of 10 things that you should undertake in your first year as a grad student. These aren’t just any 10 things… they are 10 steps you can take to make sure you contribute to a culture shift towards open science. Some a big steps, and others are small, but they will all get you (and the rest of your field) one step closer to reproducible, transparent research.
1. Learn to code in some language. Any language.
Here’s the deal: it’s easier to use black-box applications to run your analyses than to create scripts. Everyone knows this. You put in some numbers and out pop your results; you’re ready to write up your paper and get that H-index headed upwards. But this approach will not cut the mustard for much longer in the research world. Researchers need to know about how to code. Growing amounts and diversity of data, more interdisciplinary collaborators, and increasing complexity of analyses mean that no longer can black-box models, software, and applications be used in research. The truth is, if you want your research to be reproducible and transparent, you must code. In a 2013 article “The Big Data Brain Drain: Why Science is in Trouble“, Jake Vanderplas argues that
In short, the new breed of scientist must be a broadly-trained expert in statistics, in computing, in algorithm-building, in software design, and (perhaps as an afterthought) in domain knowledge as well.
I learned MATLAB in graduate school, and experimented with R during a postdoc. I wish I’d delved into this world earlier, and had more skills and knowledge about best practices for scientific software. Basically, I wish I had attended a Software Carpentry bootcamp.
The growing number of Software Carpentry (SWC) bootcamps are more evidence that researchers are increasingly aware of the importance of coding and reproducibility. These bootcamps teach researchers the basics of coding, version control, and similar topics, with the potential for customizing the course’s content to the primary discipline of the audience. I’m a big fan of SWC – read more in my blog post on the organization. Check out SWC founder Greg Wilson’s article on some insights from his years in teaching bootcamps: Software Carpentry: Lessons Learned.
2. Stop using Excel. Or at least stop ONLY using Excel.
Most seasoned researchers know that Microsoft Excel can be potentially problematic for data management: there are loads of ways to manipulate, edit, reorder, and change your data without really knowing exactly what you did. In nerd terms, the trail of dataset changes is known as provenance; generally Excel is terrible at documenting provenance. I wrote about this a few years ago on the blog, and we mentioned a few of the more egregious ways people abuse Excel in ourF1000Research publication on the DataUp tool. More recently guest blogger Kara Woo wrote a great post about struggles with dates in Excel.
Of course, everyone uses Excel. In our surveys for the DataUp project, about 88% of the researchers we interviewed used Excel at some point in their research. And we can’t expect folks to stop using it: it’s a great tool! It should, however, be used carefully. For instance, don’t manipulate the sole copy of your raw data in Excel; keep your raw data raw. Use Excel to explore your data, but use other tools to clean and analyze it, such as R, Python, or MATLAB (see #1 above on learning to code). For more help with spreadsheets, see our list of resources and tools: UC3 Spreadsheet Help.
3. Learn about how to properly care for your data.
You might know more about your data than anyone else, but you aren’t so smart when it comes stewardship your data. There are some great guidelines for how best to document, manage, and generally care for your data; I’ve collected some of my favorites here on CiteULike with the tagbest_practices. Pick one (or all of them) to read and make sure your data don’t get short shrift.
4. Write a data management plan.
I know, it sounds like the ultimate boring activity for a Friday night. But these three words (data management plan) can make a HUGE difference in the time and energy spent dealing with data during your thesis. Basically, if you spend some time thinking about file organization, sample naming schemes, backup plans, and quality control measures, you can save many hours of heartache later. Creating a data management plan also forces you to better understand best practices related to data (#3 above). Don’t know how to start? Head over to the DMPTool to write a data management plan. It’s free to use, and you can get an idea for the types of things you should consider when embarking on a new project. Most funders require data management plans alongside proposal submissions, so you might as well get the experience now.
Image: Ainsley Seago. doi:10.1371/journal.pbio.1001779.g001
5. Read Reinventing Discovery by Michael Nielsen.
Reinventing Discovery: The New Era of Networked Science by Michael Nielsen was published in 2013, and I’ve since heard it referred to as the Bible for Open Science, and the must-read book for anyone interested in engaging in the new era of 4th paradigm research. I’ve only just recently read the book, and wow. I was fist-bumping quite a bit while reading it, which must have made fellow airline passengers wonder what the fuss was about. If they had asked, I would have told them about Nielsen’s stellar explanation of the necessity for and value of openness and transparency in research, the problems with current incentive structures in science, and the steps we should all take towards shifting the culture of research to enable more connectivity and faster progress. Just writing this blog post makes me want to re-read the book.
6. Learn version control.
My blog post, Git/GitHub: a Primer for Researchers covers much of the importance of version control. Here’s an excerpt:
From git-scm.com, “Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.” We all deal with version control issues. I would guess that anyone reading this has at least one file on their computer with “v2″ in the title. Collaborating on a manuscript is a special kind of version control hell, especially if those writing are in disagreement about systems to use (e.g., LaTeX versus Microsoft Word). And figuring out the differences between two versions of an Excel spreadsheet? Good luck to you. TheWikipedia entry on version control makes a statement that brings versioning into focus:
The need for a logical way to organize and control revisions has existed for almost as long as writing has existed, but revision control became much more important, and complicated, when the era of computing began.
Ah, yes. The era of collaborative research, using scripting languages, and big data does make this issue a bit more important and complicated. Version control systems can make this much easier, but they are not necessarily intuitive for the fledgling coder. It might take a little time (plus attending a Software Carpentry Bootcamp) to understand version control, but it will be well worth your time. As an added bonus, your work can be more reproducible and transparent by using version control. Read Karthik Ram’s great article, Git can facilitate greater reproducibility and increased transparency in science.
7. Pick a way to communicate your science to the public. Then do it.
You don’t have to have a black belt in Twitter or run a weekly stellar blog to communicate your work. But you should communicate somehow. I have plenty of researcher friends who feel exasperated by the idea that they need to talk to the public about their work. But the truth is, in the US this communication is critical to our research future. My local NPR station recently ran a great piece called Why Scientists are seen as untrustworthy and why it matters. It points out that many (most?) scientists aren’t keen to spend a lot of time engaging with the broader public about their work. However:
…This head-in-the-sand approach would be a big mistake for lots of reasons. One is that public mistrust may eventually translate into less funding and so less science. But the biggest reason is that a mistrust of scientists and science will have profound effects on our future.
Basically, we are avoiding the public at our own peril. Science funding is on the decline, we are facing increasing scrutiny, and it wouldn’t be hyperbole to say that we are at war without even knowing it. Don’t believe me? Read this recent piece in Science (paywall warning): Battle between NSF and House science committee escalates: How did it get this bad?
So start talking. Participate in public lecture series, write a guest blog post, talk about your research to a crotchety relative at Thanksgiving, or write your congressman about the governmental attack on science.
8. Let everyone watch.
Consider going open. That is, do all of your science out in the public eye, so that others can see what you’re up to. One way to do this is by keeping an open notebook. This concept throws out the idea that you should be a hoarder, not telling others of your results until the Big Reveal in the form of a publication. Instead, you keep your lab notebook (you do have one, right?) out in a public place, for anyone to peruse. Most often an open notebook takes the form of a blog or a wiki, and the researcher updates their notebook daily, weekly, or whatever is most appropriate. There are links to data, code, relevant publications, or other content that helps readers, and the researcher themselves, understand the research workflow. Read more in these two blog posts: Open Up andOpen Science: What the Fuss is About.
9. Get your ORCID.
ORCID stands for “Open Researcher & Contributor ID”. The ORCID Organization is an open, non-profit group working to provide a registry of unique researcher identifiers and a transparent method of linking research activities and outputs to these identifiers. The endgame is to support the creation of a permanent, clear and unambiguous record of scholarly communication by enabling reliable attribution of authors and contributors. Basically, researcher identifiers are like social security numbers for scientists. They unambiguously identify you throughout your research life.
Lots of funders, tools, publishers, and universities are buying into the ORCID system. It’s going to make identifying researchers and their outputs much easier. If you have a generic, complicated, compound, or foreign name, you will especially benefit from claiming your ORCID and “stamping” your work with it. It allows you to claim what you’ve done and keep you from getting mixed up with that weird biochemist who does studies on the effects of bubble gum on pet hamsters. Still not convinced? I wrote a blog post a while back that might help.
10. Publish in OA journals, or make your work OA afterward.
A wonderful post by Michael White, Why I don’t care about open access to research: and why you should, captures this issue well:
It’s hard for me to see why I should care about open access…. My university library can pay for access to all of the scientific journals I could wish for, but that’s not true of many corporate R&D departments, municipal governments, and colleges and schools that are less well-endowed than mine. Scientific knowledge is not just for academic scientists at big research universities.
It’s easy to forget that you are (likely) among the privileged academics. Not all researchers have access to publications, and this is even more true for the general public. Why are we locking our work in the Ivory Tower, allowing for-profit publishers to determine who gets to read our hard-won findings? The Open Access movement is going full throttle these days, as evidenced by increasing media coverage (see “Steal this research paper: you already paid for it” from MotherJones, or The Guardian’s blog post “University research: if you believe in openness, stand up for it“). So what can you do?
Consider publishing only in open access journals (see the Directory of Open Access Journals). Does this scare you? Are you tied to a disciplinary favorite journal with a high impact factor? Then make your work open access after publishing in a standard journal. Follow my instructions here: Researchers! Make Your Previous Work #OA.
This post originally appeared on Data Pub and is reposted with the author’s permission.
Note: This article gives the views of the author, and not the position of the Impact of Social Science blog, nor of the London School of Economics. Please review our Comments Policy if you have any concerns on posting a comment below.
Carly Strasser is Manager of Strategic Partnerships for DataCite. She has a PhD in Biological Oceanography, which informs her work on helping researchers better manage and share their data. At the time of this post, she was a data curation specialist at the California Digital Library, part of the University of California system, and involved in development and implementation of many of the UC Curation Center‘s services.
This is a brilliant piece from someone “who has been there”. “… file organization, sample naming schemes, backup plans, and quality control measures …” – this is actually the same all over industry. When doing data migration for large, VERY large ERP projects, you run into the very same problems. One thing I e.g. tell US-Americans always is to use the ISO date format (2015-03-15) not the idiosyncratic “9/11” thingy. I have had world-wide conferences, where half the US, British, Asian praticipants arrived, to go with the example, on November the ninth, while others came on September 11th. Of course Nov. was cancelled, as on Sept. 12th everyone noticed what had happened to the other half … The same with satellites falling from the skies due to incompatible imperial and metric measures. Even longhand writing is different – US-Americans and British scientists read a hand-written German “one” as a “seven”, while Germans read a US “one” as a … “i” … (maybe they think you are alluding to a “surd” … It goes on from there and don’t get better! Like the German or French scientists who think cutting out the middlemen, i.e. “expensive” publishers is a good idea in open publishing – only, what they never were aware of: these publishers employed knowledgeable correctors to make their French and German “Franglish”Denglish” into English. I assume Open Data and Open Publishing will have some rough rides ahead for the next twenty years with many data lost and others cruelly misinterpreted. And we’ll all have a laugh in the history of science two hundred years “down the road”.
Outstanding advice, Carly. Open Science FTW.
Good to find an expert who knows what he’s talking about.
Great Work, there are always very useful tips and information in your articles!
Regarding Thing #1: learning to program may provide great insight into the software you use, but if you choose to use any software that you write yourself, it’ imperative to also learn a bit about software TESTING, lest a bug of your own two hands lead to great professional embarrassment.
Fantastic article. I’d really like to start seeing more mention of open sharing of materials especially via repositories. I don’t think you even mention this in the article. Open data is greatly enhanced by the sharing materials and protocols (like via Addgene or protocols.io).