Open Science

Overview

Teaching: 5 min
Exercises: 5 min
Questions
  • How can version control help me make my work more open?

  • How can I easily preserve and share my code?

Objectives
  • Explain how a version control system can be leveraged as an electronic lab notebook for computational work.

  • Show how to preserve GitHub repos in the CaltechDATA Institutional Repository.

The opposite of “open” isn’t “closed”. The opposite of “open” is “broken”.

— John Wilbanks

Free sharing of information might be the ideal in science, but the reality is often more complicated. Normal practice today looks something like this:

For a growing number of scientists, though, the process looks like this:

This open model accelerates discovery: the more open work is, the more widely it is cited and re-used. However, people who want to work this way need to make some decisions about what exactly “open” means and how to do it. You can find more on the different aspects of Open Science in this book.

This is one of the (many) reasons we teach version control. When used diligently, it answers the “how” question by acting as a shareable electronic lab notebook for computational work:

Code and Data Preservation

However, GitHub is a commercial service and makes no promises that your code will be available in the future. Anyone who has tried following web links in old publications knows that it’s very easy for URLs to break. In order to fulfill the promise of open science, you need to deposit your code in data in a trusted repository - a web service that will ensure that your files are available in the future. These repositories will provide a DOI, a permanent and registered link that leads to your code or data. You should include this DOI with your publications as the link location can be updated without breaking.

If your field has a domain-specific repository, this is a good place to store your data. You can find a good listing of subject-specific repositories at PLOS.
A more complete list of repositories is available at re3data

However, most code and a lot of data don’t fit in an existing repository. This is where institutional repositories like CaltechDATA come in. We’ll use CaltechDATA throughout the rest of this section, but non-Caltech users can use Zenodo, which is similar.

Setting Up Automatic GitHub Preservation

Log into CaltechDATA by clicking the Login link in the upper right-hand corner of the page. You can click the “Caltech account” link to log in with your Caltech IMSS username and password. Once you’re logged in, click on your profile (the little person icon in the upper right hand corner of the page) and select GitHub. By clicking the button and entering your GitHub username and password you can connect your CaltechDATA account to GitHub.

You will now see a list of all the repositories you have access to. In order to turn on repository preservation, click the slider to the right of the repository name. Nothing will happen just because you click the slider. This only tells CaltechDATA to watch the indicated repository for releases.

CaltechDATA Github Interface

Making Releases

Releases are a good way to organize your development process. They’re a way for the scientific community to reference a specific version of your code and ensures that everyone is talking about the same thing. There are lots of ways to organize releases, but it’s easiest to make a release every time someone else might be interested in citing your code. This could be when you’re preparing a publication or have finished a new feature.

To make a release, go to GitHub and click the Releases text at the center-right of your main repository page. Then click the “Draft a new release” button on the right-hand side of the screen. You’ll need to provide a version number (use something like v0.1 prior to publication, something like v1.2.1 for changes after. See more on versioning.) You’ll also need to provide a title and description for your release.

Releases

Once you click publish the magic happens! CaltechDATA notices that you have created a new release and automatically saves everything in your repository. It captures your code and files at the point of release, but does not save the full history. The release is automatically assigned a DOI, which leads to a CaltechDATA landing page. You’ll see a download link for the files, information about authors, the descriptions you wrote in GitHub, and a link back to the GitHub repo. You can edit the metadata for a record by clicking the Edit button. This is useful if you want to include more than one author, ORCID identifiers, or more links.

CaltechDATA Github Landing

You also get a badge (found on the GitHub page in CaltechDATA), that you can include in your repository README file on GitHub. This will update to the newest DOI if you do additional releases.

DOI

GitHub and Data

GitHub is great for sharing and collaborating on code, but it is not set up for managing scientific data. The maximum upload file size is 100 MB, and the entire repository should be less than 1 GB. An extension to git (LFS) supports files up to 2 GB in size, but storing these files on GitHub is fairly expensive ($5 / month for 50 GB of storage). It’s better to store your data files on a disciplinary or institutional data repository. For an overview of uploading data files to a repository, see this AuthorCarpentry lesson

Key Points

  • Open scientific work is more useful and more highly cited than closed.

  • You should preserve and share your GitHub content in a scholarly repository.