Data Analysis, Visualization, and Reproducibility with R Notebooks

This workshop is a Data Carpentry-inspired lesson based on the ‘Data Carpentry: R for Data Analysis and Visualization of Ecological Data’ lesson, maintained by François Michonneau & Auriel Fournier. The most current version of this lesson can be found here.

Parts of this lesson are also derived from Data Science in the tidyverse by Charlotte Wickham, which is licensed under a Creative Commons Attribution 4.0 International License and based on a work at https://github.com/rstudio/master-the-tidyverse.

Data Carpentry’s aim is to teach researchers basic concepts, skills, and tools for working with data so that they can get more done in less time, and with less pain.

This is an introduction to R designed for participants with no programming experience. These lessons will cover an overview of the RStudio interface, basic R syntax, an introduction to the tidyverse package and ggplot, working with files, the structure of data frames, how to deal with factors, how to add/remove rows and columns, and how to calculate summary statistics from a data frame.

What is R? What is RStudio?

The term “R” is used to refer to both the programming language and the software that interprets the scripts written using it.

RStudio is currently a very popular way to not only write your R scripts but also to interact with the R software. To function correctly, RStudio needs R and therefore both need to be installed on your computer.

Why learn R?

R does not involve lots of pointing and clicking, and that’s a good thing

The learning curve might be steeper than with other software, but with R, anaylsis results do not rely on remembering a succession of pointing and clicking, but instead on a series of written commands, and that’s a good thing! So, if you want to redo your analysis because you collected more data, you don’t have to remember which button you clicked in which order to obtain your results, you just have to run your script again.

Working with scripts makes the steps you used in your analysis clear, and the code you write can be inspected by someone else who can give you feedback and spot mistakes.

Working with scripts forces you to have a deeper understanding of what you are doing, and facilitates your learning and comprehension of the methods you use.

R code is great for reproducibility

Reproducibility is when someone else (including your future self) can obtain the same results from the same dataset when using the same analysis.

R integrates with other tools to generate manuscripts from your code. If you collect more data, or fix a mistake in your dataset, the figures and the statistical tests in your manuscript are updated automatically.

An increasing number of journals and funding agencies expect analyses to be reproducible, so knowing R will give you an edge with these requirements.

R is interdisciplinary and extensible

A package is the “fundamental unit of shareable code” and “bundles together code, data, documentation, and tests” Hadley Wickham, “R Packages”. With 10,000+ packages that can be installed to extend its capabilities, R provides a framework that allows you to combine statistical approaches from many scientific disciplines to best suit the analytical framework you need to analyze your data. For instance, R has packages for image analysis, GIS, time series, population genetics, and a lot more.

R works on data of all shapes and sizes

The skills you learn with R scale easily with the size of your dataset. Whether your dataset has hundreds or millions of lines, it won’t make much difference to you.

R is designed for data analysis. It comes with special data structures and data types that make handling of missing data and statistical factors convenient.

R can connect to spreadsheets, databases, and many other data formats, on your computer or on the web.

R produces high-quality graphics

The plotting functionalities in R are endless, and allow you to adjust any aspect of your graph to convey most effectively the message from your data.

R has a large community

Thousands of people use R daily. Many of them are willing to help you through mailing lists and websites such as Stack Overflow.

Not only is R free, but it is also open-source and cross-platform

Anyone can inspect the source code to see how R works. Because of this transparency, there is less chance for mistakes, and if you (or someone else) find some, you can report and fix bugs.

Chapters

Requirements

Data Carpentry’s teaching is hands-on, so participants are encouraged to use their own computers to ensure the proper setup of tools for an efficient workflow. These lessons assume no prior knowledge of the skills or tools, but working through this lesson requires working copies of the software described below. To most effectively use these materials, please make sure to download the data and install everything before working through this lesson.

Data

Data files for the lesson are available and can be downloaded manually here: http://dx.doi.org/10.6084/m9.figshare.1314459

However, we will download them directly from R during the lessons when we need them.

Setup instructions

R and RStudio are separate downloads and installations. R is the underlying statistical computing environment, but using R alone is no fun. RStudio is a graphical integrated development environment (IDE) that makes using R much easier and more interactive. You need to install R before you install RStudio. After installing both programs, you will need to install the tidyverse package from within RStudio. Follow the instructions below for your operating system, and then follow the instructions to install tidyverse.

Windows

If you already have R and RStudio installed

Open RStudio, and click on “Help” > “Check for updates”. If a new version is available, quit RStudio, and download the latest version for RStudio.
To check which version of R you are using, start RStudio and the first thing that appears in the console indicates the version of R you are running. Alternatively, you can type sessionInfo(), which will also display which version of R you are running. Go on the CRAN website and check whether a more recent version is available. If so, please download and install it. You can check here for more information on how to remove old versions from your system if you wish to do so.

If you don’t have R and RStudio installed

Download R from the CRAN website.
Run the .exe file that was just downloaded
Go to the RStudio download page
Under Installers select RStudio x.yy.zzz - Windows XP/Vista/7/8 (where x, y, and z represent version numbers)
Double click the file to install it
Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.

macOS

If you already have R and RStudio installed

Open RStudio, and click on “Help” > “Check for updates”. If a new version is available, quit RStudio, and download the latest version for RStudio.
To check the version of R you are using, start RStudio and the first thing that appears on the terminal indicates the version of R you are running. Alternatively, you can type sessionInfo(), which will also display which version of R you are running. Go on the CRAN website and check whether a more recent version is available. If so, please download and install it.

If you don’t have R and RStudio installed

Download R from the CRAN website.
Select the .pkg file for the latest R version
Double click on the downloaded file to install R
It is also a good idea to install XQuartz (needed by some packages)
Go to the RStudio download page
Under Installers select RStudio x.yy.zzz - Mac OS X 10.6+ (64-bit) (where x, y, and z represent version numbers)
Double click the file to install RStudio
Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.

Linux

Follow the instructions for your distribution from CRAN, they provide information to get the most recent version of R for common distributions. For most distributions, you could use your package manager (e.g., for Debian/Ubuntu run sudo apt-get install r-base, and for Fedora sudo yum install R), but we don’t recommend this approach as the versions provided by this are usually out of date. In any case, make sure you have at least R 3.3.1.
Go to the RStudio download page
Under Installers select the version that matches your distribution, and install it with your preferred method (e.g., with Debian/Ubuntu sudo dpkg -i rstudio-x.yy.zzz-amd64.deb at the terminal).
Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.

For everyone

After installing R and RStudio, you need to install the tidyverse package.

After starting RStudio, at the console type: install.packages(c("tidyverse"))

Contributors

The list of contributors to this lesson is available here.

Data Carpentry, 2017. License. Contributing.
Questions? Feedback? Please file an issue on GitHub.
On Twitter: @datacarpentry

Data Analysis, Visualization, and Reproducibility with R Notebooks

Donna Wrublewski (Instructor)

What is R? What is RStudio?

Why learn R?

R does not involve lots of pointing and clicking, and that’s a good thing

R code is great for reproducibility

R is interdisciplinary and extensible

R works on data of all shapes and sizes

R produces high-quality graphics

R has a large community

Not only is R free, but it is also open-source and cross-platform

Chapters

Requirements

Data

Setup instructions

Windows

If you already have R and RStudio installed

If you don’t have R and RStudio installed

macOS

If you already have R and RStudio installed

If you don’t have R and RStudio installed

Linux

For everyone

Contributors