Basic data analysis in R Studio

So as I bring in trainees into the lab, I’ll want them to learn how to do some (at the very least) basic data analyses. I realize they may not know exactly where to start, so as I go about making my own basic analysis scripts for analyzing data relevant to me, I’ll post them here and make sure they’re reasonably well commented to explain what is going on.

OK, the specific backstory here. Back in UW, we had next-day IDT orders. It became very clear that was not going to be the case at CWRU, especially after talking to the IDT rep (who did not seem to really care about having a more thriving business here). So, I priced out my options, and Thermo ended up handedly winning the oligo price battle ($0.12 a nucleotide. which is a slight improvement over the $0.15 we seemed to be paying at UW *Shrug*). Thermo also does no shipping cost (another bonus), with the downside being that they only deliver on Tuesdays and Thursdays. I wanted to figure out how long it takes to receive primers are ordering them, so I’ve been keeping track of when I made each order, and how long it took to arrive. Now that I have a decent number of data-points, I decided to start analyzing it to figure out if there were any patterns emerging.

Here’s a link to the data. Here’s a link to the R Markdown file. You’ll want both the data and the R Markdown file in the same directory. I’m also copy-pasting the code below, for ease:

Primer_wait_analysis

This is the first chunk, where I’m setting up my workspace by 1) Starting the workspace fresh 2) Importing the packages I’ll need 3) Importing the data I’ll need

# I always like to start by clearing the memory of existing variables / dataframes / etc.

rm(list = ls())

#Next, let's import the packages we'll need.
#1) Readxl, so that I can import the excel spreadsheet with my data
#2) Tidyverse, for doing the data wrangling (dyplr) and then for plotting the data (ggplot)

library(readxl)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# Lastly, use readxl to import the data I want

primer_data <- read_excel("How_long_it_takes_to_receive_items_after_ordering.xlsx")

The idea is to make a bar-graph (or equivalent) showing how many days it takes for the primers to arrive depending on what day I order the primers. I first have to take the raw data and do some wrangling to get it in the format I need for plotting.

# I care about how long it took me to receive primers in a *normal* week, and not about what I encountered during the holidays. Thus, I'm going to filter out the holiday datapoints.

primer_data_normal <- primer_data %>% filter(holiday == "no")

# Next, I want to group data-points based on the day of the week

primer_data_grouped <- primer_data_normal %>% group_by(day_of_week) %>% summarize(average_duration = mean(days_after_ordering), standard_deviation = sd(days_after_ordering), n = n())

# Now let's set the "day_of_week" factor to actually follow the days of the week

primer_data_grouped$day_of_week <- factor(primer_data_grouped$day_of_week, levels = c("M","T","W","R","F","Sa","Su"))

# Since I'll want standard error rather than standard deviation, let's get the standard error
primer_data_grouped$standard_error <- primer_data_grouped$standard_deviation / sqrt(primer_data_grouped$n)

Now that the data is ready, time to plot it in ggplot2.

Primer_plot <- ggplot() + geom_point(data = primer_data_grouped, aes(x = day_of_week, y = average_duration)) +
  geom_errorbar(data = primer_data_grouped, 
                aes(x = day_of_week, ymin = average_duration - standard_error, 
                    ymax = average_duration + standard_error), width = 0.5) + 
  geom_text(data = primer_data_grouped, aes(x = day_of_week, y = 0.2, label = paste("n=",primer_data_grouped$n))) + geom_hline(yintercept = 0) + scale_y_continuous(limits = c(0,7), expand = c(0,0.01)) +
  theme_bw() + xlab("Day primer was ordered") + ylab("Days until primer arrived") + theme(panel.grid.major.x = element_blank())

ggsave(file = "Primer_plot.pdf", Primer_plot, height = 4, width = 6)
Primer_plot

Hmm, probably should have made that figure less tall. Oh well.

So the n values are still rather small, but it looks like I’ll get my primers soonest if I order on Monday or maybe Tuesday. In contrast, ordering on Thursday, Friday, or Saturday give me the longest wait (though well, some of that are the weekend days, which don’t matter as much). Thus, if I have any projects coming up where I have to design primer, it’s probably worth me taking the time to do that on a Sunday or Monday night instead of late in the workweek.

UPDATE 2/22/2020: Uhhh, so the plot above was my first draft attempt, and I have since honed in on the right representation of hte data:

And that’s because average numbers are only somewhat meaningful, and the distribution of frequencies is much more relevant / accurate. Here’s a link to the updated script for generating the plot with the data file at this other link.