Last month, Kenny traveled back to Seattle to give the opening talk in the workshop portion of the Mutational Scanning Symposium 2020. The organizers recorded the talks, and put them on up on a Youtube Channel. Looks like it could be a great resource for individuals wanting to partake in this field. You can find all of the talks here, including Kenny’s talk.
Here’s a Python script that converts the MatreyekLab primer google sheet into a csv file that is easily imported into benchling.
1) Go to the lab primer inventory google sheet -> “https://docs.google.com/spreadsheets/d/15RDWrPxZXN34KhymHYkKeOglDYzfkhz5-x2PMkPo0/edit?usp=sharing”
2) Go to file -> download -> Microsoft Excel (.xlsx)
3) Then take above file (MatreyekLab_Primer_Inventory.xlsx) and put it in the same directory as the Google_sheet_to_benchling.py file.
4) Open terminal, go to the right directory, and then enter:
5) It should make a new file called “Matreyeklab_primers_benchling.csv”. The text in this file can be copy-pasted into benchling and imported into the “Primer” folder.
Uploading the list to Benchling
6) Next, Log onto Benchling, go into the “MatreyekLab” project and into the “O_Primers” folder. Make a new folder named with the date (eg. “20200130” for January 30th, 2020). Once in the folder, select “Import Oligos”, and copy-paste the contents of the csv file (Opened using a basic text editor like “TextEdit” or “Sublime Text 3”.
7) You’ll have to scroll down all the way to the bottom, but then you can hit insert. Benchling is nice enough to get back to work as it uploads.
Using the new primer list
8) Once it does finishes uploading, you can go to whatever plasmid map you want to annotate with our current primers. Go to the right-hand side, two icons down to “Primers”. Hit attach existing, add the new folder as the new location, and hit “find binding sites”. Select all of the primers (top check box), and then hit the “Attach Selected Primers” button in the top right.
9) Now click on hte sequence map tab and Voila!, you can see the plasmid map now annotated. Find the primer you want (sequencing or otherwise) and go do some science.
So as I bring in trainees into the lab, I’ll want them to learn how to do some (at the very least) basic data analyses. I realize they may not know exactly where to start, so as I go about making my own basic analysis scripts for analyzing data relevant to me, I’ll post them here and make sure they’re reasonably well commented to explain what is going on.
OK, the specific backstory here. Back in UW, we had next-day IDT orders. It became very clear that was not going to be the case at CWRU, especially after talking to the IDT rep (who did not seem to really care about having a more thriving business here). So, I priced out my options, and Thermo ended up handedly winning the oligo price battle ($0.12 a nucleotide. which is a slight improvement over the $0.15 we seemed to be paying at UW *Shrug*). Thermo also does no shipping cost (another bonus), with the downside being that they only deliver on Tuesdays and Thursdays. I wanted to figure out how long it takes to receive primers are ordering them, so I’ve been keeping track of when I made each order, and how long it took to arrive. Now that I have a decent number of data-points, I decided to start analyzing it to figure out if there were any patterns emerging.
This is the first chunk, where I’m setting up my workspace by 1) Starting the workspace fresh 2) Importing the packages I’ll need 3) Importing the data I’ll need
# I always like to start by clearing the memory of existing variables / dataframes / etc. rm(list = ls()) #Next, let's import the packages we'll need. #1) Readxl, so that I can import the excel spreadsheet with my data #2) Tidyverse, for doing the data wrangling (dyplr) and then for plotting the data (ggplot) library(readxl) library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3 ## ✓ tibble 2.1.3 ✓ dplyr 0.8.3 ## ✓ tidyr 1.0.0 ✓ stringr 1.4.0 ## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag()
# Lastly, use readxl to import the data I want primer_data <- read_excel("How_long_it_takes_to_receive_items_after_ordering.xlsx")
The idea is to make a bar-graph (or equivalent) showing how many days it takes for the primers to arrive depending on what day I order the primers. I first have to take the raw data and do some wrangling to get it in the format I need for plotting.
# I care about how long it took me to receive primers in a *normal* week, and not about what I encountered during the holidays. Thus, I'm going to filter out the holiday datapoints. primer_data_normal <- primer_data %>% filter(holiday == "no") # Next, I want to group data-points based on the day of the week primer_data_grouped <- primer_data_normal %>% group_by(day_of_week) %>% summarize(average_duration = mean(days_after_ordering), standard_deviation = sd(days_after_ordering), n = n()) # Now let's set the "day_of_week" factor to actually follow the days of the week primer_data_grouped$day_of_week <- factor(primer_data_grouped$day_of_week, levels = c("M","T","W","R","F","Sa","Su")) # Since I'll want standard error rather than standard deviation, let's get the standard error primer_data_grouped$standard_error <- primer_data_grouped$standard_deviation / sqrt(primer_data_grouped$n)
Now that the data is ready, time to plot it in ggplot2.
Primer_plot <- ggplot() + geom_point(data = primer_data_grouped, aes(x = day_of_week, y = average_duration)) + geom_errorbar(data = primer_data_grouped, aes(x = day_of_week, ymin = average_duration - standard_error, ymax = average_duration + standard_error), width = 0.5) + geom_text(data = primer_data_grouped, aes(x = day_of_week, y = 0.2, label = paste("n=",primer_data_grouped$n))) + geom_hline(yintercept = 0) + scale_y_continuous(limits = c(0,7), expand = c(0,0.01)) + theme_bw() + xlab("Day primer was ordered") + ylab("Days until primer arrived") + theme(panel.grid.major.x = element_blank()) ggsave(file = "Primer_plot.pdf", Primer_plot, height = 4, width = 6) Primer_plot
Hmm, probably should have made that figure less tall. Oh well.
So the n values are still rather small, but it looks like I’ll get my primers soonest if I order on Monday or maybe Tuesday. In contrast, ordering on Thursday, Friday, or Saturday give me the longest wait (though well, some of that are the weekend days, which don’t matter as much). Thus, if I have any projects coming up where I have to design primer, it’s probably worth me taking the time to do that on a Sunday or Monday night instead of late in the workweek.
UPDATE 2/22/2020: Uhhh, so the plot above was my first draft attempt, and I have since honed in on the right representation of hte data:
And that’s because average numbers are only somewhat meaningful, and the distribution of frequencies is much more relevant / accurate. Here’s a link to the updated script for generating the plot with the data file at this other link.
Being able to write short scripts that help you do what you need to do is a very empowering ability for modern-day biologists like me.
In the process of thinking about planning a new plasmid that encodes two proteins fused together by a flexible linker, I realized that I should just write a short script that, when run, gives me a random set of glycine and serine residues composing that linker (riffing off of the ol’ GGS-repeat linkers that seem to prevail in synthetic molecular biology). I’m also in the spirit of documenting the things I’ve learned so that future trainees could have them as reference for their own learning. So here goes my attempt at explaining how I’m approaching this:
First is thinking about the scope of the script you’re trying to put together. This should be an extremely simple one, where I’ll have lists of all of the glycine and serine codons, and a series of random numbers will determine which of those codons to use. Maybe I’ll also include in this script a user-feedback feature that will make it so you can tell the script how many codons it should be randomnly stringing together. Since it’s so simple, it shouldn’t require loading too many custom libraries / packages for performing more advanced procedures.
In the spirit of good practice, I’ll try to do this in iPython as well, though I’ll make a simple python script at the end for quick running from the command line. OK, here goes:
Note: I posted the final code at my Github page (link at the bottom). But for people wanting to just follow along here, here is the code in its final form at the outset (so you can see how I went about building it):
from random import randint codons = ["GGT","GGC","GGA","GGG","TCT","TCC","TCA","TCG","AGT","AGC"] length = input("What amino acid length flexible linker would you like a nucleotide sequence for?") linker_sequence = "" length = int(length) for x in range(0,length): new_codon = codons[randint(0,9)] linker_sequence = linker_sequence + new_codon x = x + 1 print(linker_sequence)
1) Had to remind myself of this, but firs you type in “jupyter notebook” to open the interactive web-browser interface (assuming you’ve already installed it).
2) Go to the right directory, and then make a new iPython3 notebook.
3) Let’s start simple and make the random codon generator. First, let’s make a list of the codons we want to include.
codons = ["GGT","GGC","GGA","GGG","TCT","TCC","TCA","TCG","AGT","AGC"]
4) Next, let’s figure out how to choose a random index in the list, so that a random codon is chosen. I don’t remember how to do this off-hand, so I had to google it until I got to this page.
5) OK, so i guess I do have to load a package. I’m now writing that at the top of the script.
from random import randint randint(0,9)
It gave the correct output so I saw that it worked and commented out the randint call.
# randint(0,9) #This line won't run
6) Next is having the script randomly pull out a glycine or serine codon. This is simply done by now including:
7) Cool. All that has worked so far. Let’s now have the script take in a user-input for the number of residues so we can repeat this process and spit out a linker sequence of desired length. I don’t remember how to ask for user input in a python script, so I had to google this as well.
8) Allright, so I need to use the use the “raw imput” function and lead it to a variable. Well, tried that and it said “raw_input” wasn’t defined. Google that, and found this link saying that advice as deprecated, and instead it was just “input()”.
9) Thus, I typed in:
length = input("What amino acid length flexible linker would you like a nucleotide sequence for?")
And this asked me for a number like I had hoped. Great.
10) To make this process iterable, I went for my trusty “for” loop. Actually, I tried to make a “while” loop first, but realized I didn’t know how to make it work off the top of my head. So I just ended up making a “for” loop and putting a x = x + 1 statement at the end to effectively turn into a “while” loop. Personally, I think the ends justify the means, and going with what you know works well is a valid option for most basic scripts, when efficiency isn’t a huge priority.
length = 3 #Giving an arbitrary number for testing the script for x in range(0,length): new_codon = codons[randint(0,9)] x = x + 1
11) I then had to be able to keep track of the codons that were pulled during each iteration of the loop. I thus created an empty variable called “linker_sequence” and just added the string for the new codon at the back of “linker_sequence” during each iteration.
linker_sequence = "" # A blank variable to keep track of things length = int(length) # To convert the input text into a number for x in range(0,length): new_codon = codons[randint(0,9)] linker_sequence = linker_sequence + new_codon x = x + 1
12) Lastly is putting a final print function so it returns the desired string of nucleotides to the user.
Nice, I think that does it for the script. Super easy and simple!
13) Finally, let’s test the script. I typed in 5 amino acids, and it gave me “TCTAGTTCAGGCTCT” as the output string. I google searched “Transeq” to get to the ebi for a simple codon translator, and translation of the nucleotide sequence above gave “SSSGS” as the protein sequence. So great, it worked! Sure, a little serine heavy, but that’s random chance for you. Should still be perfectly fine as a linker, regardless. Now to finish planning this plasmid…
Note: I’ve posted both the iPython notebook file (.ipynb) and a simple python script (.py) on my Github page. Feel free to use them!
Installing Enrich2 / using a Conda environment
I have a couple of new Macs in the lab that need Enrich2 installed. The goal today is to go through the steps of making a Conda environment specifically for running Enrich2 (and installing Enrich2), developed by Alan Rubin.
Alan’s instrucitons for doing this is pretty good, if I remember correctly. But, Alan is a seasoned programmer / computational scientist, while people like me are novices and far less familiar with these steps. Furthermore, depending on the specifics of your computer / system, you may get different errors in the installation process. Thus, here’s my interpretation of this process for the benefit of others like me.
0) This supposes that you have already installed an updated version of Anaconda on your Mac. Go back and do this now, if you haven’t done so already.
1) Download the Anaconda Enrich2 environment file and put it somewhere you can access using Terminal.
2A) Update to the newest version of Anaconda, just in case. Hit “y” if prompted.
conda update conda
2B) OK, so now was installing the right version of Pandas it needs. To do this, I first went into Anaconda Navigator and made a new environent called python2, that uses Python 2. Then in terminal, I called:
conda activate python2
2C) This activated python2, so that now the prompt didn’t say “base”, but now it said “(python2)”. Cool. Envinroment activated. I then installed pandas 0.19.2 by typing in:
pip install pandas==0.19.2
2D) OK, this seemed to work too. I got an error when trying to re-run the “enrich2_env.yml file”, so I removed the “=0.19” part form the .yml file and ran “conda env create -f enrich2_env.yml”. This actually seemed to work giving me a list of packages being extracted. Now to actually test it out.
conda activate enrich2
3) Cool, that worked, and the terminal prompt now says “(enrich2)”. Now we’re in business. Next is actually installing Enrich2 now that we’re in the enviroment. Go to the Enrich2 repository, download the file, unzip it, and then move to its directory in terminal. Then run:
python setup.py install
That seemed to work since it didn’t throw any errors.
4) Next is actually trying to run the application. Type in the following:
Oh Jesus christ. It literally crashes the Finder due to the following errors:
CGSTrackingRegionSetIsEnabled returned CG error 268435459
CGSTrackingRegionSetIsEnabled returned CG error 268435459
CGSTrackingRegionSetIsEnabled returned CG error 268435459
HIToolbox: received notification of WindowServer event port death.
port matched the WindowServer port created in BindCGSToRunLoop
Looks like this may have been a problem with the MacOS operating system. Updating my OS to Catalina and then trying again.
5) OK, MacOS has been updated to Catalina. Now let’s try running Enrich again. (You’ll likely want to run this command after you’ve navigated to the directory with your raw files to simplify the file locating process).
Awesome. It worked!
PS. I followed these instructions for my second Mac and it worked like a charm. I even went straight for the Cataline update early on and didn’t run into the error in Step4.
Part of my goal for this holiday break is to work on an exploratory research grant proposal for a high-throughput investigation studying how protein coding variants in inflammasome components lead to various autoinflammatory diseases. I heard there were supposed to be some cool-looking videos of ASC speck formation (like this video from Kuri et al, 2017, J Cell Biol), so I did a google search for such videos. This lead me to some videos at JoVE, the Journal of Visual Experiments. CWRU has institutional access to tons of journals including JoVE, but always having to log in to watch the video is kind of clunky, so I wanted to be able to download the relevant videos. Thus, I just looked under the hood at the html used to organize the webpage, and found where the video lived so I could download it. Here are some instructions for doing just that:
1) Using Google Chrome (though I’m sure other browsers like FireFox should do this as well), log into your institutional access service to get to the login-protected JoVE page with the full video.
2) Right click on the area with the video and hit “inspect”.
3) In the top inspector pane, go to the area that says something along the lines of…:
<video class=”fp-engine” playsinline=”” webkit-playsinline=”” preload=”none” autoplay=”” crossorigin=”anonymous” src=”blob:https://www.jove.com/d4697446-c6f4-4902-ab6b-37580284d671″ style=”display: block;”><source type=”video/mp4″ src=”https://cloudflare2.jove.com/CDNSource/protected/57463_Fink_051418_P_Web.mp4?verify=1576994641-Y2wt%2BLiPU3Iw9mTSO%2BNHPlX%2BjGEwXuULht5jH92%2FuzY%3D”><track kind=”subtitles” label=”English” srclang=”en” src=”/files/vtt/57463/57463.vtt” id=”en-English”></video>
… and click on the triangle to open up that section and display all of the sub-sections of it.
4) The first subsection should say something like …:
<source type=”video/mp4″ src=”https://cloudflare2.jove.com/CDNSource/protected/57463_Fink_051418_P_Web.mp4?verify=1576994641-Y2wt%2BLiPU3Iw9mTSO%2BNHPlX%2BjGEwXuULht5jH92%2FuzY%3D”>
… and right click on the “src” link and open in a new window.
5) Right click again and download the video file to your hard drive. It will likely be a .mp4 file format. Now you can rewatch it without having to be logged into the JoVE website.
PS-1. Once you’re at the first part of step 3, you can just look in the “src” section and copy-paste the text starting at “https://…” up through “…mp4″ (and not ?verify…” and copy-paste that to a new window as well, and skip to step 5. Though I suppose this is actually the same amount of effort as actually doing step 4.
PS-2. Yes, I find it kind of funny that I just made a tutorial of saving a video from a visual tutorial.
PS-3. For the record, I’m not supporting / condoning bypassing the gatekeeping code this journal has for accessing the full-content. I’m mostly just trying to streamline science so people can get more / better work done without impediments. As far as I can tell, you still do need institutaional access to be able to access the full file (doing the above steps at the non-logged in site only links you to the “teaser” video).
TL;DR: Statistics is everywhere, and simulating bottlenecks that happen during routine lab procedures such as dilutions of cells can potentially help you increase reproducibility, and at the least, help you better conceptualize what is happening with each step of an experiment.
I’m still working on getting a cell counter for the lab. In the meantime, we’ve been using an old school hemacytometer to count cells before an experiment. Sarah had used a hemacytometer more recently than me, and knew to dilute the cells from a T75 flask 10-fold to get them into a countable range for the hemacytometer. She said she had performed the dilution by putting 10 ul cells in 90 ul media (and then putting 10 ul of the dilution into the hemacytometer). But as she said this, she asked whether it was OK to perform the dilution as described; a grad student in her previous lab had taught her to do it that way, but a postdoc there said it was a bad idea. My immediate response was that if the cells are sufficiently mixed, then it should be fine. And while that was my gut reaction, I realized that it was something I could simulate and answer myself using available data. Would the accuracy of the count be increased if we diluted 100 ul of cells into 900ul of media, or 1ml of cells into 9ml of media?
Here are the methods (skip if you don’t want to dive in and want to save yourself a paragraph of reading): To me, it would seem the answer to whether the dilution matters depends on how the cells are dispersed in the media / how variable the count is when the same volume is sampled numerous times. Sarah’s standard practice is to count four squares of the hemacytometer, so I had four replicate counts for each volume pipetted. She had repeated this process three times by the time I had performed the analysis, giving me a reasonable dataset I could roll with. I got the mean and standard deviations for each of the three instances, all corresponding to a volume of 0.1 ul. They were all quite similar, so I created a hypothetical normal distribution from the average mean and standard deviation. Next was seeing how different ways of performing the same dilution impacted the accuracy of individual readings. I recreated the 10 ul cells by sampling from this distribution 100 times, 100 ul cells by sampling 1,000 times, and 1 ml by sampling 10,000 times, and taking the mean. I repeated this process 5,000 times for each condition, and looked at how wide each distribution was.
I then turned the counts into concentration (cells / ml):
Instead of stopping there, I thought about the number of cells I was actually trying to plate, which was 250,000. The number the distributions were converging to was ~ 27.3 (black line), so I used that as the “truth”, and saw how many “true” cells would be plated if I had determined the volume needed to be plated based on each of the repeat concentrations calculated by each of the conditions of dilutions. The resulting plot looked like this:
So as you can tell based on the plot, there are slight differences in cells plated depending on imprecision propagated by the manner in which the same 10-fold dilution as performed: while all distributions are centered around 250k, the 10 ul dilution distribution was quite wide, while the 1 ml in 9 ml dilution resulted in cell counts very close to 250k each time. To phrase it another way, ~15% of the time, a 10 ul in 90 ul dilution would cause the “wrong” number of cells to be plated (less than 24k, or more than 26k). In contrast, due to the increased precision, a 100 ul in 900 ul dilution would never result in the “wrong” number of cells being plated. So speaking solely about the dilution, the way the dilution was being performed could have some light impacts on the accuracy of how many cells would be actually plated.
I was going to call this exercise complete, but I ran this analysis by Anna, and she mentioned that I wasn’t REALLY recreating the entire process; sure I had recreated the dilution step, but we would have also counted cells from the dilution in the hemacytometer to actually get the cell counts in real life. Thus, I modified the code such that each dilution step was followed by a random sampling of four counts (using the coefficient of variation determined from the initial hemacytometer readings), and taking the mean of those counts; this represented how we would have ACTUALLY followed up each dilution in real life. The results were VERY different:
In effect, the imprecision imparted by the hemacytometer counts seemed to almost completely drown out the imprecision caused by the suboptimal dilution step. This was pretty mind-blowing for me; especially considering that I would have totally missed this effect had I not run this post by Anna. Now fully modeling the counting process, a 10 ul in 90 ul dilution would cause the “wrong” number of cells (less than 24k, or more than 26k) to be plated ~ 42.5% of the time, and a 100 ul in 900 ul dilution would still cause a “wrong” cell number to be plated ~ 42.2 % of the time; almost identical! Thus, while a 100 ul in 900 ul dilution does impart some slightly increased accuracy, it’s quite minor / negligible over a 10 ul in 90 ul dilution. So while in a sense this wasn’t the initial question asked, it’s still effectively the real answer.
At the end of the day, I think the more impactful aspect of this exercise is the idea that even routine aspects of wet-lab work are deeply rooted in stats (in this case, propagation of errors caused by poor sampling), and that the power of modern computational simulations can be used to optimize these procedures. There’s something truly empowering to having a new tool / capability that gives you new perspectives on procedures you’ve done a bunch of times, and allows you to fully rationalize it rather than relying on advice given to you by others.
Here’s the code if you want to try running it yourself.
Acknowledgements: Thanks to Sarah for bringing this question to my attention. Also, BIG THANKS to Anna for pointing out where I was being myopic in my analysis, which got me to a qualitatively different (and more real-life relevant) answer. It really is worth having smart people look over your work before you finalize it!
Sarah Roelle joins the lab as an RA2, and will be using her years of experience in the CWRU Department of Biomedical Engineering to help Kenny finish setting up the lab, and work with him to get the first sets of independent research projects moving. Welcome Sarah! We are very happy to have you here!