Flowjo Analysis of GFP positive cells

We do a lot of flow cytometry in the lab. Inevitably, what ends up being the most practical tool for analysis of low cytometry data is FlowJo. While I’ve been using FlowJo for a long time, I realize it isn’t super intuitive and new people to the lab may first struggle in using it. Thus, here’s a short set of instructions for using it to do a basic process, such as determining what percentage of live cells are also GFP positive.

Obviously, if you don’t have FlowJo yet, then download it from the website. Next, log into FlowJo Portal. I’m obviously not going to share my login and password here; ask someone in the lab or consult the lab google docs.

Once logged in, you’ll be starting with a blank analysis workspace, as below.

Before I forget, an annoying default setting of FlowJo is that it only lists two decimal points in most of its values. This can be prohibitively uninformative if you have very low percentages that you’re trying to accurate quantitate. Thus, click on the “Preferences” button:

Then click on the “Workspaces button”

And finally once in that final window, change the “Decimal Precision” value to something like 8.

With that out of the way, now you can perform your analysis. Before you start dragging in samples, I find it useful to make a group for the specific set of samples you may want to analyze. Thus, I hit the “Create Group” button and type in the name of the group I’ll be analyzing.

Now that the group is made, I select it, and then drag the new sample files into it, like below:

Now to actually start analyzing the flow data. Start by choosing a representative sample (eg. the first sample), and double clicking on it. By default, a scatterplot should show up. Set it so forward scatter (FSC-A) is on the X-axis, and side scatter (SSC-A) is on the Y-axis. Since we’re mostly using HEK cells, that means that main thing we will be doing in this screen is gating for the population of cells while excluding debris (small FSC-A but high SSC-A). Thus, make a gate like this:

Once you have made that gate, you’ll want to keep it constant between samples. Thus, right click on the “Live” population in the workspace and hit “Copy to Group”. Once you do that, the population should now be in bold, with the same text color as the group name.

Next is doublet gating. So the live cell population will already be enriched for singlets, but having a second “doublet gating” step will make it that much more pure. Here is the best description of doublet gating I’ve seen to date. To do this, make a scatterplot where FSC-A is on the X-axis, and FSC-H is on the Y-axis. Then only gate the cells directly on the diagonal, thus excluding those that have more FSC-A relative to FSC-H. Name these “Singlets”.

And like before, copy this to the group.

Next is actually setting up the analysis for the response variable we were looking to measure. In this case, it’s GFP positivity, captured by the BL1-A detector. While this can be done in histogram format, I generally also do this with a scatterplot, since it allows me to see even small numbers of events (which would be smashed against the bottom of the plot if it were a smoothed histogram). Of course, a scatterplot needs a second axis, so I just used mCherry fluorescence (or the lack of it, since these were just normal 293T cells), captured by the YL2-A detector.

And of course copy that to the group as well (you should know how to do this by now). Lastly, the easiest way to output this data is to hit the Table Editor button near the top of the screen to open up a new window. Once in this window, select the populations / statistics you want to include from the main workspace, and drag it into the table editor, so you have something that looks like this.

Some of those statistics aren’t what we’re looking for. For example, I find it much more informative to have the singlets show total count, rather than Freq of parent. To do this, double click on that row, and select the statistic you want to include.

And you should now have something that looks like this:

With the settings fixed, you can hit the “Create Table” button at the top of the main workspace. This will make a new window, holding the table you wanted. To actually use this data elsewhere (such as with R), export it into a csv format which can be easily imported by other programs.

FYI, if you followed everything exactly up to here, you should only have 2 data columns and not 3. I had simplified some things, but forgot to update this last image so it’s now no longer 100% right (though the general idea is still correct).

Congratulations. You are now a FlowJo master.

Optimal Laser and Detector Filter Combinations for Fluorescent Proteins

The people at the CWRU flow cytometry core recent did a clean reinstall of one of their instruments, which meant that we had to re-set up our acquisition template. I still ended up eyeballing what would be the best laser / filter sets based on the pages over at FPbase.org, but I had a little bit of free time today, so I decided to work on a project I had been meaning to do for a while.

In short, between the downloadable fluorescent spectra at FPbase, as well as known instrument lasers and detector bandpass filters, I figured I could just write a script that essentially takes in whatever fluorescent protein with spectra that you have downloaded, and essentially makes a table showing you which laser + detector filter combinations give you the highest amount of fluorescence.

Here’s the R script on the lab GitHub page. I made it for the two flow cytometers and two sorters I use at the CWRU cytometry core, although it would presumably be pretty easy to change the script to make it applicable for whatever instruments are at your place of work. So here’s a screenshot of a compendium of the results for these instruments:

Nothing too surprising here, although it’s still nice / interesting to see the actual results. While it’s somewhat obvious since the standard Aria has no Green or Yellow-Green laser, we should not do any sorting with mCherry on it. Instead, we should use the Aria-SORP, which has the full complement of lasers we need.

Designing Primers for Targeted Mutagenesis

Now that my lab is fully equipped, I’m taking on rotation students. Unfortunately, with the pandemic, it’s harder to have one-on-one meetings where I can sit down and walk the new students through every method. Furthermore, why repeat teaching the same thing to multiple students when I can just make an initial written record that everyone can reference and just ask me questions about? Thus, here’s my instructional tutorial on how I design primers in the lab. 12/2/24 update: Don’t forget to look at the most up-to-date strategy described at the bottom of the page!

First, it’s good to start out by making a new benchling file for whatever you’re trying to engineer. If you’re just making a missense mutation, then you can start out by copying the map for the plasmid you’re going to use as a template. Today, we’ll be mutating a plasmid called “G619C_AttB_hTrim-hCPSF6(301-358)-IRES-mCherry-P2A-PuroR” to encode the F321N mutation the CPSF6 region. This should abrogate the binding of this peptide to the HIV capsid protein. Eventually every plasmid in the lab gets a unique identifier based on the order it gets created (this is the GXXXX name). Since we haven’t actually started making this plasmid yet, I usually just stick an “X” in front of the name of the new file, to signify that it’s *planned* to be a new plasmid, with G619C being used as the template. Furthermore, I write in the mutation that I’m planning to make in it. Thus, this new plasmid map is now temporarily being called “XG619C_AttB_hTrim-hCPSF6(301-358)-F321N-IRES-mCherry-P2A-PuroR”

That’s what the overall plasmid looks like. We’ll be mutating a few nucleotides in the 4,000 nt area of the plasmid.

I’ve now zoomed into the part of the plasmid we actually want to mutate. The residue is Phe321 in the full length CPSF6 protein, but in the case of this Trim-fusion, it’s actually residue 344.

I next like to “write in” the mutation I want to make, as this 1) makes everything easier, and 2) is part of the goal of making a new map that now incorporates that mutation. Thus. I’ve now replaced the first two T’s of the Phe codon “TTT” with two A’s, making the “AAT” codon which encodes Asn (see the image above)

Next is planning the primers. So there are a few ways one could design primers to make the mutation. I like to create a pair of overlapping (~ 17 nt), inverse primers, where one of the primers encodes the new mutation in it. PCR amplification with these primers should result in a single “around-the-circle” amplicon, where there is ~ 17 nt of homology on the terminal ends. These ends can then be brought together and closed using Gibson assembly.

So first to design the forward primer. This is the primer that will go [5’end] –[17 nt homology] — [mutated codon] — [primer binding region] — [3’end]. So the first step is to figure out the primer binding region.

In a cloning scheme like this, I like to start selecting the nucleotides directly 3′ of the codon to be mutated, and select enough nucleotides such that the melting temperature is ~ 55*C. In actuality, the melting temperature will be slightly higher, since 1) we will end up having 17 nt of matching sequence 5′ of the mutated codon, and 2) the 3rd nt in the codon, T, will actually be matching as well.

Now that I’ve determined how long I need that 3′ binding region to be, I select the entire set of nucleotides I want in my full primer. In this case, this ended up being a primer 36 nt in length (see below).

Since this is the forward primer, I can just copy the “sense” version of this sequence of nucleotides.

OK, so next to design the reverse primer. This is simpler, since it’s literally just a series of nucleotides going in the antisense orientation directly 5′ of the codon (as it’s shown in the sense-stranded, plasmid map). I shoot for ~ 55*C to 60*C, usually just doing a little bit under 60*C.

Since this is the reverse primer, we want the REVERSE COMPLEMENT of what we see on the plasmid map.

Voila, we now have the two primers we need. We just now need to order these oligos (we order from ThermoFisher, since it’s the cheapest option at CWRU, and can then perform the standard MatreyekLab cloning workflow).

___________________

12/2/24 update: So the above instructions work fine, but I have since (as in like 3/4 years ago, haha) adopted a slightly different strategy. The original strategy prioritized having one primer (in the above example, the reverse primer, or KAM3402) being non-mutagenic / perfectly matching the WT sequence, so that it could double as a Sanger sequencing primer in other non-cloning circumstances. Well, we barely ever need these anymore, especially with the existence of whole-plasmid nanopore sequencing via Plasmidsaurus. Thus, I now:
1) Just design the primer pairs so that both the forward AND reverse primers are each encoding the mutated nucleotides, since that likely better balances primer Tm’s. To get the ideal 18+ nt of homology necessary for Gibson, I then append ~ 9 nt of matching sequence to the 5′ ends of each primer (so 9 matching nucleotides to the “left” of the mutated portion).
2) I also now like to reduce the anticipated Tm of the 3′ binding portion of the primer (so everything to the “right” of the mutated portion) to ~ 50-55*C, since there’s going to be some amount of binding energy provided by the ~ 9 nt of matching sequence on the other side of the primer.

Thus, my updated primer strategy for the above reactions would look like this:

Using prior data to optimize the future

As of this posting, we’ve cloned 176 constructs in the lab. I’ve kept pretty meticulous notes about what standard protocol we’ve used each time, how many clones we’ve screened, and how many clones had DNA where the intended insertions / deletions / mutations were present. With this data, I wondered whether I could take a quick retrospective look on my observed success / failure rates to see if I could use to see if my basic workflow / pipeline was optimized to maximize benefit (ie. getting the recombinant DNA we want) while limiting cost (ie. Time, effort, $$$ for reagents and services). I particularly focused on 2-part Gibsons, since that’s the workhorse approach utilized for most molecular cloning in the lab.

First, here’s a density distribution reflecting reaction-based success rates (X number of correct clones in Y number of total screened clones, or X / Y = success rate).

I then randomly repeatedly sampled N-times from that distribution, ranging from N-values 1 through 5, effectively pretending that I was screening 1 clone, 2 clones … up to 5 clones for each PCR + Gibson reaction we were performing. Since 1 good clone is really all you need, for each sampling of N clones, I checked whether any of them were a success (giving that reaction a value of “1”) or a whether all of them failed (giving that reaction a value of “0”). I repeated this process 100 times, and counted the sum of “1” and “0” values, and divided by 100 to get an overall success rate. I repeated this process 50 overall times to get a sense of the variability of outcome with each condition. Here are the results:

We screen 3 clones per reaction in our standard protocol, and I think that’s a pretty good number. We capture at least 1 successful clone 3/4 of the time. Sure, maybe we increase how often we get the correct clone on the first pass if we instead screen 4 or 5 clones at at time, but the extra effort / time / cost doesn’t really seem worth it, especially since it’s totally possible to screen a larger number on a second pass for those though-but-worth-it clones. Some of those reactions are also going to be ones that are just bad, period, and need to be re-started from the beginning (perhaps even by designing new primers), which is a screening hill that certainly isn’t worth dying on.

9/10/20 edit: In my effort to make it easier for trainees to learn / recreate what I’m doing, I posted the data and analysis script to the lab GitHub.

Modeling bacterial growth

I do a lot of molecular cloning, which means a lot of transformations of chemically competent e.coli. Using 50 uL of purchased competent bacteria would cost about $10 per transformation, which would be an AWFUL waste of money, especially with this being a highly recurring expense in the lab. I had never made my own competent cells before, so I had to figure this out shortly after starting my lab. It took a couple of days of dedicated effort, but it ended up being quite simple (I’ll link to my protocol a bit later on). Though my frozen stocks ended up working fine, I became quite used to creating fresh cells every time I need to do a transformation. The critical step here is taking a saturated overnight starter culture, and diluting it so you can harvest a larger volume of log-phase bacteria some short time later. A range of ODs [optical density here defined as absorbance at 600 nm] work, though I like to use bacteria at an OD around 0.2. I had gotten pretty good at being able to eyeball when a culture was ready for harvesting (for LB in a 250 mL flask, I found this was right when I started seeing turbidity), but I figured there was a better way to know when it’s worth sampling and harvesting.

I started keeping good notes about 1) the starting density of my prep culture (OD of the overnight culture divided by the dilution factor), 2) the amount of time I left the prep culture growing, and 3) the final OD the prep culture. I converted everything into cell density which is a bit more intuitive than OD (I found 1 OD[A600] of my bacteria roughly corresponded to 5e8 bacteria per mL), and worked in those units from there on out. Knowing bacteria exhibit exponential growth, I log base-10 transformed the counts. Much like the increasing number of COVID-19 deaths experienced by the US from early March through early April, exponential growth becomes linear in log-transformed space. I figured I could thus estimate the growth of my prep culture of competent cells by making a multi-variate linear model, where the final density of the bacteria was dependent on the starting bacterial density and how long I left it growing. I figured the lag-phase from taking the saturated culture and sticking it into cold-LB would end up being a constant in the model. Here’s my dataset, and here’s my R Markdown analysis script. My linear model seemed to perform pretty well, as you can see in the below plot. As of writing this, the Pearson’s r was 0.98.

The aforementioned analysis script has a final chunk that allows you to input the starting OD of your starter culture, and assuming a 1000-fold dilution, tells you how long you likely need to wait to hit the right OD of your prep culture. Then again. I don’t think anyone really wants to enter this info into a computer every time they want to set up a culture, so I made a handy little “look-up plot”, shown below, where a lab member could just look at their starter culture OD on the x-axis, choose the dilution they want to do (staying 2x within 1000-fold since I don’t know if smaller dilutions can affect bacterial competency), and figure out when they need to be back to harvest (or at least stick the culture on ice). I’ve now printed this plot out and left it by my bacterial shaker-incubator.

Note: The above data was collected when diluting starter culture bacteria into *COLD* LB that was stored in the fridge. We’ve since shifted to diluting the bacteria into room-temp LB (~ 25*C), which has somewhat expectedly resulted in slightly faster times to reach the desired OD. If you’re doing that too, I would suggest subtracting ~ 30min of incubation time from the above times to make sure you don’t overshoot your desired OD.

I’m still much more of a wet-lab scientist than a computational one. That said, god damn do I still think the moderate amount of computational work I can do is still empowering.

Gibson / IVA success rates

I only learned about Gibson when I started my postdoc, and it completely changed how I approached science. In some experiments with Ethan when I was in the lab, I was blown away when we realized that you don’t even need Gibson mix to piece a plasmid back together; this is something we were exploring to try to figure out if we could come up with an easier & more economical library generation workflow. I was disappointed but equally blown away when I realized numerous people had repeatedly “discovered” this fact in the literature already; the most memorable of the names given to it was IVA, or In-Vitro Assembly. Ethan had tried some experiments, and had said it worked roughly as well as with Gibson. Of course, I can’t recall exactly what his experiment was at this point (Although probably a 1-piece, DNA recircularization reaction, since this was in the context of inverse PCR-based library building, after all). So the take away I had was that it was a possible avenue for molecular cloning in the future.

We’ve done a fair amount of molecular cloning in the lab already, creating ~ 60 constructs in the first 4 months since Sarah joined. I forgot exactly the circumstances, but something was right where it made sense to try some cloning where we didn’t add in Gibson mix. I was still able to get a number of intended constructs on that first try, so I stuck to not adding Gibson mix for a few more panels of constructs. I’ve been trying to keep very organized with my molecular cloning pipelines and inventories, which included keeping track of how often each set of mol cloning reactions yielded correctly pieced-together constructs. I’ve taken this data, and broken it down based on two variables: whether it was a 1- or 2-part DNA combination (I hardly ever try more than 2 in a single reaction, for simplicities’ sake, and also because properly combined cloning intermediates may still be useful down the line, anyway), and whether Gibson mix was added or not. Here’s the current results:

Note: This is a *stacked* smoothed histogram. Essentially, the only real way to look at this data is consider the width of a given color across the range of the x-axis, relative to its thickness in other portions.

So this was extremely informative. Some points
1) I’m willing to screen at least 4 colonies for a construct I really want. Thus, I’m counting a success rate > 0.25 as being a “successful” attempt at cloning a construct. In the above plot, that means any area above the dotted red line. Thus, 1-part DNA recircularizations have pretty decent success rates, since the area of the colored curve above the red dotted like >> the area below it. Sure, Gibson mix helps, but it’s not a night-and-day difference.
2) 2-part DNA combinations are a completely different story,. Lack of Gibson means that I have just as many failed attempts at cloning something as successful attempts. Those are not great odds. Adding Gibson mix makes a big difference here, since it definitely pushes things in favor of a good outcome. Thus, I will ALWAYS be adding GIbson mix before attempting any 2-part DNA combinations.

Other notes: I’m using home-grown NEB 10-beta cells, which give me pretty decent transformation rates (high-efficiency 1-part recircularization reactions can definitely yield many hundreds of colonies on the plate from a successful attempt), so there have been relatively few plates where I literally have ZERO colonies, where I’m more likely to have a few colonies that are just hard-to-remove residual template DNA).

Primer Inventory Google Sheet To Benchling

Here’s a Python script that converts the MatreyekLab primer google sheet into a csv file that is easily imported into benchling.

1) Go to the lab primer inventory google sheet -> “https://docs.google.com/spreadsheets/d/15RDWrPxZXN34KhymHYkKeOglDYzfkhz5-x2PMkPo0/edit?usp=sharing”

2) Go to file -> download -> Microsoft Excel (.xlsx)

3) Then take above file (MatreyekLab_Primer_Inventory.xlsx) and put it in the same directory as the Google_sheet_to_benchling.py file.

4) Open terminal, go to the right directory, and then enter:

Python3 Google_sheet_to_benchling.py

5) It should make a new file called “Matreyeklab_primers_benchling.csv”. The text in this file can be copy-pasted into benchling and imported into the “Primer” folder.

Uploading the list to Benchling

6) Next, Log onto Benchling, go into the “MatreyekLab” project and into the “O_Primers” folder. Make a new folder named with the date (eg. “20200130” for January 30th, 2020). Once in the folder, select “Import Oligos”, and select the csv for importing.

Using the new primer list

7) Once it does finishes uploading, you can go to whatever plasmid map you want to annotate with our current primers. Go to the right-hand side, two icons down to “Primers”. Hit attach existing, add the new folder as the new location, and hit “find binding sites”. Select all of the primers (top check box), and then hit the “Attach Selected Primers” button in the top right.

8) Now click on the sequence map tab and Voila!, you can see the plasmid map now annotated. Find the primer you want (sequencing or otherwise) and go do some science.

Basic data analysis in R Studio

So as I bring in trainees into the lab, I’ll want them to learn how to do some (at the very least) basic data analyses. I realize they may not know exactly where to start, so as I go about making my own basic analysis scripts for analyzing data relevant to me, I’ll post them here and make sure they’re reasonably well commented to explain what is going on.

OK, the specific backstory here. Back in UW, we had next-day IDT orders. It became very clear that was not going to be the case at CWRU, especially after talking to the IDT rep (who did not seem to really care about having a more thriving business here). So, I priced out my options, and Thermo ended up handedly winning the oligo price battle ($0.12 a nucleotide. which is a slight improvement over the $0.15 we seemed to be paying at UW *Shrug*). Thermo also does no shipping cost (another bonus), with the downside being that they only deliver on Tuesdays and Thursdays. I wanted to figure out how long it takes to receive primers are ordering them, so I’ve been keeping track of when I made each order, and how long it took to arrive. Now that I have a decent number of data-points, I decided to start analyzing it to figure out if there were any patterns emerging.

Here’s a link to the data. Here’s a link to the R Markdown file. You’ll want both the data and the R Markdown file in the same directory. I’m also copy-pasting the code below, for ease:

Primer_wait_analysis

This is the first chunk, where I’m setting up my workspace by 1) Starting the workspace fresh 2) Importing the packages I’ll need 3) Importing the data I’ll need

# I always like to start by clearing the memory of existing variables / dataframes / etc.

rm(list = ls())

#Next, let's import the packages we'll need.
#1) Readxl, so that I can import the excel spreadsheet with my data
#2) Tidyverse, for doing the data wrangling (dyplr) and then for plotting the data (ggplot)

library(readxl)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# Lastly, use readxl to import the data I want

primer_data <- read_excel("How_long_it_takes_to_receive_items_after_ordering.xlsx")

The idea is to make a bar-graph (or equivalent) showing how many days it takes for the primers to arrive depending on what day I order the primers. I first have to take the raw data and do some wrangling to get it in the format I need for plotting.

# I care about how long it took me to receive primers in a *normal* week, and not about what I encountered during the holidays. Thus, I'm going to filter out the holiday datapoints.

primer_data_normal <- primer_data %>% filter(holiday == "no")

# Next, I want to group data-points based on the day of the week

primer_data_grouped <- primer_data_normal %>% group_by(day_of_week) %>% summarize(average_duration = mean(days_after_ordering), standard_deviation = sd(days_after_ordering), n = n())

# Now let's set the "day_of_week" factor to actually follow the days of the week

primer_data_grouped$day_of_week <- factor(primer_data_grouped$day_of_week, levels = c("M","T","W","R","F","Sa","Su"))

# Since I'll want standard error rather than standard deviation, let's get the standard error
primer_data_grouped$standard_error <- primer_data_grouped$standard_deviation / sqrt(primer_data_grouped$n)

Now that the data is ready, time to plot it in ggplot2.

Primer_plot <- ggplot() + geom_point(data = primer_data_grouped, aes(x = day_of_week, y = average_duration)) +
  geom_errorbar(data = primer_data_grouped, 
                aes(x = day_of_week, ymin = average_duration - standard_error, 
                    ymax = average_duration + standard_error), width = 0.5) + 
  geom_text(data = primer_data_grouped, aes(x = day_of_week, y = 0.2, label = paste("n=",primer_data_grouped$n))) + geom_hline(yintercept = 0) + scale_y_continuous(limits = c(0,7), expand = c(0,0.01)) +
  theme_bw() + xlab("Day primer was ordered") + ylab("Days until primer arrived") + theme(panel.grid.major.x = element_blank())

ggsave(file = "Primer_plot.pdf", Primer_plot, height = 4, width = 6)
Primer_plot

Hmm, probably should have made that figure less tall. Oh well.

So the n values are still rather small, but it looks like I’ll get my primers soonest if I order on Monday or maybe Tuesday. In contrast, ordering on Thursday, Friday, or Saturday give me the longest wait (though well, some of that are the weekend days, which don’t matter as much). Thus, if I have any projects coming up where I have to design primer, it’s probably worth me taking the time to do that on a Sunday or Monday night instead of late in the workweek.

UPDATE 2/22/2020: Uhhh, so the plot above was my first draft attempt, and I have since honed in on the right representation of hte data:

And that’s because average numbers are only somewhat meaningful, and the distribution of frequencies is much more relevant / accurate. Here’s a link to the updated script for generating the plot with the data file at this other link.

Writing a simple Python script (for Biologists)

Being able to write short scripts that help you do what you need to do is a very empowering ability for modern-day biologists like me.

In the process of thinking about planning a new plasmid that encodes two proteins fused together by a flexible linker, I realized that I should just write a short script that, when run, gives me a random set of glycine and serine residues composing that linker (riffing off of the ol’ GGS-repeat linkers that seem to prevail in synthetic molecular biology). I’m also in the spirit of documenting the things I’ve learned so that future trainees could have them as reference for their own learning. So here goes my attempt at explaining how I’m approaching this:

First is thinking about the scope of the script you’re trying to put together. This should be an extremely simple one, where I’ll have lists of all of the glycine and serine codons, and a series of random numbers will determine which of those codons to use. Maybe I’ll also include in this script a user-feedback feature that will make it so you can tell the script how many codons it should be randomnly stringing together. Since it’s so simple, it shouldn’t require loading too many custom libraries / packages for performing more advanced procedures.

In the spirit of good practice, I’ll try to do this in iPython as well, though I’ll make a simple python script at the end for quick running from the command line. OK, here goes:

Note: I posted the final code at my Github page (link at the bottom). But for people wanting to just follow along here, here is the code in its final form at the outset (so you can see how I went about building it):

from random import randint

codons = ["GGT","GGC","GGA","GGG","TCT","TCC","TCA","TCG","AGT","AGC"]

length = input("What amino acid length flexible linker would you like a nucleotide sequence for?")

linker_sequence = ""
length = int(length)
for x in range(0,length):
    new_codon = codons[randint(0,9)]
    linker_sequence = linker_sequence + new_codon
    x = x + 1

print(linker_sequence)

1) Had to remind myself of this, but firs you type in “jupyter notebook” to open the interactive web-browser interface (assuming you’ve already installed it).

2) Go to the right directory, and then make a new iPython3 notebook.

3) Let’s start simple and make the random codon generator. First, let’s make a list of the codons we want to include.

codons = ["GGT","GGC","GGA","GGG","TCT","TCC","TCA","TCG","AGT","AGC"]

4) Next, let’s figure out how to choose a random index in the list, so that a random codon is chosen. I don’t remember how to do this off-hand, so I had to google it until I got to this page.

5) OK, so i guess I do have to load a package. I’m now writing that at the top of the script.

from random import randint   
randint(0,9)

It gave the correct output so I saw that it worked and commented out the randint call.

# randint(0,9) #This line won't run

6) Next is having the script randomly pull out a glycine or serine codon. This is simply done by now including:

codons[randint(0,9)]

7) Cool. All that has worked so far. Let’s now have the script take in a user-input for the number of residues so we can repeat this process and spit out a linker sequence of desired length. I don’t remember how to ask for user input in a python script, so I had to google this as well.

8) Allright, so I need to use the use the “raw imput” function and lead it to a variable. Well, tried that and it said “raw_input” wasn’t defined. Google that, and found this link saying that advice as deprecated, and instead it was just “input()”.

9) Thus, I typed in:

length = input("What amino acid length flexible linker would you like a nucleotide sequence for?")

And this asked me for a number like I had hoped. Great.

10) To make this process iterable, I went for my trusty “for” loop. Actually, I tried to make a “while” loop first, but realized I didn’t know how to make it work off the top of my head. So I just ended up making a “for” loop and putting a x = x + 1 statement at the end to effectively turn into a “while” loop. Personally, I think the ends justify the means, and going with what you know works well is a valid option for most basic scripts, when efficiency isn’t a huge priority.

length = 3  #Giving an arbitrary number for testing the script
for x in range(0,length):
    new_codon = codons[randint(0,9)]
    x = x + 1 

11) I then had to be able to keep track of the codons that were pulled during each iteration of the loop. I thus created an empty variable called “linker_sequence” and just added the string for the new codon at the back of “linker_sequence” during each iteration.

linker_sequence = ""  # A blank variable to keep track of things
length = int(length)  # To convert the input text into a number
for x in range(0,length):   
    new_codon = codons[randint(0,9)]  
    linker_sequence = linker_sequence + new_codon  
    x = x + 1

12) Lastly is putting a final print function so it returns the desired string of nucleotides to the user.

print(linker_sequence)

Nice, I think that does it for the script. Super easy and simple!

13) Finally, let’s test the script. I typed in 5 amino acids, and it gave me “TCTAGTTCAGGCTCT” as the output string. I google searched “Transeq” to get to the ebi for a simple codon translator, and translation of the nucleotide sequence above gave “SSSGS” as the protein sequence. So great, it worked! Sure, a little serine heavy, but that’s random chance for you. Should still be perfectly fine as a linker, regardless. Now to finish planning this plasmid…

Note: I’ve posted both the iPython notebook file (.ipynb) and a simple python script (.py) on my Github page. Feel free to use them!