Dual index on Miseq and Nextseq

We’re planning on submitting some dual indexed, paired read, amplicon sequencing samples, and depending on how many we have, we may submit for sequencing on a Miseq or Nextseq. Since this is all custom (and we generated the indices), I had to figure out how exactly how the process plays out for the two instruments to see what we need to the sample sheet for demultiplexing. I figure I’d illustrate it out for people in the lab so they understand the process as well.

The common steps

Everything starts with bridge amplification, where both the forward and reverse strands are physically bound to the flow cell by their 5′ ends. I’m denote where the DNA strands are physically bound to the flow cell with the grey circles on the ends.

Next is denaturing the strands so they are no longer bound, and also cleaving the strand that is bound to the flow cell via the p5 sequence. The result is something that looks like this.

Now the single stranded molecule is ready for sequencing from the read 1 primer. The light blue box is showing the nucleotides we’ll be reading in this particular library.

Once that read is done, next is reading the first index.

The dissimilar steps

We didn’t do a ton of dual indexing of the libraries in my postdoc (and since Jason used to run all of the kits anyway), I didn’t really need to know these steps, but I’ve had to figure them out now setting everything up in my own lab. I’ll go over what happens with the Nextseq first, since that’s conceptually a bit easier. Here’s what the Illumina docs say.

The Nextseq way

So this setup allows for dual indexing read off of two custom primers. So for this specific library, this means that the complementary sequence is going to be synthesized, making a double-stranded bridged molecule again.

And then after everything is denatured and the original template strand removed (presumably by cleaving at the p7 sequence this time), the second index can now be sequenced.

Followed by sequencing of the second read.

The Miseq way

Looking at what Illumina says in their documents for dual indexing on the Miseq, it looks a little different:

So the clear difference here is that the second index for the Miseq system is without a second custom index seq primer, but instead *during* the second strand synthesis step primed by the p5 oligo. With our specific library, it will look like this, starting with sequencing of that second index:

Note: I didn’t realize this until we did the exercise, but the p5 cluster generator we used to use in the Fowler lab is longer than the actual p5 sequence Illumina gives in their manuals. Not quite sure for this discrepancy, though I’m assuming that may mean that the sequences immediately after the p5 oligo during this step won’t be our rather variable index sequences, and instead may be some constant bases preceding the indexes. I’m guessing this is not a deal-killer, but something we’ll still have to be cognizant of when determining the run programming.

After that, is the complete second strand synthesis, resulting in a double stranded bridged molecule again.

Followed by denaturing and cleavage (again, presumably of p7 sequence) as described above, followed by annealing of the read2 primer.

So why does this matter?

Well, I think the practical implications are a few-fold. Firstly, Miseq dual indexing won’t need that second custom index read primer, since it will be reading off of the p5 sequence. And this is further complicated by the fact that the p5 adapter sequence we added onto the amplicons may perhaps be a bit longer than what’s actually on the chip, so we may have to factor this into the run parameters. Secondly, the strand that is reading that second index is different. With the Miseq, it’s being read off that first strand and thus in the same orientation as index 1. On the Nextseq, it’s being read on the second strand, so it will be read in the opposite order as index 1. Thus, I think this matters in terms of whether we’re putting the forward or reverse complement in the sample sheet, which will differ depending on whether we’re going Miseq or Nextseq with these samples.

Estimating coverage for NNK SSM transformations

We were recently doing some small scale Gibson-based NNK site saturation mutagenesis PCR reactions. In this scheme, we are independently transforming each position separately, so the number of transformants (ie. colonies) on a given plate should be directly related to the likelihood that all of the desired variants that we want to see are there at least once.

In fact, there are three parameters that factor into how good the variant coverage is at a given position. This is going to be 1) nucleotide biases in the creation of the NNK degenerate region of the primer, 2) the number of transformants, and 3) the fraction of the number of total transformants are actually variants, rather than undesired molecules such as carryover of the WT plasmid used for the template.

For any given experiment, you’re not going to know what the nucleotide bias is like until you actually Illumina sequence your library…. but at that point, you’ll already know the variant coverage of your library, so no need to estimate it anymore. On the other hand, if you know the nucleotide biases you observed for similar libraries, then you can do this estimation far before you get around to Illumina sequencing. Based on previous libraries, I have a pretty good idea of what the biases from machine-mixed NNK primers from IDT are like. For simplicity sake, I’m using 40% G, 20% C, 20% A, and 20%T as a rough estimate for the nucleotide bias I saw in the most biased NNK libraries.

The other two parameters are going to be very much experiment specific, and can be determined shortly after generating the library. The number of transformants can be determined by counting colonies from the transformation. And the amount of template contamination can be roughly determined by performing Sanger sequencing on a handful of colonies from those plates. Thus, I chose a few reasonable values for each: colony counts ranging from the very small (10 and 20) to quite large (400 and 1000), and template contamination percentages from almost impossibly low (0%) and much more likely (10 or 20%) all the way to possibly prohibitively high (50% and 75%). I then simulated the entire process, bootstrapped 20 times to get a sense of the average output, and made a plot showing what types of variant coverages you get depending on the combinations of those observed parameters. This is what the plot looks like:

So there you go. In a reasonable condition where you have, let’s say 10 or 20% template contamination, then you’d really be hoping to see at least 200 colonies, and hopefully around 400, where you can then really pat yourself on the back. If things went awry with the DPNI step, for example, and you were getting between a quarter to a half of colonies being template, then you’d minimally want 400 or so colonies and don’t feel too safe until you got a fair bit more than that. Though that’s only to make sure you at least have one copy of every variant at that position. If your library is half template, then chances are you’ll be running into a bunch of other problems down the line.

HEK 293T Bxb1 Landing Pad recombination protocol

Due to popular request, I’m going to put my *most current* version of the HEK 293T Landing Pad recombination protocol here for others’ benefit. Much of the credit goes to Sarah Roelle, who wrote up this current version of the protocol.

Recombination:

[This protocol is for a 24-well plate:]
Day 1:

1) Make 2 transfection mixtures per sample:
Tube 1: 23 μL Opti-MEM + 1 μL Fugene6

Tube 2 (If using cells that don’t already express the Bxb1 recombinase enzyme): μL of DNA corresponding to 16 ng Bxb1 plasmid + μL corresponding to 224 ng attB plasmid and OPTI-MEM to a final tube volume of 24 μL.
-OR-
Tube 2 (If using cells that already express the Bxb1 recombinase enzyme): μL of DNA corresponding to 240 ng attB plasmid and OPTI-MEM to a final tube volume of 24 μL.

2) Mix Tube 2 into Tube 1 for each sample. Mix up and down a couple times with pipet. Then let the mixtures sit for 15-30 min while you get the cells ready (Unless you trypsinized and counted the cells first, to know how many you had in case they were limiting).

4) [Meanwhile] Trypsinize and count the cells. Add 120,000 cells per well in a final volume of 300 μL media to each well.

5) Once at least 15 minutes have passed since mixing, add the mixtures dropwise throughout the well of cells being transfected.

Day 2:
Add at least 500 μL media to each well.

Notes about scaling up / down:
We typically pilot new plasmids in a 24-well scale, and transfect larger populations of cells by transfecting 6-wells and increasing the number of plates / wells as needed.

Negative selection (if applicable):

Negative selection with iCasp9 is wonderful, and hopefully you are using one of those landing pad versions. I typically use 10nM final concentration of AP1903, although I’m fairly certain 1nM is just as effective. Death occurs in a couple of hours, so you can come back to your plate / flask later on in the day and change media to get rid of the dying cells. I typically wait until at least 72 hours after recombination to perform this step.

Positive selection (if applicable):

I tend to do the positive selection step AFTER negative selection, since the negative selection step has usually thinned the number of cells down enough such that the positive selection is going to be most effective (I find that over-confluent wells tend not to do super-grant with positive selection). Thus, while it can be done as soon as 72 hours post recombination, I tend to do this a week or later after recombination. You can find effective concentrations for positive selection of landing pad HEK 293T cells here.

Some additional notes:

  1. Due to a phenomenon of (annotated) promoter-less expression from the thousands of un-recombined plasmids that remain in the cell following transfection, I typically wait ~ 5 to 7 days before running any cells in the flow cytometer, to wait for that background signal to die down.
  2. Other transfection methods may work, but will need to be optimized. For example, I have observed transfection with lipofectamine 3000 to result in prolonged promoter-less expression, probably due to increased transfection and greater toxicity preventing cell division and thus plasmid dilution.

Conda virtual environments

If you’re going to do things with the bash command line, you’ll inevitably have to install a number of packages / dependencies. Having a package manager can help with this, along with virtual environments, that allow you to keep the versions of packages you need for different purposes relatively organized. For package managers, I have tended to like Anaconda the most. So install that. I’m not going to go through the steps here, since I already have it installed, so I’d really have to go out of my way to try to describe that. But it should be pretty straightforward.

Well, I want to create and run a script using biopython, but I don’t already have it installed on this computer, so this seems like a nice time to make a virtual environment for it. The steps I did were as follows:

  1. Create a virtual environment for biopython.
    $ conda create -n biopython
  2. Activate the virtual environment
    $ conda activate biopython
  3. Next, install biopython.
    $ conda install -c conda-forge biopython

That should be it. Whenever you want to deactivate the virtual environment, you can type:
$ conda deactivate

Analyzing Illumina Fastq data

We recently got some Illumina sequencing back from GeneWiz, and I realized that this is good opportunity to show people in the lab how to do some really basic operations to handle such types of sequencing data, so I’ll write those instructions here. Since essentially everybody in the lab uses a Mac as their primary computer, these instructions will be directly related to performing these steps on a Mac, though the same basic steps can likely be applied to PCs. Also, since these files are small, everything will be done locally; once the files get big enough and the analyses more complicated, we’ll start doing things on a computer cluster. Now to get to the actual info:

  1. First, find the data we’ll be using for practice today. If you’re in the lab, you can go to the lab GoogleDrive into the Data/Illumina/Amplicon_EZ/30-507925014/00_fastq directory to find the files.

We won’t need to analyze everything there for this tutorial; instead, let’s focus on the “KAM-IDT-Std_R1_001.fastq.gz” and “KAM-IDT-Std_R2_001.fastq.gz” files.

2. Copy the files to a directory on your local computer. You can do the old “drag and drop” using the GUI, or you can do it in the command line like so, once you adjust the paths for your own computer:

$ cp /Volumes/GoogleDrive/My\ Drive/MatreyekLab_GoogleDrive/Data/Illumina/Amplicon_EZ/30-507925014/00_fastq/KAM-IDT-Std_R1_001.fastq.gz /Users/kmatreyek/Desktop/Illumina_data

$ cp /Volumes/GoogleDrive/My\ Drive/MatreyekLab_GoogleDrive/Data/Illumina/Amplicon_EZ/30-507925014/00_fastq/KAM-IDT-Std_R2_001.fastq.gz /Users/kmatreyek/Desktop/Illumina_data

3. Un-gzip the files. You can do this in the GUI by double-clicking the files, or you can do it in the terminal (if you’re now in the right directory) like so.

$ gzip -dk KAM-IDT-Std_R1_001.fastq.gz

$ gzip -dk KAM-IDT-Std_R2_001.fastq.gz

Optional: Take a look at your fastq files. You won’t want to open the files in their entirety, so what makes more sense it just looking at the first 4 or so lines of the file, corresponding to the first read. To do this, type:

$ head -4 KAM-IDT-Std_R1_001.fastq

And you should get an output that looks like so:

4. They won’t always be paired reads, but this time it is. So we’ll pair them. I you don’t already have a method for doing this, then download PEAR and install it like I described here. Once you have it installed, you can type in a command like so:

$ pear -f KAM-IDT-Std_R1_001.fastq.gz -r KAM-IDT-Std_R2_001.fastq.gz -o IDT_HM

It took my desktop a couple minutes for this process to complete. You’ll get an output that looks like this.

Your directory should now have all of these files:

You can look at the first read again (now that it’s been paired), using the following line, it should look like so:

$ head -4 IDT_HM.assembled.fastq

As you can tell, the quality scores in the first line went from mostly F’s (Q-Score of 37) to almost all I’s (Q-Score of 40).

5. Now that we’ve prepped the Illumina data, it’s time for the downstream analysis. This will be far more project or experiment specific, so these next steps won’t apply for every situation. But in this case, we made a library of Kozak variants to try to get a range of expression levels of the protein of interest. Furthermore, the template DNA used for the PCR lacked a Kozak sequence and a start codon, and these will inevitably be in the sequencing data also. So the goal of this next step is to identify the reads that are template vs those that have the Kozak sequence, and if it does have a Kozak and ATG introduced, to extract the Kozak sequence from the read.

I went ahead and wrote a short python script that achieves this. So, grab that file, stick it in the same directory as the data you want to analyze, and run it.

$ python3 Extract_Kozak.py IDT_STD.assembled.fastq

The script should then create a file called “IDT_STD.assembled.tsv” that should look like this:

This can now be easily analyzed with whatever your favorite data analysis language is, whether it’s R or Python. Huzzah!

Using PEAR to deal with paired reads

Got some paired Illumina sequencing back today, and wanted to pair the R1 and R2 fastq files. I vaguely remember using PEAR to do this before, so gave it a shot. Since the files are relatively small, I figured I’d just try to run this locally, which meant trying to install it (and it’s various dependencies) on my Macbook. Here are the steps…

  1. Download “autoconf-2.65.tar.gz” from this website. Uncompress it, and then go into the now uncompressed directory, and type in “./configure && make && make install”

2. Download “automake-1.14.1.tar.gz” from this website. Uncompress it, and then go into the now uncompressed directory, and type in “./configure && make && make install”

3. Get PEAR academic from this website by entering in your info, receiving an email, and downloading “pear-src-0.9.11.tar.gz”. Uncompress it, and then go into the now uncompressed directory, and type in “./configure && make && make install”

Huzzah. It should now be installed.

Now you can go to whatever directory has your R1 and R2 files, and type in something like so:
“pear -f KAM-IDT-HM_R1_001.fastq -r KAM-IDT-HM_R2_001.fastq -o IDT_HM”

You should now have a file called “IDT_HM.assembled.fastq” that has all of the joined paired reads.

5/18/21 Update: I was working on something else that required trimming reads. I decided to use fastp to do this. The instructions to install fastp is as follows (essentially following the instructions here):

  1. Download the zip file (under the “Code” button) from https://github.com/OpenGene/fastp, and unzip it.
  2. Use terminal to go to the directory that is made, and run
    $ make
  3. Run the following command:
    $ sudo make install
    … and then type in your computer password when prompted. Fastp should now be installed.

Landing pad plasmid maps

I still intend to get around to posting most if not all landing pad plasmids to Addgene, but it’s taken me forever to get around to it. Once those are there, the plasmid sequences will obviously be publicly available. In the mean time, I figure I’d post some of the most common plasmid maps here, so other people can benefit form them (or I don’t have to send specific emails to each person who asks for them). So here are some of the most popular, published plasmids, in GenBank (.gb) format.

G384A_LLP-iCasp9-Blast

G274A_attB-sGFP-PTEN_IRES-mCherry-P2A-PuroR

attB-EGFP

attB-mCherry

dAAVS1_TetBxb1_mTagBFP2_rtTa3

dAAVS1_CMV-rtTA3-Term

dAAVS1_Dummy

G371A_LLP-rEF1a

G375F_attB-eGFP-rIRES-mCherry

G382C_attB-eGFP-rIRES-mCherry-2A-PuroR

G163A/B_AttB_PuroR-P2A-mCherry_Bgl2

Chemical transformation efficiencies

We do a lot of molecular cloning in the lab. Standard practice in the workflow is to make your own home-made chemically competent NEB 10-beta bacteria to be used fresh the day of the transformation. This has worked surprisingly well, and we have made ~ 350 different plasmid constructs in the first ~ 600 days). Each time you do the transformation, it’s important to include a positive control (we use 40 ng of attB-mCherry plasmid) to make sure the transformation step was performing properly (to help interpret / troubleshoot what may have gone wrong if you had few / zero colonies in your actual molecular cloning transformation plates). I’ve done this enough times now to know what is “normal”. Thus, especially for new members in the lab, please reference this plot to see how your own transformation results compare to how it has worked for me, where I typically get slightly more than 10,000 transformants (Note: you may get numbers better than me, which is a good thing!).

HEK 293T Landing Pad Dose Response Curves

Positive selection for recombined selection is a pretty useful technique when working with the HEK 293T landing pads, but people might not know what concentrations are best. Well, here is that info, all in one place. In each case, the antibiotic resistance gene had been placed after an IRES element in an attB recombination plasmid, and was linked to mCherry using a 2A stop-start sequence. Thus, the Y axis is showing how well the mCherry+ cells were enriched at various concentrations of antibiotic.

Consistent with the above plots, I generally suggest people use 1 ug/mL Puro, 10 ug/mL Blast, 100 ug/mL Hygro, and 100 ug/mL Zeocin for selections.

Quantitating Western blots

Almost everybody hates doing Western blots since they’re so labor intensive (and somewhat finicky), but they are undeniably useful and will forever have a place in molecular biology / biomedical research. We’re currently putting together a manuscript where we express and test variants of ACE2, which requires some Western blotting quantitation. Since I’m about to do some of the quantitation now, I figured I’d just record the steps I do it so trainees in the lab have a basic set of instructions they can follow in the future.

This already assumes you have a “.tif” file of western blot. While I someday hope to do fluorescent westerns, this will likely be a chemiluminescent image. Hopefully this isn’t a scan of a chemiluminescent image captured on film, b/c film sux. So you presumably took it on a Chemidoc or an equivalent piece of equipment. Who knows; maybe I finally found the time to finish / standardize my chemiluminescent capture using a standard mirrorless camera procedure (unlikely). I digress; all that matters right now is you already have such an image file ready to quantitate.

My favorite method for Western blot quantitation is to use “Image Lab” from BioRad. You know, I like BioRad. And I definitely like the fact that they provide this software for free. Anyway, download and install it as we’ll be using it.

Once you have it installed, start it up and open your image file of interest. The screen will probably look something like this:

First off, stop lying to yourself. By default, the imagine is going to be auto-scaled so that the darkest grey values in your image get turned to black, while the lightest grey values get turned white. But in actuality, the image you see is not going to be the “raw” range of your image, so you may as well turn off the auto-scaling so you see your image for what it really is. Thus, press that “Image Transform” button…

… and get a screen that looks like below:

See where those low and high values are? Awful. Set the low value to 0 and the high value to the max (65535).

And now the image looks like this, which is great, since this is what the actual greyscale values in your image actually look like.

OK, now we can actually start quantitating the bands in the plot. First off, don’t expect your western blot to look 100% clean. Lysates are messy, and you can’t always get pure, discrete bands. Sure, some of the lower-sized bands may be degradation products that happened when you accidentally left the lysates out of ice (you should avoid that, of course). Then again, you may have done everything perfectly bench-wise, and the lower-sized bands may be because you’re overexpressing a transgene, and proteins naturally get degraded, and overexpressed proteins may tax the normal machinery and get degraded more obviously. I say the best thing is to acknowledge that happens, show all your results / work, and make the more educated interpretations that you can. Regardless, in that above plot, we’re going to try to quantitate the density of the highest molecular weight band, since that should be the full-length protein. To do that, first select the “Lane and bands” button on the left.

I then press the “Manual” lane annotation button.

In this case, I know I have 11 lanes, so I enter that and get something that looks like this:

Clearly that’s wrong, so grab the handles on the edges and move the grid such that it actually falls more-or-less on the actual lanes.

Sure the overall grid is pretty good, but maybe it’s not perfect for every single individual lane. The program also lets you adjust those. To do that, click off the “Resize frame” button on the left so it’s no longer blue…

And then adjust the individual lanes so they fit the entire band as your human eyes see them, resulting in an adjust grid that looks like this:

Nice. Now go to the left and select the tab at the top that says “Bands”, and then click on the button that says “Add”.

Once you do that, start clicking on the bands you want to quantitate across all of the lanes. You may have to grab the dotted magenta lines in each lane to adjust them so that the actual band is within them (and presumably somewhere near the solid magenta line which should be somewhere in between them). This is what it looks like after I do that:

It’s good to check how well the bands are being seen by the program. Go to the top and press the “Lane profile” button. It should give you a density plot. This is also the window where you can do background subtraction. Find a number that seems sensible (in this case, a disk size of 20 mm seems reasonable), and make sure you hit the “apply to all lanes” button so it propagates this across lanes. While I’m only showing the picture for lane 5, it’s probably worth scanning across the lanes to make sure the settings are sensible.

Now with those settings fine, close out, and then click on “Analysis table” at the top. Once that is open, go to the bottom and click on “lane statistics”. These should be the numbers you’re looking for.

Now export the statistics (either pressing the “copy analysis table to clipboard” button and pasting in an spreadsheet you want to use, or the “export analysis table to spreadsheet”). The number you’ll be looking to analyze will be those in the “Adj. Total Band Vol” column.

Note: Now that I’m doing this, the “standard curve” button is now ringing a bell. I’m fairly certain that in my PhD work, when I ran a ton of Western blots or just straight up protein gels stained with coomassie, that I would run dilutions of lysates / proteins to make a standard curve of known proteins amounts that I could calibrate the densitometry against. We obviously didn’t do that here, since we didn’t have the space, so these numbers aren’t going to be quite as accurate if we had done so. Still, getting some actual numbers we can compare across replicates is still a major step up than not quantitating and having everything be even more subjective.