Using Inkscape

Vector graphics editor software are SUPER useful when trying to make figures. A proprietary program some people may be used to is Adobe Illustrator. I’ve never actually used it, mostly b/c I started using playing around with vector graphics editor software when I was a grad student, and the idea of paying for an Adobe product with my own money was unpalatable. Inkscape is a free & open source software that works on Mac or Windows. It can be a little buggy, but I’ve learned to absolutely love it over the last 10 years or so. To help people in my lab start using it, I’m writing up this primer describing a series of basic processes I perform when using it.

Step 1: Downloading Inkscape. Obviously, this is downloading the program. I’ll assume this is pretty self-explanatory unless someone tells me otherwise.

Step 2: Starting up the program. You should get a blank screen like this, which is the basic workspace. The white rectangle with the drop-shadow is the blank “page”.

Step 3: Adjusting the properties of your page. While you don’t have to start here, I like to make some basic adjustments to the page out of the get-go. To do this, go to File -> Document Properties… or just hit “Command + Shift + D” if you’re on a Mac, like me. A new pane should open up to the right of the program, like below (if it doesn’t, try adjusting the size of your window and that should force a refresh).

The main things I like to change are as follows:
1) The dimensions. Most journal have maximum figure widths between 174 and 180 mm (though handful allow 190 or even 200 mm). It’s probably easier to plan for the smallest size of 174 mm, and just space things apart a bit / give the panels more breathing room if you end up in a situation where larger dimensions are allowed. Thus, adjust the width to 174 mm, and hit enter, and the width of that page should now shrink.
2) So the page settings defaults to *NO* background color (this can be revealed by clicking on that “Checkerboard background” button). So if you put in some shapes / plots, they will be floating in emptiness, rather than a white background. While this isn’t problematic per-se, it usually makes sense to make figures that actually have a white background. Thus, I next click on that box next to the text that says “Background color:” and in the bottom field that has a 0 next to “A:”, I change that to 100 (I’m assuming this is opacity, and we are turning the opacity from 0 to 100). This will make the background white.

Step 3: Saving the file. Now that you’ve made some modifications to your file, you may as well save it (since you’re going to want to start saving the file soon anyway). Go to File -> Save As… and save the file in whatever directory you want to keep it. The file type is will save is a “.svg” file, apparently for an “Inkscape svg” file, which is seemingly a slight derivation of a pretty genetic / standard file type for vector graphics (svg). You can just keep these default settings.

Step 4: Import some existing vector graphics data. So while you can start from scratch making shapes in the program itself, I’m most often using Inkscape to make adjustments to vector graphics plots generated in another program, like R. In this case, I’ll be importing a phylogenetic tree based on the spike protein receptor binding domain sequences from various coronaviruses. Thus, go to File -> Import… or hit “Command + I”, or press that arrow-point-to-a-page button at the top of the screen, to prompt the import screen,.

In the ensuing screen, point to the file you’re trying to import. If it’s a vector graphics -compatible file, then this will most likely be a “.pdf” type file. After you select a file, you’ll get a screen like this:

You can probably just click “OK”, since I’ve found these default settings to generally work fine. All of the shapes will all get imported fine, though I’ve found that the text gets imported kind of weird. Like, the letters are fine if you don’t make any adjustments, but if you try to rewrite any of the letters, you’ll notice that the spacing gets all screwed up. I usually just get around that by replacing the imported text with replacement text that I’ve created using Inkscape, though that gets a bit tedious. I suppose maybe one day I’ll learn if there is a better way to import text from a PDF, but for now, this will do.

Step 5: Adjust the positions and sizes of things. Depending on the dimensions of your image as it was outputted by the preceding program (like R), your image may or may not fit well within your blank page. In my case here, the image was indeed roughly the right size, though I wanted to zoom in a bit to see it in a bit more detail. To do this, I pressed the “+” button my keyboard a couple of times to get to the desired zoom, though you can adjust the Zoom in more detail on the bottom right of the screen (see below):

Another thing you may want to do is adjust the size and position of the image / plot you just imported. If you want to keep everything in the right dimensions, the first thing you’ll want to do is click on the “lock” button making sure the height and width stay in proportion.

Then you can type in whatever number you want into either the W: or H: fields, hit enter, and see if the size was adjusted to what you’d like. Another thing that’s useful is to adjust the position of the image. For example, perhaps this is the first panel of your figure, and you want it in the top left of the page. If that’s the case, enter 0 for both the X: and Y: fields, and hit enter (In many cases, you’ll want at least a couple pixels worth of blank space around the borders, in which case typing in something like 2 is better, but I’m using 0 today to prove a later point). Now the plot should look like so:

Step 6: Ungrouping and adjusting vector objects. As you can see, the border is now missing. Why is that? Well, it’s because in this case, the phylogenetic tree pdf I imported also had a white background, so that white background is now obstructing the borders of the page under it. While this isn’t necessarily problematic, this extra white background is superfluous and may complicate things downstream, so we will now get rid of it. To do so, click on the object, and go to Object -> Ungroup, or just hit the “Command + U” keys like I do. That one object should now have been turned into a bunch of objects, each with its own highlighting, showing that the ungroup command worked.

Unselect everything, and then select only the white background. Once you seem to have selected it, hit delete. If you missed and you deleted the wrong thing, you can undo. If you’re having a hard time selecting the white background because there is something in the foreground that inevitably gets selected first, you can take that foreground element and then hit the “send to back” button at the top of the screen to get that foreground element out of the way.

Once you do that, you should be able to more easily select the background element and hit delete. (In this case, there were apparently two background elements I had to individually select and delete). The end result should now look like this, where you can now see the Inkscape “page” borders again.

Step 7: Adjusting individual elements. In this case, there’s a lot of wasted space because the lines for the outgroup (MERS) is so long. This is also kind of pointless wasted space, since I’m not going to try to make any point about how similar the MERS RBD sequencer is to the rest of them… I’m just using MERS as a “outlier” sample to orient everything else with. To fix this waste of space, I can adjust the lines leading to MERS (the power of vector graphics editors!). To do this, click on the line in question. I did that, and noticed all of the lines in the image were still getting selected. Thus, I hit “ungroup” a few more times until EACH of the individual lines were selectable. I then selected the two lines I wanted to reduce in length: the one leading to the MERS outgroup, and the one leading to the tree. Once you do that, hit the “Edit paths by node” button that’s toward the top left of the screen (see the orange highlight below):

Once you do that, the start and end points of each of those lines are now selectable. Since I want to adjust both lines on the left size (and make it smaller by moving them to the right), click on one of those points, and then Shift-Click on the other point. If they’re both selected, they should look different from the points on the right side of the line.

Once they’re selected, you can move them to the right. I like to do fine manipulating at this scale using the arrow keys on the keyboard. So, I *could* hit the right arrow to move those nodes to the right and thus shortening the lines. Then again, I’d probably have to hit that button ~ 100 times to move it to the right the amount I want. That’s b/c by default, the arrows are “fine” adjustments. To turn the arrows into “coarse” adjustments, hold down the shift key as you press the arrow buttons. I only had to hit “Shift + [Right Arrow]” four times to get those lines to the length I wanted them.

That worked well, but now that connecting line is orphaned and confusing. We can now move that to the right by selecting that, and since we’re not selecting individual nodes but just moving the entire line itself, we can hit “Shift + [Right Arrow]” four times *without* first hitting that “Edit paths by nodes” button. If you already have the “Edit paths by nodes” button enabled (and thus shaded gray), you may need to turn it into the other “Select and transform objects” mode by clicking on the button above the “Edit paths by nodes” button. I’m also now moving everything to the left by selecting everything and entering X: 1 and Y: 1 as I mentioned above. It should now look like so:

Progress! Though adjusting the tree like we did can be disingenuous, since both the adjust lines we shorted (for the outgroup) is the same color & texture as the lines we didn’t adjust (for the rest of the tree). To distinguish between the two, we can click on the lines leading to the outgroup and change its style / texture to a dotted or interval one. That way, we can write in the figure legend later on that the dotted lines leading to the outgroup were adjusted to conserve figure space. To do this, click on the lines in question (holding down shift while doing so, so that all four individual lines can be selected), and then hit the color / button associated with the “Stroke:” label at the very bottom-left of your screen (see orange highlighted circle in the image below). This should open up a new page to the right. If you still have the “Document Properties” pane open from before, you may need to close it out (by hitting the “X” button at the top right of that pane), or just scroll down until you get to the “Fill and Stroke” pane. Apparently one can also get to this pane by hitting “Command + Shift + F”.

While you still have those lines selected, you can click on the tab that says “Stroke style” and then go to the button that says Dashes, and click on a style that has dots or dashes. You may also need to adjust the width of the line, so that it’s a bit more pronounced (I set it to 1 pixel).

The end result should look like this. A much better use of space!

Since this is potentially panel A, let’s make a panel label. To do this, click on the button labeled with “A” on the left side (the “Create and edit text objects” button), and then *drag* a text box roughly in the right area of where you’ll want it. Note: While you can *click* and make a test object, dragging is better because then you can modify the size of that text object later (less relevant for things like figure labels, but *definitely* important for things like figure legends). In this case, I typed in the letter “A”, turned it into bold. Once I did that, I clicked on the now bolded A, and then adjusted the X: and Y: to 1 so it’s now in the top left corner, like so:

Step 8: Aligning some object together. OK, well this figure doesn’t actually need this label, but let’s pretend we want to make a short one-line label / title at the bottom, and we want it right at the center of the tree. Make a text box like before, and type in your label. If you select the text box and are in text editing mode (you may need to click on the “A” button on the left again), you have the option to change fonts, bold / italics, size, line spacing, and line justification / centering in a series of boxes at the top. The one I adjust ALL the time is the line spacing. It defaults to 1.25, but it’s kind of waste of space like that, so I set it to 1.

Once you’ve done all that, you should have a screen that looks like below:

Now for centering the label right in the middle of the tree. To do this, select all of the tree elements (I would click and drag the selection around the tree). If it gets too many things, like the “A” panel label at the top left, do a “Shift + [Click]” to unselect just that. Then you can group all of the tree stuff together for easy manipulation by going to Object -> Group or hitting “Command + G” like me. To now align the figure label in the center of the tree, open the alignment pane. You can get to it by going to Object -> Align and Distribute… , or by hitting “Command + Shift + A” like I normally do.

This pane will allow you to align things based on the right edge, left edge, the center. Or if you have multiple objects that you want to space equally, you can go to the Distribute button options in the middle. Looks like it’s defaulted to “Relative to: Last selected”. That means that if you’re aligning two objects, the object you select first will be moved to the objected you selected second. You can of course change that order if you change the “Relative to:” to “First selected”. But in this case, I’m leaving it as the default, so I first selected the legend, and then selected the tree, and then hit the “Center by Vertical Axis” button that I highlighted in orange. The resulting image looks like this, where now the elements are aligned!

Step 9: Exporting your image. OK, so I’ve done everything I’ve wanted to do for this tutorial, so that last thing will be exporting the image. You have two options here. You can make a PDF that can be imported into whatever program later, by going to File -> Save As… and then instead of saving it as a .svg file, saving it as a .pdf file. (Note: this will change the workspace you’re working in to be a “.pdf”. If you want to have the workspace be an “.svg”, you’ll need to resave into that format. Or if you don’t want to do this roundabout process, you can also do File -> Print, and then select “Print to File”; all this is up to you). But, let’s pretend you want to actually submit this as a figure for a manuscript submission. Usually the desired filetype is as an image file rather than a vector graphic. This means something like an uncompressed .tif or .png file (instead of a compressed .jpg file). To do this, click on File -> Export PNG Image… , or click on the “Arrow coming out of page” button next to the import button at the top of the screen. This will open up another pane to the right:

For novices, I suggest you always click on “Page” for the Export area; this means it will export that full canvas you have been looking at (with the full width of 174 mm or whatever you had set before), rather than just the selected object, etc. You can adjust the dots per inch or DPI, but the default 300 is usually pretty good. And of course, you can and should change where the file is being saved. Once you do that, hit the “Export” button on the bottom. A little pop-up screen will show the export process (should be relatively quick for files like this). And Voila! You’ve made yourself a nice little customized image!

Update 1: Now with Western Blots.
I told myself that I would document the next time I was going to insert some image files into Inkscape for the purposes of this post, and here we are. So let’s now set the scene:

OK, so we have a multi-panel figure in the works here. The only panel relevant to today’s post is panel B, which is going to be a couple of Western Blotting images. In short, we have a negative control lysate, as well as 5 experimental samples where a different protein has been HA-tagged in each lysate. Looks like the first 3 proteins express quite highly, while the last two are expressed at pretty low levels. Thus, I’m going to try putting a lighter exposure image on the left side, and probably a cropped right portion of a longer exposure image showing the last few samples to the right.

Well, first thing to do is press that import button (circled in orange) at the top of the page. Doing that will open up a window like below, where you’ll want to select the image you want to import into your canvas.

Once you choose a file, you should get another screen like the one below. Default settings are fine in my experience, so you can just hit “OK”.

Do this for all of the images you may want. The image files, if they are of pretty normal quality, will be imported as HUGE dimensions on the canvas. Reduce the dimensions of the file to something more reasonable, like 150 pixels. You’ll see yourself in a situation like this:

Now to start aligning the images, the main trick you’ll to using today is “setting a clip”. In other words, instead of cropping the edges off of an image like one would normally do in Powerpoint, you instead make an area that you save from being cropped. Thus, to do this, I start by making a partially transparent box above the relevant bands, like these actin bands here.

I have multiple exposures I’m eventually going to want to line up on top of each other, so now that I’ve the box to be the size I want, I’m first going to make a duplicate of the box by hitting command D. I then move that duplicated box elsewhere on the canvas, so we can get back to it later.

Then, select both the box and the image (note: the box needs to be “above” the blot image), and then right click and hit “Set Clip”.

Doing so reduces the bottom image to the size of the top object. So now it’s the Blot image that has effectively been cropped.

Well, so I didn’t do a great job centering this crop, so I went back and fixed it but didn’t bother to retake the images. But if you’re curious, you can actually do this by right clicking on the object and hitting “Release Clip” to turn them into two separate objects again. Then just resize the box, select everything again, and then right click and hit “Set Clip” again.

Alright; now we can get one of the other exposures cropped and lined up. So now, I’ve moved in one of the other exposures in proximity to the red box. Since I’m going to have to make a larger field of view for this cropping, how straight the image is lined up is going to matter a lot more. For that reason, I drew in a black vertical line to visually inspect how straight it all seems. For the most part looks pretty good.

Allright, so I don’t need that line anymore so I went ahead and deleted it. But as I had mentioned; I want to get a larger window of the blot as compared to actin (a single horizontal band). Thus, I want to keep the width of the box the same but change the height of the box. To do this, click on the image, and you’ll see it have these arrows coming out of it.

Well, just click on the top and bottom arrows and drag to the height that you want it. It should now look something like this.

I can then crop and get the most relevant part of the image. Now that I have the cropped image, I can now align the two images together.

Now move everything in place. If you have another image, like a darker exposure, then you can just go back and repeat some of the previous steps for that one. You can even give them an outline, by duplicating the image, releasing the clip, deleting the repeated image, but turning hte red box into a completely clear box with a solid stroke; essentially, turning it into a see-through frame. Anyway, at this point, you should have a plot like above.

Next, making a bunch of labels to denote what each sample is. You can do that by dragging a text box, and then typing in your label. Once you’ve written the text you want for the first columns, duplicate that text box and replace the existing text with the text for the next sample, and keep doing this until everything is correctly labeled. When trying to make row labels, you may encounter needing to get some special greek characters like I did. In that case, while in the text box, go to Text -> Unicode Characters… and select it (see below).

You’ll get a box that looks like below, where you’ll want to click and select “Greek and Coptic”.

End result after you finish fixing the text should look like so:

Last bit is adding in the size labels, at least for the blot in the middle, since it spans a pretty large size. Alright, well, how would I go about doing this? Well, it’s pretty easy for the imager that we use, since it seems to spit out a combined image of the marker and the exposure, into a file that looks like this when imported into Inkscape (and resized to be the same dimensions of the lower exposure blot from earlier, and aligned on the vertical axis); here they are side-by-side.

If you want to prove to yourself that they are perfectly aligned, you can click both the actual lower exposure image and the marker-exposure overlay image, align them on both the vertical and horizontal axis, and then toggle the order back and forth (make the top image the bottom, and them make the bottom image top, etc). I did this, and they indeed seem to be perfectly aligned / superimposable (as one would expect). So knowing this, I’ll move forward by leaving tick marks where each of those marker labels are. Looks like the ladder we used is this one from GenScript.

Cool, now I can delete the overlaid picture, reset the clip on the original low exposure image, and then move the marker labels to a convenient spot to the left. Voila! (see below).

Additional resources:
I haven’t vetted this tutorial page, but it looks like it could potentially be useful.

Simulating Sampling during Recombination

Once in a while, someone will ask me how many cells they will need to be recombine to get full coverage of their variant library. And I always tell them that they should just simulate it for their specific situation. I essentially refuse to give a “[some number] -fold coverage of the total number of variants in the library” answer, since it really oversimplifies what is going on at this step.

More recently, I was asked how I would actually do this; hence this post. This prompted me to dig up a script I had previously made for the 2020 Mutational Scanning Symposium talk.

But it really all starts in the wet-lab. You should Illumina sequence your plasmid library. Now, practically speaking, that may be easier for some than others. I now realize just how convenient things like spike-ins and multi-user runs were back in Genome Sciences, since this would be a perfect utilization of a spike-in. In a place without as much sequencing bandwidth, I suppose this instead requires a small Miseq kit, which will probably be 0.5k or something. I suppose if the library was small enough, then perhaps one or a few Amplicon-EZ -type submissions would do it. Regardless, both the PTEN and NUDT15 libraries had this (well, PTEN did for sure; I think NUDT15 did b/c I have data that looks like it should be it). This is what the distributions of those libraries looked like. Note: I believe I normalized the values from both libraries to the frequency one would have gotten for a perfectly uniform library (so 1 / the total number of variants):

We definitely struggled a bit when we were making the PTEN library (partially b/c it was my first time doing it), resulting in some molecular gymnastics (like switching antibiotic resistance markers). While I don’t know exactly what went into making the NUDT15 library, I do know it was ordered from Twist and it looked great aside from a hefty share of WT sequences (approx 25%).

Anyway, I’m bringing these up because we can use these distributions as the starting population for the simulation, where we sample (with replacement) as many times as we want the number of recombined cells to be, and do this for a bunch of cell numbers we may consider (and then some, to get a good sense of the full range of outcomes).

So yea, the NUDT15 library had a tighter distribution, and it got covered way faster than the PTEN one, which was rather broad in distribution. Looks like the millions of recombined cells is nicely within the plateau, which is good, since this is the absolute minimum here: you definitely want your variant to get into more than one cell, and the plot above is only one. The curve moves to the right if you ask for a higher threshold, like 10 cells.

But maybe you don’t have your library sequenced? Well, I suppose the worst you could do is just assume your library is complete and uniform, and go from there knowing full well that what you see will be the unattainable best case scenario. I suppose another option is to try to use approximations of the NUDT15 or PTEN libraries as guides for potentially realistic distributions.

So I ran some function that told me what a hypothetical log-normal distribution fitted to the real data was (I’m glossing over this part for today), and apparently the NUDT15 library log-normal mean was -8.92 and log-normal sd was 0.71. On the other hand, the PTEN library log-normal distribution had a mean of -9.58 and sd of 0.99. So using those numbers, one could create NUDT15 or PTEN library-esque distributions of variants of whatever sizes you want (presumably the size of your library). But in this case, I created fake PTEN and NUDT15 datasets, so I can see how good the approximation is compared to the real data.

So it definitely had much more trouble trying to approximate the NUDT15 library, but eh, it will have to do. The PTEN library is spot on though. Anyway, using those fake approximations of the distributions allows us to sample from that distribution, n number of times, with n being one from a bunch of cell numbers that you could consider recombining. So like, one-hundred thousand, quarter million, etc. Then you can see how many members of the library was successfully recombined (in silico).

Well, so the fake NUDT15 library actually performed much worse out of the gate (as compared to the real data), but evened out over increased numbers. PTEN had the opposite happen, where the fake data did better than the real data, though this evened out over increased numbers too (like 1e5). But to be honest, I think the more realistic numbers are all pretty similar (ie. If you’re thinking of recombining a library of ~5,000 members into 10,000 cells, then you are nuts).

So based on this graph, what would I say is the minimum number of recombined cells allowed? Probably something like 300k. Though really, I’d definitely plan to get a million recombinants, and just know that if something goes wrong somewhere that it won’t put the number below a crucial region of sampling. Well, libraries of approximately 5,000 variants, and a million recombinants. I guess I’m recommending something like 100 to 200 -fold coverage. Damn, way to undercut this whole post. Or did I?

BTW, I’ve posted the code and data to recreate all of this at

Also, I dug up old code and tried to modify / update it for this post, but something could have gotten lost along the way, so there may be sone errors. If anyone points out some major ones, I’ll make the corresponding fix. If they’re minor, I’ll likely just leave it alone.

Dual index on Miseq and Nextseq

We’re planning on submitting some dual indexed, paired read, amplicon sequencing samples, and depending on how many we have, we may submit for sequencing on a Miseq or Nextseq. Since this is all custom (and we generated the indices), I had to figure out how exactly how the process plays out for the two instruments to see what we need to the sample sheet for demultiplexing. I figure I’d illustrate it out for people in the lab so they understand the process as well.

The common steps

Everything starts with bridge amplification, where both the forward and reverse strands are physically bound to the flow cell by their 5′ ends. I’m denote where the DNA strands are physically bound to the flow cell with the grey circles on the ends.

Next is denaturing the strands so they are no longer bound, and also cleaving the strand that is bound to the flow cell via the p5 sequence. The result is something that looks like this.

Now the single stranded molecule is ready for sequencing from the read 1 primer. The light blue box is showing the nucleotides we’ll be reading in this particular library.

Once that read is done, next is reading the first index.

The dissimilar steps

We didn’t do a ton of dual indexing of the libraries in my postdoc (and since Jason used to run all of the kits anyway), I didn’t really need to know these steps, but I’ve had to figure them out now setting everything up in my own lab. I’ll go over what happens with the Nextseq first, since that’s conceptually a bit easier. Here’s what the Illumina docs say.

The Nextseq way

So this setup allows for dual indexing read off of two custom primers. So for this specific library, this means that the complementary sequence is going to be synthesized, making a double-stranded bridged molecule again.

And then after everything is denatured and the original template strand removed (presumably by cleaving at the p7 sequence this time), the second index can now be sequenced.

Followed by sequencing of the second read.

The Miseq way

Looking at what Illumina says in their documents for dual indexing on the Miseq, it looks a little different:

So the clear difference here is that the second index for the Miseq system is without a second custom index seq primer, but instead *during* the second strand synthesis step primed by the p5 oligo. With our specific library, it will look like this, starting with sequencing of that second index:

Note: I didn’t realize this until we did the exercise, but the p5 cluster generator we used to use in the Fowler lab is longer than the actual p5 sequence Illumina gives in their manuals. Not quite sure for this discrepancy, though I’m assuming that may mean that the sequences immediately after the p5 oligo during this step won’t be our rather variable index sequences, and instead may be some constant bases preceding the indexes. I’m guessing this is not a deal-killer, but something we’ll still have to be cognizant of when determining the run programming.

After that, is the complete second strand synthesis, resulting in a double stranded bridged molecule again.

Followed by denaturing and cleavage (again, presumably of p7 sequence) as described above, followed by annealing of the read2 primer.

So why does this matter?

Well, I think the practical implications are a few-fold. Firstly, Miseq dual indexing won’t need that second custom index read primer, since it will be reading off of the p5 sequence. And this is further complicated by the fact that the p5 adapter sequence we added onto the amplicons may perhaps be a bit longer than what’s actually on the chip, so we may have to factor this into the run parameters. Secondly, the strand that is reading that second index is different. With the Miseq, it’s being read off that first strand and thus in the same orientation as index 1. On the Nextseq, it’s being read on the second strand, so it will be read in the opposite order as index 1. Thus, I think this matters in terms of whether we’re putting the forward or reverse complement in the sample sheet, which will differ depending on whether we’re going Miseq or Nextseq with these samples.

Estimating coverage for NNK SSM transformations

We were recently doing some small scale Gibson-based NNK site saturation mutagenesis PCR reactions. In this scheme, we are independently transforming each position separately, so the number of transformants (ie. colonies) on a given plate should be directly related to the likelihood that all of the desired variants that we want to see are there at least once.

In fact, there are three parameters that factor into how good the variant coverage is at a given position. This is going to be 1) nucleotide biases in the creation of the NNK degenerate region of the primer, 2) the number of transformants, and 3) the fraction of the number of total transformants are actually variants, rather than undesired molecules such as carryover of the WT plasmid used for the template.

For any given experiment, you’re not going to know what the nucleotide bias is like until you actually Illumina sequence your library…. but at that point, you’ll already know the variant coverage of your library, so no need to estimate it anymore. On the other hand, if you know the nucleotide biases you observed for similar libraries, then you can do this estimation far before you get around to Illumina sequencing. Based on previous libraries, I have a pretty good idea of what the biases from machine-mixed NNK primers from IDT are like. For simplicity sake, I’m using 40% G, 20% C, 20% A, and 20%T as a rough estimate for the nucleotide bias I saw in the most biased NNK libraries.

The other two parameters are going to be very much experiment specific, and can be determined shortly after generating the library. The number of transformants can be determined by counting colonies from the transformation. And the amount of template contamination can be roughly determined by performing Sanger sequencing on a handful of colonies from those plates. Thus, I chose a few reasonable values for each: colony counts ranging from the very small (10 and 20) to quite large (400 and 1000), and template contamination percentages from almost impossibly low (0%) and much more likely (10 or 20%) all the way to possibly prohibitively high (50% and 75%). I then simulated the entire process, bootstrapped 20 times to get a sense of the average output, and made a plot showing what types of variant coverages you get depending on the combinations of those observed parameters. This is what the plot looks like:

So there you go. In a reasonable condition where you have, let’s say 10 or 20% template contamination, then you’d really be hoping to see at least 200 colonies, and hopefully around 400, where you can then really pat yourself on the back. If things went awry with the DPNI step, for example, and you were getting between a quarter to a half of colonies being template, then you’d minimally want 400 or so colonies and don’t feel too safe until you got a fair bit more than that. Though that’s only to make sure you at least have one copy of every variant at that position. If your library is half template, then chances are you’ll be running into a bunch of other problems down the line.

HEK 293T Bxb1 Landing Pad recombination protocol

Due to popular request, I’m going to put my *most current* version of the HEK 293T Landing Pad recombination protocol here for others’ benefit. Much of the credit goes to Sarah Roelle, who wrote up this current version of the protocol.


[This protocol is for a 24-well plate:]
Day 1:

1) Make 2 transfection mixtures per sample:
Tube 1: 23 μL Opti-MEM + 1 μL Fugene6

Tube 2 (If using cells that don’t already express the Bxb1 recombinase enzyme): μL of DNA corresponding to 16 ng Bxb1 plasmid + μL corresponding to 224 ng attB plasmid and OPTI-MEM to a final tube volume of 24 μL.
Tube 2 (If using cells that already express the Bxb1 recombinase enzyme): μL of DNA corresponding to 240 ng attB plasmid and OPTI-MEM to a final tube volume of 24 μL.

2) Mix Tube 2 into Tube 1 for each sample. Mix up and down a couple times with pipet. Then let the mixtures sit for 15-30 min while you get the cells ready (Unless you trypsinized and counted the cells first, to know how many you had in case they were limiting).

4) [Meanwhile] Trypsinize and count the cells. Add 120,000 cells per well in a final volume of 300 μL media to each well.

5) Once at least 15 minutes have passed since mixing, add the mixtures dropwise throughout the well of cells being transfected.

Day 2:
Add at least 500 μL media to each well.

Notes about scaling up / down:
We typically pilot new plasmids in a 24-well scale, and transfect larger populations of cells by transfecting 6-wells and increasing the number of plates / wells as needed.

Negative selection (if applicable):

Negative selection with iCasp9 is wonderful, and hopefully you are using one of those landing pad versions. I typically use 10nM final concentration of AP1903, although I’m fairly certain 1nM is just as effective. Death occurs in a couple of hours, so you can come back to your plate / flask later on in the day and change media to get rid of the dying cells. I typically wait until at least 72 hours after recombination to perform this step.

Important note: If you’re doing a library-based experiment, then make sure you leave some cells aside (even 100k to 500k cells will be plenty) that DO NOT go through any selection steps, since this will allow us to estimate the number of recombined cells (and thus estimate the coverage of your library; see below for the calculation). If you’re just doing individual samples, this isn’t nearly as big of a deal, since you’ll be able to visually inspect the number of recombinants (ie. do you see only a handful of surviving cells, or is it 100+ individual cells surviving?).

Positive selection (if applicable):

I tend to do the positive selection step AFTER negative selection, since the negative selection step has usually thinned the number of cells down enough such that the positive selection is going to be most effective (I find that over-confluent wells tend not to do super-grant with positive selection). Thus, while it can be done as soon as 72 hours post recombination, I tend to do this a week or later after recombination. You can find effective concentrations for positive selection of landing pad HEK 293T cells here.

Some additional notes:

  1. Due to a phenomenon of (annotated) promoter-less expression from the thousands of un-recombined plasmids that remain in the cell following transfection, I typically wait ~ 5 to 7 days before running any cells in the flow cytometer, to wait for that background signal to die down.
  2. Other transfection methods may work, but will need to be optimized. For example, I have observed transfection with lipofectamine 3000 to result in prolonged promoter-less expression, probably due to increased transfection and greater toxicity preventing cell division and thus plasmid dilution.
  3. Calculating the number of recombinants. OK, so as I alluded to above, it’s definitely worth estimating the number of recombinants of any library-based experiments. To do this, take your observed recombination rate from running flow on your unselected cells (say, you see 5% of cells as being mCherry+ / recombined when you ran flow on unselected cells 7 days after transfection), and multiply that fraction with the number of cells you transfected. So, if we pretend that you transfected 20 million HEK 293T cells and you had observed 5% of cells recombined in your unselected well, then the rough calculation is 2e7 cells * 0.05, amounting to roughly 1 million cells estimated to have been recombined. Of course, directly sequencing what’s there in the recombined library is a more direct measurement of library coverage at the recombined cell step, but doing this calculation is still usually worthwhile as another line of evidence.

Conda virtual environments

If you’re going to do things with the bash command line, you’ll inevitably have to install a number of packages / dependencies. Having a package manager can help with this, along with virtual environments, that allow you to keep the versions of packages you need for different purposes relatively organized. For package managers, I have tended to like Anaconda the most. So install that. I’m not going to go through the steps here, since I already have it installed, so I’d really have to go out of my way to try to describe that. But it should be pretty straightforward.

Well, I want to create and run a script using biopython, but I don’t already have it installed on this computer, so this seems like a nice time to make a virtual environment for it. The steps I did were as follows:

  1. Create a virtual environment for biopython.
    $ conda create -n biopython
  2. Activate the virtual environment
    $ conda activate biopython
  3. Next, install biopython.
    $ conda install -c conda-forge biopython

That should be it. Whenever you want to deactivate the virtual environment, you can type:
$ conda deactivate

5/17/22 edit: I’ve been doing some image analysis in python, that requires installing some of the following packages:
$ conda config –env –add channels conda-forge
$ conda install numpy
$ conda install matplotlib
$ conda install scipy
$ conda install opencv
$ conda install -c anaconda scikit-image
$ conda install -c conda-forge gdal

Analyzing Illumina Fastq data

We recently got some Illumina sequencing back from GeneWiz, and I realized that this is good opportunity to show people in the lab how to do some really basic operations to handle such types of sequencing data, so I’ll write those instructions here. Since essentially everybody in the lab uses a Mac as their primary computer, these instructions will be directly related to performing these steps on a Mac, though the same basic steps can likely be applied to PCs. Also, since these files are small, everything will be done locally; once the files get big enough and the analyses more complicated, we’ll start doing things on a computer cluster. Now to get to the actual info:

  1. First, find the data we’ll be using for practice today. If you’re in the lab, you can go to the lab GoogleDrive into the Data/Illumina/Amplicon_EZ/30-507925014/00_fastq directory to find the files.

We won’t need to analyze everything there for this tutorial; instead, let’s focus on the “KAM-IDT-Std_R1_001.fastq.gz” and “KAM-IDT-Std_R2_001.fastq.gz” files.

2. Copy the files to a directory on your local computer. You can do the old “drag and drop” using the GUI, or you can do it in the command line like so, once you adjust the paths for your own computer:

$ cp /Volumes/GoogleDrive/My\ Drive/MatreyekLab_GoogleDrive/Data/Illumina/Amplicon_EZ/30-507925014/00_fastq/KAM-IDT-Std_R1_001.fastq.gz /Users/kmatreyek/Desktop/Illumina_data

$ cp /Volumes/GoogleDrive/My\ Drive/MatreyekLab_GoogleDrive/Data/Illumina/Amplicon_EZ/30-507925014/00_fastq/KAM-IDT-Std_R2_001.fastq.gz /Users/kmatreyek/Desktop/Illumina_data

3. Un-gzip the files. You can do this in the GUI by double-clicking the files, or you can do it in the terminal (if you’re now in the right directory) like so.

$ gzip -dk KAM-IDT-Std_R1_001.fastq.gz

$ gzip -dk KAM-IDT-Std_R2_001.fastq.gz

Optional: Take a look at your fastq files. You won’t want to open the files in their entirety, so what makes more sense it just looking at the first 4 or so lines of the file, corresponding to the first read. To do this, type:

$ head -4 KAM-IDT-Std_R1_001.fastq

And you should get an output that looks like so:

4. They won’t always be paired reads, but this time it is. So we’ll pair them. I you don’t already have a method for doing this, then download PEAR and install it like I described here. Once you have it installed, you can type in a command like so:

$ pear -f KAM-IDT-Std_R1_001.fastq.gz -r KAM-IDT-Std_R2_001.fastq.gz -o IDT_HM

It took my desktop a couple minutes for this process to complete. You’ll get an output that looks like this.

Your directory should now have all of these files:

You can look at the first read again (now that it’s been paired), using the following line, it should look like so:

$ head -4 IDT_HM.assembled.fastq

As you can tell, the quality scores in the first line went from mostly F’s (Q-Score of 37) to almost all I’s (Q-Score of 40).

5. Now that we’ve prepped the Illumina data, it’s time for the downstream analysis. This will be far more project or experiment specific, so these next steps won’t apply for every situation. But in this case, we made a library of Kozak variants to try to get a range of expression levels of the protein of interest. Furthermore, the template DNA used for the PCR lacked a Kozak sequence and a start codon, and these will inevitably be in the sequencing data also. So the goal of this next step is to identify the reads that are template vs those that have the Kozak sequence, and if it does have a Kozak and ATG introduced, to extract the Kozak sequence from the read.

I went ahead and wrote a short python script that achieves this. So, grab that file (ie. Right click, and it “Save link as…”), stick it in the same directory as the data you want to analyze, and run it with the following command.

$ python3 IDT_STD.assembled.fastq

The script should then create a file called “IDT_STD.assembled.tsv” that should look like this:

This can now be easily analyzed with whatever your favorite data analysis language is, whether it’s R or Python. Huzzah!

Dealing with paired fastq reads

10/25/2022 update:
Ran into some paired sequences that didn’t work well with fastq-join, so went back to pear. Found an easy way to install it via bioconda, using the following command. So simple.

conda install -c bioconda pear

July 19, 2022 update:
OK, I’m trying to reinstall PEAR on my new-ish laptop and running into errors. After struggling a bit, I decided to switch over to using Fastq-join. This is way easier to install and run, as you can see below:

Based on the information on this website, and assuming you have anaconda installed, run:
conda install -c bioconda fastq-join

After that, actually pair your files running a command like:
fastq-join ACE2-Kozak-mini-library-pooled_R1_001.fastq ACE2-Kozak-mini-library-pooled_R2_001.fastq -o ACE2-Kozak-mini-library-pooled.fastq

Note 1: You can run fastq-join on gzipped files, like so:
fastq-join No-infection-maybe_R1_001.fastq.gz No-infection-maybe_R2_001.fastq.gz -o No-infection-maybe.fastq.gz

Note 2: If working with gzipped fastq data, it’s a little trickier to look at your files compressed files. To look at the first read, for example, you need to use a command like this:
gunzip -c SARS2-MOI-0pt1-1_R1_001.fastq.gz | head -4

Here’s the original (now deprecated) text in this post:

Got some paired Illumina sequencing back today, and wanted to pair the R1 and R2 fastq files. I vaguely remember using PEAR to do this before, so gave it a shot. Since the files are relatively small, I figured I’d just try to run this locally, which meant trying to install it (and it’s various dependencies) on my Macbook. Here are the steps…

  1. Download “autoconf-2.65.tar.gz” from this website. Uncompress it, and then go into the now uncompressed directory, and type in “./configure && make && make install”

2. Download “automake-1.14.1.tar.gz” from this website. Uncompress it, and then go into the now uncompressed directory, and type in “./configure && make && make install”

3. Get PEAR academic from this website by entering in your info, receiving an email, and downloading “pear-src-0.9.11.tar.gz”. Uncompress it, and then go into the now uncompressed directory, and type in “./configure && make && make install”

Huzzah. It should now be installed.

Now you can go to whatever directory has your R1 and R2 files, and type in something like so:
“pear -f KAM-IDT-HM_R1_001.fastq -r KAM-IDT-HM_R2_001.fastq -o IDT_HM”

You should now have a file called “IDT_HM.assembled.fastq” that has all of the joined paired reads.

5/18/21 Update: I was working on something else that required trimming reads. I decided to use fastp to do this. The instructions to install fastp is as follows (essentially following the instructions here):

  1. Download the zip file (under the “Code” button) from, and unzip it.
  2. Use terminal to go to the directory that is made, and run
    $ make
  3. Run the following command:
    $ sudo make install
    … and then type in your computer password when prompted. Fastp should now be installed.

Landing pad plasmid maps

I still intend to get around to posting most if not all landing pad plasmids to Addgene, but it’s taken me forever to get around to it. Once those are there, the plasmid sequences will obviously be publicly available. In the mean time, I figure I’d post some of the most common plasmid maps here, so other people can benefit form them (or I don’t have to send specific emails to each person who asks for them). So here are some of the most popular, published plasmids, in GenBank (.gb) format.

G542A_pLenti-Tet-coBxb1-2A-BFP_IRES_iCasp9-2A-Blast_rtTA3 (Addgene #171588)

G698C_AttB_ACE2_IRES_mCherry-H2A-P2A-PuroR (Addgene #171594)

G758A_AttB_ACE2(del)-IRES-mCherry-H2A-P2A-PuroR (Addgene #171596)





attB-EGFP (Addgene #171597)

attB-mCherry (Addgene #171598)














Chemical transformation efficiencies

We do a lot of molecular cloning in the lab. Standard practice in the workflow is to make your own home-made chemically competent NEB 10-beta bacteria to be used fresh the day of the transformation. This has worked surprisingly well, and we have made ~ 350 different plasmid constructs in the first ~ 600 days). Each time you do the transformation, it’s important to include a positive control (we use 40 ng of attB-mCherry plasmid) to make sure the transformation step was performing properly (to help interpret / troubleshoot what may have gone wrong if you had few / zero colonies in your actual molecular cloning transformation plates). I’ve done this enough times now to know what is “normal”. Thus, especially for new members in the lab, please reference this plot to see how your own transformation results compare to how it has worked for me, where I typically get slightly more than 10,000 transformants (Note: you may get numbers better than me, which is a good thing!).

Edit 2/24/2022: It recently dawned on me that we haven’t necessarily been using a standard number of bacteria per transformation (ie. we typically prep the same amount of bacteria, and that bacteria normally just gets split into however many transformations are being done at the time. Thus, if there are 10 plasmids being transformed, the number of bacteria per transformation is ~ one fifth of the number of bacteria for when two plasmids are being transformed.

Well, since I had a bunch of data recorded for our positive control transformations (40ng of attB-mCherry plasmid), I decided to see what number of bacteria we had used and if there were any patterns in the transformation efficiencies:

So it looks like, depending on the instance, we used anywhere between 300 million to 10 billion bacteria per transformation. Notably, there was no real pattern in the transformation efficiency, were on average, it worked just as well with the half a billion as it was for 5 billion. Thus, it’s likely that the limiting factor at this point is the DNA and the bacteria are in vast excess.

For those curious, I had previously mentioned that 1mL of our e.coli at an OD of 1 corresponded to 5e8 bacterial cells / colony forming units. Thus, for a 180 mL culture at an OD of 0.2, I would do [5e8 cells/mL at OD1] * [0.2 of OD1] * [180 mL] * [1 billion cells / 1e9 cells] = ~ 18 billion cells. So if we were doing 9 transformations with this culture, it would be about 2 billion cells for each transformation.