Example Barcoded Variant Library Counts

As part of the AVE-ETS, we had been discussing barcoded variant libraries. I don’t quite remember the context, but I think I suggested we look at some real sequencing data of a barcoded variant library, and I offered to dig up the PTEN VAMP-Seq library data. Of course, I had other things to do in the month following this statement so I didn’t look for those files until the long weekend immediately prior to the next meeting. My original plan was to just find this blog post [https://www.matreyeklab.com/simulating-sampling-during-recombination/1175/] and say we use that, but then I realized that those data tables were for variant frequencies, and not barcode counts. All of my old data from my postdoc was on flash drives in the office at work, but I didn’t feel like making the commute in just to for that. I decided to try to find and re-process the raw data uploaded to GEO / SRA.

First, a number of steps that aren’t explicitly coded in this markdown file.

  1. I downloaded the files from the relevant GEO sites. The first dataset from the NatGenet paper can be found here [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE108727]. The second dataset from the Genome Med paper where we published on a “fill-in” library we created to reintroduce some missing variants can be found here [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE159469].
  2. I counted the barcodes using the original method we had used, which was just having Enrich2 do it, with a minimum quality score filter of 30.
  3. I imported the data into R to do some analyses here
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.2.3

## Warning: package 'tidyr' was built under R version 4.2.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
theme_set(theme_bw())
theme_update(panel.grid.minor = element_blank())
first_1 <- read.delim(file = "Data/SRR6437841.tsv.gz", sep = "\t")
first_2 <- read.delim(file = "Data/SRR6437842.tsv.gz", sep = "\t")

first <- merge(first_1, first_2, by = "X")
first$count = rowMeans(first[,c("count.x","count.y")])

ggplot() + scale_x_log10() + #scale_y_log10() +
  geom_histogram(data = first %>% filter(count > 3), aes(x = count)) + geom_vline(xintercept = 100)
## As a rough but effective approach, I like to look at the histogram of read counts to identify where the relative minima between the population containing counts of 1 (largely erroneous barcodes from sequencing error) and the next non-zero population (assuming the sample was sequenced to enough depth). For this sample, since it was sequenced so deeply, this is around a count of 100.

first_filtered <- first %>% filter(count > 100)

#first_key <- read.delim(file = "Data/GSE108727_PTEN_barcodeInsertAssignments.tsv", sep = "\t", header = F)
first_key <- read.delim(file = "Data/first_key.tsv", sep = "\t", header = F)
colnames(first_key) <- c("X","variant")

first_df <- merge(first_filtered, first_key, by = "X", all.x = T)

First_lib_histogram <- ggplot() + scale_x_log10() + #scale_y_log10() +
  geom_histogram(data = first_df, aes(x = count), fill = "grey90", bins = 50) +
  geom_histogram(data = first_df %>% filter(!is.na(variant)), aes(x = count), bins = 50) + 
  geom_vline(xintercept = 100) + 
  labs(x = "Num of reads", y = "Count", title = "Lib1: Grey, all barcodes; Black, subassembled variants") +
  theme(panel.grid.minor = element_blank()) + 
  NULL; First_lib_histogram
ggsave(file = "Output/First_lib_histogram.pdf", First_lib_histogram, height = 4, width = 5)

Some notes for the above plot. The grey bars of the histogram denote counts for all barcodes that were observed. The black bars are barcodes that were linked to a particular PTEN coding variant via PacBio subassembly.

Now, let’s look at the next library.

second_1 <- read.delim(file = "Data/SRR12818211.tsv.gz", sep = "\t")
second_2 <- read.delim(file = "Data/SRR12818212.tsv.gz", sep = "\t")

second <- merge(second_1, second_2, by = "X")
second$count = rowMeans(second[,c("count.x","count.y")])

ggplot() + scale_x_log10() + #scale_y_log10() +
  geom_histogram(data = second %>% filter(count > 1), aes(x = count)) + geom_vline(xintercept = 10)
## As a rough but effective approach, I like to look at the histogram of read counts to identify where the relative minima between the population containing counts of 1 (largely erroneous barcodes from sequencing error) and the next non-zero population (assuming the sample was sequenced to enough depth). For this sample, it's around 10.

second_filtered <- second %>% filter(count > 10)

second_key <- read.delim(file = "Data/Second_key.tsv", sep = "\t", header = T)
colnames(second_key)[2] <- "variant"

second_df <- merge(second_filtered, second_key, by = "X", all.x = T)

Second_lib_histogram  <- ggplot() + scale_x_log10() + #scale_y_log10() +
  geom_histogram(data = second_df, aes(x = count), fill = "grey90") +
  geom_histogram(data = second_df %>% filter(!is.na(variant)), aes(x = count)) + 
  geom_vline(xintercept = 100) +
  labs(x = "Num of reads", y = "Count", title = "Lib2: Grey, all barcodes; Black, subassembled variants") +
  theme(panel.grid.minor = element_blank()) + 
  NULL; Second_lib_histogram
ggsave(file = "Output/Second_lib_histogram.pdf", Second_lib_histogram, height = 4, width = 5)

## I have no idea why it looks bimodal , btw. Probably a prob with library mixing.
write.table(file = "Output/First_df.tsv", first_df, quote = F, row.names = F)
write.table(file = "Output/Second_df.tsv", second_df, quote = F, row.names = F)

For anyone that wants to test this, the github repo for the above scripts and data can be found here: https://github.com/MatreyekLab/Barcodes

The Hall of Unexpectedly Cloned Plasmids

Testing out Plasmidsaurus’s zeroprep service (re)opened my eyes to the really weird and unexpected recombinant DNA plasmids that can be made through our standard cloning pipeline (largely utilizing PCR amplification and Gibson assembly). Here are some fun examples.

s/o to pLannotate for making the annotated plasmid maps I’m using as the visuals.

The intended construct "L149"
What I got. # The head-to-head fusion is kind of unusual. But also, where it did the split is funny, where it didn't even give me the full PDCH19 sequence after all that.
The intended construct "L153"
What I got. # I think I've seen this happen before. One intended construct fused in head-to-tail orientation with a linerized template plasmid.

Barcoding the epilepsy vector

We have a redesigned attB vector purposefully made to carry and barcode a bunch of epilepsy-associated membrane protein genes (and to a lesser extent, cytoplasmic and secreted protein genes). We’ll eventually need to make a number of barcoded libraries from it, so we’ve been figuring out the kinks of barcoding at a rare-cutter site. I’m realizing that if I want things to scale (whether it’s across labs, or even in the lab), it probably makes sense to make things as easy for the next person to pick up as possible. So, I’m showing my work in terms of how barcoding can be analyzed with this vector.

Alright, so to QC the barcoding vector, we’re trying two things. An initial plasmidsaurus run (quick turnaround, hundreds to low thousands of reads, sufficient to estimate unbarcoded contamination), and then a submission to Genewiz/Azenta/(Formerly Brooks Life Sciences) 2x250nt miSeq Amp-EZ service (2 week turnaround, hundreds of thousands of reads, likely enough to fully analyze small barcoded libraries). I’ll talk about the plasmidsaurus reads on another day.

Today is figuring out how many barcodes seem to exist per prep with the Amp-EZ data.

First was pairing the reads.

% pear -f BC-L062-G2_R1_001.fastq.gz -r BC-L062-G2_R2_001.fastq.gz -o L062_G2
Next was treating the sequence next to the barcode as adapters to identify the barcodes themselves.
% cutadapt -a ATAAGATCTGGTCCTCTGATCCGA...CTATCGGTAACGCATTCGCC -o G2_linked.fastq L062_G2.assembled.fastq

% sh Tally_sequences.sh G2_linked.fastq G2_linked_tally.csv

At this point, I have a csv file, which I imported into R since i'm more nimble there. (Obviously one can do something similar in Python). if the adapters were indeed there, then the resulting read is returned as a 20nt sequence of the barcode. If the adapters weren't there (eg. if the plasmid was unbarcoded, or if it had so many errors in the read that it wasn't identified as the adapter), then it was returned as the full sequence. Thus, I can then subset for reads that were 20nt (barcoded) vs those that were not (unbarcoded), resulting in this following histogram based on that designation.

For AEZ035_G1, 83% of the reads were barcoded, whereas 17% of reads were not. This included 11.5% of the reads being the unbarcoded template. While we can get a bit fancier with things like error-correcting the barcodes (eg. to accommodate for things like sequencing error, which is presumed to result in low counts stemming from true barcodes a small hamming-distance away with much higher counts), for today’s purpose, I’m just going to use a minimum threshold value, such as 7, to distinguish likely true barcodes from the likely erroneous non-barcodes in the “barcoded” subset. This yields a plot like so:

Great, so based on this relatively simple analysis scheme, it seems to indicate that there are ~ 5,400 unique barcodes in this sample. We also repeated this process in another independently derived sample. What does that look like?

For this independent barcoding attempt, we got 92% of all reads being barcoded reads, and the remaining 8% something else (with 3.5% being clearly unbarcoded plasmid)

Based on the same analysis scheme, there are ~ 5,100 unique barcodes in this sample. So despite the slight different in the number of unbarcoded samples using Nidhi’s Gibson barcoding protocol, the total number of unique barcodes seemed to be similar.

Finally, since we have two independent barcoding attempts with a highly diverse oligo (N20 – so a potential diversity of 1.1 trillion barcodes), we would expect to get very little overlap in barcodes between these two dataset. Does that actually play out? Well, here are the actual results: G1 barcoded library only: 5439, G2 barcoded library only: 5095, and the number of barcodes found in both: 34. So, pretty non-overlapping… so that’s good! The vast majority of these overlapping barcodes had reasonably high counts in the G2 library, but counts barely above the threshold filter (9, 10, 11, 12) in the G1 library. Thus, these are likely due to sequencing errors, that can be corrected either by being more stringent, or by taking a more involved barcode error-correcting scheme (perhaps to come in a future blog post).

Plasmidsaurus decision tree

Plasmidsaurus whole-plasmid nanopore sequencing is a fantastic service. As of today (3/25/25), we’ve sent 1357 samples into them(!!). And they’ve essentially been worth every penny. But there are a bunch of different reasons to use them, and I wanted to make sure everyone in the lab was on the same page in terms of reasons that are well justified vs those that are more arguable, so I made this following decision tree (also, this codifies lab policies of sequencing plasmids from unverified sources before working with them). For anyone in the lab; let me know if you think we should make any specific changes to it (I tried to remember what we discussed in lab meeting, but I may have forgotten something).

Time investments

As mentioned in a previous post, starting January 20th, 2023, I began doing detailed accounting of how many minutes I was spending each day performing different types of work. This was largely motivated by my involvement on a departmental “committee” where there were around 3 other members assigned to the committee (so, theoretically, everybody could be doing ~ 25% of the necessary work), but it became clear that nobody was really going to do anything, so I had to essentially do > 80% of the necessary work to have the whole thing not completely fail outright. As of today (March, 21, 2025), this time-stamped spreadsheet of my activity log has 5,075 rows. Well, aside from allowing me to keep “receipts” of how much of my actual effort was going toward this piece of departmental service (rather than the assumed ~ 25% ), it had the secondary benefit of allowing me to actually quantitate how long I was spending on any given work-related item (whether it was service-related or not). Here are some activities I was able to assess.

Manuscripts: So for the last 4 manuscripts from the lab, it’s taken me, on average, ~150 hours to go from “OK, time to start putting together the manuscript” to having~ it done. Actually, the number is probably closer to 200 hours on average, since two of these are still ongoing, and at lesat one of these I’ve been working on since like 2021. But ya, it really is still a big lift for me to get a paper published from the lab. Which, in one sense, is probably what it should be. But it is also exhausting.

Presentations: There were four presentations during that span, and it on average, took me about 20 to 25 hours to prep for each. I suppose this is because each presentation was a different topic (necessitating starting from “scratch”).

Grant_writing: This is going to be a completely anomalous number. Two were letters of intents (LOIs), and thus just short ~2-page descriptions, and one was a full-application that took an LOI-like condensed format (This was also a joint grant application, so I didn’t need to pull all of the weight on it). This is because all of my previous major grant applications were made before I started collecting this data. Next time I need to write something that is R01-sized, I imagine it will take me double that number, if not more.

Rotations: Each PhD student rotation seems to take about 15-20 hours of my time. Probably a somewhat moot point for next year, though, since it’s unclear whether I could / should take another student at that point.

Teaching, hiring, and editing fellowship applications: These seem to take between 15 to 22 hours of my time per (new) lecture, new staff hire, and per application. But n = 1 for each, so we’ll see what happens in the future (maybe).

RPPRs: 9 hours on average. Doesn’t seem unreasonable.

Reviewing manuscripts: 5 hours on average. Again, seems pretty reasonable.

The dark corners of the plasmid

I was trying to streamline our existing attB vector. I was prompted to do this for a few reasons: 1) I recently identified a previously unappreciated T7 (bacteriophage) promoter and potential cryptic bacterial promoter in our standard plasmid, 2) There are presumably some weak cryptic eukaryotic promoters hidden somewhere in the plasmid too, and 3) I was trying to “domesticate” the plasmid to get rid of some Type II and Type IIS restriction enzyme sites.

As part of the most recent plan, I decided to delete two different sections of the plasmid; one was the segment of DNA between the attB site and the Amp promoter driving the AmpR gene, and the second was the segment between the origin of replication and the SV40 PolyA signal. First one worked fine (which I knew, since I eventually remembered that I had previously done this back in like….2015, but never used it again for some reason). The second one proved very problematic. Here’s the section in question:

So I had seen those annotations for the lac promoter and lac operon, but assuming the directionality of the map was true, it seemed like they weren’t really pointed at enough bacterial sequence to matter so I just assumed they were vestiges of something. Well, this is what happened in terms of my plasmid yields in this lineage of plasmid.

I’m not going to read into the slightly higher concentration of L036 too much, but man, what really smacks you in the face is just how bad yields became with L048 (a derivative of L036). Well, so I tested the panel for their ability to recombine into landing pad cells, and the phenotype there was obvious as well; all plasmids up to and including L036 recombined at high rates, whereas L048 and one of its sibling plasmids with the same deletion had nonzero but *severely* diminished recombination. So not only is the DNA yield bad, but the “quality” in some sense seems to be much worse in that the DNA that is there is not resulting in good recombination.

I’ve learned my lesson, and I’m now just trying to take out that last BspQI/SapI site with a nucleotide substitution.

But still, this begets the question: so what in the world is in that DNA section, and why is it so important for plasmid propagation? I’m sure some bacteriologists and perhaps some old-school molecular biologists should know, but I’ve always lamented how much of a black box the bacterial portions of plasmids are (my expertise is in eukaryotic / mammalian cell biology). Will I ever figure this out?

Well, I will first try with pLannotate. That’s what told me there was a T7 primer site in this plasmid after all.

Well, so that didn’t really uncover anything new. Hmmm…

Pymol figures

I end up having to Google search the commands the relevant commands every time I need to make publication-quality figures in Pymol, so I’m just going to note them here to save myself some time.

  1. set bg_rgb, white

    This sets the background to white. Usually safe bet for any image meant for a normal presentation (ie. slides with white backgrounds), or a publication (where it’s text and images against a white page).

  2. set ray_trace_mode, 1

    This allows it to have the “black outline” representations, which I think look a little nicer than the default.

  3. ray [Some number, usually between 900 and 1500]

    This makes a static fully-rendered image that is much higher quality than the fast render in the interactive interface.

See above for an example of what a resulting image looks like.