Posts - Matreyek Lab

Amplifying from the landing pad

Posted on September 2, 2025September 2, 2025 by kmatreyek

Someone recently asked what primers I use to amplify from the landing pad for sequencing. This will require me looking back at our published papers. Rather than just do this one-off and answer the email just to have to repeat this exercise in the future, I figured I’d write a blog post that I can just point people to in the future.

Why is primer design important here? The main thing is, you presumably want to avoid amplifying your unintegrated plasmid, since counts of those sequences will be completely unrelated to the genotype-phenotype relationship you’re presumably trying to measure (since only the singular integrated plasmid that expresses and is resulting in the phenotype is what matters and should be the only thing being sequenced). We normally do this by designing a PCR amplicon across an attR or attL recombination junction. Here’s an example amplicon schematic from Nisha’s Methods Mol Biology manuscript.

Amplifying across the attR junction (the near side):

This is relevant if you’re trying to do direct sequencing of our transgene of interest (much like Nisha is doing above), or if you have a barcode in the 3’UTR of your transgene of interest (or, well, the 5′ UTR, but that’s pretty uncommon). If the ampicon size becomes larger than the limit for Illumina sequencing, then you may have to do a pair of nested PCRs. In this case, the first PCR is performed with a primer recognizing the Tet-inducible promoter in the landing pad (like KAM499). The amplicon, only of recombined DNA, is then used as the second-round template for amplifying and subsequently sequencing the barcode. Here’s a screenshot of Fig 4 from the original 2017 NAR paper.

The primer we normally use to bind the Tet-inducible promoter for amplifying across the attR junction is KAM499: GAGAACGTATGTCGAGGTAGGC. If doing direct Illumina amplicon amplification (much like Nisha did, since her amplicon is sufficiently small), then the primer will presumably look something like KAM3748, which hybridizes in a pretty similar location (GCCTGGAGCAATTCCACAACAC) and has the full sequence of (AATGATACGGCGACCACCGAGATCTACACCGTGGACGGCGCCTGGAGCAATTCCACAACAC). We probably moved the binding site in KAM3748 to be closer to the attR junction just to make the amplicon ~75 nucleotides smaller than using KAM499, but no big deal either way (and especially when doing the nested approach).

Amplifying across the attL junction (the far side):

While the concept of amplifying across the attR junction is universally applicable, the details are not as the amplicon size will vastly change depending on the size of the transgene of interest. Furthermore, as mentioned above, the barcode typically ends up in the 3′ UTR, which is not a terrible place but certainly capable of having some unintended consequences like messing with the RNA steady-state abundance and translation rates. Thus, pretty early on, we switched over to barcoding to the left of the attB site in the recombination plasmid, with the idea that it is in close proximity to the transgene of interest in the plasmid itself, but allows you to amplify the attL junction in recombined genomic DNA, which ends up being far more universal between experiments. Here’s a screenshot demonstrating this in Fig 1 from our 2024 PLoS Pathogens paper, where we use it to identify samples within a library of barcoded ACE2 variants.

What primers do we use here for these amplifications?

The reverse primer tends to be something like the primer binding / hybridizing sequence in KAM4362 (ATGTGCTGCAAGGCGATTAA). In actuality, since this strategy allows you to forgo the nested PCR amplification, KAM4362 and other related ones tend to have an index (allowing for multiplexed custom Illumina sequencing) and the p5 adapter, with a full sequence like this: AATGATACGGCGACCACCGAGATCTACACCCTCGCAATTatgtgctgcaaggcgattaa

So that’s a primer that’s likely already in your attB plasmid (although you should certainly check). How about amplifying from the landing pad itself? Well, this will change a bit depending on the landing pad, since we typically use a sequence present in the most 5′ transgene encoded by the landing pad prior to recombination, such as mTagBFP2 in the original 293T AAVS1 Clone4s and the popular LLP-iCasp9-Blast Clone 12s, or Bxb1 integrase in the G542A Clone3s we typically use in the lab. For binding to mTagBFP2 we typically hybridize with this sequence (CTCGACCACCTTGATTCTCATGG) for a full primer sequence like KAM2162 (CAAGCAGAAGACGGCATACGAGATGGATCACGTctcgaccaccttgattctcatgg). For binding Bxb1 integrase, we typically hybridize with this sequence (GGCCTCCTCTTTCTGTCGAA) for a full primer sequence like KAM4364 (CAAGCAGAAGACGGCATACGAGATAGATTGCGAGggcctcctctttctgtcgaa).

Nisha’s STIM1 EF-hand manuscript published in GENETICS

Posted on August 20, 2025August 20, 2025 by kmatreyek

Nisha’s paper is now officially published in its final form! Here’s a direct link to the manuscript PDF, and here’s another one for the supplementary materials.

Monthly expenditures

Posted on July 28, 2025July 28, 2025 by kmatreyek

I was curious how much the lab costs to run per month, so I made the above plot. The outliers were usually from equipment purchases, which are labeled. The lab cost about $30k to run per month for the first few years (when we were personnel and space constrained), but has risen since. Over the last few months, it’s gone up to ~ $60k. This is part of the reason why government investment in basic not-for-immediate-profit biomedical research is so important, and why the purposeful partisan (ie. MAGA) cuts / disruption / dismantling of research funding is so devastating, since there’s no other entity (eg. private foundations, or companies) that is willing or able to step in when the rug is pulled out from us.

Qubit dsDNA broad range

Posted on July 16, 2025July 16, 2025 by kmatreyek

I’ve been spending more time in the lab recently, which has allowed me to do some hands-on things that I previously had to try to advise people on without ever having done. This includes something as mundane as using the Qubit dsDNA broad range kit to quantify some plasmid DNA prep concentrations. Here are some notes for people in my lab to keep in mind in the future.

Standard 1 is just buffer. And it essentially gives the same amount of background signal, regardless of the volume of buffer or working solution (WS) added (within reason). For example:
- 10uL of S1 in 190uL WS: 41.55 RFU
- 10uL of S1 in 90uL WS: 44.72 RFU
- 5uL of S1 in 95uL WS: 45.32 RFU
The RFU (relative fluorescence units) that are reported are based on DNA concentration, rather than total DNA input.
For example, with the S2 standard (at 100 ng/uL):
- 10uL of S2 in 190uL WS = 1000 ng total, 5 ug/uL = 2421.87 RFU
- 10uL of S2 in 90uL WS = 1000 ng total, 10ng/uL = 4914.62 RFU
- 5uL of S2 in 95uL WS = 500 ng total, 5 ug/uL = 2546.59 RFU
200uL for a final volume in the tube is overkill. Looking at those standards, we get nearly the same exact RFU numbers when scaled down to 100uL. I mostly view this as ThermoFisher wanting you to burn through your reagent faster so you spend more $$$.
I don’t have any data on hand to show this, but I swear that a year or two ago, somebody in my lab showed that you don’t have to use ThermoFisher branded Qubit tubes to actually use the qubit. I don’t quite remember what they used (whether it was 200uL PCR tubes or some off-brand 0.5mL tubes more akin to the Qubit tubes). This makes perfect sense, since all it needs to do is allow light to pass through. So ya, I imagine that most clear tubes are fine; you’ll of course want to make sure you’re using the same tubes for both the standards calibration and reading your own unknown samples.
If you do want to scale down your WS volumes to conserve reagent, it gets a little tricky since the Qubit assumes you’re doing the standards calibration step as recommended by ThermoFisher (ie. for S2, it assumes that you’re doing 10uL of S2 in 190uL of WS, for a 20x dilution of 100 ng/uL to yield a in-tube concentration of 5ng/uL). So while it lets you adjust the volume of your sample you’re putting in (ie. you can tell it a volume between 1uL and 20uL), it doesn’t let you do the same thing for the standards, even if the concentrations end up being the same (like in the 5uL S2 in 95 uL WS case). If doing the 5:95 route, I suggest doing the same thing for both the standards and your samples, and just “telling” the Qubit machine that you’re doing 10uL. Then, if you need to do a smaller amount b/c the DNA is too concentrated (eg. 2uL instead of 5uL), then apply that dilution factor to the “told volume” (eg. 4uL assuming you told it 10uL).
Probably the most important one: I saw a pretty huge discrepancy between what the Spec told me (measured with a Take3 plate on the BioTek) and what the Qubit was saying. For most samples, it was roughly a 2-3 fold diff (eg. 53ng/uL by spec, 24 ng/uL by Qubit), although there were also some samples that differed by 4-5fold (eg. 57 ug/uL by spec, 12.8 ug/uL by Qubit). I have no clue why this discrepancy exists……..some of it can probably be explained by contaminants affecting the spec readings, but definitely doesn’t explain the whole thing,

BCA protein quantitation

Posted on July 1, 2025July 1, 2025 by kmatreyek

I guess the verdict is still out exactly how well the Qubit Protein Assay works. Now that I’m looking into the details, looks like the Qubit Protein Assay (Q33211) that we have is incompatible with RIPA buffer, which is likely the buffer people had tried using it with in the past. But, the BCA assay is perfectly compatible with RIPA buffer, so we still do it reasonably often. But, to make sure everyone in the lab is on the same page, we may as well look at some real data and see how I view it. This will be based on some data that Olivia S had generated a month or so ago.

Link to the code below:
https://colab.research.google.com/drive/16jIz0Qq_jvb7R1s-fIJW804ffahZDHPG?usp=sharing

Here’s what the raw values of the standard curve looked like:

The value of the background is shown as the datapoint on the y-axis. Clearly, the lowest standard dilution loses linearity from the rest of the values, likely b/c a larger fraction of it is due to background signal. Accordingly, if we subtract out the background value from it, we might be able to salvage use of that dilution. That is indeed the case.

Using those values, I’ve now fit a linear model. This will allow us to predict the concentration of a protein sample based on its absorbance, relative to the standard curve. The points and line in red denote the values calculated based on this linear model. The real standard curve values do straddle it pretty well, which is good.

Cool, so what do the experimental samples of unknown concentration look like based on this model? Since Olivia did half-log dilutions of these, we can actually see how the predicted values change based on the dilution tested and its resulting absorbance value.

So for all of the samples, the 33-fold dilution sample was erroneous. Otherwise, 10 uL of undiluted sample in 190 uL BCA working solution, or 10 uL of 3- or 10-fold dilutions of that lysate in 190 uL BCA working solution, gave values that were relatively similar. Taking the geometric mean of those, we calculated values of 1184, 1894, and 2339 ng / uL for these lysates. Seems reasonable enough.

There we go. A primer on the data analysis portion of the BCA assay, or anything else that requires using dilutions of a standard of known concentration to determine the likely concentrations of unknown samples.

The doomsday clock ticks closer

Posted on June 13, 2025June 14, 2025 by kmatreyek

Learned today that the University is telling the School of Medicine to cut costs, which means the School of Medicine is telling the Department to cut costs, which means that the Department is adjusting its policy of supporting grad student tuition coverage for Dept faculty members, which means that each of the PhD students in the lab will now cost somewhat more than they did before the department changed its policy. This is potentially not an insignificant change, so obviously justifies me re-running my budget projections out to determine the new lab doomsday (when we run out of funds).

Theoretical projection based on total income and total salary costs (with operational costs being estimated as 60% of salary cost).

Complementary projection based on how much money is left in all of my accounts (including a non-competing renewal for year 5 of my R35, so the Fed gov actually honoring the full terms of the awarded grant, which is no longer automatic since they are no longer a dependable entity in the world), subtracted by costs of salary (and again, operational costs being estimated as 60% of salary).

After re-doing the math, it turns out that this will be ~30k of lab funds that will have to go to pay for tuition that was not expected before. Current doomsday estimate is September 2027. Of course, if I am able to secure more funding for the lab to continue our research, then this number can be pushed back. But really, if the Fed Gov is no longer a dependable supporter of American science, then it’s really unclear how possible that even is anymore.

Your current Federal Government at work, folks. Purposefully destroying American higher education and the American biomedical science ecosystem (all American science, really). We’re no longer going to be world leaders in science. New findings, new therapies, new cures for disease; all minimally delayed and perhaps some never seeing the light of day. Uncountable numbers of preventable deaths. All b/c of politics. What a sad state we exist in.

Example Barcoded Variant Library Counts

Posted on May 27, 2025May 27, 2025 by kmatreyek

As part of the AVE-ETS, we had been discussing barcoded variant libraries. I don’t quite remember the context, but I think I suggested we look at some real sequencing data of a barcoded variant library, and I offered to dig up the PTEN VAMP-Seq library data. Of course, I had other things to do in the month following this statement so I didn’t look for those files until the long weekend immediately prior to the next meeting. My original plan was to just find this blog post [https://www.matreyeklab.com/simulating-sampling-during-recombination/1175/] and say we use that, but then I realized that those data tables were for variant frequencies, and not barcode counts. All of my old data from my postdoc was on flash drives in the office at work, but I didn’t feel like making the commute in just to for that. I decided to try to find and re-process the raw data uploaded to GEO / SRA.

First, a number of steps that aren’t explicitly coded in this markdown file.

I downloaded the files from the relevant GEO sites. The first dataset from the NatGenet paper can be found here [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE108727]. The second dataset from the Genome Med paper where we published on a “fill-in” library we created to reintroduce some missing variants can be found here [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE159469].
I counted the barcodes using the original method we had used, which was just having Enrich2 do it, with a minimum quality score filter of 30.
I imported the data into R to do some analyses here

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.2.3

## Warning: package 'tidyr' was built under R version 4.2.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

theme_set(theme_bw())
theme_update(panel.grid.minor = element_blank())

first_1 <- read.delim(file = "Data/SRR6437841.tsv.gz", sep = "\t")
first_2 <- read.delim(file = "Data/SRR6437842.tsv.gz", sep = "\t")

first <- merge(first_1, first_2, by = "X")
first$count = rowMeans(first[,c("count.x","count.y")])

ggplot() + scale_x_log10() + #scale_y_log10() +
  geom_histogram(data = first %>% filter(count > 3), aes(x = count)) + geom_vline(xintercept = 100)

## As a rough but effective approach, I like to look at the histogram of read counts to identify where the relative minima between the population containing counts of 1 (largely erroneous barcodes from sequencing error) and the next non-zero population (assuming the sample was sequenced to enough depth). For this sample, since it was sequenced so deeply, this is around a count of 100.

first_filtered <- first %>% filter(count > 100)

#first_key <- read.delim(file = "Data/GSE108727_PTEN_barcodeInsertAssignments.tsv", sep = "\t", header = F)
first_key <- read.delim(file = "Data/first_key.tsv", sep = "\t", header = F)
colnames(first_key) <- c("X","variant")

first_df <- merge(first_filtered, first_key, by = "X", all.x = T)

First_lib_histogram <- ggplot() + scale_x_log10() + #scale_y_log10() +
  geom_histogram(data = first_df, aes(x = count), fill = "grey90", bins = 50) +
  geom_histogram(data = first_df %>% filter(!is.na(variant)), aes(x = count), bins = 50) + 
  geom_vline(xintercept = 100) + 
  labs(x = "Num of reads", y = "Count", title = "Lib1: Grey, all barcodes; Black, subassembled variants") +
  theme(panel.grid.minor = element_blank()) + 
  NULL; First_lib_histogram

ggsave(file = "Output/First_lib_histogram.pdf", First_lib_histogram, height = 4, width = 5)

Some notes for the above plot. The grey bars of the histogram denote counts for all barcodes that were observed. The black bars are barcodes that were linked to a particular PTEN coding variant via PacBio subassembly.

Now, let’s look at the next library.

second_1 <- read.delim(file = "Data/SRR12818211.tsv.gz", sep = "\t")
second_2 <- read.delim(file = "Data/SRR12818212.tsv.gz", sep = "\t")

second <- merge(second_1, second_2, by = "X")
second$count = rowMeans(second[,c("count.x","count.y")])

ggplot() + scale_x_log10() + #scale_y_log10() +
  geom_histogram(data = second %>% filter(count > 1), aes(x = count)) + geom_vline(xintercept = 10)

## As a rough but effective approach, I like to look at the histogram of read counts to identify where the relative minima between the population containing counts of 1 (largely erroneous barcodes from sequencing error) and the next non-zero population (assuming the sample was sequenced to enough depth). For this sample, it's around 10.

second_filtered <- second %>% filter(count > 10)

second_key <- read.delim(file = "Data/Second_key.tsv", sep = "\t", header = T)
colnames(second_key)[2] <- "variant"

second_df <- merge(second_filtered, second_key, by = "X", all.x = T)

Second_lib_histogram  <- ggplot() + scale_x_log10() + #scale_y_log10() +
  geom_histogram(data = second_df, aes(x = count), fill = "grey90") +
  geom_histogram(data = second_df %>% filter(!is.na(variant)), aes(x = count)) + 
  geom_vline(xintercept = 100) +
  labs(x = "Num of reads", y = "Count", title = "Lib2: Grey, all barcodes; Black, subassembled variants") +
  theme(panel.grid.minor = element_blank()) + 
  NULL; Second_lib_histogram

ggsave(file = "Output/Second_lib_histogram.pdf", Second_lib_histogram, height = 4, width = 5)

## I have no idea why it looks bimodal , btw. Probably a prob with library mixing.

write.table(file = "Output/First_df.tsv", first_df, quote = F, row.names = F)
write.table(file = "Output/Second_df.tsv", second_df, quote = F, row.names = F)

For anyone that wants to test this, the github repo for the above scripts and data can be found here: https://github.com/MatreyekLab/Barcodes

The Hall of Unexpectedly Cloned Plasmids

Posted on May 24, 2025June 29, 2025 by kmatreyek

Testing out Plasmidsaurus’s zeroprep service (re)opened my eyes to the really weird and unexpected recombinant DNA plasmids that can be made through our standard cloning pipeline (largely utilizing PCR amplification and Gibson assembly). Here are some fun examples.

s/o to pLannotate for making the annotated plasmid maps I’m using as the visuals.

Barcoding the epilepsy vector

Posted on May 13, 2025May 13, 2025 by kmatreyek

We have a redesigned attB vector purposefully made to carry and barcode a bunch of epilepsy-associated membrane protein genes (and to a lesser extent, cytoplasmic and secreted protein genes). We’ll eventually need to make a number of barcoded libraries from it, so we’ve been figuring out the kinks of barcoding at a rare-cutter site. I’m realizing that if I want things to scale (whether it’s across labs, or even in the lab), it probably makes sense to make things as easy for the next person to pick up as possible. So, I’m showing my work in terms of how barcoding can be analyzed with this vector.

Alright, so to QC the barcoding vector, we’re trying two things. An initial plasmidsaurus run (quick turnaround, hundreds to low thousands of reads, sufficient to estimate unbarcoded contamination), and then a submission to Genewiz/Azenta/(Formerly Brooks Life Sciences) 2x250nt miSeq Amp-EZ service (2 week turnaround, hundreds of thousands of reads, likely enough to fully analyze small barcoded libraries). I’ll talk about the plasmidsaurus reads on another day.

Today is figuring out how many barcodes seem to exist per prep with the Amp-EZ data.

First was pairing the reads.

% pear -f BC-L062-G2_R1_001.fastq.gz -r BC-L062-G2_R2_001.fastq.gz -o L062_G2

Next was treating the sequence next to the barcode as adapters to identify the barcodes themselves.

% cutadapt -a ATAAGATCTGGTCCTCTGATCCGA...CTATCGGTAACGCATTCGCC -o G2_linked.fastq L062_G2.assembled.fastq

% sh Tally_sequences.sh G2_linked.fastq G2_linked_tally.csv

At this point, I have a csv file, which I imported into R since i'm more nimble there. (Obviously one can do something similar in Python). if the adapters were indeed there, then the resulting read is returned as a 20nt sequence of the barcode. If the adapters weren't there (eg. if the plasmid was unbarcoded, or if it had so many errors in the read that it wasn't identified as the adapter), then it was returned as the full sequence. Thus, I can then subset for reads that were 20nt (barcoded) vs those that were not (unbarcoded), resulting in this following histogram based on that designation.

For AEZ035_G1, 83% of the reads were barcoded, whereas 17% of reads were not. This included 11.5% of the reads being the unbarcoded template. While we can get a bit fancier with things like error-correcting the barcodes (eg. to accommodate for things like sequencing error, which is presumed to result in low counts stemming from true barcodes a small hamming-distance away with much higher counts), for today’s purpose, I’m just going to use a minimum threshold value, such as 7, to distinguish likely true barcodes from the likely erroneous non-barcodes in the “barcoded” subset. This yields a plot like so:

Great, so based on this relatively simple analysis scheme, it seems to indicate that there are ~ 5,400 unique barcodes in this sample. We also repeated this process in another independently derived sample. What does that look like?

For this independent barcoding attempt, we got 92% of all reads being barcoded reads, and the remaining 8% something else (with 3.5% being clearly unbarcoded plasmid)

Based on the same analysis scheme, there are ~ 5,100 unique barcodes in this sample. So despite the slight different in the number of unbarcoded samples using Nidhi’s Gibson barcoding protocol, the total number of unique barcodes seemed to be similar.

Finally, since we have two independent barcoding attempts with a highly diverse oligo (N20 – so a potential diversity of 1.1 trillion barcodes), we would expect to get very little overlap in barcodes between these two dataset. Does that actually play out? Well, here are the actual results: G1 barcoded library only: 5439, G2 barcoded library only: 5095, and the number of barcodes found in both: 34. So, pretty non-overlapping… so that’s good! The vast majority of these overlapping barcodes had reasonably high counts in the G2 library, but counts barely above the threshold filter (9, 10, 11, 12) in the G1 library. Thus, these are likely due to sequencing errors, that can be corrected either by being more stringent, or by taking a more involved barcode error-correcting scheme (perhaps to come in a future blog post).

Nidhi’s Kozak preprint posted

Posted on April 30, 2025April 30, 2025 by kmatreyek

See the link here! https://www.biorxiv.org/content/10.1101/2025.04.28.651141v1