Barcoding the epilepsy vector

We have a redesigned attB vector purposefully made to carry and barcode a bunch of epilepsy-associated membrane protein genes (and to a lesser extent, cytoplasmic and secreted protein genes). We’ll eventually need to make a number of barcoded libraries from it, so we’ve been figuring out the kinks of barcoding at a rare-cutter site. I’m realizing that if I want things to scale (whether it’s across labs, or even in the lab), it probably makes sense to make things as easy for the next person to pick up as possible. So, I’m showing my work in terms of how barcoding can be analyzed with this vector.

Alright, so to QC the barcoding vector, we’re trying two things. An initial plasmidsaurus run (quick turnaround, hundreds to low thousands of reads, sufficient to estimate unbarcoded contamination), and then a submission to Genewiz/Azenta/(Formerly Brooks Life Sciences) 2x250nt miSeq Amp-EZ service (2 week turnaround, hundreds of thousands of reads, likely enough to fully analyze small barcoded libraries). I’ll talk about the plasmidsaurus reads on another day.

Today is figuring out how many barcodes seem to exist per prep with the Amp-EZ data.

First was pairing the reads.

% pear -f BC-L062-G2_R1_001.fastq.gz -r BC-L062-G2_R2_001.fastq.gz -o L062_G2
Next was treating the sequence next to the barcode as adapters to identify the barcodes themselves.
% cutadapt -a ATAAGATCTGGTCCTCTGATCCGA...CTATCGGTAACGCATTCGCC -o G2_linked.fastq L062_G2.assembled.fastq

% sh Tally_sequences.sh G2_linked.fastq G2_linked_tally.csv

At this point, I have a csv file, which I imported into R since i'm more nimble there. (Obviously one can do something similar in Python). if the adapters were indeed there, then the resulting read is returned as a 20nt sequence of the barcode. If the adapters weren't there (eg. if the plasmid was unbarcoded, or if it had so many errors in the read that it wasn't identified as the adapter), then it was returned as the full sequence. Thus, I can then subset for reads that were 20nt (barcoded) vs those that were not (unbarcoded), resulting in this following histogram based on that designation.

For AEZ035_G1, 83% of the reads were barcoded, whereas 17% of reads were not. This included 11.5% of the reads being the unbarcoded template. While we can get a bit fancier with things like error-correcting the barcodes (eg. to accommodate for things like sequencing error, which is presumed to result in low counts stemming from true barcodes a small hamming-distance away with much higher counts), for today’s purpose, I’m just going to use a minimum threshold value, such as 7, to distinguish likely true barcodes from the likely erroneous non-barcodes in the “barcoded” subset. This yields a plot like so:

Great, so based on this relatively simple analysis scheme, it seems to indicate that there are ~ 5,400 unique barcodes in this sample. We also repeated this process in another independently derived sample. What does that look like?

For this independent barcoding attempt, we got 92% of all reads being barcoded reads, and the remaining 8% something else (with 3.5% being clearly unbarcoded plasmid)

Based on the same analysis scheme, there are ~ 5,100 unique barcodes in this sample. So despite the slight different in the number of unbarcoded samples using Nidhi’s Gibson barcoding protocol, the total number of unique barcodes seemed to be similar.

Finally, since we have two independent barcoding attempts with a highly diverse oligo (N20 – so a potential diversity of 1.1 trillion barcodes), we would expect to get very little overlap in barcodes between these two dataset. Does that actually play out? Well, here are the actual results: G1 barcoded library only: 5439, G2 barcoded library only: 5095, and the number of barcodes found in both: 34. So, pretty non-overlapping… so that’s good! The vast majority of these overlapping barcodes had reasonably high counts in the G2 library, but counts barely above the threshold filter (9, 10, 11, 12) in the G1 library. Thus, these are likely due to sequencing errors, that can be corrected either by being more stringent, or by taking a more involved barcode error-correcting scheme (perhaps to come in a future blog post).