Lentivector supe collection

Most lentiviral production protocols (usually with VSV-G pseudotyped particles) tells the user to collect the supe at 48 or 72 hours. My protocols tend to say collect the supe twice a day (once when coming into the lab, and once when leaving) starting at 24 hours and ending a few days later (96 hours? more?).

This largely stems from a point in my PhD when I was generating a bunch of VSV-G pseudotyped lentiviral particles to study the “early stage” of the HIV life cycle (ie. the points preceding integration into the genome, such as the trafficking steps to get into the nucleus). After thinking about the protocol a bit, I realized that there’s really nothing stopping the produced VSV-G pseudotyped particles from attaching and re-entering the cells they emerged from, which is useless for viral production purposes. Even for the particles that are lucky enough not to re-enter the producer cells, they are going to be more stabler in a less energy environment (such as 4*C) than floating around in the supe at 37*C in the incubator.

But, well, data is always better to back such ideas. So back in April 2014 (I know this since I incorporated the date into the resulting data file name), I did an experiment where I produced VSV-G pseudotyped lentiviral particles as normal, and collected the supe at ~12 hour intervals, keeping them separate in the fridge. After they were all collected, I took ~ 10uL from each collected supe, put them on target cells, and measured luciferase activity a couple of days later (these particles had a lentiviral vector genome encoding firefly luciferase). Here’s the resulting data.

Some observations:

  • Particles definitely being produced by 24 hours, and seemingly reaching a peak production rate between 24 and 48 hours.
  • The producer cells kept producing particles at a reasonable constant rate. Sure, there was some loss between 48 and 72 hours, but still a ton being produced.
  • I stopped this experiment at 67 hours, but one can imagine extrapolating that curve out, and presumably there’s still ample production happening after 72 hours.

So yea, I suppose if the goal is to have the highest singular concentration, then taking a single collection at 48 or 72 hours will probably give you that. That said, if the goal is to have the highest total yield (which is usually the situation I’m in), then it makes much more sense to collect at various intervals, and then use the filtered, pooled supe in downstream experiments.

Also, I consider being able to dig up and discuss 10-year old data as a win!

Trimming and tabulating fastq reads

Anna has been testing her transposon sequencing pipeline, and needing some help processing some of her Illumina reads. In short, she needed to remove sequenced invariant transposon region (essentially a 5′ adapter sequence), trim the remaining (hopefully genomic) sequence to a reasonable 40nt, and then tabulate the reads since there were likely going to be duplicates in there that don’t need to be considered independently. Here is what I did.

# For removing the adapter and trimming the reads down, I used a program called cutadapt. Here's information for it, as well as how I installed and used it below.
# https://cutadapt.readthedocs.io/en/stable/installation.html
# https://bioconda.github.io/

## Run the commands below in Bash (they tell conda where else to look for the program)
$ conda config --add channels defaults
$ conda config --add channels bioconda
$ conda config --add channels conda-forge
$ conda config --set channel_priority strict

## Since my laptop uses an M1 processor
$ CONDA_SUBDIR=osx-64 conda create -n cutadaptenv cutadapt

## Activate the conda environment
$ conda activate cutadaptenv

## Now trying this for the actual transposon sequencing files
$ cutadapt -g AGAATGCATGCGTCAATTTTACGCAGACTATCTTTGTAGGGTTAA -l 40 -o sample1_trimmed.fastq sample1.assembled.fastq

This should have created a file called “sample1_trimmed.fastq”. OK, next is tabulating the reads that are there. I used a program called 2fast2q for this.

## I liked to do this in the same cutadaptenv environment, so in case it was deactivated, here I am activating it again.
$ conda activate cutadaptenv

## Installing with pip, which is easy.
$ pip install fast2q

## Now running it on the actual file. I think you have to already be in the directory with the file you want (since you don't specify the file in the command).
$ python -m fast2q -c --mo EC --m 2

## Note: the "python -m fast2q -c" is to run it on the command line, rather than the graphical interface. "--mo EC" is to run it in the Extract and Count mode. "--m 2" is to allow 2 nucleotides of mismatches.

Nanopore denovo assembly

Plasmidsaurus is great, but it looks like some additional companies aside from Primordium are trying to develop a “nanopore for plasmid sequencing” service. We just tried the Plasmid-EZ service from Genewiz / Azenta(?), partially b/c there’s daily pickup from a dropbox on our campus. At first glance, the results were rather mixed. Read numbers seemed decent, but 4 of the 8 plasmid submissions didn’t yield a consensus sequence. Instead, all we were given were the “raw” fastq files. To even see whether these reads were useful or not, i had to figure out how to derive my own consensus sequence from the raw fastq files.

After some googling and a failed attempt or two, I ended up using a program called “flye”. here’s what I did to install and use it, following the instructions here.

## Make a conda environment for this purpose
$ Conda create -n flye

## Pretty easy to install with the below command
$ Conda install flye

## It didn't like my file in a path that had spaces (darn google drive default naming), so I ended up dragging my files into a folder on my desktop. Just based on the above two commands, you should already be able to run it on your file, like so:
$ flye --nano-raw J202B.fastq --out-dir assembled

This worked for all four of those fastq files returned without consensus sequences. Two worked perfectly and gave sequences of the expected size, The remaining two returned consensus sequences that were 2-times as large as the expected plasmid. Looking at the read lengths, these plasmids did show a few reads that were twice as long as expected. That said, those reads being in there didn’t make the program return that doubly-long consensus sequence, as it still made that consensus seq even after the long reads were filtered out (I did this with fastq-filter; “pip install fastq-filter”, “fastq-filter -l 500 -L 9000 -o J203G_filtered.fastq J203G.fastq.gz”). So ya, still haven’t figured out why this happened and if it’s real or not, but even a potentially incorrectly assembled consensus read was helpful, as I could import it into Benchling and align it with my expected sequence, and see if there were any errors.

After this experience, I’ve come to better appreciate how well Plasmidsaurus is run (and how good their pipeline for returning data, is). We’ll probably try the Genewiz Plasmid-EZ another couple times, but so far, in terms of quality of the service, it doesn’t seem as good.

Plasmidsaurus fasta standardizer

I really like plasmidsaurus, and it’s an integral part of our molecular cloning pipeline. That said, I’ve found analyzing the resulting consensus fasta file to be somewhat cumbersome, since where they inevitable start their sequence string in the fasta file is rather arbitrary (which, I don’t blame them for at all, since these are circular plasmids with no particular starting nucleotide, and every plasmid they’re getting is unique), and obviously doesn’t match where my sequence starts on my plasmid map in Benchling.

For the longest time (the past year?) I dealt with each file / analysis individually, where I would either 1) reindex my plasmid map on Benchling to match up with how the Plasmidsaurus fasta file is aligning, or 2) Manually copy-pasting sequence in the Plasmidsaurus fasta file, after seeing hwo things match up after aligning.

Anyway, I got tired of doing this, so I wrote a Python script that standardizes things. This will still require some up-front work in 1) Running the script on each plasmidsaurus file, and 2) Making sure all of our plasmid maps in Benchling start at the “right” location, but I still think it will be easier than what I’ve been doing.

1) Reordering the plasmid map.

I wrote the script so that it reordering the Plasmidsaurus fasta file based on the junction between the stop codon of the AmpR gene, and the sequence directly after it. Thus, you’ll have to reindex your Benchling plasmid map so it exhibits that same break at that junction point. Thus, if your plasmid has AmpR in the forward direction, it should look like so on the 5′ end of your sequence:

And like this on the 3′ end of your sequence:

While if AmpR is in the reverse direction, it should look like this on the 5′ end of your sequence:

And like this on the 3′ end of your map:

Easy ‘nuf.

2) Running the Python script on Plasmidsauru fasta file.

The python script can be found at this GitHub link: https://github.com/MatreyekLab/Sequence_design_and_analysis/tree/main/Plasmidsaurus_fasta_reordering

If you’re in my lab (and have access to the lab Google Drive), you don’t have to go to the GitHub repo. Instead, it will already be in the “[…additional_text_here…]/_MatreyekLab/Data/Plasmidsaurus” directory.

Open up Terminal, and go to that directory. Then type in “python3 Plasmidsaurus_fasta_standardizer.py”. Before hitting return to run, you’ll have to tell it which file to perform this on. Because of the highly nested structure of how the actual data is stored, it will probably be easier just to navigate to the relevant folder in Finder, and then drag the intended file into the Terminal window. The absolute path of where the file sits in your directory will be copied, so the command will now look something like “python3 Plasmidsaurus_fasta_standardizer.py /[…additional_text_here…]/_MatreyekLab/Data/Plasmidsaurus/PSRS033/Matreyek_f6f_results/Matreyek_f6f_5_G1131C.fasta”

It will make a new fasta file suffixed with “_reordered” (such as “Matreyek_f6f_1_G1118A_reordered.fasta”), which you can now easily use for alignment in Benchling.

Note: Currently, the script only works for ampicillin resistant plasmids, since that’s somewhere between 95 to 99% of all of the plasmids that we use in the lab. That said, plasmidsaurus sequencing of the rare KanR plasmid won’t work with this method. Perhaps one day I’ll update the script for also working with KanR plasmids (ie. the first time I need to run plasmidsaurus data analysis on a KanR plasmid, haha).

Flow cytometry compensation

So I tend to use fluorescent protein combinations that are not spectrally overlapping (eg. BFP, GFP, mCherry, miRFP670), so that circumvents the need for any compensation (at least on the flow cytometer configurations that we normally use). That being said, I apparently started using mScarlet-I in some of our vectors, and there is some bleedover into the green channel if it’s bright enough.

These cells only express mScarlet-I, and yet green signal is seen when red MFI values > 10^4… booo……

Well, that’s annoying, but the concept of compensation seems pretty straightforward to me, so I figure we can also do it post hoc if necessary. The idea here is to take the value of red fluorescence, multiply that value to a fraction (< 1) constant value representing the amount of bleedover that is happening into the second channel, and then subtracting this product from the original amount of green fluorescence to make the compensated green measurement.

To actually work with the data, I exported the above cell measurements shown in the Flowjo plot above as a csv and imported it into R. Very easy to execute the above formula, but how does one figure out the relevant constant value that should be used for this particular type of bleedover? Well, I wrote a for-loop testing values from 0.0010 to 0.1, and saw whether the adjusted values now resulted in a straight horizontal line with ~ zero slope (since then, regardless of red fluorescence, green fluorescence would be unchanged).

Now as that value becomes too large, then more will be subtracted than should be, resulting in an inverse relationship between red and green fluorescence. To make my life easier to find the best value, I took the absolute value of the resulting slope in the points, which pointed me to a value of 0.0046 as the minima, for mScarlet-I red fluorescence bleeding over into my green channel on this particular flow cytometer with these particular settings.

Great, so what does the data actually look like once I compensate for this bleedover? Well, with this control data, this is the before and after (on a random subset of 1000 datapoints)

Hurrah. Crisis averted. Assuming we now have sample with both actual green and red fluorescence (previously confounded by the red to green bleedover from mScarlet), we can presumably now analyze that data in peace.

Just for fun, here’s a couple of additional samples and their before and after this compensation is performed.

First, here are cells that express both EGFP and mScarlet-I at high levels. You can see that the compensation does almost nothing. This makes sense, since the bleedover is contributing such a small total percentage to the total green signal (EGFP itself is contributing most of the signal), that removing that small portion is almost imperceptible.
Here’s a sample that’s a far better example. Here, there’s a bunch of mScarlet-I positive cells (as well as some intermediates), and a smattering of lightly EGFP positive cells throughout. But aside from the shape of the mScarlet-I positive, GFP negative population changing from a 45-degree line to a circular cloud, the overall effects aren’t huge. Still, even that is useful though, b/c if one didn’t look at this scatterplot (and know about the concept of bleedover and compensation), one might interpret that slight uptick in green fluorescence in that aforementioned population as a real biologically meaningful difference.

Common Plasmidsaurus Errors

OK, so we all know that Plasmidsaurus nanopore sequencing isn’t perfect. Every time I see the mistake at the 5′ end of the IRES sequence I know to ignore it, but there are bunch of other ones that I still repeatedly run into (but not quite as frequently) such that I don’t have it memorized and am not sure if I should be ignoring it right off the bat. Thus, I’m going to keep a list of repeated erroneous calls here on this page so I’m reminded to ignore them in the future.

Visual evidence of individual example listed below. But here’s a summary of Plasmidsaurus errors to ignore:

  1. IRES – deletions near the 5′ end
  2. mCherry – W63R or Q114R
  3. mScarlet – errors at R71 (sometimes R71G) and S113/L114 (including L114P).
  4. mKG – L96P
  5. Puromycin resistance gene PAC – R18G or L125P
  6. shBleR – Q56R
  7. Silent or frameshift mutations at the NPGP motif at the 3’end of the P2A sequence

In fact, be very suspicious of any unexpected L -> P mutant through Plasmidsaurus seq. And maybe Q -> R muts too.

Since almost all of my plasmids have this IRES sequence in it, I almost always run across this error (although it’s usually a 1nt miscall rather than 2nt like this example).
This Puromycin R18G error is annoying b/c it looks like it could be really problematic.
I don’t use mScarlet-I all that often, but when I do, Plasmidsaurus sometimes gives me this L114P erroneous call.
Here it gets screwed up in the same area but mysteriously called an A insertion, making it an S113fs.
It also has issues with mScarlet-I R71. Sometimes it calls it a silent mutation, but other times it calls it as R71G.
An insertion (which, if true, would make a frameshift) in the NPGP motif toward the 3’end of P2A.
Puro L125P
mCherry W63R
mCherry Q114R
mKG L96P
shBleR Q56R.

Edit 2/9/24: Here’s another one. A nt insertion in Asp residue at around position 4 or so of the histone 2A protein.

Ordering oligos at CWRU

Here’s a price comparison I did back in 2019 (presumably still correct?). But in short, per nt price was cheapest through ThermoFisher.

Thus, we’ve been almost exclusively buying oligos from them, with $7,220 spent (as of June 2022) since our first orders starting December 2019.

Here’s what the histogram of oligo costs have shaped up as.

But, well, don’t order degenerate nucleotides oligos from them as they’ll likely be T biased.

If anyone sees anything better on campus, let me know!

Consistent Plasmidsaurus sequencing miscalls

As I noted in this Twitter exchange, plasmid nanopore sequencing via Plasmidsaurus is great, but not perfect. For example, there seem to be some “achilles heal” sequences, where nanopore reproducibly (like 100% of the time with different plasmid submissions) miscalls certain parts of our plasmids. How do we know they’re miscalls? B/c the Sanger traces of the same exact plasmids show the expected sequence very clearly. Here are two that we commonly see:

A single deleted C nucleotide in the beginning of our IRES sequence:

A phantom T>C base miscall that incorrectly tells us we have a W566R nonsynonymous change in every single one of our human ACE2 constructs.

Both are related to C repeats, but there are plenty of other C repeats in the plasmids we submit and it’s ALWAYS these sequences that give Plasmidsaurus problems. Once I figured this one, it’s really NBD, since I know to ignore these changes, although it did inform our current molecular biology workflow in the lab of 1) Screen colony minipreps via Sanger -> 2) Sequence candidate good constructs with Plasmidsaurus / nanopore -> 3) Sanger to resolve unexpected discrepancies between the expected / intended and Plasmidsaurus sequences.

Command line BLAST

One of the pseudo-projects in the lab requires looking for a particular peptide motif in genomic data. While small scale searches can be done using the web interface, the idea is to do this in a pretty comprehensive / high throughput manner, so shifting to the command line makes sense for this work. I last did this back in 2018 for some preliminary studies, so I’m going to have to re-install the software on my new computer and re-run some of those analyses. I figure I’ll write down my notes as I re-do this, so that I (and others) can use this post as a reference.

Installing BLAST+

The instructions on how to download the program can be found here. I’m on a mac, so I downloaded “ncbi-blast-2.13.0+.dmg” and double clicked and ran the package installer.

Assuming it’s been correctly installed, writing the command …

blastp -task blastp-short -query <(echo -e ">Name\nAAWLIEKGVASAEE") -db nr -remote -outfmt 1

… into the terminal should actually reveal some BLAST-specific output, rather than throw an error.

Running protein motif-specific blast searches

Type in the following into your terminal:

psiblast -phi_pattern PHI-Blast_2A_pattern.txt -db nr -remote -query <(echo -e ">Name\nGATNFSLLKQAGDVEENPGP") -max_hsps 1 -max_target_seqs 10000 -out phi_blast_output.csv -outfmt 10

Note: The above command will require having a text file specifying the pattern constraint (“PHI-Blast_2A_pattern.txt” above), which can be found here. This should yield a 25 KB file csv output, like so.

Extracting just the accession numbers

I don’t remember if there are other BLAST+ outputs that give you the full hit sequence. If so, the method I ended up taking back in 2018 would seem to be unnecessarily roundabout. But, until I figure that out, I’ll follow the old method. As you can see in the aforementioned output format, it doesn’t output the hit protein sequence, and instead just gives the accession number. Thus, the next step is using the accession number to actually figure out the protein sequence. To do this, we’ll use Entrez Direct. To install Entrez Direct, follow the instructions here. Briefly, type in the following into the terminal:

sh -c "$(curl -fsSL ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh

In order to complete the configuration process, execute the following:

echo "source ~/.bash_profile" >> $HOME/.bashrc
echo "export PATH=\${PATH}:/Users/kmatreyek/edirect" >> $HOME/.bash_profile

OK, now that it’s installed, here’s how I’ve used it:

First, the output file above has more info than the accession number. To have it pare down to only the accession number, I used this script, which can be run by entering the following into the terminal, assuming you have the previous output csv file somewhere in the directory with the script (can even be in other folders within that directory):

python3 3_Blast_to_accession.py

This will create a file called “3A_prot_accession_list_complete.txt” (example output file here) which will be the unique-ified list of accession numbers to give to Entrez Direct. (Uniquifying is important if you have multiple .csv outputs you wanted to compile into a single master list).

This can be fed into Entrez Direct using this shell script, which you can run by typing in:

sh 4_Accession_to_fasta.sh

You should now have an output file called “4A_prot_fasta.txt” with the resulting protein sequences in fasta format, like so.

Now you can search for your desired sequence (in its full protein context) within the resulting file.

To be continued…

Are there other steps in this process related to this project? Sure. Like what do you do with all of these full sequences containing the hits? Well, that’s beyond the scope of this post.

ODs on the spec and nanodrop

So there are two ways to measure bacterial culture ODs in the lab. The first is to use the nearby ~ $10,000 Thermofisher Nanodrop One (no cuvette option). The second option is to use a relatively cheaply made cuvette-based spectrophotometer I bought off of Amazon for ~ $100. To make it clear, this comparison is not a statement about the value of a Nanodrop (though I will say that having an instrument like a Nanodrop is essentially a must in a mol biol lab). This is more about if the Nanodrop is already being used by someone and waiting would get in the way of some bacterial speccing timepoints, can I purchase a $100 piece of equipment to relieve such a conflict? Especially for bacterial cultures, where volume isn’t really an issue and the measurement is simply the reading at 600 nm, not even requiring some algebra to make a conversion to more practical units (like ng/uL for DNA).

So to do this comparison, over a number of independent instances, I took the same bacterial culture and put 1mL into a cuvette and ran it on the old spec, and took 2 uL and put it on the Nanodrop pedestal and measured there. I made a table of the results, and graphed it in the plot below.

So the readings on the two instruments certainly correlate (that’s good), although it’s not an exact 1:1 relationship. In fact, the nanodrop gave numbers roughly 1.5 times higher than the spec. But if the two instruments give two different readings, then the question becomes “which is right?”

And to that, I essentially say there is no right answer. Each is a proxy for bacterial cell density (ie. Billions of bacteria / mL), but there’s no “absolute” information encoded in the OD number that tells us that specifically for our bacteria, and we’d still have to come up with a conversion factor either way (ie. my doing limiting dilutions of specc’d cultures and counting colonies), and once we have that, both will be right with that context. Sure, it would be nice if we had a method that was the most in-line with whatever ODs that were being described by various papers in the literature, but who knows what they used (recent papers may be using ODs from the nanodrop [with some perhaps using the cuvette option but many others not], while the older publications certainly didn’t have and instead likely used some old-school form of spec). But even that’s going to be heterogeneous, and will only give limited information anyway.

Well, good record-keeping to the rescue. We’ve transformed the positive control plasmid enough times to sample a range of various ODs just by chance, to see if certain bacterial ODs correlate with transformation efficiency. And boy, there’s been a whole lot of nothing there so far (which is actually quite notable; see below).

(FYI: I don’t remember which instrument I used to measure the OD A600 readings. Probably mostly the old spec, tho).

So yea, I’ve generally used cultures with ODs at the time of collection between 0.1 and 0.45, and they’ve collectively given me transformation rates of ~ 20,000 using our standard “positive control” plasmid. So there seems to be a pretty wide window of workable ODs. But generally speaking, I see no issue with having a culture of 0.1 to 0.4 OD as measured with either machine for use with chemical transformation.