Manuscript acceptance timing

I’ve now been an author on enough papers to have a reasonable sampling of what the experiences can be like. In short, it minimally takes 2 months to go from initial submission to eventual manuscript acceptance. These experiences typically are those that require little to no experiments for revision. The process, especially when requiring hefty experiments or multiple rounds of revision, can easily stretch to half a year. In some cases it can take *much* longer (in one experience I’ve seen, a journal did the “rejected but amenable to resubmission if sufficient additional impact is added”, which resulted in an informal span of ~ 800 days!!! ). Anyway, at least for the manuscript submissions I was involved with where I had access to an author portal (or received emails when things happened), I noted when the reviews were returned and revisions submitted, to also keep track of how much of that time was technically under one’s control (manuscript with authors; red) or completely out of one’s control (manuscript with journal; blue). See below:

Also, part of the reason it makes sense to post manuscripts to bioRxiv; why have a completed manuscript that is essentially publication-worthy sit in the dark for half a year?

Note: Obviously this data is for manuscripts that eventually got accepted. Rather pleasantly, I’ve never been first or corresponding author for a paper that got rejected, so I don’t have nearly as good a sampling of that experience. But sitting on the sideline as a middle-author for a handful of such occasions, it would seem to take anywhere between a week (eg. immediate desk rejection) to a couple of months (eg. rejected upon peer review) per submission; when sequentially shopping across multiple journals to find a taker, this would seem to add up.

Lentivector supe collection

Most lentiviral production protocols (usually with VSV-G pseudotyped particles) tells the user to collect the supe at 48 or 72 hours. My protocols tend to say collect the supe twice a day (once when coming into the lab, and once when leaving) starting at 24 hours and ending a few days later (96 hours? more?).

This largely stems from a point in my PhD when I was generating a bunch of VSV-G pseudotyped lentiviral particles to study the “early stage” of the HIV life cycle (ie. the points preceding integration into the genome, such as the trafficking steps to get into the nucleus). After thinking about the protocol a bit, I realized that there’s really nothing stopping the produced VSV-G pseudotyped particles from attaching and re-entering the cells they emerged from, which is useless for viral production purposes. Even for the particles that are lucky enough not to re-enter the producer cells, they are going to be more stabler in a less energy environment (such as 4*C) than floating around in the supe at 37*C in the incubator.

But, well, data is always better to back such ideas. So back in April 2014 (I know this since I incorporated the date into the resulting data file name), I did an experiment where I produced VSV-G pseudotyped lentiviral particles as normal, and collected the supe at ~12 hour intervals, keeping them separate in the fridge. After they were all collected, I took ~ 10uL from each collected supe, put them on target cells, and measured luciferase activity a couple of days later (these particles had a lentiviral vector genome encoding firefly luciferase). Here’s the resulting data.

Some observations:

  • Particles definitely being produced by 24 hours, and seemingly reaching a peak production rate between 24 and 48 hours.
  • The producer cells kept producing particles at a reasonable constant rate. Sure, there was some loss between 48 and 72 hours, but still a ton being produced.
  • I stopped this experiment at 67 hours, but one can imagine extrapolating that curve out, and presumably there’s still ample production happening after 72 hours.

So yea, I suppose if the goal is to have the highest singular concentration, then taking a single collection at 48 or 72 hours will probably give you that. That said, if the goal is to have the highest total yield (which is usually the situation I’m in), then it makes much more sense to collect at various intervals, and then use the filtered, pooled supe in downstream experiments.

Also, I consider being able to dig up and discuss 10-year old data as a win!

Neutralizing Supe

Purchasing purified antibodies is expensive. Furthermore, purchased antibodies are a black box, from an amino-acid sequence perspective. For example, you may have a favorite anti-HA antibody from a company (my favorite from my PhD was an HRP direct conjugate of 3F10), but you likely have no clue what the sequence of that antibody is. Then again, if one had the sequence of the antibody, then they could order DNA encoding that antibody themselves, and then produce unlimited supplies of the antibody protein.

Well, I was curious enough to try this proof-of-principle. Thus, I ordered a DNA sequence encoding Bamlinivimab. It originally had an EUA to treat people infected with SARS-CoV-2, although the EUA eventually got pulled once variants resistant to it started circulating. Well, I engineered cells stably expressing it. Then to test it, I used it in a neutralization experiment, where I mixed SARS-CoV-2 spike pseudotyped lentiviral particles (encoding GFP) with high ACE2 expressing cells, and simultaneously added various amounts of the presumed Bamlinivimab-containing supernatant, or just supernatant from unmodified 293T cells as a control. Well, here are the results:

So definitely a dose-dependent decrease to pseudotyped virus infection, with the max amount used in this experiment (I believe 4 mLs supe out of 6 total mLs in the well, with the cells and virus each also taking a mL) giving a greater than 10-fold neutralizing effect. Cool.


So this doesn’t quite count as “synbiofun” since it didn’t work, so it’s not that fun. But figure I may as well post negative data here when we have it…

Based on this paper, I felt compelled to test some of the tricks they had published on that improved recombination efficiency. First up was DNA sequences that may help with nuclear targeting / import. Tried the NFkB DTS (DNA nuclear-targeting sequences) since that seemed to perform the best for them.

Didn’t exactly reproduce what they did, since I wanted to 1) Use my own construct that we normally use for recombination reactions, and 2) insert the sequence at a convenient location that I could put in with a single molecular cloning step and didn’t get in the way of any other elements we already had in the plasmid.

We cloned the sequences into the “G1180C_AttB_ACE2(del)-IRES-mScarletI-H2A-P2A-PuroR” backbone, where we could use the percentage of mScarlet fluorescent cells to tell us if recombination efficiency increased. Because of the repetitive nature of the NFkB DTS sequence, we ended up getting two different clones with the intended sequence: clone D had the indicated NFkB DTS sequence plus and additional repeat (for 5 repeats total), while clone E had the NFkB DTS sequence missing a repeat (for 4 repeats total).

Clone D
Clone E

Sarah recombined these into landing pad HEK 293T cells, and these were the results of red+ cells.
Negative control: 0.002%
G1180C (unmodified): 17.4%
Clone D (5 repeats): 4.78%
Clone E (3 repeats): 17.8%

So yea. Really didn’t seem to do anything. Not sure why Clone D is worse, although this is an n=1 experiment. If we really wanted to continue this, we would probably need to re-miniprep the plasmids to make sure it’s nothing about that specific prep. That said, nothing about the above results makes me optimistic that this will actually help our current system, so this avenue is likely going on ice.

Trimming and tabulating fastq reads

Anna has been testing her transposon sequencing pipeline, and needing some help processing some of her Illumina reads. In short, she needed to remove sequenced invariant transposon region (essentially a 5′ adapter sequence), trim the remaining (hopefully genomic) sequence to a reasonable 40nt, and then tabulate the reads since there were likely going to be duplicates in there that don’t need to be considered independently. Here is what I did.

# For removing the adapter and trimming the reads down, I used a program called cutadapt. Here's information for it, as well as how I installed and used it below.

## Run the commands below in Bash (they tell conda where else to look for the program)
$ conda config --add channels defaults
$ conda config --add channels bioconda
$ conda config --add channels conda-forge
$ conda config --set channel_priority strict

## Since my laptop uses an M1 processor
$ CONDA_SUBDIR=osx-64 conda create -n cutadaptenv cutadapt

## Activate the conda environment
$ conda activate cutadaptenv

## Now trying this for the actual transposon sequencing files
$ cutadapt -g AGAATGCATGCGTCAATTTTACGCAGACTATCTTTGTAGGGTTAA -l 40 -o sample1_trimmed.fastq sample1.assembled.fastq

This should have created a file called “sample1_trimmed.fastq”. OK, next is tabulating the reads that are there. I used a program called 2fast2q for this.

## I liked to do this in the same cutadaptenv environment, so in case it was deactivated, here I am activating it again.
$ conda activate cutadaptenv

## Installing with pip, which is easy.
$ pip install fast2q

## Now running it on the actual file. I think you have to already be in the directory with the file you want (since you don't specify the file in the command).
$ python -m fast2q -c --mo EC --m 2

## Note: the "python -m fast2q -c" is to run it on the command line, rather than the graphical interface. "--mo EC" is to run it in the Extract and Count mode. "--m 2" is to allow 2 nucleotides of mismatches.

Nanopore denovo assembly

Plasmidsaurus is great, but it looks like some additional companies aside from Primordium are trying to develop a “nanopore for plasmid sequencing” service. We just tried the Plasmid-EZ service from Genewiz / Azenta(?), partially b/c there’s daily pickup from a dropbox on our campus. At first glance, the results were rather mixed. Read numbers seemed decent, but 4 of the 8 plasmid submissions didn’t yield a consensus sequence. Instead, all we were given were the “raw” fastq files. To even see whether these reads were useful or not, i had to figure out how to derive my own consensus sequence from the raw fastq files.

After some googling and a failed attempt or two, I ended up using a program called “flye”. here’s what I did to install and use it, following the instructions here.

## Make a conda environment for this purpose
$ Conda create -n flye

## Pretty easy to install with the below command
$ Conda install flye

## It didn't like my file in a path that had spaces (darn google drive default naming), so I ended up dragging my files into a folder on my desktop. Just based on the above two commands, you should already be able to run it on your file, like so:
$ flye --nano-raw J202B.fastq --out-dir assembled

This worked for all four of those fastq files returned without consensus sequences. Two worked perfectly and gave sequences of the expected size, The remaining two returned consensus sequences that were 2-times as large as the expected plasmid. Looking at the read lengths, these plasmids did show a few reads that were twice as long as expected. That said, those reads being in there didn’t make the program return that doubly-long consensus sequence, as it still made that consensus seq even after the long reads were filtered out (I did this with fastq-filter; “pip install fastq-filter”, “fastq-filter -l 500 -L 9000 -o J203G_filtered.fastq J203G.fastq.gz”). So ya, still haven’t figured out why this happened and if it’s real or not, but even a potentially incorrectly assembled consensus read was helpful, as I could import it into Benchling and align it with my expected sequence, and see if there were any errors.

After this experience, I’ve come to better appreciate how well Plasmidsaurus is run (and how good their pipeline for returning data, is). We’ll probably try the Genewiz Plasmid-EZ another couple times, but so far, in terms of quality of the service, it doesn’t seem as good.

Plasmidsaurus fasta standardizer

I really like plasmidsaurus, and it’s an integral part of our molecular cloning pipeline. That said, I’ve found analyzing the resulting consensus fasta file to be somewhat cumbersome, since where they inevitable start their sequence string in the fasta file is rather arbitrary (which, I don’t blame them for at all, since these are circular plasmids with no particular starting nucleotide, and every plasmid they’re getting is unique), and obviously doesn’t match where my sequence starts on my plasmid map in Benchling.

For the longest time (the past year?) I dealt with each file / analysis individually, where I would either 1) reindex my plasmid map on Benchling to match up with how the Plasmidsaurus fasta file is aligning, or 2) Manually copy-pasting sequence in the Plasmidsaurus fasta file, after seeing hwo things match up after aligning.

Anyway, I got tired of doing this, so I wrote a Python script that standardizes things. This will still require some up-front work in 1) Running the script on each plasmidsaurus file, and 2) Making sure all of our plasmid maps in Benchling start at the “right” location, but I still think it will be easier than what I’ve been doing.

1) Reordering the plasmid map.

I wrote the script so that it reordering the Plasmidsaurus fasta file based on the junction between the stop codon of the AmpR gene, and the sequence directly after it. Thus, you’ll have to reindex your Benchling plasmid map so it exhibits that same break at that junction point. Thus, if your plasmid has AmpR in the forward direction, it should look like so on the 5′ end of your sequence:

And like this on the 3′ end of your sequence:

While if AmpR is in the reverse direction, it should look like this on the 5′ end of your sequence:

And like this on the 3′ end of your map:

Easy ‘nuf.

2) Running the Python script on Plasmidsauru fasta file.

The python script can be found at this GitHub link:

If you’re in my lab (and have access to the lab Google Drive), you don’t have to go to the GitHub repo. Instead, it will already be in the “[…additional_text_here…]/_MatreyekLab/Data/Plasmidsaurus” directory.

Open up Terminal, and go to that directory. Then type in “python3”. Before hitting return to run, you’ll have to tell it which file to perform this on. Because of the highly nested structure of how the actual data is stored, it will probably be easier just to navigate to the relevant folder in Finder, and then drag the intended file into the Terminal window. The absolute path of where the file sits in your directory will be copied, so the command will now look something like “python3 /[…additional_text_here…]/_MatreyekLab/Data/Plasmidsaurus/PSRS033/Matreyek_f6f_results/Matreyek_f6f_5_G1131C.fasta”

It will make a new fasta file suffixed with “_reordered” (such as “Matreyek_f6f_1_G1118A_reordered.fasta”), which you can now easily use for alignment in Benchling.

Note: Currently, the script only works for ampicillin resistant plasmids, since that’s somewhere between 95 to 99% of all of the plasmids that we use in the lab. That said, plasmidsaurus sequencing of the rare KanR plasmid won’t work with this method. Perhaps one day I’ll update the script for also working with KanR plasmids (ie. the first time I need to run plasmidsaurus data analysis on a KanR plasmid, haha).

Network drive for file storage

I’m so tired of being jerked around by various cloud storage services. Institutional Google Drive was unlimited storage, but now the University has capped it at 100GB (how uselessly small…). I believe the institutional Box account was also unlimited, but apparently it is now 1 TB. Which, I mean, is better than 100GB, but also the Box interface is painfully awful and IMO is not useful for anything other than “cold storage” of files, where 1 TB isn’t going to cut it. Regardless, I solved this problem for actively used shared file storage for my lab by purchasing 2TB of GoogleDrive cloud storage for $100 a year (out of my personal bank account). Still doesn’t solve the “cold storage” problem for microscopy data, which can easily run much more than a terabyte.

Well, I have an iMac at work that is hooked up to a landline, and I have plenty of external hard drives, so I’m going to try to allow one of those external hard drives be discoverable on the network, and see if that works as a way to store our microscopy data. To access this network (on a mac, at least), follow these instructions:

  1. If you’re not directly hooked up to the landline on campus, you’ll need to VPN in. This is done through FortiClient. This will require DUO authentication (so check your phone to accept).
  2. If you’re now connected to the VPN, you can now try to load that external network drive. to do this, hit Command + K while in the MacOS Finder (or go to the top menu and hit Go > Connect to Server…) and then enter the following:
    smb://[See the IP address the Matreyeklab_Overview_Googledoc]/MLab_5TB
  3. You might have to put in a username and password. This will be the standard lab username and password (refer to the Matreyeklab_Overview_Googledoc if you’ve forgotten, but you must have memorized this by now…)
  4. Voila! You should now have a Finder window for the external hard drive, which will (at the least) have a folder called “Microscopy”.
  5. Now what if you want to do things on the Terminal, since you’re a power user (perhaps aspiring power user?). In that case, load up a new Terminal instance, back out of your own account directory into one of the root directories of your computer (ie. do “cd ../..) and then go into the “Volumes” directory, and if you’ve accessed the network drive like described above, you should now be able to go into the external hard drive directory off the network (ie. do “cd MLab_5TB), and start doing what you need to do there.

Yay, problem solved. Obviously the above instructions are only for lab members. If you run into problems, LMK, and I’ll try to help troubleshoot.

Various links / references, mostly for my own use:

NIH grant expenditures

At least in my SOM-based department, one’s general status is supposedly correlated with the amount of indirect costs they bring in. It looks like I’m currently capped at 4 desks for my personnel, and it’s not clear if it’s worth trying to expand here. Regardless, this is apparently the directs / indirects space I’m operating in with my current setup.

What does this come out to in terms of indirect costs generated per month? Here’s the plot based on my budget reports, below:

So, essentially at my steady state (which I’ll probably be at for at least the next 2.5 years), the lab is generating ~ $20k a month in indirects for the department, amounting to about ~ $5k per desk per month.

Ah, the business of SOM-based academic research. Well, I’m nothing if not transparent.

Submitted DNA amounts and reads returned

In this previous post, I showed how many reads we’ve gotten from our Plasmidsaurus and AMP-EZ submissions. Well, now’s also time to see whether the amount of DNA that we gave correlated with the number of reads we got back.

Submissions to Plasmidsaurus. Red vertical line denotes the minimum value asked for submission (>= 10uL at 30 ng/uL). Blue line is a linear model based on the datapoints.

As you can see above, since this is miniprepped DNA, it’s usually quite easy to reach the 300 ng needed for submission. One time, when we submitted closer to 200ng, it worked perfectly fine. One other time, when we submitted ~ 100ng, it did not, albeit this was not plasmid DNA and instead was a PCR product, so it’s an outlier for that reason as well.

Submissions to Genewiz / Azenta AMP-EZ. Red vertical line is the minimum amount of DNA asked for, while the horizontal red line is the number of reads they “guarantee” returned. Blue line is a linear model based on the data.

This is the more important graph though, since all of our AMP-EZ submissions are from gel extracted PCR amplifications, and it can be quite difficult to do it in such a way that we have the 500 ng of total qubitted DNA available for submission. Well, turns out that it’s probably not all that important for us to hit 500 ng of DNA, since it’s worked perfectly fine in our attempts between 200 and 500 ng. I imagine people in my lab will simultaneously be happy (knowing they don’t have to hit 500 ng) and sad (knowing they had spent a bunch of extra effort in the past unnecessarily trying to reach that number) seeing the above data, but hey, it’s good to finally know this and better late than never!