Analyzing Illumina Fastq data

We recently got some Illumina sequencing back from GeneWiz, and I realized that this is good opportunity to show people in the lab how to do some really basic operations to handle such types of sequencing data, so I’ll write those instructions here. Since essentially everybody in the lab uses a Mac as their primary computer, these instructions will be directly related to performing these steps on a Mac, though the same basic steps can likely be applied to PCs. Also, since these files are small, everything will be done locally; once the files get big enough and the analyses more complicated, we’ll start doing things on a computer cluster. Now to get to the actual info:

  1. First, find the data we’ll be using for practice today. If you’re in the lab, you can go to the lab GoogleDrive into the Data/Illumina/Amplicon_EZ/30-507925014/00_fastq directory to find the files.

We won’t need to analyze everything there for this tutorial; instead, let’s focus on the “KAM-IDT-Std_R1_001.fastq.gz” and “KAM-IDT-Std_R2_001.fastq.gz” files.

2. Copy the files to a directory on your local computer. You can do the old “drag and drop” using the GUI, or you can do it in the command line like so, once you adjust the paths for your own computer:

$ cp /Volumes/GoogleDrive/My\ Drive/MatreyekLab_GoogleDrive/Data/Illumina/Amplicon_EZ/30-507925014/00_fastq/KAM-IDT-Std_R1_001.fastq.gz /Users/kmatreyek/Desktop/Illumina_data

$ cp /Volumes/GoogleDrive/My\ Drive/MatreyekLab_GoogleDrive/Data/Illumina/Amplicon_EZ/30-507925014/00_fastq/KAM-IDT-Std_R2_001.fastq.gz /Users/kmatreyek/Desktop/Illumina_data

3. Un-gzip the files. You can do this in the GUI by double-clicking the files, or you can do it in the terminal (if you’re now in the right directory) like so.

$ gzip -dk KAM-IDT-Std_R1_001.fastq.gz

$ gzip -dk KAM-IDT-Std_R2_001.fastq.gz

Optional: Take a look at your fastq files. You won’t want to open the files in their entirety, so what makes more sense it just looking at the first 4 or so lines of the file, corresponding to the first read. To do this, type:

$ head -4 KAM-IDT-Std_R1_001.fastq

And you should get an output that looks like so:

4. They won’t always be paired reads, but this time it is. So we’ll pair them. I you don’t already have a method for doing this, then download PEAR and install it like I described here. Once you have it installed, you can type in a command like so:

$ pear -f KAM-IDT-Std_R1_001.fastq.gz -r KAM-IDT-Std_R2_001.fastq.gz -o IDT_HM

It took my desktop a couple minutes for this process to complete. You’ll get an output that looks like this.

Your directory should now have all of these files:

You can look at the first read again (now that it’s been paired), using the following line, it should look like so:

$ head -4 IDT_HM.assembled.fastq

As you can tell, the quality scores in the first line went from mostly F’s (Q-Score of 37) to almost all I’s (Q-Score of 40).

5. Now that we’ve prepped the Illumina data, it’s time for the downstream analysis. This will be far more project or experiment specific, so these next steps won’t apply for every situation. But in this case, we made a library of Kozak variants to try to get a range of expression levels of the protein of interest. Furthermore, the template DNA used for the PCR lacked a Kozak sequence and a start codon, and these will inevitably be in the sequencing data also. So the goal of this next step is to identify the reads that are template vs those that have the Kozak sequence, and if it does have a Kozak and ATG introduced, to extract the Kozak sequence from the read.

I went ahead and wrote a short python script that achieves this. So, grab that file, stick it in the same directory as the data you want to analyze, and run it.

$ python3 Extract_Kozak.py IDT_STD.assembled.fastq

The script should then create a file called “IDT_STD.assembled.tsv” that should look like this:

This can now be easily analyzed with whatever your favorite data analysis language is, whether it’s R or Python. Huzzah!