Writing a simple Python script (for Biologists)

Being able to write short scripts that help you do what you need to do is a very empowering ability for modern-day biologists like me.

In the process of thinking about planning a new plasmid that encodes two proteins fused together by a flexible linker, I realized that I should just write a short script that, when run, gives me a random set of glycine and serine residues composing that linker (riffing off of the ol’ GGS-repeat linkers that seem to prevail in synthetic molecular biology). I’m also in the spirit of documenting the things I’ve learned so that future trainees could have them as reference for their own learning. So here goes my attempt at explaining how I’m approaching this:

First is thinking about the scope of the script you’re trying to put together. This should be an extremely simple one, where I’ll have lists of all of the glycine and serine codons, and a series of random numbers will determine which of those codons to use. Maybe I’ll also include in this script a user-feedback feature that will make it so you can tell the script how many codons it should be randomnly stringing together. Since it’s so simple, it shouldn’t require loading too many custom libraries / packages for performing more advanced procedures.

In the spirit of good practice, I’ll try to do this in iPython as well, though I’ll make a simple python script at the end for quick running from the command line. OK, here goes:

Note: I posted the final code at my Github page (link at the bottom). But for people wanting to just follow along here, here is the code in its final form at the outset (so you can see how I went about building it):

from random import randint

codons = ["GGT","GGC","GGA","GGG","TCT","TCC","TCA","TCG","AGT","AGC"]

length = input("What amino acid length flexible linker would you like a nucleotide sequence for?")

linker_sequence = ""
length = int(length)
for x in range(0,length):
    new_codon = codons[randint(0,9)]
    linker_sequence = linker_sequence + new_codon
    x = x + 1

print(linker_sequence)

1) Had to remind myself of this, but firs you type in “jupyter notebook” to open the interactive web-browser interface (assuming you’ve already installed it).

2) Go to the right directory, and then make a new iPython3 notebook.

3) Let’s start simple and make the random codon generator. First, let’s make a list of the codons we want to include.

codons = ["GGT","GGC","GGA","GGG","TCT","TCC","TCA","TCG","AGT","AGC"]

4) Next, let’s figure out how to choose a random index in the list, so that a random codon is chosen. I don’t remember how to do this off-hand, so I had to google it until I got to this page.

5) OK, so i guess I do have to load a package. I’m now writing that at the top of the script.

from random import randint   
randint(0,9)

It gave the correct output so I saw that it worked and commented out the randint call.

# randint(0,9) #This line won't run

6) Next is having the script randomly pull out a glycine or serine codon. This is simply done by now including:

codons[randint(0,9)]

7) Cool. All that has worked so far. Let’s now have the script take in a user-input for the number of residues so we can repeat this process and spit out a linker sequence of desired length. I don’t remember how to ask for user input in a python script, so I had to google this as well.

8) Allright, so I need to use the use the “raw imput” function and lead it to a variable. Well, tried that and it said “raw_input” wasn’t defined. Google that, and found this link saying that advice as deprecated, and instead it was just “input()”.

9) Thus, I typed in:

length = input("What amino acid length flexible linker would you like a nucleotide sequence for?")

And this asked me for a number like I had hoped. Great.

10) To make this process iterable, I went for my trusty “for” loop. Actually, I tried to make a “while” loop first, but realized I didn’t know how to make it work off the top of my head. So I just ended up making a “for” loop and putting a x = x + 1 statement at the end to effectively turn into a “while” loop. Personally, I think the ends justify the means, and going with what you know works well is a valid option for most basic scripts, when efficiency isn’t a huge priority.

length = 3  #Giving an arbitrary number for testing the script
for x in range(0,length):
    new_codon = codons[randint(0,9)]
    x = x + 1 

11) I then had to be able to keep track of the codons that were pulled during each iteration of the loop. I thus created an empty variable called “linker_sequence” and just added the string for the new codon at the back of “linker_sequence” during each iteration.

linker_sequence = ""  # A blank variable to keep track of things
length = int(length)  # To convert the input text into a number
for x in range(0,length):   
    new_codon = codons[randint(0,9)]  
    linker_sequence = linker_sequence + new_codon  
    x = x + 1

12) Lastly is putting a final print function so it returns the desired string of nucleotides to the user.

print(linker_sequence)

Nice, I think that does it for the script. Super easy and simple!

13) Finally, let’s test the script. I typed in 5 amino acids, and it gave me “TCTAGTTCAGGCTCT” as the output string. I google searched “Transeq” to get to the ebi for a simple codon translator, and translation of the nucleotide sequence above gave “SSSGS” as the protein sequence. So great, it worked! Sure, a little serine heavy, but that’s random chance for you. Should still be perfectly fine as a linker, regardless. Now to finish planning this plasmid…

Note: I’ve posted both the iPython notebook file (.ipynb) and a simple python script (.py) on my Github page. Feel free to use them!