|
|
||||||||||||||||||
|
|
||||||||||||||||||
![]() |
![]() |
Issue 8 - Revision 4 / October 20, 2005
|
|||
|
Slithering Through Molecules: An Overview of Python in Bioinformatics - - - - - - - - - - - - By Vineeth Surendranath | February 28, 2005
Introduction
Mixing two disparate disciplines like biology and computation yields extremist views and a middle ground. At one extreme, some believe that computation will ultimately spew answers to every known biological question. At the other extreme, some believe computation cannot begin to ask the right questions in molecular biology. Yet, as with all extreme views, neither of these opinions are very practical. Some problems probably cannot be solved with computation, and still other problems are ideally suited to the application of computers. As an example of the latter, in research where huge volumes of data exist, a number of tasks need to be repeated often, and there's a need to find patterns in data -- patterns difficult for humans to analyze because of our own limitations -- algorithms and computers are an invaluable aid. This article describes some common uses of bioinformatics — the application of computer and information science to biology — and explains how Python has been of considerable use in bioinformatics. Though advocacy is not the intention of this article, it should nonetheless give Python programmers a taste of bioinformatics, and show those interested in bioinformatics how easily Python can solve substantial problems.
The Central Dogma
Before setting off to mix bits and bytes with molecules, let's take a quick look at the central dogma of molecular biology. The central dogma talks about DNA, RNA, proteins, and the mechanisms that dictate how those basic building blocks are formed. DNA is made up of four nucleotides, labelled A, G, T, and C. The information contained in the DNA is transferred to synthesized RNA, which is then translated to a protein. A protein is made up of amino acids, where each amino acid is represented by 20 letters, where each letter stands for one of the 20 standard amino acids. The transfer of information from DNA to RNA is called transcription, and from transcription to translation involves translation of codons, elementrary units of genetic code, with three nucleotides in each unit. So, there are 64 (4 * 4 * 4) possible amino acids, but because of redundancy, the 64 sets code only for the 20 amino acids. (For example, both the codons CAA and CAG code for the same amino acid glutamine.) (In fact, this process has been studied in great detail, and the state of the art model of the central dogma cannot be described as simply. One such exception is that not all of the DNA codes for proteins. There are parts of the genome — all of the genetic information of an organism is called its genome, and to put it formally for informatics, it is the entire set of nucleotides that make up all of the dna for a particular individual — that do not form proteins, and this leads us to two principal players in the DNA: introns and exons. Introns are those parts of the DNA that do not code for proteins, and exons do. After the generation of RNA, a process called splicing removes the introns and brings the exons together. Many such events have been discovered, and the central dogma remains an evolving one, though the basic framework holds true.) If there are four nucleotides and 20 amino acids, where does Python — and if not Python, then programming in general — figure in the analysis? Ultimately, it's the size of the data that makes the application of computers to biology inevitable. For example, the smallest known genetic framework has 600,000 nucleotide pairs, and the human and mouse genomes (genetic framework) have roughly three billion nucleotide pairs. Any attempt at analysis by hand is virtually impossible. So, if there are thousands and billions of nucleotides, where exactly does programming figure in the interplay of such a large set of molecules? The wow factor is that these sets of molecules are formally represented as strings, and the processes involving these molecules can be expessed as mathematical models. That brings biology directly into the domain of computation, which has large toolkits and methods for dealing with both strings and numbers. (The down side is that, for any given biological system, the model can never be completely comprehensive. This is because there are too many variables influencing the system. Reasonable approximations are plausible and are very much used in the analysis.) "What do I need to analyze?" is the obvious next question. Biologists look for signals in the sequences. Signals are specific patterns in DNA sequences or in protein sequences (where they are called motifs). For example, the place in the genome where transcription begins is usually preceded by the occurrence of the specific pattern TATA. This is an example of a signal: all TATA occurences need not indicate a transcription starting somewhere after it, but are suggestive. Read that preceding sentence until you understand it completely — not to understand the TATA mechanism, but to realize the ambiguous nature of genetic "rules." A bioinformatician doesn't live in a purely deterministic world. Similarly, when dealing with protein sequences the occurrence of a particular pattern of amino acids, such as KLXXYXR, where the X indicates any amino acid and K, L, and R are particular amino acids, might suggest a specific function that's been attributed to this motif. Bioinformaticians use these motifs to classify different proteins into families based on the occurrence of specific motifs.
Bioinformatics 101
You now have enough background to look at some real bioinformatics. As cliched as it may have become, learning by doing is still the best method. Let's take a look at a signal. Genes are parts of the DNA that code for specific proteins, but there's no "treasure map" or Rosetta stone that documents where the genes are in a complete DNA setup. (Moreover, Nature seems to have made it a point to preemptively strike down any formulated rule.) But there are some classic tell-tale signs and one of them is called a TATA box. A TATA box occurs upstream of where the actual transcription begins, but not at a specific upstream location. The occurence of a TATA box is only indicative and not conclusive, as was mentioned earlier, and so it helps to look for them. Let's make a quick calculation. What is the probability that a TATA box should occur? Since there are four nucleotides and the occurence of T, A, T and A are indpendent, the probability is 1/4 * 1/4 * 1/4 * 1/4, or 1/256. Quite high for an entire genome. So, how can you test this out? You first need to get the genome. Specifically, you should get the human genome. You can download it from the National Center for Biotechnology Information (NCBI) at ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/. Download all of the files ending in *.fa.gz. (Each zipped file is around 40 MB, and unzipped files are aound 150 MB.) When the downloads are finished and you expand all of the files, the complete genome spans all of the files that end with the extension .fa. (.fa signifies a fasta file, which is a common file format for storing sequences.) For example, this is a fasta file, with the sequence greatly truncated: > 9 dna_rm:chromosome chromosome:NCBI35:9:1:138429268:1 ... TTCCTGCATGTAGTTTAAACGAGATTGCCAGCACCGGGTATCATTCACCATTTTTCT TTTTGTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGC TGTCTCTTAGCCCAGACTTCCCGTGTCCTTTCCACCAAGCCTTTGAGAGGTCACAGGGTC TTGATGCTGTGGTCTTGATCTGCAGGTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAG GTGCAAGCTGAGCACTGGAGTGGAGTTTTCCTGTGG... Each fasta file has an identifier line at the start, and the sequence appears from the next line on. Additional identifiers and sequences can appear in the file, separated by newlines. (In the sample data you downloaded, the data -- the chromosomes -- has only one identifier line, with the sequence for the chromosome occupying the rest of the file. The sequence in this case is a 150 MB string!) Now, with a host of .fa files in a folder, you're ready to find the count of TATA boxes in the entire genome. First, you need to enumerate all of the .fa files. You could do that in many ways: 1. You could use the listdir() method from the os module and check if the last two characters of each filename are fa. (Assuming the filename is an a variabled called filename, the code filename[-2:] does the trick.) 2. You could also use regular expressions to test the filename for the proper extension. The regular expression re.compile("(.*?)fa") can be used to check a filename for a match. 3. Or, to better reflect what one typically does on the command-line, you could use glob.glob("*.fa"), which returns a list of the filenames with extension .fa. Assuming that you use the latter option, you must then open the files and read them in sets of fixed size, since you don't want to read all 150 MB in one shot. Let's read 4,000 bytes at a time, or in our case, 4,000 nucleotides at a time. (You could read more depending on how you want to use the memory of your system.) But remember: you cannot read in discrete chunks, since the string you're looking for might occur across two chunks. Instead, you must use a sliding window along the sequence, with an overlap of 3. Why 3? You are looking for TATA, which implies sets of 4 nucleotides. For example, if you're reading IAMASTRINGTOBEWINDOWED using an overlap of 3, a sliding window of 8, looking for the string RIN, the reads would be: (Read 1) IAMASTRI (Read 2) RINGTOBE (Read 3) BEWINDOW And so on... Once the string is in memory, you can do a native string find operation using the find() method on a string object. Here's a look at the actual Python code that does the counting:
import glob
fastaFiles = glob.glob("*.fa")
tataCount = 0 # getting ready for the counting, Python does not
auto-initialize, sensible
slideWindow = 4000
overlap = 3 #overlap could begin with TAT | A, or TA | TA or T | ATA,
| indicates split
for fastaFile in fastaFiles:
currFile = open(fastaFile, "r")
idxLine = currFile.readline() # we take the index line off
currSeq = currFile.read(slideWindow).strip().replace("\n","")
while currSeq:
tataCount += currSeq.count('TATA')
currFile.seek(-3,1) # seek tells file pointer to move 3 back
from current position
currSeq = currFile.read(slideWindow).strip().replace("\n","")
print tataCount
You could modify this script to yield the positions of the TATA boxes in the individual chromosomes. Then you could move downstream from those locations in the genome and look for any other signals that indicate transcription start sites.
On to Proteins
The previous problem is one of the simpler problems that faces a bioinformatician. Let's go one step further into the realm of proteins. Among proteins, there are signature motifs that have been identified as having a specific function. Some motifs are positional too — that is, the occurrence of the motif has to be at a particular position in the protein's sequence of amino acids. Let's consider one such case. For this analysis, you will need another fasta file, but this is just one file with all known protein sequences in the species Homo sapiens (humans). You can get the file at ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/protein/protein.fa.gz. Now let's say that you're interested in a particular motif at the end of each protein sequence. You want to specify the motif as a set of possible amino acids for each of a few end positions of the protein. For example, you'd want to say the pattern at the end should be H/Y/L, F/W, V/I/C/Y, which are the possible amino acids for the last 3 positions of the protein sequence. Assume that you have a built-in function to collect input, such that you get the motif in the format ['HYL', 'FW', 'VICY'], which allows you to enter the variable parts of the motif as strings. Given such a list of strings, the naive approach would be to build all possible endings from these strings and then to search for each of these endings in each of the proteins. However, the very sound of this naive idea should suggest its exclusion. (If it isn't simple, you are doing something wrong.) Since hubris should be the virtue of any programmer, let's use regular expressions instead, as this is a classic example of using text patterns anyway. motifParts = ['HYL','FW','VICY'] motifRegex = re.compile(reduce(lambda a, b: a + b, ['['+motifPart+']' for motifPart in motifParts]) + '$') Don't be frightened. Let's walk through the code. ['['+motifPart+']' for motifPart in motifParts] is a list comprehension that turns every element of the list into the form [HYL]. So, at the end of this operation. you have a list which looks like ['[HYL]','[FW]','[VICY]']. Next, the code adds them up using reduce(), which applies a function iteratively to the elements of a list. At the end of the reduce(), you're left with '[HYL][FW][VICY]'. Now just add the $, and that makes the regular expression: [HYL][FW][VICY]$ And voila! You are done! Open the file, read sequence by sequence (in the case of protein sequences you do not have to use sliding windows, since individual protein sequences aren't as large as chromosome sequences), and try to match the regular expression against the sequence. When you find a match, return the fasta identifier line. The beauty of this idea is that, if at some time, the researcher wants to introduce more variability, such as position 1 can be H or Y or L, position 2 can be any amino acid, and position 3 can be K or R, you could easily incorporate this into the model.
Summary
We have only scratched the surface of bioinformatics, and what Python itself can do for bioinformatics. Using Python it is easy to take an idea from concept to implementation without having to coerce the language or the data. Vineeth Surendranath |
|||||||||||||||||||||||||||||||||||||||
|
Py is committed to bringing you great Python Articles. | ||||||||||||||||||||||||||||||||||||||||
![]() |
Reproduction of material from any of PyZine's pages without prior written permission is strictly prohibited. Copyright 2003 - 2005 PyZine |
|