PyZine
 


Article Finder
People
Issue 6 - Revision 7  /   October 18, 2004 


 
  Py Links:
Latest Issue
Issue 08
Issue 07
Issue 06
Issue 05
Issue 04
Issue 02
Issue 01
 
 
Downloads
     
  Articles:
Throughout the quarter we cover topics of interest to Python developers.

  Python MIDI

  Python in Bioinformatics

  Getting Twisted

  optparse

  jEdit

  Python RFID

  3d Graphics in Python

 
 
 
     


Cartwheel/FamilyJewels:
- a framework for analysis and visualization of DNA
- - - - - - - - - - - -

By C. Titus Brown | June 6, 2004

print
Abstract

Python is a language that is widely used in scientific research. Not only is it easy to learn and simple to use, it is particularly well suited to rapid development and can be used as a "glue" language to integrate a variety of different services. Over the last several years, our lab has constructed a variety of programs and toolkits using Python, including: our main sequence analysis processing pipeline; a Web interface to the processing system, including a Web service API; a prototype of our current GUI sequence visualization system (written in Jython); and Python wrappers for two C++ toolkits that are in development. Each of these projects has made significant contributions to our research, and together they form a flexible and powerful system for sequence analysis, annotation, and visualization. In this article, I describe the projects and discuss the motivations behind some of the technology decisions we made.

Introduction

In the Davidson Lab at Caltech we study the molecular basis of the information flow that leads to tissue-specific gene expression during early development in sea urchins. That is, we study the way in which regulatory information is encoded in genomic DNA; this regulatory information governs and directs the expression of specific genes in specific places at specific times. Because this information is encoded not only in genes (which, in sea urchins, as in many animals, make up less than 5% of the genomic sequence) but in genomic DNA, we have had to learn to process and analyze large amounts of sequence data. While most genes can be encoded in 2,000-5,000 bases (2-5 kilobases, or kb) of DNA, the genomic region surrounding that gene -- which usually contains the control regions -- can be spread across 50,000-80,000 bases.

DNA consists of only four letters, A, C, G, and T, representing the individual bases in the sequence. When confronted with a sequence of 100,000 of these bases, it can be difficult to know what is in the sequence, much less pull out of the sequence those features that are of interest to the biology work you're doing. Biologists, in particular, have a difficult time dealing with such large files: unlike genes, which are usually less than 5kb in length, these files cannot be easily manipulated by existing programs.

In 1999, the Davidson Lab initiated a targeted sequencing project that began to systematically collect and sequence 100kb fragments of genomic DNA containing genes and control regions of interest to our lab. Over the years, we have built several systems to analyze, annotate, and display these fragments of DNA, and we have faced several challenges.

Our first challenge was to build a novel analysis and visualization system for analyzing similarities between genomic sequence from two or three different sea urchin species, while making the system both flexible and easy to use. We also needed to produce a visualization system that could consolidate the results of several different analyses in a form accessible to biologists. Finally, we needed to distribute the execution of several different computationally intensive analysis programs across a Beowulf; this was complicated by the closed source nature of many of the programs.

Pairwise Sequence Analyses

One of the easiest ways to figure out which genomic regions are important is to look at what regions are conserved as species diverge from a common ancestor. Closely related species such as mouse and human not only have many genes in common but also share control regions. Once genomic sequence surrounding a shared gene in the two species is available, these control regions can be found by comparing the two sequences, taking what is shared, and subtracting regions that code for genes. The remaining shared sequence almost always contains the control region.

In December of 2000, my advisor, Eric Davidson, asked me to write a program to compare two sequences from different sea urchins. We needed to read in 50-100kb of genomic DNA from two species, compare the DNA sequences exhaustively, and display the results. It was particularly important to make the results accessible to the biologists working on these genes, so that they could decide what experiments to do. This meant I had to build an easy-to-use GUI to display the results of the analysis.

The initial decision I had to make was what GUI framework to use. I had been developing in Python for about a year, and I was hoping to use a Python-based GUI toolkit to build the visualization system. I wanted it to be cross-platform, because our lab uses Windows, and I preferred (and prefer) to develop under Linux. I first considered Tkinter, but rejected it based on my prior experience with Tk/Tcl. For me, the main strength of Tk was the ability to extend it with C/C++ widgets, but I didn't want to have to compile extension modules for multiple platforms. I briefly investigated and rejected several other platforms, including Qt, wxPython, and GTK. I rejected Qt because it wasn't free on the main user platform, Windows, and my advisor wasn't particularly keen on spending money for software development. wxWindows wouldn't compile or install on my stock RedHat development system, so I never tried out wxPython. PyGTK was fairly immature at the time, and the libraries seemed to be undergoing rapid change, which was a definite negative.

I finally settled on Jython, an implementation of Python built on Java, for several reasons: despite some prior bad experiences with Java/AWT, the Java class libraries had matured significantly in the intervening years. Java was built to be cross-platform, so I could build a single version for both Windows and Linux (Mac OS X did not yet exist). Also, Java/Swing looked really nice, and the documentation was astonishingly solid. Most importantly, everything "just worked"; within two hours of starting to play with Jython, I had a basic GUI interface up and running.

Figure 1: The two main views

Figure 1b: The two main views

Figure 1 The two main views from the Jython/Swing version of FamilyRelations (on Windows NT). The genomic region shown is a short segment of the mec-3 regulatory region from the C. elegans (top) and C. briggsae (bottom) genomes.

The left view is an overall view of matches between the two sequences, with matches drawn in red. The plot at the top shows the relation of the current similarity threshold to the best match, graphed against the top sequence. Control widgets are on the right.

The right view is a closeup of some of the matches. The top pane shows distinct clusters of matches; the middle pane shows a closeup of the region; and the bottom pane shows the base-to-base matchups for the region selected in the middle pane.

I also spent some time writing a simple NxM comparison algorithm, which I called seqcomp, to do the basic data analysis. The first version was written in Python and was abysmally slow; the naive algorithm is quite simple and involves a lot of looping:

## naive implementation of seqcomp algorithm

## INPUTS: two DNA sequences, seq1 & seq2.

seq1_len = len(seq1) - windowsize + 1 # account for windowsize in
seq2_len = len(seq2) - windowsize + 1 # the sequence length...


# DNA can be read in two directions: forward, and reverse complement,
# in which the sequence is reversed in order, and A --> T, C --> G, G --> C,
# and T --> A. Both forward & reverse complement must be searched.


seq2_rc = make_reverse_complement(seq2)

for i in range(0, seq1_len): # iterate over both sequences
  for j in range(0, seq2_len):
    n = 0
    for k in range(0, windowsize): # iterate across window,
      if seq1[i + k] == seq2[j + k]: # and count matches;
        n += 1


    if n >= threshold: # if sufficient, record.
       record_forward_match(i, j, n)


  for j in range(0, seq2_len): # search reverse complement
    n = 0
    for k in range(0, windowsize): # count matches in window
      if seq1[i + k] == seq2_rc[j + k]:
        n += 1


    if n >= threshold: # & record, for reverse.
      record_rc_match(i, j, n)

When I showed the GUI and seqcomp to my friend Tristan De Buysscher, a graduate student in the Wold Lab, he pulled out some C code he'd been working on independently that did the same comparison 1000 times faster than my Python code. We agreed to leave the GUI to me and seqcomp to him!

We combined the two projects under the name "FamilyJewels" and I named the GUI "FamilyRelations".

FamilyRelations and seqcomp were an immediate success. Within 6 months of the first analyses being shown to biologists, experimental (wet-lab) validation of the results of an analysis had been done by Cathy Yuh, a postdoctoral fellow in the Davidson Lab. Several other local labs were also interested in using the software. So, in addition to continuing work on the basic GUI, I started to work on an analysis server that would run the seqcomp program with user-specified parameters. We also began to integrate other views into the GUI. These projects are described in the next two sections.

Both the GUI program, FamilyRelations, and the command-line analysis program, seqcomp, have evolved considerably since the original versions. My Jython implementation of FamilyRelations proved too slow to parse and analyze the amount of data generated by seqcomp for large comparisons (comparisons of two 50k sequences generate 4-8mb of data, on average), and I have re-written FamilyRelations in C++ using the cross-platform FLTK graphics library. While this requires me to compile packages on multiple platforms, both Cygwin and Mac OS X have made this considerably easier than I'd feared: I now use the same Makefile and exact same code, without any #ifdefs, to compile standalone binaries for Linux, Windows, and OS X. seqcomp has also evolved: Tristan has moved it into a new project, Mussa, in which comparisons between sequence from three or more species can be done. I have recently put several different implementations of seqcomp-style algorithms into paircomp, a C++/Python library for parsing, generating, and manipulating analyses (discussed below).

Figure 2: The main view from FamilyRelations II, the C++/FLTK rewrite of FamilyRelations (on Mac OS X); the genomic region shown is the same as in Figure 1.

Annotating and Viewing Features on Sequences

At the same time as I was developing the FamilyRelations GUI program and Tristan was writing seqcomp, two colleagues, Pete Clarke and Alistair Rust, were working on building a system to annotate the sequences that were being generated by our directed sequencing project.

Their goal was to build a system that took sequences around 100kb in length, annotated them with various features, and then displayed them in a way that made the annotations accessible to biologists. Because we were working on the sea urchin, we had to run certain analyses that were specific to the sea urchin; this meant that most commonly available annotation programs needed to be modified for our use.

Pete and Alistair initially built a system on top of Genotator, a Tkperl-based program that is the forerunner of many of the annotation systems in use today. It allowed the results of a variety of sequence analysis programs to be displayed on different tracks and users could scan about and investigate the features however they wanted.

However, Genotator had several problems. First and foremost, it was not easy to install on Windows, and the biologists in our lab had to access it through a VNC-based login system running on Linux. It was written in Perl, which meant that its maintainability and extensibility -- especially compared to the Python software we were building -- was relatively low. And, finally, it did not have a distributed analysis pipeline built into it. In order to run jobs on our Beowulf cluster, we had to manually distribute the jobs across the cluster and then integrate the results; this process occupied much of Pete Clarke's time for the first year.

After Genotator had been running for about a year, these problems became increasingly apparent. I suggested that we switch over to using Jython to write the client part of a client-server system that would run and display the analyses we wanted. While this suggestion was not greeted with much enthusiasm -- Genotator was already up and running -- it required only a small extension of the FamilyRelations program to provide much of the basic functionality of the Genotator display. More work needed to be done on the server-side execution pipeline (discussed below) but within another 6 months we had an automated system that distributed our analyses across a Beowulf, integrated them on a database server, and served them up to a modified FamilyRelations in a locally built XML dialect.

Figure 3: The single-sequence analysis/annotation view in FRII (on Mac OS X). The line at the top represents 150kb of genomic sequence surrounding the Beta-catenin gene in S. purpuratus, the California purple sea urchin. The 8 sets of features below show the results of 8 different analyses, including matches to known genes, known genomic features, and predicted genes. Each of these analyses was run on our Beowulf cluster, stored in the Cartwheel database, and exported via the Web services API to FamilyRelations II.

As with FamilyRelations and seqcomp, this system has evolved since its beginning. The display part is still integrated with the FamilyRelations program, but has moved to C++ and FLTK along with the rest of the FamilyRelations GUI framework. By virtue of a Web services API, the analyses can now be established via a script interface in addition to a browser interface. This makes the system a generally useful way to annotate sequence with a set of stereotyped analyses.

Building a Distributed Execution Pipeline for Closed-Source Programs

For both the comparative sequence analysis project and the sequence annotation project, we needed a central server which could both do computation and serve clients with the results of the computation. We wanted to distribute the computation across several compute-only systems, and some of the computation could only be done by closed-source programs. We therefore began to develop a simple queueing system that could fit these needs.

The simplest model for distributed computation that accomodated our needs was the Linda tuple space model, in which "messages" (jobs to be done) are deposited in a common object space and individually reserved by job processors. After processing, results are returned to the space. This model requires very little in the way of inter-process coordination; in fact, the logic can be implemented in about 5 lines of Python, as long as it is possible to lock the message space.

global tuple_space # a tuple space shared across machines


## job entry function for a Linda tuple space


def enter_job(job):
  tuple_space.add(job) # no locking needed, if 'add' is atomic


## client processing function


def process_job():
  iftuple_space.has_available(): # first check availability;
    tuple_space.lock() # then lock the space,
    job = tuple_space.get_next() # grab the job,
    tuple_space.unlock() # unlock the space,
    job.execute() # and execute.

I settled on PostgreSQL, one of the top two open-source RDBMS available, for the core of the system. PostgreSQL had two essential features: it had a mature, robust transaction system, and the server could be accessed over the Internet by clients. It also had a unique object-relational system which allowed inheritance relations between tables; this greatly simplified certain types of programming issues. I chose PostgreSQL over MySQL becase MySQL did not yet support transactions, and does not have any object-relational features. Today I would consider using ZODB, which at the time did not exist as a standalone database.

This resulted in a very simple system, in which a request to run a program was deposited into the database in a given table. This table was polled at regular intervals by queue programs running on compute nodes in our Beowulf; when a new job became available, a compute node would lock the database, extract the job information, and execute it. The results of the job were deposited back into the database. Because of the object-relational table structure possible in PostgreSQL, only one table -- the base "batchqueue_requests" table, from which all job request tables descended -- needed to be polled by the queue program, and any job-specific details could be delegated to the code responsible for running the various requests. This made it easy to wrap essentially any command-line program for which a Linux binary was available; parameters and input/output could be translated as needed and stored in the database in whatever format seemed appropriate to the program. The programs were then run on individual compute nodes by queue processes. Because all of the details could be neatly encapsulated in a single class, the system was very extensible -- essential for research programming.

Only one thing was missing: an easy way to establish the analyses without making the biologists in the lab resort to SQL statements or Python code.

Putting it all together: Cartwheel at the hub

From the moment FamilyRelations became a useful way to look at comparative analyses, individual biologists had been asking me to run analyses for them using the command-line seqcomp program. This was not a scalable solution! It was not practical to introduce a seqcomp-style comparison directly into FamilyRelations -- among other things, Java is too slow to do large combinatorial computations, and the Java/C bridge system seemed too complicated to be sustainable for a small project -- and it soon became clear that we needed a system to coordinate and customize the execution of seqcomp for individuals.

So, I attached a Web interface to the batchqueue analysis system described above, and named the whole system "Cartwheel". I ended up writing two separate implementations of the Web interface, based on two different Web toolkits.

My first Web interface used the PyWX embedding module for AOLserver. AOLserver is a fantastically scalable open-source Web server used by AOL.com, among others, to serve Web applications; it's probably best known as the original foundation for the ArsDigita Community System. My primary problem with AOLserver is that it is written in C and uses Tcl as a (tightly integrated) extension language; PyWX was a project initiated by Brent Fulgham and continued by Brent, Michael Haggerty, and me, to embed Python as an alternate scripting language within AOLserver. On top of the PyWX API, I developed a simple object-page system (essentially a badly-written version of WebKit/JSP) to give out user accounts, group users into labs, and provide a way to establish sequence analyses.

The system worked fairly well from the user's point of view, but was ugly and overcomplicated in design.

About 6 months after the initial version of Cartwheel was made available, the Quixote project made its public debut. Quixote was an effort to develop a lean system for publishing objects, and borrowed many ideas from Zope. I immediately fell in love with it, and rewrote the Cartwheel website and batchqueue system from scratch using Quixote; this turned out to be a good decision, and two years later I am still happy with the modularity and extensibility supported by the Quixotic way of building Web applications.

One particularly convenient aspect of Quixote is its ability to run on top of, or inside of, several different Web servers. My initial port was built on top of PyWX, and when the maintenance of the Cartwheel site at Caltech was taken over by our sysadmin, Ian Lipsky, we moved it over to an Apache server using mod_scgi.

Figure 4: A diagram of how the various components of the Cartwheel/FamilyJewels system are connected.

Libraries and APIs: Building out the toolkit

Once the basic infrastructure was in place, it became clear that the system could be extended to provide a general framework for annotation and analysis of sequence by individuals. However, because the system was designed and built around the needs of the end user -- biologists -- it wasn't very useful for programmers: we had no scriptable interfaces to the batchqueue or the Cartwheel analysis server, and even in places where easily separable functionality existed, it was tightly integrated with the application code.

In the summer of 2002, an undergraduate student named Yuan Xie started work on an XML-RPC-based Web services API to the batchqueue, and built both Python and Java client libraries. These APIs allowed programs to connect to a running Cartwheel server and inject job requests into the queue via a script. Since then we have also built Python, Java, and C++ interfaces to the Web frontend system layered on top of the queueing system, which allows programs to set up new analyses and retrieve the data from previous analyses via XML-RPC.

In conjunction with C++ and Python XML parsers for the data files returned by Cartwheel, we can now do things like set up analyses, let them run, and produce PDF pictures of the output -- all automatically.

We have also started to extract useful functionality into separate toolkits. Two such toolkits are now at a stable stage of development: the motility toolkit provides a C++ interface (and a Python wrapper) for searching DNA sequences for sequence motifs, and the paircomp toolkit provides both a C++ and a Python interface to creating and manipulating comparative sequence analyses.

Current Status and Future Prognosis

The FamilyJewels and Cartwheel projects have been quite successful for projects developed within a single lab. Over 200 people from more than 40 labs have signed up to use our local Cartwheel server, and there are usually 20-30 active users at any one time. The availability of a tutorial for FamilyRelations has reduced the barrier to entry, and the interface is easy enough to use that a manual is not required. Currently there are 6 scientific papers published that use FR/Cartwheel, and another 5-10 are on the way.

For better or for worse, Cartwheel and FamilyJewels remain Caltech-based projects. Development is centered in the Davidson Lab, and new development takes place primarily when dictated by scientific research. One positive consequence is that the system has not become bloated with features; neither, however, has much extension of basic functionality happened.

In the future, I hope to extend FamilyRelations to have more facilities for annotating sequences in general, a better interface for searching for motifs, and additional ways to link sequence features to personal research and other data sources. The need to understand the increasing wealth of genomic sequence is only increasing, and the Cartwheel/FamilyJewels projects serve a niche that is not well represented by other projects: it is a relatively simple, extensible, and easy-to-use system for small-scale sequence analysis and annotation that has been battle-tested by many users.



Using Python

This article is mostly about the tools we built and why we built them, which would seem to be somewhat independent of the languages we used to build them. Yet Python has had a significant effect on our development effort, for several reasons.

First of all, Python is easy to learn and use: staff with previous experience in other languages can quickly pick up on existing code, and people with no programming experience at all can learn the language well enough to use existing libraries. Python also has a plethora of tools and frameworks and toolkits from which to pick and choose, which allows us to make technology decisions on merit rather than availability of packages. The integration of Jython with the Java/Swing libraries is particularly good for quickly building a nice, portable user interface. The reusability of code is better with Python than with any other language I've used, which means that code can be shared between projects and even substantial refactoring can often be done simply by moving blocks of code around. And finally, the ability to wrap C++ functions in Python quickly and easily is particularly useful for building fast analysis code into reusable packages.

For all of these reasons, Python is an excellent language for rapidly building and extending research software -- especially in bioinformatics.


Project Statistics

Projectlanguagelines of code# of filesdescriptionSourceForge project home
CartwheelPython10,000200queue processing & Web serverCartwheel
FamilyRelationsJython11,000100graphical interfaceFamilyJewels
FamilyRelations IIC++660095graphical interfaceFamilyJewels
motilityC++100023motif searching toolkitCartwheel
paircompC++200017sequence comparison toolkitFamilyJewels

Simple Python wrappers exist for paircomp and motility.



Availability

All of the Caltech software described in this article is available under either the GPL or LGPL, as is most of the other software. Please see the individual packages for details.



Acknowledgements

Eric Davidson, my advisor, and Barbara Wold, Tristan De Buysscher's advisor, provided much of the early impetus for the projects; Eric Davidson continues to support development. Andy Cameron and Erich Schwarz have been involved in design since the beginning, and were responsible for quite a bit of the early bug testing. Pete Clarke and Alistair Rust worked under Hamid Bolouri, a collaborator now at the Institute for Systems Biology in Seattle. A full list of contributors can be found on the Cartwheel Web site.

URLs

Davidson Lab:
http://www.its.caltech.edu/~mirsky/

FamilyJewels (FamilyRelations, seqcomp, FRII, paircomp):
http://family.caltech.edu/

Cartwheel (batchqueue + Web site server, client APIs, and motility):
http://cartwheel.caltech.edu/

Mussa (N-way sequence comparison):
http://mussa.caltech.edu

Genotator:
http://www.fruitfly.org/~nomi/genotator/

Quixote:
http://www.mems-exchange.org/software/quixote/

PostgreSQL:
http://www.postgresql.org

PyWX (Python in AOLserver):
http://pywx.idyll.org/ :

FLTK (C++ GUI toolkit):
http://www.fltk.org/


Titus Brown

Titus Brown is a PhD student in Biology at Caltech. He has been programming since 1988, and doing experimental molecular biology since 1999. Unfortunately for his graduate career, he's still much better at programming than experiments.


shim
shim

 Py is committed to bringing you great Python Articles.

shim
shim


Home   Subscribe   Migration FAQ   Contact PyZine   Write for PyZine   ZopeMag   opensourcexperts.com  

Reproduction of material from any of PyZine's pages without prior written permission is strictly prohibited. Copyright 2003 - 2005 PyZine Zope/Plone hosting by Nidelven IT