MoSDi -- Motif Statistics and Discovery

Biological sequence analysis is often concerned with the search for structure in long strings like DNA, RNA or amino acid sequences. Frequently, "search for structure" means to look for patterns that occur very often. Here, two problems arise: A huge number of (often ad-hoc) score functions and (often heuristic) motif discovery algorithms have been proposed over the last years. In our research, we seek to address the motif discovery problem in an exact and statistically rigorous manner. To this end, we recently introduced probabilistic arithmetic automata, a theoretical framework allowing for fast and exact motif statistics (see [1]). In [2] we present an algorithm to discover the optimal motif with respect to its p-value.

Software

All motif statistics and motif discovery algorithms explained in [1] and [2] (and more) has been implemented in JAVA in a software package called MoSDi. This software is currently in an experimental/alpha stage. All algorithms from [1] and [2] are, however, fully usable and tested.

Download

Source Code

A source code release is planned for the near future.

How to use MoSDi for DNA motif discovery

Contact

If you have any questions, suggestions, or comments, please feel free to contact Tobias Marschall.
E-mail: tobias.marschall$tu-dortmund.de

References

[1] Tobias Marschall and Sven Rahmann. Probabilistic arithmetic automata and their application to pattern matching statistics. In Paolo Ferragina and Gad Landau, editors, Combinatorial Pattern Matching (CPM'08), volume 5029 of LNCS, pages 95-106. Springer, 2008.
[2] Tobias Marschall and Sven Rahmann. Efficient Exact Motif Discovery. Submitted.

Last changed 2009/06/09