The primary goal of
msprime is to efficiently and conveniently
generate coalescent trees for a sample under a range of evolutionary
scenarios. The library is a reimplementation of Hudson’s seminal
ms program, and aims to eventually reproduce all its functionality.
msprime differs from
ms in some important ways:
msprimeis much more efficient than
ms, both in terms of memory usage and simulation time. In fact,
msprimeis also much more efficient than simulators based on approximations to the coalescent with recombination model, especially for simulations with very large sample sizes.
msprimecan easily simulate chromosome sized regions for hundreds of thousands of samples.
msprimeis primarily designed to be used through its Python API to simplify the workflow associated with running and analysing simulations. (However, we do provide an
mscompatible command line interface to plug in to existing workflows.) For many simulations we first write a script to generate the command line parameters we want to run, then fork shell processes to run the simulations, and then parse the results to obtain the genealogies in a form we can use. With
msprimeall of this can be done directly in Python, which is both simpler and far more efficient.
msprimedoes not use Newick trees for interchange as they are extremely inefficient in terms of the time required to generate and parse, as well as the space required to store them. Instead, we use a well_defined format using the powerful HDF5 standard. This format allows us to store genealogical data very concisely, particularly for large sample sizes.
msprime library has also evolved to support data
from external sources, and can work with data conforming to
the Tree sequence interchange definitions. In the near future, the
efficient algorithms and data structures used to process tree
sequence data will be moved into a new library, provisiononally
tskit. Once this transition is complete,
will depend on this library, and will become primarily concerned
with simulating backwards-in-time population processes.