The primary goal of
msprime is to efficiently and conveniently
generate coalescent trees for a sample under a range of evolutionary
scenarios. The library is a reimplementation of Hudson’s seminal
ms program, and aims to eventually reproduce all its functionality.
msprime differs from
ms in some important ways:
msprimeis much more efficient than
ms, both in terms of memory usage and simulation time. In fact,
msprimeis also much more efficient than simulators based on approximations to the coalescent with recombination model, especially for simulations with very large sample sizes.
msprimecan easily simulate chromosome sized regions for hundreds of thousands of samples.
msprimeis primarily designed to be used through its Python API to simplify the workflow associated with running and analysing simulations. (However, we do provide an
ms-compatible command line interface to plug in to existing workflows.) For many simulations we first write a script to generate the command line parameters we want to run, then fork shell processes to run the simulations, and then parse the results to obtain the genealogies in a form we can use. With
msprimeall of this can be done directly in Python, which is both simpler and far more efficient.
msprimedoes not use Newick trees for interchange as they are extremely inefficient in terms of the time required to generate and parse, as well as the space required to store them. Instead, we use a well-defined format using the powerful HDF5 standard. This format allows us to store genealogical data very concisely, particularly for large sample sizes.