API Documentation¶
This is the API documentation for msprime
, and provides detailed information
on the Python programming interface. See the Tutorial for an
introduction to using this API to run simulations.
See the tskit documentation for
information on how to use the
tskit Python API
to analyse simulation results.
Simulation model¶
The simulation model in msprime
closely follows the classical ms
program. Unlike ms
, however, time is measured in generations rather than
in units of \(4 N_e\) generations, i.e., “coalescent units”.
This means that when simulating a population with diploid effective size \(N_e\),
the mean time to coalescence between two samples
in an msprime
simulation will be around \(2 N_e\),
while in an ms
simulation, the mean time will be around \(0.5\).
Internally, msprime
uses the same algorithm as ms
,
and so the Ne
parameter to the simulate()
function
still acts as a time scaling, and can be set to 0.5
to match many theoretical results,
or to 0.25
to match ms
. Population sizes for each
subpopulation and for past demographic events are also defined as absolute values, not
scaled by Ne
. All migration rates and growth rates are also per generation.
When running simulations we define the length \(L\) of the sequence in
question using the length
parameter. This defines the coordinate space
within which trees and mutations are defined. \(L\) is a continuous value,
so units are arbitrary, and coordinates can take any value from \(0\) to
\(L\). (So, although we recommend setting the units of length to be
analogous to “bases”, events can occur at fractional positions.)
Mutations occur in an infinite sites process along this sequence,
and mutation rates are specified per generation, per unit of sequence length.
Thus, given the pergeneration mutation rate \(\mu\), the rate of mutation
over the entire sequence in coalescent time units is \(\theta = 4 N_e \mu
L\). It is important to remember these scaling factors when comparing with
analytical results!
Similarly, recombination rates are per unit of sequence length and per
generation in msprime
. Thus, given the per generation crossover rate
\(r\), the overall rate of recombination between the ends of the sequence
in coalescent time units is \(\rho = 4 N_e r L\). Although breakpoints do
not necessarily occur at integer locations, the underlying recombination model
is finite, and the behaviour of a small number of loci can be modelled using
the RecombinationMap
class. However, this is considered an advanced
feature and the majority of cases should be well served with the default
recombination model and number of loci.
Population structure is modelled by specifying a fixed number of subpopulations \(d\), and a \(d \times d\) matrix \(M\) of per generation migration rates. Each element of the matrix \(M_{j,k}\) defines the fraction of population \(j\) that consists of migrants from population \(k\) in each generation. Each subpopulation has an initial absolute population size \(s\) and a per generation exponential growth rate \(\alpha\). The size of a given population at time \(t\) in the past (measured in generations) is therefore given by \(s e^{\alpha t}\). Demographic events that occur in the history of the simulated population alter some aspect of this population configuration at a particular time in the past.
Warning
This parameterisation of recombination, mutation and
migration rates is different to ms, which states these
rates over the entire region and in coalescent time units. The
motivation for this is to allow the user change the size of the simulated
region without having to rescale the recombination and mutation rates,
and to also allow users directly state times and rates in units of
generations. However, the mspms
command line application is
fully ms compatible.
Running simulations¶
The simulate()
function provides the primary interface to running
coalescent simulations in msprime.
Population structure¶
Population structure is modelled in msprime
by specifying a fixed number of
subpopulations, with the migration rates between those subpopulations defined by a migration
matrix. Each subpopulation has an initial_size
that defines its absolute diploid size at
time zero and a pergeneration growth_rate
which specifies the exponential
growth rate of the subpopulation. We must also define the number of genomes to
sample from each subpopulation. The number of populations and their initial
configuration is defined using the population_configurations
parameter to
simulate()
, which takes a list of PopulationConfiguration
instances. Population IDs are zero indexed, and correspond to their position in
the list.
Samples are drawn sequentially from populations in increasing order of population ID. For example, if we specified an overall sample size of 6, and specify that 2 samples are drawn from population 0 and 4 from population 1, then samples 0 and 1 will be initially located in population 0, and samples 2, 3, 4, and 5 will be drawn from population 2.
Given \(N\) populations, migration matrices are specified using an \(N
\times N\) matrix of betweensubpopulation migration rates. See the
documentation for simulate()
and the Simulation model section for
more details on the migration rates.
Demographic Events¶
Demographic events change some aspect of the population configuration
at some time in the past, and are specified using the demographic_events
parameter to simulate()
. Each element of this list must be an
instance of one of the following demographic events
that are currently supported. Note that all times are measured in
generations, all sizes are absolute (i.e., not relative to \(N_e\)),
and all rates are pergeneration.
Debugging demographic models¶
Warning
The DemographyDebugger
class is preliminary, and the API
is likely to change in the future.
Variable recombination rates¶
Initialising simulations from a tree sequence¶
By default msprime
simulations are initialised by specifying a set of samples,
using the sample_size
or samples
parameters to simulate()
. This
initialises the simulation with segments of ancestral material covering the
whole sequence. Simulation then proceeds backwards in time until a most recent
common ancestor has been found at all points along this sequence. We can
also start simulations from different initial conditions by using the
from_ts
argument to simulate()
. Informally, we take an ‘unfinished’
tree sequence as a parameter to simulate, initialise the simulation
from the state of this tree sequence and then run the simulation until
coalescence. The returned tree sequence is then the result of taking the
input tree sequence and completing the trees using the coalescent.
This is useful for forwardstime simulators such as
SLiM that can output tree sequences. By running
forwardtime simulation for a certain number of generations we obtain a
tree sequence, but these trees may not have had sufficient time to
reach a most recent common ancestor. By using the from_ts
argument
to simulate()
we can combine the best of both forwards and
backwardstime simulators. The recent past can be simulated forwards
in time and the ancient past by the coalescent. The coalescent
simulation is initialised by the root segments of the
input tree sequence, ensuring that the minimal amount of ancestral
material possible is simulated.
Please see the tutorial for an example of how to use this feature with a simple forwardstime WrightFisher simulator
Input requirements¶
Any tree sequence can be provided as input to this process, but there is a specific topological requirement that must be met for the simulations to be statistically correct. To ensure that ancestral segments are correctly associated within chromosomes when constructing the initial conditions for the coalescent simulation, forwardtime simulators must retain the nodes corresponding to the initial generation. Furthermore, for every sample in the final generation (i.e. the extant population at the present time) there must be a path to one of the founder population nodes. (Please see the tutorial for further explanation of this point and an example.)
Recombination map limitations¶
Because of the way that msprime
handles recombination internally, care must
be taken when specifying recombination when using the from_ts
argument.
If recombination positions are generated in the same way in both the initial
tree sequence and the coalescent simulation, then everything should work.
However, the fine scale details of the underlying recombination model matter,
so matching nonuniform recombination maps between simulators may not be
possible at present. (To make it work, we must ensure that every recombination
breakpoint in from_ts
matches exactly to a possible recombination
breakpoint in msprime’s recombination map, which is not guaranteed because of
msprime’s discrete recombination model.)
One case in which it is guaranteed to work is if from_ts
has integer
coordinates, and we want to simulate a coalescent with a uniform recombination
rate. In this case, to have a uniform recombination rate r
use:
L = int(from_ts.sequence_length)
recomb_map = msprime.RecombinationMap.uniform_map(L, r, L)
final_ts = mpsrime.simulate(from_ts=from_ts, recomb_map=recomb_map)
Simulating mutations¶
When running coalescent simulations it’s usually most convenient to use the
mutation_rate
argument to the simulate()
function to throw neutral
mutations down on the trees. However, sometimes we wish to throw mutations
down on an existing tree sequence: for example, if we want to see the outcome
of different random mutational processes on top of a single simulated topology,
or if we have obtained the tree sequence from another program and wish to
overlay neutral mutations on this tree sequence.

msprime.BINARY == 0
The binary mutation alphabet where ancestral states are always “0” and derived states “1”.

msprime.NUCLEOTIDES == 1
The nucleotides mutation alphabet in which ancestral and derived states are chosen from the characters “A”, “C”, “G” and “T”.