API Documentation¶
This is the API documentation for msprime
, and provides detailed information
on the Python programming interface. See the Tutorial for an
introduction to using this API to run simulations.
See the tskit documentation for
information on how to use the
tskit Python API
to analyse simulation results.
Simulation model¶
The simulation model in msprime
closely follows the classical ms
program. Unlike ms
, however, time is measured in generations rather than
in units of \(4 N_e\) generations, i.e., “coalescent units”.
This means that when simulating a population with diploid effective size \(N_e\),
the mean time to coalescence between two samples
in an msprime
simulation will be around \(2 N_e\),
while in an ms
simulation, the mean time will be around \(0.5\).
Internally, msprime
uses the same algorithm as ms
,
and so the Ne
parameter to the simulate()
function
still acts as a time scaling, and can be set to 0.5
to match many theoretical results,
or to 0.25
to match ms
. Population sizes for each
subpopulation and for past demographic events are also defined as absolute values, not
scaled by Ne
. All migration rates and growth rates are also per generation.
When running simulations we define the length \(L\) of the sequence in
question using the length
parameter. This defines the coordinate space
within which trees and mutations are defined. \(L\) is a continuous value,
so units are arbitrary, and coordinates can take any value from \(0\) to
\(L\). (So, although we recommend setting the units of length to be
analogous to “bases”, events can occur at fractional positions.)
Mutations occur in an infinite sites process along this sequence,
and mutation rates are specified per generation, per unit of sequence length.
Thus, given the pergeneration mutation rate \(\mu\), the rate of mutation
over the entire sequence in coalescent time units is \(\theta = 4 N_e \mu
L\). It is important to remember these scaling factors when comparing with
analytical results!
Similarly, recombination rates are per unit of sequence length and per
generation in msprime
. Thus, given the per generation crossover rate
\(r\), the overall rate of recombination between the ends of the sequence
in coalescent time units is \(\rho = 4 N_e r L\). Although breakpoints do
not necessarily occur at integer locations, the underlying recombination model
is finite, and the behaviour of a small number of loci can be modelled using
the RecombinationMap
class. However, this is considered an advanced
feature and the majority of cases should be well served with the default
recombination model and number of loci.
Population structure is modelled by specifying a fixed number of subpopulations \(d\), and a \(d \times d\) matrix \(M\) of per generation migration rates. Each element of the matrix \(M_{j,k}\) defines the fraction of population \(j\) that consists of migrants from population \(k\) in each generation. Each subpopulation has an initial absolute population size \(s\) and a per generation exponential growth rate \(\alpha\). The size of a given population at time \(t\) in the past (measured in generations) is therefore given by \(s e^{\alpha t}\). Demographic events that occur in the history of the simulated population alter some aspect of this population configuration at a particular time in the past.
Warning
This parameterisation of recombination, mutation and
migration rates is different to ms, which states these
rates over the entire region and in coalescent time units. The
motivation for this is to allow the user change the size of the simulated
region without having to rescale the recombination and mutation rates,
and to also allow users directly state times and rates in units of
generations. However, the mspms
command line application is
fully ms compatible.
Running simulations¶
The simulate()
function provides the primary interface to running
coalescent simulations in msprime.

msprime.
simulate
(sample_size=None, Ne=1, length=None, recombination_rate=None, recombination_map=None, mutation_rate=None, population_configurations=None, migration_matrix=None, demographic_events=[], samples=None, model=None, record_migrations=False, random_seed=None, mutation_generator=None, num_replicates=None, from_ts=None, start_time=None, record_full_arg=False, __tmp_max_time=None)¶ Simulates the coalescent with recombination under the specified model parameters and returns the resulting
tskit.TreeSequence
. Note that Ne is the effective diploid population size (so the effective number of genomes in the population is 2*Ne), butsample_size
is the number of (monoploid) genomes sampled.Parameters:  sample_size (int) – The number of sampled monoploid genomes. If not
specified or None, this defaults to the sum of the subpopulation sample
sizes. Either
sample_size
,population_configurations
orsamples
must be specified.  Ne (float) – The effective (diploid) population size for the reference population. This defaults to 1 if not specified.
 length (float) – The length of the simulated region in bases.
This parameter cannot be used along with
recombination_map
. Defaults to 1 if not specified.  recombination_rate (float) – The rate of recombination per base
per generation. This parameter cannot be used along with
recombination_map
. Defaults to 0 if not specified.  recombination_map (
RecombinationMap
) – The map describing the changing rates of recombination along the simulated chromosome. This parameter cannot be used along with therecombination_rate
orlength
parameters, as these values are encoded within the map. Defaults to a uniform rate as described in therecombination_rate
parameter if not specified.  mutation_rate (float) – The rate of infinite sites
mutations per unit of sequence length per generation.
If not specified, no mutations are generated. This option only
allows for infinite sites mutations with a binary (i.e., 0/1)
alphabet. For more control over the mutational process, please
use the
mutate()
function.  population_configurations (list or None.) – The list of
PopulationConfiguration
instances describing the sampling configuration, relative sizes and growth rates of the populations to be simulated. If this is not specified, a single population with a sample of sizesample_size
is assumed.  migration_matrix (list) – The matrix describing the rates
of migration between all pairs of populations. If \(N\)
populations are defined in the
population_configurations
parameter, then the migration matrix must be an \(N \times N\) matrix with 0 on the diagonal, consisting of \(N\) lists of length \(N\) or an \(N \times N\) numpy array.  demographic_events (list) – The list of demographic events to simulate. Demographic events describe changes to the populations in the past. Events should be supplied in nondecreasing order of time in the past. Events with the same time value will be applied sequentially in the order that they were supplied before the simulation algorithm continues with the next time step.
 samples (list) – The list specifying the location and time of
all samples. This parameter may be used to specify historical
samples, and cannot be used in conjunction with the
sample_size
parameter. Each sample is a (population
,time
) pair such that the sample in positionj
in the list of samples is drawn in the specified population at the specfied time. Time is measured in generations ago, as elsewhere.  random_seed (int) – The random seed. If this is None, a random seed will be automatically generated. Valid random seeds must be between 1 and \(2^{32}  1\).
 num_replicates (int) – The number of replicates of the specified
parameters to simulate. If this is not specified or None,
no replication is performed and a
tskit.TreeSequence
object returned. Ifnum_replicates
is provided, the specified number of replicates is performed, and an iterator over the resultingtskit.TreeSequence
objects returned.  from_ts (tskit.TreeSequence) – If specified, initialise the simulation from the root segments of this tree sequence and return the completed tree sequence. Please see here for details on the required properties of this tree sequence and its interactions with other parameters. (Default: None).
 start_time (float) – If specified, set the initial time that the
simulation starts to this value. If not specified, the start
time is zero if performing a simulation of a set of samples,
or is the time of the oldest node if simulating from an
existing tree sequence (see the
from_ts
parameter).  record_full_arg (bool) – If True, record all intermediate nodes arising from common ancestor and recombination events in the output tree sequence. This will result in unary nodes (i.e., nodes in marginal trees that have only one child). Defaults to False.
Returns: The
tskit.TreeSequence
object representing the results of the simulation if no replication is performed, or an iterator over the independent replicates simulated if thenum_replicates
parameter has been used.Return type: tskit.TreeSequence
or an iterator overtskit.TreeSequence
replicates.Warning: If using replication, do not store the results of the iterator in a list! For performance reasons, the same underlying object may be used for every TreeSequence returned which will most likely lead to unexpected behaviour.
 sample_size (int) – The number of sampled monoploid genomes. If not
specified or None, this defaults to the sum of the subpopulation sample
sizes. Either
Population structure¶
Population structure is modelled in msprime
by specifying a fixed number of
subpopulations, with the migration rates between those subpopulations defined by a migration
matrix. Each subpopulation has an initial_size
that defines its absolute diploid size at
time zero and a pergeneration growth_rate
which specifies the exponential
growth rate of the subpopulation. We must also define the number of genomes to
sample from each subpopulation. The number of populations and their initial
configuration is defined using the population_configurations
parameter to
simulate()
, which takes a list of PopulationConfiguration
instances. Population IDs are zero indexed, and correspond to their position in
the list.
Samples are drawn sequentially from populations in increasing order of population ID. For example, if we specified an overall sample size of 6, and specify that 2 samples are drawn from population 0 and 4 from population 1, then samples 0 and 1 will be initially located in population 0, and samples 2, 3, 4, and 5 will be drawn from population 2.
Given \(N\) populations, migration matrices are specified using an \(N
\times N\) matrix of betweensubpopulation migration rates. See the
documentation for simulate()
and the Simulation model section for
more details on the migration rates.

class
msprime.
PopulationConfiguration
(sample_size=None, initial_size=None, growth_rate=0.0)¶ The initial configuration of a population (or deme) in a simulation.
Parameters:  sample_size (int) – The number of initial samples that are drawn from this population.
 initial_size (float) – The absolute size of the population at time zero. Defaults to the reference population size \(N_e\).
 growth_rate (float) – The forwardstime exponential growth rate of the population per generation. Growth rates can be negative. This is zero for a constant population size, and positive for a population that has been growing. Defaults to 0.
Demographic Events¶
Demographic events change some aspect of the population configuration
at some time in the past, and are specified using the demographic_events
parameter to simulate()
. Each element of this list must be an
instance of one of the following demographic events
that are currently supported. Note that all times are measured in
generations, all sizes are absolute (i.e., not relative to \(N_e\)),
and all rates are pergeneration.

class
msprime.
PopulationParametersChange
(time, initial_size=None, growth_rate=None, population=None, population_id=None)¶ Changes the demographic parameters of a population at a given time.
This event generalises the
eg
,eG
,en
andeN
options fromms
. Note that unlikems
we do not automatically set growth rates to zero when the population size is changed.Parameters:  time (float) – The length of time ago at which this event occurred.
 initial_size (float) – The absolute diploid size of the population
at the beginning of the time slice starting at
time
. If None, this is calculated according to the initial population size and growth rate over the preceding time slice.  growth_rate (float) – The new pergeneration growth rate. If None, the growth rate is not changed. Defaults to None.
 population (int) – The ID of the population affected. If
population
is None, the changes affect all populations simultaneously.

class
msprime.
MigrationRateChange
(time, rate, matrix_index=None)¶ Changes the rate of migration to a new value at a specific time.
Parameters:  time (float) – The time at which this event occurs in generations.
 rate (float) – The new pergeneration migration rate.
 matrix_index (tuple) – A tuple of two population IDs descibing
the matrix index of interest. If
matrix_index
is None, all nondiagonal entries of the migration matrix are changed simultaneously.

class
msprime.
MassMigration
(time, source, dest=None, proportion=1.0, destination=None)¶ A mass migration event in which some fraction of the population in one deme simultaneously move to another deme, viewed backwards in time. Each lineage currently present in the source population moves to the destination population with probability equal to
proportion
.This event class generalises the population split (
ej
) and admixture (es
) events fromms
. Note that MassMigrations do not have any side effects on the migration matrix.Parameters:
Debugging demographic models¶
Warning
The DemographyDebugger
class is preliminary, and the API
is likely to change in the future.

class
msprime.
DemographyDebugger
(Ne=1, population_configurations=None, migration_matrix=None, demographic_events=[])¶ A class to facilitate debugging of population parameters and migration rates in the past.

print_history
(output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF8'>)¶ Prints a summary of the history of the populations.

Variable recombination rates¶

class
msprime.
RecombinationMap
(positions, rates, num_loci=None)¶ A RecombinationMap represents the changing rates of recombination along a chromosome. This is defined via two lists of numbers:
positions
andrates
, which must be of the same length. Given an index j in these lists, the rate of recombination per base per generation isrates[j]
over the intervalpositions[j]
topositions[j + 1]
. Consequently, the first position must be zero, and by convention the last rate value is also required to be zero (although it is not used).Warning
The chromosome is divided into
num_loci
regions of equal recombination distance, between which recombinations occur. This means that if recombination is constant and the genome is lengthn
, thennum_loci=n
will produce breakpoints only at integer values. If recombination rate is not constant, breakpoints will still only occur atn
distinct positions, but these will probably not coincide with thepositions
provided.Parameters:  positions (list) – The positions (in bases) denoting the distinct intervals where recombination rates change. These can be floating point values.
 rates (list) – The list of rates corresponding to the supplied
positions
. Recombination rates are specified per base, per generation.  num_loci (int) – The maximum number of nonrecombining loci in the underlying simulation. By default this is set to the largest possible value, allowing the maximum resolution in the recombination process. However, for a finite sites model this can be set to smaller values.

DEFAULT_NUM_LOCI
= 4294967295¶ The default number of nonrecombining loci in a RecombinationMap.

classmethod
read_hapmap
(filename)¶ Parses the specified file in HapMap format. These files must contain a single header line (which is ignored), and then each subsequent line denotes a position/rate pair. Positions are in units of bases, and recombination rates in centimorgans/Megabase. The first column in this file is ignored, as are subsequence columns after the Position and Rate. A sample of this format is as follows:
Chromosome Position(bp) Rate(cM/Mb) Map(cM) chr1 55550 2.981822 0.000000 chr1 82571 2.082414 0.080572 chr1 88169 2.081358 0.092229 chr1 254996 3.354927 0.439456 chr1 564598 2.887498 1.478148
Parameters: filename (str) – The name of the file to be parsed. This may be in plain text or gzipped plain text.

classmethod
uniform_map
(length, rate, num_loci=None)¶ Returns a
RecombinationMap
instance in which the recombination rate is constant over a chromosome of the specified length. The optionalnum_loci
controls the number of discrete loci in the underlying simulation, and is by default large enough to be effectively be a continuous model.The following map can be used to simulate a true finite locus model with a fixed number of loci
m
:>>> recomb_map = RecombinationMap.uniform_map(m, rate, num_loci=m)
Parameters:
Initialising simulations from a tree sequence¶
By default msprime
simulations are initialised by specifying a set of samples,
using the sample_size
or samples
parameters to simulate()
. This
initialises the simulation with segments of ancestral material covering the
whole sequence. Simulation then proceeds backwards in time until a most recent
common ancestor has been found at all points along this sequence. We can
also start simulations from different initial conditions by using the
from_ts
argument to simulate()
. Informally, we take an ‘unfinished’
tree sequence as a parameter to simulate, initialise the simulation
from the state of this tree sequence and then run the simulation until
coalescence. The returned tree sequence is then the result of taking the
input tree sequence and completing the trees using the coalescent.
This is useful for forwardstime simulators such as
SLiM that can output tree sequences. By running
forwardtime simulation for a certain number of generations we obtain a
tree sequence, but these trees may not have had sufficient time to
reach a most recent common ancestor. By using the from_ts
argument
to simulate()
we can combine the best of both forwards and
backwardstime simulators. The recent past can be simulated forwards
in time and the ancient past by the coalescent. The coalescent
simulation is initialised by the root segments of the
input tree sequence, ensuring that the minimal amount of ancestral
material possible is simulated.
Please see the tutorial for an example of how to use this feature with a simple forwardstime WrightFisher simulator
Input requirements¶
Any tree sequence can be provided as input to this process, but there is a specific topological requirement that must be met for the simulations to be statistically correct. To ensure that ancestral segments are correctly associated within chromosomes when constructing the initial conditions for the coalescent simulation, forwardtime simulators must retain the nodes corresponding to the initial generation. Furthermore, for every sample in the final generation (i.e. the extant population at the present time) there must be a path to one of the founder population nodes. (Please see the tutorial for further explanation of this point and an example.)
Recombination map limitations¶
Because of the way that msprime
handles recombination internally, care must
be taken when specifying recombination when using the from_ts
argument.
If recombination positions are generated in the same way in both the initial
tree sequence and the coalescent simulation, then everything should work.
However, the fine scale details of the underlying recombination model matter,
so matching nonuniform recombination maps between simulators may not be
possible at present. (To make it work, we must ensure that every recombination
breakpoint in from_ts
matches exactly to a possible recombination
breakpoint in msprime’s recombination map, which is not guaranteed because of
msprime’s discrete recombination model.)
One case in which it is guaranteed to work is if from_ts
has integer
coordinates, and we want to simulate a coalescent with a uniform recombination
rate. In this case, to have a uniform recombination rate r
use:
L = int(from_ts.sequence_length)
recomb_map = msprime.RecombinationMap.uniform_map(L, r, L)
final_ts = mpsrime.simulate(from_ts=from_ts, recomb_map=recomb_map)
Node flags¶
For standard coalescent simulations, all samples are marked with the
tskit.NODE_IS_SAMPLE
flag; internal nodes all have a flags value of 0.
When using the record_full_arg
argument to simulate()
, the following
flags values are defined:

msprime.
NODE_IS_RE_EVENT
¶ The node is an ARG recombination event. Each recombination event is marked with two nodes, one identifying the individual providing the genetic material to the left of the breakpoint and the other providing the genetic material the right.

msprime.
NODE_IS_CA_EVENT
¶ The node is an ARG common ancestor event that did not result in marginal coalescence.

msprime.
NODE_IS_MIG_EVENT
¶ The node is an ARG migration event identifying the individual that migrated. Can be used in combination with the
record_migrations
argument tosimulate()
.
Simulating mutations¶
When running coalescent simulations it’s usually most convenient to use the
mutation_rate
argument to the simulate()
function to throw neutral
mutations down on the trees. However, sometimes we wish to throw mutations
down on an existing tree sequence: for example, if we want to see the outcome
of different random mutational processes on top of a single simulated topology,
or if we have obtained the tree sequence from another program and wish to
overlay neutral mutations on this tree sequence.

class
msprime.
InfiniteSites
(alphabet=0)¶ The “infinitely many sites” mutation model. In this model each mutation corresponds to a unique site, which has a floatingpoint position chosen uniformly along the sequence. As a result, each site is associated with exactly one mutation.
By default, the ancestral and derived states in this model are represented by the characters “0” and “1”. Thus, the ancestral state at a site is always “0” and the derived state for a mutation is always “1”. However, by specifying the
alphabet=NUCLEOTIDES
we can generate mutations from the nucleotide characters ACGT. In this case, for each mutation an ancestral state is chosen uniformly from these letters. The derived state is then chosen uniformly from the remaining characters so that the ancestral and derived states are always distinct.

msprime.BINARY == 0
The binary mutation alphabet where ancestral states are always “0” and derived states “1”.

msprime.NUCLEOTIDES == 1
The nucleotides mutation alphabet in which ancestral and derived states are chosen from the characters “A”, “C”, “G” and “T”.

msprime.
mutate
(tree_sequence, rate=None, random_seed=None, model=None, keep=False, start_time=None, end_time=None)¶ Simulates mutations on the specified ancestry and returns the resulting
tskit.TreeSequence
. Mutations are generated at the specified rate in measured generations. Mutations are generated under the infinite sites model, and so the rate of new mutations is per unit of sequence length per generation.If a random seed is specified, this is used to seed the random number generator. If the same seed is specified and all other parameters are equal then the same mutations will be generated. If no random seed is specified then one is generated automatically.
If the
model
parameter is specified, this determines the model under which mutations are generated. Currently only theInfiniteSites
mutation model is supported. This parameter is useful if you wish to obtain sequences with letters from the nucleotide alphabet rather than the default 0/1 states. By default mutations from the infinite sites model with a binary alphabet are generated.By default, sites and mutations in the parameter tree sequence are discarded. If the
keep
parameter is true, however, additional mutations are simulated. Under the infinite sites mutation model, all new mutations generated will occur at distinct positions from each other and from any existing mutations (by rejection sampling).The time interval over which mutations can occur may be controlled using the
start_time
andend_time
parameters. Thestart_time
defines the lower bound (in timeago) on this interval andmax_time
the upper bound. Note that we may have mutations associated with nodes with time <=start_time
since mutations store the node at the bottom (i.e., towards the leaves) of the branch that they occur on.Parameters:  tree_sequence (tskit.TreeSequence) – The tree sequence onto which we wish to throw mutations.
 rate (float) – The rate of mutation per generation. (Default: 0).
 random_seed (int) – The random seed. If this is None, a random seed will be automatically generated. Valid random seeds must be between 1 and \(2^{32}  1\).
 model (MutationModel) – The mutation model to use when generating
mutations. If not specified or None, the
InfiniteSites
mutation model is used.  keep (bool) – Whether to keep existing mutations (default: False).
 start_time (float) – The minimum time at which a mutation can occur. (Default: no restriction.)
 end_time (float) – The maximum time at which a mutation can occur (Default: no restriction).
Returns: The
tskit.TreeSequence
object resulting from overlaying mutations on the input tree sequence.Return type: