API Documentation¶
This is the API documentation for msprime
, and provides detailed information
on the Python programming interface. See the Tutorial for an
introduction to using this API to run simulations.
See the tskit documentation for
information on how to use the
tskit Python API
to analyse simulation results.
Simulation model¶
The simulation model in msprime
closely follows the classical ms
program. Unlike ms
, however, time is measured in generations rather than
in units of \(4 N_e\) generations, i.e., “coalescent units”.
This means that when simulating a population with diploid effective size \(N_e\),
the mean time to coalescence between two samples
in an msprime
simulation will be around \(2 N_e\),
while in an ms
simulation, the mean time will be around \(0.5\).
Internally, msprime
uses the same algorithm as ms
,
and so the Ne
parameter to the simulate()
function
still acts as a time scaling, and can be set to 0.5
to match many theoretical results,
or to 0.25
to match ms
. Population sizes for each
subpopulation and for past demographic events are also defined as absolute values, not
scaled by Ne
. All migration rates and growth rates are also per generation.
When running simulations we define the length \(L\) of the sequence in
question using the length
parameter. This defines the coordinate space
within which trees and mutations are defined. \(L\) is a continuous value,
so units are arbitrary, and coordinates can take any continuous value from \(0\) up to
(but not including) \(L\). (So, although we recommend setting the units of length to be
analogous to “bases”, events can occur at fractional positions.)
Mutations occur in an infinite sites process along this sequence,
and mutation rates are specified per generation, per unit of sequence length.
Thus, given the pergeneration mutation rate \(\mu\), the rate of mutation
over the entire sequence in coalescent time units is \(\theta = 4 N_e \mu
L\). It is important to remember these scaling factors when comparing with
analytical results!
Similarly, recombination rates are per unit of sequence length and per
generation in msprime
. Thus, given the per generation crossover rate
\(r\), the overall rate of recombination between the ends of the sequence
in coalescent time units is \(\rho = 4 N_e r L\). Although breakpoints do
not necessarily occur at integer locations, the underlying recombination model
is finite, and the behaviour of a small number of loci can be modelled using
the RecombinationMap
class. However, this is considered an advanced
feature and the majority of cases should be well served with the default
recombination model and number of loci.
For gene conversion there are two parameters. The gene conversion rate determines the initiation
and is again per unit of sequence length and per generation in msprime
.
Thus, given the per generation gene conversion rate \(g\), the overall rate of
gene conversion initiation between the ends of the sequence is \(\rho = 4 N_e g L\) in
coalescent time units. The second parameter \(track\_len\) is the expected track length
of a gene conversion. At each gene conversion initiation site the track of the conversion
extends to the right and the length of the track is geometric distributed with parameter
\(1/track\_len\). Currently recombination maps for gene conversion are not supported.
However, recombination (with or without recombination maps) and a constant gene conversion
rate along the genome can be combined in msprime
.
Population structure is modelled by specifying a fixed number of subpopulations \(d\), and a \(d \times d\) matrix \(M\) of per generation migration rates. Each element of the matrix \(M_{j,k}\) defines the fraction of population \(j\) that consists of migrants from population \(k\) in each generation. Note that migration rates are specified in a way that makes sense for the coalescence process, so that the \((j, k)^{th}\) entry of the migration matrix, \(M_{j, k}\), gives the rate at which an ancestral lineage moves from population \(j\) to population \(k\) as one follows it back through time. In terms of the demographic process of the populations, if a total of \(n\) migrants move from population \(k\) to population \(j\) per generation, and population \(j\) has a total of \(N_{j}\) individuals, then the migration matrix has \(M_{j,k} = n/N_j\). This differs from the migration matrix one usually uses in population demography: if \(m_{k,j}\) is the proportion of individuals (in the usual sense; not lineages) in population \(k\) that move to population \(j\) per generation, then translating this proportion of population \(k\) to a proportion of population \(j\), we have \(M_{j,k} = m_{k,j} \times N_k / N_j\).
Each subpopulation has an initial absolute population size \(s\) and a per generation exponential growth rate \(\alpha\). The size of a given population at time \(t\) in the past (measured in generations) is therefore given by \(s e^{\alpha t}\). Demographic events that occur in the history of the simulated population alter some aspect of this population configuration at a particular time in the past.
Warning
This parameterisation of recombination, mutation and
migration rates is different to ms, which states these
rates over the entire region and in coalescent time units. The
motivation for this is to allow the user change the size of the simulated
region without having to rescale the recombination, gene conversion, and mutation rates,
and to also allow users directly state times and rates in units of
generations. However, the mspms
command line application is
fully ms compatible.
If recombination and gene conversion are combined the gene conversion
rate in ms is determined by the ratio \(f\), which corresponds to
setting \(g = f r\). In msprime
the gene conversion rate \(g\) is
set independently and does not depend on the recombination rate. However,
mspms
mimics the ms behaviour.
Running simulations¶
The simulate()
function provides the primary interface to running
coalescent simulations in msprime.

msprime.
simulate
()[source]¶ Simulates the coalescent with recombination under the specified model parameters and returns the resulting
tskit.TreeSequence
. Note that Ne is the effective diploid population size (so the effective number of genomes in the population is 2*Ne), butsample_size
is the number of (monoploid) genomes sampled.Parameters:  sample_size (int) – The number of sampled monoploid genomes. If not
specified or None, this defaults to the sum of the subpopulation sample
sizes. Either
sample_size
,population_configurations
orsamples
must be specified.  Ne (float) – The effective (diploid) population size for the reference population. This defaults to 1 if not specified. Please see the Simulation models section for more details on specifying simulations models.
 length (float) – The length of the simulated region in bases.
This parameter cannot be used along with
recombination_map
. Defaults to 1 if not specified.  recombination_rate (float) – The rate of recombination per base
per generation. This parameter cannot be used along with
recombination_map
. Defaults to 0 if not specified.  recombination_map (
RecombinationMap
) – The map describing the changing rates of recombination along the simulated chromosome. This parameter cannot be used along with therecombination_rate
orlength
parameters, as these values are encoded within the map. Defaults to a uniform rate as described in therecombination_rate
parameter if not specified.  mutation_rate (float) – The rate of infinite sites
mutations per unit of sequence length per generation.
If not specified, no mutations are generated. This option only
allows for infinite sites mutations with a binary (i.e., 0/1)
alphabet. For more control over the mutational process, please
use the
mutate()
function.  population_configurations (list or None.) – The list of
PopulationConfiguration
instances describing the sampling configuration, relative sizes and growth rates of the populations to be simulated. If this is not specified, a single population with a sample of sizesample_size
is assumed.  migration_matrix (list) – The matrix describing the rates
of migration between all pairs of populations. If \(N\)
populations are defined in the
population_configurations
parameter, then the migration matrix must be an \(N \times N\) matrix with 0 on the diagonal, consisting of \(N\) lists of length \(N\) or an \(N \times N\) numpy array, with the [j, k]th element giving the fraction of population j that consists of migrants from population k in each generation.  demographic_events (list) – The list of demographic events to simulate. Demographic events describe changes to the populations in the past. Events should be supplied in nondecreasing order of time in the past. Events with the same time value will be applied sequentially in the order that they were supplied before the simulation algorithm continues with the next time step.
 samples (list) – The list specifying the location and time of
all samples. This parameter may be used to specify historical
samples, and cannot be used in conjunction with the
sample_size
parameter. Each sample is a (population
,time
) pair such that the sample in positionj
in the list of samples is drawn in the specified population at the specfied time. Time is measured in generations ago, as elsewhere.  random_seed (int) – The random seed. If this is None, a random seed will be automatically generated. Valid random seeds must be between 1 and \(2^{32}  1\).
 num_replicates (int) – The number of replicates of the specified
parameters to simulate. If this is not specified or None,
no replication is performed and a
tskit.TreeSequence
object returned. Ifnum_replicates
is provided, the specified number of replicates is performed, and an iterator over the resultingtskit.TreeSequence
objects returned.  replicate_index (int) – Return only a specific tree sequence from the set of replicates. This is used to recreate a specific tree sequence from e.g. provenance. This argument only makes sense when used with random seed, and is not compatible with num_replicates. Note also that msprime will have to create and discard all the tree sequences up to this index.
 from_ts (tskit.TreeSequence) – If specified, initialise the simulation from the root segments of this tree sequence and return the completed tree sequence. Please see here for details on the required properties of this tree sequence and its interactions with other parameters. (Default: None).
 start_time (float) – If specified, set the initial time that the
simulation starts to this value. If not specified, the start
time is zero if performing a simulation of a set of samples,
or is the time of the oldest node if simulating from an
existing tree sequence (see the
from_ts
parameter).  end_time (float) – If specified, terminate the simulation at the
specified time. In the returned tree sequence, all rootward paths from
samples with time < end_time will end in a node with one child with
time equal to end_time. Sample nodes with time >= end_time will
also be present in the output tree sequence. If not specified or
None
, run the simulation until all samples have an MRCA at all positions in the genome.  record_full_arg (bool) – If True, record all intermediate nodes arising from common ancestor and recombination events in the output tree sequence. This will result in unary nodes (i.e., nodes in marginal trees that have only one child). Defaults to False.
 model (str or simulation model instance) – The simulation model to use.
This can either be a string (e.g.,
"smc_prime"
) or an instance of a simulation model class (e.g,msprime.DiscreteTimeWrightFisher(100)
. Please see the Simulation models section for more details on specifying simulations models.  record_provenance (bool) – If True, record all configuration and parameters
required to recreate the tree sequence. These can be accessed
via
TreeSequence.provenances()
).
Returns: The
tskit.TreeSequence
object representing the results of the simulation if no replication is performed, or an iterator over the independent replicates simulated if thenum_replicates
parameter has been used.Return type: tskit.TreeSequence
or an iterator overtskit.TreeSequence
replicates.Warning: If using replication, do not store the results of the iterator in a list! For performance reasons, the same underlying object may be used for every TreeSequence returned which will most likely lead to unexpected behaviour.
 sample_size (int) – The number of sampled monoploid genomes. If not
specified or None, this defaults to the sum of the subpopulation sample
sizes. Either
Population structure¶
Population structure is modelled in msprime
by specifying a fixed number of
subpopulations, with the migration rates between those subpopulations defined by a migration
matrix. Each subpopulation has an initial_size
that defines its absolute diploid size at
time zero and a pergeneration growth_rate
which specifies the exponential
growth rate of the subpopulation. We must also define the number of genomes to
sample from each subpopulation. The number of populations and their initial
configuration is defined using the population_configurations
parameter to
simulate()
, which takes a list of PopulationConfiguration
instances. Population IDs are zero indexed, and correspond to their position in
the list.
Samples are drawn sequentially from populations in increasing order of population ID. For example, if we specified an overall sample size of 6, and specify that 2 samples are drawn from population 0 and 4 from population 1, then samples 0 and 1 will be initially located in population 0, and samples 2, 3, 4, and 5 will be drawn from population 2.
Given \(N\) populations, migration matrices are specified using an \(N
\times N\) matrix of betweensubpopulation migration rates. See the
documentation for simulate()
and the Simulation model section for
more details on the migration rates.

class
msprime.
PopulationConfiguration
(sample_size=None, initial_size=None, growth_rate=0.0, metadata=None)[source]¶ The initial configuration of a population (or deme) in a simulation.
Parameters:  sample_size (int) – The number of initial samples that are drawn from this population.
 initial_size (float) – The absolute size of the population at time zero. Defaults to the reference population size \(N_e\).
 growth_rate (float) – The forwardstime exponential growth rate of the population per generation. Growth rates can be negative. This is zero for a constant population size, and positive for a population that has been growing. Defaults to 0.
 metadata (dict) – A JSONencodable dictionary of metadata to associate
with the corresponding Population in the output tree sequence.
If not specified or None, no metadata is stored (i.e., an empty bytes array).
Note that this metadata is ignored when using the
from_ts
argument tosimulate()
, as the population definitions in the tree sequence that is used as the starting point take precedence.
Demographic Events¶
Demographic events change some aspect of the population configuration
at some time in the past, and are specified using the demographic_events
parameter to simulate()
. Each element of this list must be an
instance of one of the following demographic events
that are currently supported. Note that all times are measured in
generations, all sizes are absolute (i.e., not relative to \(N_e\)),
and all rates are pergeneration.
Model change events are also considered to be demographic events; see the Simulation model changes section for details.

class
msprime.
PopulationParametersChange
(time, initial_size=None, growth_rate=None, population=None, population_id=None)[source]¶ Changes the demographic parameters of a population at a given time.
This event generalises the
eg
,eG
,en
andeN
options fromms
. Note that unlikems
we do not automatically set growth rates to zero when the population size is changed.Parameters:  time (float) – The length of time ago at which this event occurred.
 initial_size (float) – The absolute diploid size of the population
at the beginning of the time slice starting at
time
. If None, this is calculated according to the initial population size and growth rate over the preceding time slice.  growth_rate (float) – The new pergeneration growth rate. If None, the growth rate is not changed. Defaults to None.
 population (int) – The ID of the population affected. If
population
is None, the changes affect all populations simultaneously.

class
msprime.
MigrationRateChange
(time, rate, source=1, dest=1, matrix_index=None)[source]¶ Changes the rate of migration from one deme to another to a new value at a specific time. Migration rates are specified in terms of the rate at which lineages move from population
source
todest
during the progress of the simulation. Note thatsource
anddest
are from the perspective of the coalescent process; please see the Simulation model section for more details on the interpretation of this migration model.By default,
source=1
anddest=1
, which results in all nondiagonal elements of the migration matrix being changed to the new rate. Ifsource
anddest
are specified, they must refer to valid population IDs.Parameters:

class
msprime.
MassMigration
(time, source, dest=None, proportion=1.0, destination=None)[source]¶ A mass migration event in which some fraction of the population in one deme (the
source
) simultaneously move to another deme (dest
) during the progress of the simulation. Each lineage currently present in the source population moves to the destination population with probability equal toproportion
. Note thatsource
anddest
are from the perspective of the coalescent process; please see the Simulation model section for more details on the interpretation of this migration model.This event class generalises the population split (
ej
) and admixture (es
) events fromms
. Note that MassMigrations do not have any side effects on the migration matrix.Parameters:

class
msprime.
CensusEvent
(time)[source]¶ An event that adds a node to each branch of every tree at a given time during the simulation. This may be used to record all ancestral haplotypes present at that time, and to extract other information related to these haplotypes: for instance to trace the local ancestry of a sample back to a set of contemporaneous ancestors, or to assess whether a subset of samples has coalesced more recently than the census time. See the tutorial for an example.
Parameters: time (float) – The time at which this event occurs in generations.
Parsing species trees¶
Species trees hold information about the sequence and the times at which species
diverged from each other. Viewed backwards in time, divergence events are equivalent
to mass migration events in which all lineages from one population move to another
population. The history of a set of populations can thus be modelled according to
a given species tree. To faciliate the specification of the model,
parse_species_tree()
parses a species tree and returns the mass migration
events corresponding to all species divergence events in the tree, together with
population configurations that specify population sizes and names.
When species trees are estimated with a program like StarBEAST they can further
contain estimates on the population sizes of extant and ancestral species.
parse_starbeast()
parses species trees estimated with StarBEAST and uses
these estimates to define the population configurations.
Note that when the species tree has branch lengths not in units of generations but in units of years or millions of years (which is common), a generation time in years is required for parsing.

msprime.
parse_species_tree
()[source]¶ This function parses a species tree in Newick format and defines a simulation model according to the species tree. The tree is assumed to be rooted and ultrametric and branch lengths must be included and correspond to time, either in units of millions of years, years, or generations. Leaves must be named.
After reading the input tree, this function defines a
PopulationConfiguration
instance for each terminal node in the tree, corresponding to extant species. These population configurations store the species’ name and population size. The specified Ne is used as the size of all populations. Additionally, one or moreMassMigration
instances are defined for each internal node, with the time of the mass migration set according to the age of the node in the species tree. \(n  1\) mass migration events are defined for internal nodes with n descendants, meaning that a single event is defined for bifurcating nodes. For each internal node, the leftmost of the descendant populations is arbitrarily selected as the destination in all mass migrations defined for that node.Parameters:  tree (str) – The tree string in Newick format, with named leaves and branch lengths.
 Ne (float) – The effective population size.
 branch_length_units (str) – The units of time in which the species tree’s
branch lengths are measured. Allowed branch length units are millions of
years, years, and generations; these should be specified with the strings
"myr"
,"yr"
, or"gen"
, respectively. This defaults to"gen"
.  generation_time (float or None) – The number of years per generation. If and only if the branch lengths are not in units of generations, the generation time must be specified. This defaults to None.
Returns: A tuple of two lists of which the first contains
PopulationConfiguration
instances and the second containsMassMigration
instances. The population configurations specify the size of each population and the species name corresponding to each population. Species names are stored as metadata in eachPopulationConfiguration
instance, with the metadata tag “species_name”. Sampling configurations and growth rates are not specified in the population configurations. The list of population configurations is ordered according to the order of the corresponding extant species in a postorder tree traversal. The list of mass migration events is ordered by the time of the events, from young to old events.Return type: Warning: This function does not modify migration matrices. When the population configurations and mass migration events returned by this function are used to simulate with the
simulate()
function, it should be ensured that migration rates to source populations of mass migration events are zero after the mass migration (viewed backwards in time).

msprime.
parse_starbeast
()[source]¶ This function parses a species tree produced by the program TreeAnnotator based on a posterior tree distribution generated with StarBEAST and defines a simulation model according to the species tree. Species trees produced by TreeAnnotator are written in Nexus format and are rooted, bifurcating, and ultrametric. Branch lengths usually are in units of millions of years, but the use of other units is permitted by StarBEAST (and thus TreeAnnotator). This function allows branch length units of millions of years or years. Leaves must be named and the tree must include information on population sizes of leaf and ancestral species in the form of annotation with the “dmv” tag, which is the case for trees written by TreeAnnotator based on StarBEAST posterior tree distributions.
After reading the input tree, this function defines a
PopulationConfiguration
instance for each terminal node in the tree, corresponding to extant species. These population configurations store the species’ name and population size, both according to information from the input tree. Additionally, aMassMigration
instance is defined for each internal node, with the time of the mass migration set according to the age of the node in the species tree. For each internal node, the left one of the two descendant populations is arbitrarily selected as the destination in the mass migration defined for that node. APopulationParametersChange
instance is also added for each internal node to adjust the population size of the destination population according to the information given in the tree for the population size of the species that is ancestral to the node. Like the mass migration event defined for the same node, the time of the population parameter change is also set according to the age of the node.Parameters:  tree (str) – The tree string in Nexus format, with named leaves, branch lengths, and branch annotation. Typically, this string is the entire content of a file written by TreeAnnotator.
 generation_time (float) – The number of years per generation.
 branch_length_units (str) – The units of time in which the species tree’s
branch lengths are measured. Allowed branch length units are millions of
years, and years; these should be specified with the strings
"myr"
or"yr"
, respectively. This defaults to"myr"
.
Returns: A tuple of two lists of which the first contains
PopulationConfiguration
instances and the second containsMassMigration
andPopulationParametersChange
instances. The population configurations specify the size of each population according to the information from the input species tree and the species name corresponding to each population. Species names are stored as metadata in eachPopulationConfiguration
instance, with the metadata tag “species_name”. Sampling configurations and growth rates are not specified in the population configurations. The list of population configurations is ordered according to the order of the corresponding extant species in a postorder tree traversal. The list of mass migration events and population parameter changes is ordered by the time of the events, from young to old events.Return type: Warning: This function does not modify migration matrices. When the population configurations and mass migration events returned by this function are used to simulate with the
simulate()
function, it should be ensured that migration rates to source populations of mass migration events are zero after the mass migration (viewed backwards in time).
Debugging demographic models¶
Warning
The DemographyDebugger
class is preliminary, and the API
is likely to change in the future.

class
msprime.
DemographyDebugger
(Ne=1, population_configurations=None, migration_matrix=None, demographic_events=None, model='hudson')[source]¶ A class to facilitate debugging of population parameters and migration rates in the past.

coalescence_rate_trajectory
(steps, num_samples, min_pop_size=1, double_step_validation=True)[source]¶ This function will calculate the mean coalescence rates and proportions of uncoalesced lineages between the lineages of the sample configuration provided in num_samples, at each of the times ago listed by steps, in this demographic model. The coalescence rate at time t in the past is the average rate of coalescence of asyetuncoalesed lineages, computed as follows: let \(p(t)\) be the probability that the lineages of a randomly chosen pair of samples has not yet coalesced by time \(t\), let \(p(z,t)\) be the probability that the lineages of a randomly chosen pair of samples has not yet coalesced by time \(t\) and are both in population \(z\), and let \(N(z,t)\) be the diploid effective population size of population \(z\) at time \(t\). Then the mean coalescence rate at time \(t\) is \(r(t) = (\sum_z p(z,t) / (2 * N(z,t)) / p(t)\).
The computation is done by approximating population size trajectories with piecewise constant trajectories between each of the steps. For this to be accurate, the distance between the steps must be small enough so that (a) short epochs (e.g., bottlenecks) are not missed, and (b) populations do not change in size too much over that time, if they are growing or shrinking. This function optionally provides a simple check of this approximation by recomputing the coalescence rates on a grid of steps twice as fine and throwing a warning if the resulting values do not match to a relative tolerance of 0.001.
Parameters:  steps (list) – The times ago at which coalescence rates will be computed.
 num_samples (list) – A list of the same length as the number of populations, so that num_samples[j] is the number of sampled chromosomes in subpopulation j.
 min_pop_size (int) – The smallest allowed population size during computation of coalescent rates (i.e., coalescence rates are actually 1 / (2 * max(min_pop_size, N(z,t))). Spurious very small population sizes can occur in models where populations grow exponentially but are unused before some time in the past, and lead to floating point error. This should be set to a value smaller than the smallest desired population size in the model.
 double_step_validation (bool) – Whether to perform the check that step sizes are sufficiently small, as described above. This is highly recommended, and will take at most four times the computation.
Returns: A tuple of arrays whose jth elements, respectively, are the coalescence rate at the jth time point (denoted r(t[j]) above), and the probablility that a randomly chosen pair of lineages has not yet coalesced (denoted p(t[j]) above).
Return type: (numpy.array, numpy.array)

epoch_times
¶ Returns array of epoch times defined by the demographic model

lineage_probabilities
(steps, sample_time=0)[source]¶ Returns an array such that P[j, a, b] is the probability that a lineage that started in population a at time sample_time is in population b at time steps[j] ago.
This function reports sampling probabilities _before_ mass migration events at a step time, if a mass migration event occurs at one of those times. Migrations will then effect the next time step.
Parameters:  steps (list) – A list of times to compute probabilities.
 sample_time – The time of sampling of the lineage. For any times in steps that are more recent than sample_time, the probability of finding the lineage in any population is zero.
Returns: An array of dimension len(steps) by num pops by num_pops.

mean_coalescence_time
(num_samples, min_pop_size=1, steps=None, rtol=0.005, max_iter=12)[source]¶ Compute the mean time until coalescence between lineages of two samples drawn from the sample configuration specified in num_samples. This is done using
coalescence_rate_trajectory
to compute the probability that the lineages have not yet coalesced by time t, and using these to approximate \(E[T] = \int_t^\infty P(T > t) dt\), where \(T\) is the coalescence time. Seecoalescence_rate_trajectory
for more details.To compute this, an adequate time discretization must be arrived at by iteratively extending or refining the current discretization. Debugging information about numerical convergence of this procedure is logged using the Python
logging
infrastructure. To make it appear, using thedaiquiri
module, do for instance:import daiquiri daiquiri.setup(level="DEBUG") debugger.mean_coalescence_time([2])
will print this debugging information to stderr. Briefly, this outputs iteration number, mean coalescence time, maximum difference in probabilty of not having coalesced yet, difference to last coalescence time, probability of not having coalesced by the final time point, and whether the last iteration was an extension or refinement.
Parameters:  num_samples (list) – A list of the same length as the number of populations, so that num_samples[j] is the number of sampled chromosomes in subpopulation j.
 min_pop_size (int) – See
coalescence_rate_trajectory
.  steps (list) – The time discretization to start out with (by default, picks something based on epoch times).
 rtol (float) – The relative tolerance to determine mean coalescence time to (used to decide when to stop subdividing the steps).
 max_iter (int) – The maximum number of times to subdivide the steps.
Returns: The mean coalescence time (a number).
Return type:

num_epochs
¶ Returns the number of epochs defined by the demographic model.

population_size_history
¶ Returns a (num_pops, num_epochs) numpy array giving the starting population size for each population in each epoch.

population_size_trajectory
(steps)[source]¶ This function returns an array of perpopulation effective population sizes, as defined by the demographic model. These are the initial_size parameters of the model, modified by any population growth rates. The sizes are computed at the time points given by steps.
Parameters: steps (list) – List of times ago at which the population size will be computed. Returns: Returns a numpy array of population sizes, with one column per population, whose [i,j]th entry is the size of population j at time steps[i] ago.

possible_lineage_locations
(samples=None)[source]¶ Given the sampling configuration, this function determines when lineages are possibly found within each population over epochs defined by demographic events and sampling times. If no sampling configuration is given, we assume we sample lineages from every population at time zero. The samples are specified by a list of msprime Sample objects, so that possible ancient samples may be accounted for.
Parameters: samples (list) – A list of msprime Sample objects, which specify their populations and times. Returns: Returns a dictionary with epoch intervals as keys whose values are a list with length equal to the number of populations with True and False indicating which populations could possibly contain lineages over that epoch. The epoch intervals are given by tuples: (epoch start, epoch end). The first epoch necessarily starts at time 0, and the final epoch has end time of infinity.

Variable recombination rates¶

class
msprime.
RecombinationMap
(positions, rates, num_loci=None, discrete=False, map_start=0)[source]¶ A RecombinationMap represents the changing rates of recombination along a chromosome. This is defined via two lists of numbers:
positions
andrates
, which must be of the same length. Given an index j in these lists, the rate of recombination per base per generation isrates[j]
over the intervalpositions[j]
topositions[j + 1]
. Consequently, the first position must be zero, and by convention the last rate value is also required to be zero (although it is not used).Warning
The
num_loci
parameter is deprecated. To set a discrete number of possible recombination sites along the sequence, scalepositions
to the desired number of sites and setdiscrete=True
to ensure recombination occurs only at integer values.Parameters:  positions (list) – The positions (in bases) denoting the distinct intervals where recombination rates change. These can be floating point values.
 rates (list) – The list of rates corresponding to the supplied
positions
. Recombination rates are specified per base, per generation.  num_loci (int) – This parameter is deprecated. The maximum number of nonrecombining loci in the underlying simulation. By default this is set to the largest possible value, allowing the maximum resolution in the recombination process. However, for a finite sites model this can be set to smaller values.
 discrete (bool) – Whether recombination can occur only at integer
positions. When
False
, recombination sites can take continuous values. To simulate a fixed number of loci, set this parameter toTrue
and scalepositions
to span the desired number of loci.

mean_recombination_rate
¶ Return the weighted mean recombination rate across all windows of the entire recombination map.

classmethod
read_hapmap
(filename)[source]¶ Parses the specified file in HapMap format. These files must be whitespacedelimited, and contain a single header line (which is ignored), and then each subsequent line contains the starting position and recombination rate for the segment from that position (inclusive) to the starting position on the next line (exclusive). Starting positions of each segment are given in units of bases, and recombination rates in centimorgans/Megabase. The first column in this file is ignored, as are additional columns after the third (Position is assumed to be the second column, and Rate is assumed to be the third). If the first starting position is not equal to zero, then a zerorecombination region is inserted at the start of the chromosome.
A sample of this format is as follows:
Chromosome Position(bp) Rate(cM/Mb) Map(cM) chr1 55550 2.981822 0.000000 chr1 82571 2.082414 0.080572 chr1 88169 2.081358 0.092229 chr1 254996 3.354927 0.439456 chr1 564598 2.887498 1.478148 ... chr1 182973428 2.512769 122.832331 chr1 183630013 0.000000 124.482178
Parameters: filename (str) – The name of the file to be parsed. This may be in plain text or gzipped plain text.

slice
(start=None, end=None, trim=False)[source]¶ Returns a subset of this recombination map between the specified end points. If start is None, it defaults to 0. If end is None, it defaults to the end of the map. If trim is True, remove the flanking zero recombination rate regions such that the sequence length of the new recombination map is end  start.

classmethod
uniform_map
(length, rate, num_loci=None, discrete=False)[source]¶ Returns a
RecombinationMap
instance in which the recombination rate is constant over a chromosome of the specified length. The optionaldiscrete
controls whether recombination sites can occur only on integer positions or can take continuous values. The legacynum_loci
option is no longer supported and should not be used.The following map can be used to simulate a true finite locus model with a fixed number of loci
m
:>>> recomb_map = RecombinationMap.uniform_map(m, rate, discrete=True)
Parameters:  length (float) – The length of the chromosome.
 rate (float) – The rate of recombination per unit of sequence length along this chromosome.
 num_loci (int) – This parameter is no longer supported.
 discrete (bool) – Whether recombination can occur only at integer
positions. When
False
, recombination sites can take continuous values. To simulate a fixed number of loci, set this parameter toTrue
and setlength
to the desired number of loci.
Simulation models¶
The default simulation model in msprime
is the standard coalescent with recombination
model. We also support a number of different models, which are documented in this section.
Simulations models are specified using the model
parameter to
simulate()
. This parameter can either take the form of a
string describing the model (e.g. model="dtwf"
) or an instance of a
model definition class (e.g model=msprime.DiscreteTimeWrightFisher(1000)
).
The available models are documented in the following subsections.
A key element of simulation models in msprime
is the concept
of a “reference population size”. The Ne
argument to
simulate()
can also be used to define this parameter
when combined with a string shorthand for a model.
msprime.simulate(10, Ne=1000, model="dtwf")
and
msprime.simulate(10, model=msprime.DiscreteTimeWrightFisher(1000))
define the same simulation.
Note
When Ne
and an instance of SimulationModel
with
reference_size
set are both provided as arguments to simulate()
,
the Ne
parameter is ignored.
Todo
Add a discussion of population sizes here, describing what Ne/model.reference_size really means, and how it interacts with the individual population sizes.
We are often interested in simulating mixtures of models: for example,
using the DiscreteTimeWrightFisher
model to simulate the
recent past and then using the standard coalescent to complete the
simulation of the ancient past. This can be achieved using the
SimulationModelChange
event. See the
Hybrid simulations for an example of this
approach.
Coalescent and approximations¶

class
msprime.
StandardCoalescent
(reference_size=None)[source]¶ The classical coalescent with recombination model (i.e., Hudson’s algorithm). The string
"hudson"
can be used to refer to this model.This is the default simulation model.

class
msprime.
SmcApproxCoalescent
(reference_size=None)[source]¶ The original SMC model defined by McVean and Cardin. This model is implemented using a naive rejection sampling approach and so it may not be any more efficient to simulate than the standard Hudson model.
The string
"smc"
can be used to refer to this model.

class
msprime.
SmcPrimeApproxCoalescent
(reference_size=None)[source]¶ The SMC’ model defined by Marjoram and Wall as an improvement on the original SMC. model is implemented using a naive rejection sampling approach and so it may not be any more efficient to simulate than the standard Hudson model.
The string
"smc_prime"
can be used to refer to this model.
Multiple merger coalescents¶
Some evolutionary scenarios, such as a skewed offspring distribution combined with a type III survivorship curve, range expansion, and rapid adaptation, can predict diploid genealogies with up to four simultaneous multiple mergers. Msprime provides the option to simulate from two classes of such genealogical processes.
See Multiple merger coalescents for examples of how to use these multiple merger coalescents.

class
msprime.
BetaCoalescent
(reference_size=None, alpha=None, truncation_point=1)[source]¶ A diploid Xicoalescent with up to four simultaneous multiple mergers and crossover recombination.
There are two main differences between the BetaXicoalescent and the standard coalescent. Firstly, the number of lineages that take part in each common ancestor event is random, with distribution determined by moments of the \(Beta(2  \alpha, \alpha)\)distribution. In particular, when there are \(n\) lineages, each set of \(k \leq n\) of them participates in a common ancestor event at rate
\[\frac{8}{B(2  \alpha, \alpha)} \int_0^1 x^{k  \alpha  1} (1  x)^{n  k + \alpha  1} dx,\]where \(B(2  \alpha, \alpha)\) is the Betafunction.
In a common ancestor event, all participating lineages are randomly split into four groups, corresponding to the four parental chromosomes in a diploid, biparental reproduction event. All lineages within each group merge simultaneously.
Warning
The prefactor of 8 in the common ancestor event rate arises as the product of two terms. A factor of 4 compensates for the fact that only one quarter of binary common ancestor events result in a merger due to diploidy. A further factor of 2 is included for consistency with the implementation of the Hudson model, in which \(n\) lineages undergo binary mergers at rate \(n (n  1)\).
Secondly, the time scale predicted by the BetaXicoalescent is proportional to \(N_e^{\alpha  1}\) generations, where \(N_e\) is the effective population size. Specifically, one unit of coalescent time corresponds to a number of generations given by
\[\frac{m^{\alpha} N_e^{\alpha  1}}{\alpha B(2  \alpha, \alpha)},\]where
\[m = 2 + \frac{2^{\alpha}}{3^{\alpha  1} (\alpha  1)}.\]Note that the time scale depends both on the effective population size \(N_e\) and \(\alpha\), and can be dramatically shorter than the timescale of the standard coalescent. Thus, effective population sizes must often be many orders of magnitude larger than census population sizes. The pergeneration recombination rate is rescaled similarly to obtain the populationrescaled recombination rate.
See Schweinsberg (2003) for the derivation of the common ancestor event rate, as well as the time scaling. Note however that the model of Schweinsberg (2003) is haploid, so that all participating lineages merge in a common ancestor event without splitting into four groups.
Parameters:  alpha (float) – Determines the degree of skewness in the family size distribution, and must satisfy \(1 < \alpha < 2\). Smaller values of \(\alpha\) correspond to greater skewness, and \(\alpha = 2\) would coincide with the standard coalescent.
 truncation_point (float) – Determines the maximum fraction of the population replaced by offspring in one reproduction event, and must satisfy \(0 < \tau \leq 1\), where \(\tau\) is the truncation point. The default is \(\tau = 1\), which corresponds to the standard BetaXicoalescent. When \(\tau < 1\), the number of lineages participating in a common ancestor event is determined by moments of the \(Beta(2  \alpha, \alpha)\) distribution conditioned on not exceeding \(\tau\), and the Betafunction in the expression for the time scale is also replaced by the incomplete Beta function \(Beta(\tau; 2  \alpha, \alpha)\).

class
msprime.
DiracCoalescent
(reference_size=None, psi=None, c=None)[source]¶ A diploid Xicoalescent with up to four simultaneous multiple mergers and crossover recombination.
The DiracXicoalescent is an implementation of the model of Blath et al. (2013) The simulation proceeds similarly to the standard coalescent. In addition to binary common ancestor events at rate \(n (n  1)\) when there are \(n\) lineages, potential multiple merger events take place at rate \(2 c > 0\). Each lineage participates in each multiple merger event independently with probability \(0 < \psi \leq 1\). All participating lineages are randomly split into four groups, corresponding to the four parental chromosomes present in a diploid, biparental reproduction event, and the lineages within each group merge simultaneously.
Warning
The DiracXicoalescent is obtained as a scaling limit of Moran models, rather than WrightFisher models. As a consequence, one unit of coalescent time is proportional to \(N_e^2\) generations, rather than \(N_e\) generations as in the standard coalescent. However, the coalescent recombination rate is obtained from the pergeneration recombination probability by rescaling with \(N_e\). See Multiple merger coalescents for an illustration of how this affects simulation output in practice.
Parameters:  c (float) – Determines the rate of potential multiple merger events. We require \(c > 0\).
 psi (float) – Determines the fraction of the population replaced by offspring in one large reproduction event, i.e. one reproduction event giving rise to potential multiple mergers when viewed backwards in time. We require \(0 < \psi \leq 1\).
Discrete time WrightFisher¶
Msprime provides the option to perform discretetime WrightFisher simulations for scenarios when the coalescent model is not appropriate, including large sample sizes, multiple chromosomes, or recent migration.
To use this option, set the flag model="dtwf"
as in the following example:
>>> tree_sequence = msprime.simulate(
... sample_size=6, Ne=1000, length=1e4, recombination_rate=2e8,
... model="dtwf")
All other parameters can be set as usual.

class
msprime.
DiscreteTimeWrightFisher
(reference_size=None)[source]¶ A discrete backwardstime WrightFisher model, with diploid backandforth recombination. The string
"dtwf"
can be used to refer to this model.WrightFisher simulations are performed very similarly to coalescent simulations, with all parameters denoting the same quantities in both models. Because events occur at discrete times however, the order in which they occur matters. Each generation consists of the following ordered events:
 Migration events. As in the Hudson coalescent, these move single extant lineages between populations. Because migration events occur before lineages choose parents, migrant lineages choose parents from their new population in the same generation.
 Demographic events. All events with previous_generation < event_time <= current_generation are carried out here.
 Lineages draw parents. Each (monoploid) extant lineage draws a parent from their current population.
 Diploid recombination. Each parent is diploid, so all child lineages recombine backandforth into the same two parental genome copies. These become two independent lineages in the next generation.
 Historical sampling events. All historical samples with previous_generation < sample_time <= current_generation are inserted.
Selective sweeps¶
Todo
Document the selective sweep models.
Simulation model changes¶
Todo
Describe the subtleties of how simulation model changes work along with how times are computed.

class
msprime.
SimulationModelChange
(time=None, model=None)[source]¶ An event representing a change of underlying simulation model.
Parameters:  time (float) – The time at which the simulation model changes to the new model, in generations. After this time, all internal tree nodes, edges and migrations are the result of the new model. If time is set to None (the default), the model change will occur immediately after the previous model has completed. If time is a callable, the time at which the simulation model changes is the result of calling this function with the time that the previous model started with as a parameter.
 model (str or simulation model instance) – The new simulation model to use.
This can either be a string (e.g.,
"smc_prime"
) or an instance of a simulation model class (e.g,msprime.DiscreteTimeWrightFisher(100)
. Please see the Simulation models section for more details on specifying simulations models. If the argument is a string, the reference population size is set from the top levelNe
parameter tosimulate()
. If this is None (the default) the model is changed to the standard coalescent with a reference_size of Ne (if model was not specified).
Initialising simulations from a tree sequence¶
By default msprime
simulations are initialised by specifying a set of samples,
using the sample_size
or samples
parameters to simulate()
. This
initialises the simulation with segments of ancestral material covering the
whole sequence. Simulation then proceeds backwards in time until a most recent
common ancestor has been found at all points along this sequence. We can
also start simulations from different initial conditions by using the
from_ts
argument to simulate()
. Informally, we take an ‘unfinished’
tree sequence as a parameter to simulate, initialise the simulation
from the state of this tree sequence and then run the simulation until
coalescence. The returned tree sequence is then the result of taking the
input tree sequence and completing the trees using the coalescent.
This is useful for forwardstime simulators such as
SLiM that can output tree sequences. By running
forwardtime simulation for a certain number of generations we obtain a
tree sequence, but these trees may not have had sufficient time to
reach a most recent common ancestor. By using the from_ts
argument
to simulate()
we can combine the best of both forwards and
backwardstime simulators. The recent past can be simulated forwards
in time and the ancient past by the coalescent. The coalescent
simulation is initialised by the root segments of the
input tree sequence, ensuring that the minimal amount of ancestral
material possible is simulated.
Please see the tutorial for an example of how to use this feature with a simple forwardstime WrightFisher simulator
Input requirements¶
Any tree sequence can be provided as input to this process, but there is a specific topological requirement that must be met for the simulations to be statistically correct. To ensure that ancestral segments are correctly associated within chromosomes when constructing the initial conditions for the coalescent simulation, forwardtime simulators must retain the nodes corresponding to the initial generation. Furthermore, for every sample in the final generation (i.e. the extant population at the present time) there must be a path to one of the founder population nodes. (Please see the tutorial for further explanation of this point and an example.)
Recombination map limitations¶
Because of the way that msprime
handles recombination internally, care must
be taken when specifying recombination when using the from_ts
argument.
If recombination positions are generated in the same way in both the initial
tree sequence and the coalescent simulation, then everything should work.
However, the fine scale details of the underlying recombination model matter,
so matching nonuniform recombination maps between simulators may not be
possible at present. (To make it work, we must ensure that every recombination
breakpoint in from_ts
matches exactly to a possible recombination
breakpoint in msprime’s recombination map, which is not guaranteed because of
msprime’s discrete recombination model.)
One case in which it is guaranteed to work is if from_ts
has integer
coordinates, and we want to simulate a coalescent with a uniform recombination
rate. In this case, to have a uniform recombination rate r
use:
L = int(from_ts.sequence_length)
recomb_map = msprime.RecombinationMap.uniform_map(L, r, L)
final_ts = mpsrime.simulate(from_ts=from_ts, recomb_map=recomb_map)
Node flags¶
For standard coalescent simulations, all samples are marked with the
tskit.NODE_IS_SAMPLE
flag; internal nodes all have a flags value of 0.
When using the record_full_arg
argument to simulate()
, the following
flags values are defined:

msprime.
NODE_IS_RE_EVENT
¶ The node is an ARG recombination event. Each recombination event is marked with two nodes, one identifying the individual providing the genetic material to the left of the breakpoint and the other providing the genetic material the right.

msprime.
NODE_IS_CA_EVENT
¶ The node is an ARG common ancestor event that did not result in marginal coalescence.

msprime.
NODE_IS_MIG_EVENT
¶ The node is an ARG migration event identifying the individual that migrated. Can be used in combination with the
record_migrations
argument tosimulate()
.

msprime.
NODE_IS_CEN_EVENT
¶ The node was created by a
msprime.CensusEvent
.
Simulating mutations¶
When running coalescent simulations it’s usually most convenient to use the
mutation_rate
argument to the simulate()
function to throw neutral
mutations down on the trees. However, sometimes we wish to throw mutations
down on an existing tree sequence: for example, if we want to see the outcome
of different random mutational processes on top of a single simulated topology,
or if we have obtained the tree sequence from another program and wish to
overlay neutral mutations on this tree sequence.

msprime.
InfiniteSites
¶

msprime.BINARY == 0
The binary mutation alphabet where ancestral states are always “0” and derived states “1”.

msprime.NUCLEOTIDES == 1
The nucleotides mutation alphabet in which ancestral and derived states are chosen from the characters “A”, “C”, “G” and “T”.

msprime.
mutate
(tree_sequence, rate=None, random_seed=None, model=None, keep=False, start_time=None, end_time=None, discrete=False)[source]¶ Simulates mutations on the specified ancestry and returns the resulting
tskit.TreeSequence
. Mutations are generated at the specified rate in measured generations. Mutations are generated under the infinite sites model, and so the rate of new mutations is per unit of sequence length per generation.If a random seed is specified, this is used to seed the random number generator. If the same seed is specified and all other parameters are equal then the same mutations will be generated. If no random seed is specified then one is generated automatically.
If the
model
parameter is specified, this determines the model under which mutations are generated. Currently only theInfiniteSites
mutation model is supported. This parameter is useful if you wish to obtain sequences with letters from the nucleotide alphabet rather than the default 0/1 states. By default mutations from the infinite sites model with a binary alphabet are generated.By default, sites and mutations in the parameter tree sequence are discarded. If the
keep
parameter is true, however, additional mutations are simulated. Under the infinite sites mutation model, all new mutations generated will occur at distinct positions from each other and from any existing mutations (by rejection sampling).The time interval over which mutations can occur may be controlled using the
start_time
andend_time
parameters. Thestart_time
defines the lower bound (in timeago) on this interval andmax_time
the upper bound. Note that we may have mutations associated with nodes with time <=start_time
since mutations store the node at the bottom (i.e., towards the leaves) of the branch that they occur on.Parameters:  tree_sequence (tskit.TreeSequence) – The tree sequence onto which we wish to throw mutations.
 rate (float) – The rate of mutation per generation, as either a
single number (for a uniform rate) or as a
MutationMap
. (Default: 0).  random_seed (int) – The random seed. If this is None, a random seed will be automatically generated. Valid random seeds must be between 1 and \(2^{32}  1\).
 model (MutationModel) – The mutation model to use when generating
mutations. If not specified or None, the
BinaryMutations
mutation model is used.  keep (bool) – Whether to keep existing mutations (default: False).
 start_time (float) – The minimum time ago at which a mutation can occur. (Default: no restriction.)
 end_time (float) – The maximum time ago at which a mutation can occur (Default: no restriction).
 discrete (bool) – Whether to generate mutations at only integer positions along the genome. Default is False, which produces infinitesites mutations at floatingpoint positions.
Returns: The
tskit.TreeSequence
object resulting from overlaying mutations on the input tree sequence.Return type:
Evaluating sampling probabilities¶
msprime
provides the capability to evaluate the sampling probabilities:
that of a stored tree sequence for a given diploid effective population size
\(N_e\) and perlink, pergeneration recombination probability \(r\)
under the standard ancestral recombination graph; and that of a pattern of
mutations given a tree sequence and persite, pergeneration mutation
probability \(\mu\) under the infinite sites model.

msprime.
log_arg_likelihood
(ts, recombination_rate, Ne=1)[source]¶ Returns the log probability of the stored tree sequence under the Hudson ARG. An exact expression for this probability is given in equation (1) of Kuhner et al. (2000).
We assume branch lengths stored in generations, resulting in a coalescence rate of \(1 / (2 N_e)\) per pair of lineages.
Warning
The stored tree sequence must store the full realisation of the ARG, including all recombination events and all common ancestor events, regardless of whether the recombinations cause a change in the ancestral tree or whether the common ancestor events cause coalescence of ancestral material. See Recording the full ARG for details of this data structure, and how to generate them using
msprime
.Parameters:  ts (tskit.TreeSequence) – The tree sequence object.
 recombination_rate (float) – The perlink, pergeneration recombination probability. Must be nonnegative.
 Ne (float) – The diploid effective population size.
Returns: The log probability of the tree sequence under the Hudson ancestral recombination graph model. If the recombination rate is zero and the tree sequence contains at least one recombination event, then returns float(“inf”).

msprime.
unnormalised_log_mutation_likelihood
(ts, mu)[source]¶ Returns the unnormalised log probability of the stored pattern of mutations on the stored tree sequence, assuming infinite sites mutation. In particular, each stored site must only contain a single mutation. The omitted normalising constant depends on the pattern of mutations, but not on the tree sequence or the mutation rate.
The function first computes the probability of the overall number of mutations \(M\) from the Poisson probability mass function
\[e^{T \mu / 2} \frac{(T \mu / 2)^M}{M!},\]where \(T\) is the total area of ancestral material in the tree sequence, stored in units of generations. Each mutation then contributes an individual factor of \(l / T\), where \(l\) is the total branch length on which the mutation could have arisen while appearing on all of the required lineages, again stored in generations.
Warning
If the tree sequence contains unary nodes, then \(l\) could span more than one edge. In particular, we do not constrain mutations to take place on the edge directly above the node on which they have been recorded, but rather on any edge which would yield the same configuration of SNPs at the leaves of the tree sequence.
Parameters:  ts (tskit.TreeSequence) – The tree sequence object with mutations.
 mu (float) – The persite, pergeneration mutation probablity. Must be nonnegative.
Returns: The unnormalised log probability of the observed SNPs given the tree sequence. If the mutation rate is set to zero and the tree sequence contains at least one mutation, then returns float(“inf”).