API Documentation¶
This is the API documentation for msprime
, and provides detailed information
on the Python programming interface. See the Tutorial for an
introduction to using this API to run simulations and analyse the results.
Simulation model¶
The simulation model in msprime
closely follows the classical ms
program. Unlike ms
, however, time is measured in generations rather than
“coalescent units”. Internally the same simulation algorithm is used, but
msprime
provides a translation layer to allow the user input times and
rates in generations. Similarly, the times associated with the trees produced
by msprime
are in measured generations. To enable this translation from
generations into coalescent units and viceversa, a reference effective
population size must be provided, which is given by the Ne
parameter in the
simulate()
function. (Note that we assume diploid population sizes
thoughout, since we scale by \(4 N_e\).) Population sizes for individual
demes and for past demographic events are defined as absolute values, not
scaled by Ne
. All migration rates and growth rates are also per generation.
When running simulations we define the length in bases \(L\) of the
sequence in question using the length
parameter. This defines the
coordinate space within which trees and mutations are defined. \(L\) is a
continuous value, and coordinates can take any value from \(0\) to
\(L\). Mutations occur in an infinite sites process along this sequence,
and mutation rates are specified per generation, per unit of sequence length.
Thus, given the pergeneration mutation rate \(\mu\), the rate of mutation
over the entire sequence in coalescent time units is \(\theta = 4 N_e \mu
L\). It is important to remember these scaling factors when comparing with
analytical results!
Similarly, recombination rates are per base, per generation in msprime
.
Thus, given the per generation crossover rate \(r\), the overall rate
of recombination between the ends of the sequence in coalescent time units
is \(\rho = 4 N_e r L\). Recombination events occur in a continuous
coordinate space, so that breakpoints do not necessarily occur at integer
locations. However, the underlying recombination model is finite, and the
behaviour of a small number of loci can be modelled using the
RecombinationMap
class. However, this is considered an advanced
feature and the majority of cases should be well served with the default
recombination model and number of loci.
Population structure is modelled by specifying a fixed number of demes \(d\), and a \(d \times d\) matrix \(M\) of per generation migration rates. Each element of the matrix \(M_{j,k}\) defines the fraction of population \(j\) that consists of migrants from population \(k\) in each generation. Each deme has an initial absolute population size \(s\) and a per generation exponential growth rate \(\alpha\). The size of a given population at time \(t\) in the past (measured in generations) is therefore given by \(s e^{\alpha t}\). Demographic events that occur in the history of the simulated population alter some aspect of this population configuration at a particular time in the past.
Warning
This parameterisation of recombination, mutation and
migration rates is different to ms, which states these
rates over the entire region and in coalescent time units. The
motivation for this is to allow the user change the size of the simulated
region without having to rescale the recombination and mutation rates,
and to also allow users directly state times and rates in units of
generations. However, the mspms
command line application is
fully ms compatible.
Running simulations¶
The simulate()
function provides the primary interface to running
coalescent simulations in msprime.

msprime.
simulate
(sample_size=None, Ne=1, length=None, recombination_rate=None, recombination_map=None, mutation_rate=None, population_configurations=None, migration_matrix=None, demographic_events=[], samples=None, model=None, record_migrations=False, random_seed=None, mutation_generator=None, num_replicates=None)¶ Simulates the coalescent with recombination under the specified model parameters and returns the resulting
TreeSequence
.Parameters:  sample_size (int) – The number of individuals in our sample.
If not specified or None, this defaults to the sum of the
subpopulation sample sizes. Either
sample_size
,population_configurations
orsamples
must be specified.  Ne (float) – The effective (diploid) population size for the reference population. This determines the factor by which the pergeneration recombination and mutation rates are scaled in the simulation. This defaults to 1 if not specified.
 length (float) – The length of the simulated region in bases.
This parameter cannot be used along with
recombination_map
. Defaults to 1 if not specified.  recombination_rate (float) – The rate of recombination per base
per generation. This parameter cannot be used along with
recombination_map
. Defaults to 0 if not specified.  recombination_map (
RecombinationMap
) – The map describing the changing rates of recombination along the simulated chromosome. This parameter cannot be used along with therecombination_rate
orlength
parameters, as these values are encoded within the map. Defaults to a uniform rate as described in therecombination_rate
parameter if not specified.  mutation_rate (float) – The rate of mutation per base per generation. If not specified, no mutations are generated.
 population_configurations (list or None.) – The list of
PopulationConfiguration
instances describing the sampling configuration, relative sizes and growth rates of the populations to be simulated. If this is not specified, a single population with a sample of sizesample_size
is assumed.  migration_matrix (list) – The matrix describing the rates
of migration between all pairs of populations. If \(N\)
populations are defined in the
population_configurations
parameter, then the migration matrix must be an \(N\times N\) matrix consisting of \(N\) lists of length \(N\).  demographic_events (list) – The list of demographic events to simulate. Demographic events describe changes to the populations in the past. Events should be supplied in nondecreasing order of time. Events with the same time value will be applied sequentially in the order that they were supplied before the simulation algorithm continues with the next time step.
 samples (list) – The list specifying the location and time of
all samples. This parameter may be used to specify historical
samples, and cannot be used in conjunction with the
sample_size
parameter. Each sample is a (population_id
,time
) pair such that the sample in positionj
in the list of samples is drawn in the specified population at the specfied time. Time is measured in generations, as elsewhere.  random_seed (int) – The random seed. If this is None, a random seed will be automatically generated. Valid random seeds must be between 1 and \(2^{32}  1\).
 num_replicates (int) – The number of replicates of the specified
parameters to simulate. If this is not specified or None,
no replication is performed and a
TreeSequence
object returned. Ifnum_replicates
is provided, the specified number of replicates is performed, and an iterator over the resultingTreeSequence
objects returned.
Returns: The
TreeSequence
object representing the results of the simulation if no replication is performed, or an iterator over the independent replicates simulated if thenum_replicates
parameter has been used.Return type: TreeSequence
or an iterator overTreeSequence
replicates.Warning: If using replication, do not store the results of the iterator in a list! For performance reasons, the same underlying object may be used for every TreeSequence returned which will most likely lead to unexpected behaviour.
 sample_size (int) – The number of individuals in our sample.
If not specified or None, this defaults to the sum of the
subpopulation sample sizes. Either
Population structure¶
Population structure is modelled in msprime
by specifying a fixed number of
demes, with the migration rates between those demes defined by a migration
matrix. Each deme has an initial_size
that defines its absolute size at
time zero and a pergeneration growth_rate
which specifies the exponential
growth rate of the subpopulation. We must also define the size of the sample
to draw from each deme. The number of populations and their initial
configuration is defined using the population_configurations
parameter to
simulate()
, which takes a list of PopulationConfiguration
instances. Population IDs are zero indexed, and correspond to their position in
the list.
Samples are drawn sequentially from populations in increasing order of population ID. For example, if we specified an overall sample size of 5, and specify that 2 samples are drawn from population 0 and 3 from population 1, then individuals 0 and 1 will be initially located in population 0, and individuals 2, 3 and 4 will be drawn from population 2.
Given \(N\) populations, migration matrices are specified using an \(N
\times N\) matrix of demetodeme migration rates. See the documentation for
simulate()
and the Simulation model section for more details on the
migration rates.

class
msprime.
PopulationConfiguration
(sample_size=None, initial_size=None, growth_rate=0.0)¶ The initial configuration of a population (or deme) in a simulation.
Parameters:  sample_size (int) – The number of initial samples that are drawn from this population.
 initial_size (float) – The absolute size of the population at time zero. Defaults to the reference population size \(N_e\).
 growth_rate (float) – The exponential growth rate of the population per generation. Growth rates can be negative. This is zero for a constant population size. Defaults to 0.
Demographic Events¶
Demographic events change some aspect of the population configuration
at some time in the past, and are specified using the demographic_events
parameter to simulate()
. Each element of this list must be an
instance of one of the following demographic events
that are currently supported. Note that all times are measured in
generations, all sizes are absolute (i.e., not relative to \(N_e\)),
and all rates are pergeneration.

class
msprime.
PopulationParametersChange
(time, initial_size=None, growth_rate=None, population_id=None)¶ Changes the demographic parameters of a population at a given time.
This event generalises the
eg
,eG
,en
andeN
options fromms
. Note that unlikems
we do not automatically set growth rates to zero when the population size is changed.Parameters:  time (float) – The time at which this event occurs in generations.
 initial_size (float) – The absolute size of the population
at the beginning of the time slice starting at
time
. If None, this is calculated according to the initial population size and growth rate over the preceding time slice.  growth_rate (float) – The new pergeneration growth rate. If None, the growth rate is not changed. Defaults to None.
 population_id (int) – The ID of the population affected. If
population_id
is None, the changes affect all populations simultaneously.

class
msprime.
MigrationRateChange
(time, rate, matrix_index=None)¶ Changes the rate of migration to a new value at a specific time.
Parameters:  time (float) – The time at which this event occurs in generations.
 rate (float) – The new pergeneration migration rate.
 matrix_index (tuple) – A tuple of two population IDs descibing
the matrix index of interest. If
matrix_index
is None, all nondiagonal entries of the migration matrix are changed simultaneously.

class
msprime.
MassMigration
(time, source, destination, proportion=1.0)¶ A mass migration event in which some fraction of the population in one deme simultaneously move to another deme, viewed backwards in time. For each lineage currently present in the source population, they move to the destination population with probability equal to
proportion
.This event class generalises the population split (
ej
) and admixture (es
) events fromms
. Note that MassMigrations do not have any side effects on the migration matrix.Parameters:  time (float) – The time at which this event occurs in generations.
 source (int) – The ID of the source population.
 destination (int) – The ID of the destination population.
 proportion (float) – The probability that any given lineage within the source population migrates to the destination population.
Debugging demographic models¶
Warning
The DemographyDebugger
class is preliminary, and the API
is likely to change in the future.

class
msprime.
DemographyDebugger
(Ne=1, population_configurations=None, migration_matrix=None, demographic_events=[])¶ A class to facilitate debugging of population parameters and migration rates in the past.

print_history
(output=<open file '<stdout>', mode 'w'>)¶ Prints a summary of the history of the populations.

Variable recombination rates¶

class
msprime.
RecombinationMap
(positions, rates, num_loci=None)¶ A RecombinationMap represents the changing rates of recombination along a chromosome. This is defined via two lists of numbers:
positions
andrates
, which must be of the same length. Given an index j in these lists, the rate of recombination per base per generation isrates[j]
over the intervalpositions[j]
topositions[j + 1]
. Consequently, the first position must be zero, and by convention the last rate value is also required to be zero (although it does not used).Parameters:  positions (list) – The positions (in bases) denoting the distinct intervals where recombination rates change. These can be floating point values.
 rates (list) – The list of rates corresponding to the supplied
positions
. Recombination rates are specified per base, per generation.  num_loci (int) – The maximum number of nonrecombining loci in the underlying simulation. By default this is set to the largest possible value, allowing the maximum resolution in the recombination process. However, for a finite sites model this can be set to smaller values.

classmethod
read_hapmap
(filename)¶ Parses the specified file in HapMap format. These files must contain a single header line (which is ignored), and then each subsequent line denotes a position/rate pair. Positions are in units of bases, and recombination rates in centimorgans/Megabase. The first column in this file is ignored, as are subsequence columns after the Position and Rate. A sample of this format is as follows:
Chromosome Position(bp) Rate(cM/Mb) Map(cM) chr1 55550 2.981822 0.000000 chr1 82571 2.082414 0.080572 chr1 88169 2.081358 0.092229 chr1 254996 3.354927 0.439456 chr1 564598 2.887498 1.478148
Parameters: filename (str) – The name of the file to be parsed. This may be in plain text or gzipped plain text.
Processing results¶
The TreeSequence
class represents a sequence of correlated trees
output by a simulation. The SparseTree
class represents a single
tree in this sequence.

msprime.NULL_NODE = 1
Special reserved value, representing the null node. If the parent of a given node is null, then this node is a root. Similarly, if the children of a node are null, this node is a leaf.

msprime.NULL_POPULATION = 1
Special reserved value, representing the null population ID. If the population associated with a particular tree node is not defined, or population information was not available in the underlying tree sequence, then this value will be returned by
SparseTree.get_population()
.

msprime.FORWARD = 1
Constant representing the forward direction of travel (i.e., increasing coordinate values).

msprime.REVERSE = 1
Constant representing the reverse direction of travel (i.e., decreasing coordinate values).

msprime.
load
(path)¶ Loads a tree sequence from the specified file path. This file must be in the HDF5 file format produced by the
TreeSequence.dump()
method.Parameters: path (str) – The file path of the HDF5 file containing the tree sequence we wish to load. Returns: The tree sequence object containing the information stored in the specified file path. Return type: msprime.TreeSequence

class
msprime.
TreeSequence
¶ A TreeSequence represents the information generated in a coalescent simulation. This includes all the trees across the simulated region, along with the mutations (if any are present).

Y
(leaf_sets, windows)¶ Finds the ‘Y’ statistic between the three groups of leaves in leaf_sets. The leaf_sets should be disjoint (the computation works fine, but if not the result depends on the amount of overlap). If the leaf_sets are A, B, and C, then the result gives the mean total length of any edge in the tree between a and the most recent common ancestor of b and c, where a, b, and c are random draws from A, B, and C respectively.
Parameters:  leaf_sets (list) – A list of three sets of IDs of leaves: (A,B,C).
 windows (iterable) – The breakpoints of the windows (including start and end, so has one more entry than number of windows).
Returns: A list of numeric values computed separately on each window.

Y_vector
(leaf_sets, windows, indices)¶ Finds the ‘Y’ statistic between three leaf_sets. The leaf_sets should be disjoint (the computation works fine, but if not the result depends on the amount of overlap). If the leaf_sets are A, B, and C, then the result gives the mean total length of any edge in the tree between a and the most recent common ancestor of b and c, where a, b, and c are random draws from A, B, and C respectively.
 The result is, for each window, a vector whose kth entry is
 Y(leaf_sets[indices[k][0]], leaf_sets[indices[k][1]],
 leaf_sets[indices[k][2]]).
Parameters:  leaf_sets (list) – A list of three sets of IDs of leaves: (A,B,C).
 windows (iterable) – The breakpoints of the windows (including start and end, so has one more entry than number of windows).
 indices (list) – A list of triples of indices of leaf_sets.
Returns: A list of numeric vectors of length equal to the length of indices, computed separately on each window.

branch_stats
(leaf_sets, weight_fun)¶ Here leaf_sets is a list of lists of leaves, and weight_fun is a function whose argument is a list of integers of the same length as leaf_sets that returns a boolean. A branch in a tree is weighted by weight_fun(x), where x[i] is the number of leaves in leaf_sets[i] below that branch. This finds the sum of all counted branches for each tree, and averages this across the tree sequence, weighted by genomic length.

branch_stats_vector
(leaf_sets, weight_fun, windows=None)¶ Here leaf_sets is a list of lists of leaves, and weight_fun is a function whose argument is a list of integers of the same length as leaf_sets that returns a boolean. A branch in a tree is weighted by weight_fun(x), where x[i] is the number of leaves in leaf_sets[i] below that branch. This finds the sum of all counted branches for each tree, and averages this across the tree sequence, weighted by genomic length.
It does this separately for each window [windows[i], windows[i+1]) and returns the values in a list.

branch_stats_windowed
(leaf_sets, weight_fun, windows=None)¶ Here leaf_sets is a list of lists of leaves, and weight_fun is a function whose argument is a list of integers of the same length as leaf_sets that returns a boolean. A branch in a tree is weighted by weight_fun(x), where x[i] is the number of leaves in leaf_sets[i] below that branch. This finds the sum of all counted branches for each tree, and averages this across the tree sequence, weighted by genomic length.

breakpoints
()¶ Returns an iterator over the breakpoints along the chromosome, including the two extreme points 0 and L. This is equivalent to
>>> [0] + [t.get_interval()[1] for t in self.trees()]
although we do not build an explicit list.
Returns: An iterator over all the breakpoints along the simulated sequence. Return type: iter

diffs
()¶ Returns an iterator over the differences between adjacent trees in this tree sequence. Each diff returned by this method is a tuple of the form (length, records_out, records_in). The length is the length of the genomic interval covered by the current tree, and is equivalent to the value returned by
msprime.SparseTree.get_length()
. The records_out value is list of \((u, c, t)\) tuples, and corresponds to the coalescence records that have been invalidated by moving to the current tree. As in therecords()
method, \(u\) is a tree node, \(c\) is a tuple containing its children, and \(t\) is the time the event occurred. These records are returned in timedecreasing order, such that the record affecting the highest parts of the tree (i.e., closest to the root) are returned first. The records_in value is also a list of \((u, c, t)\) tuples, and these describe the records that must be applied to create the tree covering the current interval. These records are returned in timeincreasing order, such that the records affecting the lowest parts of the tree (i.e., closest to the leaves) are returned first.Returns: An iterator over the diffs between adjacent trees in this tree sequence. Return type: iter

dump
(path, zlib_compression=False)¶ Writes the tree sequence to the specified file path.
Parameters:

dump_samples_text
(samples, precision=6)¶ Writes a text representation of the entries in the NodeTable corresponding to samples to the specified connections.
Parameters:  samples (stream) – The filelike object to write the subset of the NodeTable describing the samples to, with an extra column, id.
 precision (int) – The number of digits of precision.

dump_text
(nodes=None, edgesets=None, sites=None, mutations=None, precision=6)¶ Writes a text representation of the tables underlying the tree sequence to the specified connections.
Parameters:  nodes (stream) – The filelike object (having a .write() method) to write the NodeTable to.
 edgesets (stream) – The filelike object to write the EdgesetTable to.
 sites (stream) – The filelike object to write the SiteTable to.
 mutations (stream) – The filelike object to write the MutationTable to.
 precision (int) – The number of digits of precision.

f2
(leaf_sets, windows)¶ Finds the Patterson’s f2 statistics between the three groups of leaves in leaf_sets. The leaf_sets should be disjoint (the computation works fine, but if not the result depends on the amount of overlap).
f2(A;B) is f4(A,B;A,B) corrected to not include self comparisons.
Parameters:  leaf_sets (list) – A list of two sets of IDs of leaves: (A,B), each having at least two samples.
 windows (iterable) – The breakpoints of the windows (including start and end, so has one more entry than number of windows).
Returns: A list of values of f2(A,B) computed separately on each window.

f2_vector
(leaf_sets, windows, indices)¶ Finds the Patterson’s f2 statistics between multiple subsets of pairs of leaves in leaf_sets. The leaf_sets should be disjoint (the computation works fine, but if not the result depends on the amount of overlap).
f2(A;B) is f4(A,B;A,B) corrected to not include self comparisons.
Parameters:  leaf_sets (list) – A list of sets of IDs of leaves, each having at least two samples.
 windows (iterable) – The breakpoints of the windows (including start and end, so has one more entry than number of windows).
 indices (list) – A list of pairs of indices of leaf_sets.
Returns: A list of values of f2(A,C) computed separately on each window.

f3
(leaf_sets, windows)¶ Finds the Patterson’s f3 statistics between the three groups of leaves in leaf_sets. The leaf_sets should be disjoint (the computation works fine, but if not the result depends on the amount of overlap).
f3(A;B,C) is f4(A,B;A,C) corrected to not include self comparisons.
Parameters:  leaf_sets (list) – A list of three sets of IDs of leaves: (A,B,C), with the first set having at least two samples.
 windows (iterable) – The breakpoints of the windows (including start and end, so has one more entry than number of windows).
Returns: A list of values of f3(A,B,C) computed separately on each window.

f3_vector
(leaf_sets, windows, indices)¶ Finds the Patterson’s f3 statistics between multiple subsets of three groups of leaves in leaf_sets. The leaf_sets should be disjoint (the computation works fine, but if not the result depends on the amount of overlap).
f3(A;B,C) is f4(A,B;A,C) corrected to not include self comparisons.
If A does not contain at least three samples, the result is NaN.
Parameters:  leaf_sets (list) – A list of sets of IDs of leaves.
 windows (iterable) – The breakpoints of the windows (including start and end, so has one more entry than number of windows).
 indices (list) – A list of triples of indices of leaf_sets.
Returns: A list of values of f3(A,B,C) computed separately on each window.

f4
(leaf_sets, windows)¶ Finds the Patterson’s f4 statistics between the four groups of leaves in leaf_sets. The leaf_sets should be disjoint (the computation works fine, but if not the result depends on the amount of overlap).
Parameters:  leaf_sets (list) – A list of four sets of IDs of leaves: (A,B,C,D)
 windows (iterable) – The breakpoints of the windows (including start and end, so has one more entry than number of windows).
Returns: A list of values of f4(A,B;C,D) computed separately on each window.

f4_vector
(leaf_sets, windows, indices)¶ Finds the Patterson’s f4 statistics between multiple subsets of four groups of leaf_sets. The leaf_sets should be disjoint (the computation works fine, but if not the result depends on the amount of overlap).
Parameters:  leaf_sets (list) – A list of four sets of IDs of leaves: (A,B,C,D)
 windows (iterable) – The breakpoints of the windows (including start and end, so has one more entry than number of windows).
 indices (list) – A list of 4tuples of indices of leaf_sets.
Returns: A list of values of f4(A,B;C,D) of length equal to the length of indices, computed separately on each window.

get_num_mutations
()¶ Returns the number of mutations in this tree sequence. See the
msprime.TreeSequence.mutations()
method for details on how mutations are defined.Returns: The number of mutations in this tree sequence. Return type: int

get_num_nodes
()¶ Returns the number of nodes in this tree sequence. This 1 + the largest value \(u\) such that u is a node in any of the constituent trees.
Returns: The total number of nodes in this tree sequence. Return type: int

get_num_records
()¶ Returns the number of coalescence records in this tree sequence. See the
records()
method for details on these objects.Returns: The number of coalescence records defining this tree sequence. Return type: int

get_num_trees
()¶ Returns the number of distinct trees in this tree sequence. This is equal to the number of trees returned by the
trees()
method.Returns: The number of trees in this tree sequence. Return type: int

get_pairwise_diversity
(samples=None)¶ Returns the value of pi, the pairwise nucleotide site diversity, which is the average number of mutations that differ between a randomly chosen pair of samples. If samples is specified, calculate the diversity within this set.
Parameters: samples (iterable) – The set of samples within which we calculate the diversity. If None, calculate diversity within the entire sample. Returns: The pairwise nucleotide site diversity. Return type: float

get_population
(u)¶ Returns the population ID for the specified sample ID.
Parameters: u (int) – The individual ID of interest. Returns: The population ID where the specified individual lived. Returns NULL_POPULATION
if no population information is available.Return type: int

get_sample_size
()¶ Returns the sample size for this tree sequence. This is the number of leaf nodes in each tree.
Returns: The number of leaf nodes in the tree sequence. Return type: int

get_samples
(population_id=None)¶ Returns the samples matching the specified population ID.
Parameters: population_id (int) – The population of interest. If None, return all samples. Returns: The ID of the population we wish to find samples from. If None, return samples from all populations. Return type: list

get_sequence_length
()¶ Returns the sequence length in this tree sequence. This defines the genomic scale over which tree coordinates are defined. Given a tree sequence with a sequence length \(L\), the constituent trees will be defined over the halfclosed interval \((0, L]\). Each tree then covers some subset of this interval — see
msprime.SparseTree.get_interval()
for details.Returns: The length of the sequence in this tree sequence in bases. Return type: float

get_time
(u)¶ Returns the time that the specified ID was alive at.
Parameters: u (int) – The individual ID of interest. Returns: The time at which the specified individual was alive at. Return type: int

haplotypes
()¶ Returns an iterator over the haplotypes resulting from the trees and mutations in this tree sequence as a string of ‘1’s and ‘0’s. The iterator returns a total of \(n\) strings, each of which contains \(s\) characters (\(n\) is the sample size returned by
msprime.TreeSequence.get_sample_size()
and \(s\) is the number of mutations returned bymsprime.TreeSequence.get_num_mutations()
). The first string returned is the haplotype for sample 0, and so on.Returns: An iterator over the haplotype strings for the samples in this tree sequence. Return type: iter

mean_pairwise_tmrca
(leaf_sets, windows)¶ Finds the mean time to most recent common ancestor between pairs of samples as described in mean_pairwise_tmrca_matrix (which uses this function). Returns the upper triangle (including the diagonal) in rowmajor order, so if the output is x, then:
>>> k=0 >>> for w in range(len(windows)1): >>> for i in range(len(leaf_sets)): >>> for j in range(i,len(leaf_sets)): >>> trmca[i,j] = tmrca[j,i] = x[w][k] >>> k += 1
will fill out the matrix of mean TMRCAs in the i`th window between (and within) each group of leaves in `leaf_sets in the matrix tmrca. Alternatively, if names labels the leaf_sets, the output labels are:
>>> [".".join(names[i],names[j]) for i in range(len(names)) >>> for j in range(i,len(names))]
Parameters:  leaf_sets (list) – A list of sets of IDs of leaves.
 windows (iterable) – The breakpoints of the windows (including start and end, so has one more entry than number of windows).
Returns: A list of the upper triangle of mean TMRCA values in rowmajor order, including the diagonal.

mean_pairwise_tmrca_matrix
(leaf_sets, windows)¶ Finds the mean time to most recent common ancestor between pairs of samples from each set of leaves and in each window. Returns a numpy array indexed by (window, leaf_set, leaf_set). Diagonal entries are corrected so that the value gives the mean pairwise TMRCA for distinct samples, but it is not checked whether the leaf_sets are disjoint (so offdiagonals are not corrected). For this reason, if an element of leaf_sets has only one element, the corresponding diagonal will be NaN.
The mean TMRCA between two samples is defined to be onehalf the length of all edges separating them in the tree at a uniformly chosen position on the genome.
Parameters:  leaf_sets (list) – A list of sets of IDs of leaves.
 windows (iterable) – The breakpoints of the windows (including start and end, so has one more entry than number of windows).
Returns: A list of the upper triangle of mean TMRCA values in rowmajor order, including the diagonal.

mutations
()¶ Returns an iterator over the mutations in this tree sequence. Each mutation is represented as a tuple \((x, u, j)\) where \(x\) is the position of the mutation in the sequence in chromosome coordinates, \(u\) is the node over which the mutation occurred and \(j\) is the zerobased index of the mutation within the overall tree sequence. Mutations are returned in nondecreasing order of position and increasing index.
Each mutation returned is an instance of
collections.namedtuple()
, and may be accessed via the attributesposition
,node
andindex
as well as the usual positional approach. This is the recommended interface for working with mutations as it is both more readable and also ensures that code is forward compatible with future extensions.Returns: An iterator of all \((x, u, j)\) tuples defining the mutations in this tree sequence. Return type: iter

records
()¶ Returns an iterator over the coalescence records in this tree sequence in timesorted order. Each record is a tuple \((l, r, u, c, t, d)\) defining the assignment of a tree node across an interval. The range of this record is the halfopen genomic interval \([l, r)\), such that it applies to all positions \(l \leq x < r\). Each record represents the assignment of a pair of children \(c\) to a parent parent \(u\). This assignment happens at \(t\) generations in the past within the population with ID \(d\). If population information was not stored for this tree sequence then the population ID will be
NULL_POPULATION
.Each record returned is an instance of
collections.namedtuple()
, and may be accessed via the attributesleft
,right
,node
,children
,time
andpopulation
, as well as the usual positional approach. For example, if we wished to print out the genomic length of each record, we could write:>>> for record in tree_sequence.records(): >>> print(record.right  record.left)
Returns: An iterator of all \((l, r, u, c, t, d)\) tuples defining the coalescence records in this tree sequence. Return type: iter

trees
(tracked_leaves=None, leaf_counts=True, leaf_lists=False)¶ Returns an iterator over the trees in this tree sequence. Each value returned in this iterator is an instance of
SparseTree
.The
leaf_counts
andleaf_lists
parameters control the features that are enabled for the resulting trees. Ifleaf_counts
is True, then it is possible to count the number of leaves underneath a particular node in constant time using theget_num_leaves()
method. Ifleaf_lists
is True a more efficient algorithm is used in theSparseTree.leaves()
method.The
tracked_leaves
parameter can be used to efficiently count the number of leaves in a given set that exist in a particular subtree using theSparseTree.get_num_tracked_leaves()
method. It is an error to use thetracked_leaves
parameter when theleaf_counts
flag is False.Warning: Do not store the results of this iterator in a list! For performance reasons, the same underlying object is used for every tree returned which will most likely lead to unexpected behaviour.
Parameters:  tracked_leaves (list) – The list of leaves to be tracked and
counted using the
SparseTree.get_num_tracked_leaves()
method.  leaf_counts (bool) – If True, support constant time leaf counts
via the
SparseTree.get_num_leaves()
andSparseTree.get_num_tracked_leaves()
methods.  leaf_lists (bool) – If True, provide more efficient access
to the leaves beneath a give node using the
SparseTree.leaves()
method.
Returns: An iterator over the sparse trees in this tree sequence.
Return type:  tracked_leaves (list) – The list of leaves to be tracked and
counted using the

variants
(as_bytes=False)¶ Returns an iterator over the variants in this tree sequence. Each variant corresponds to a single mutation and is represented as a tuple \((x, u, j, g)\). The values of \(x\), \(u\) and \(j\) are identical to the values returned by the
TreeSequence.mutations()
method, and \(g\) represents the sample genotypes for this variant. Thus, \(g[k]\) is the observed state for sample \(k\) at this site; zero represents the ancestral type and one the derived type.Each variant returned is an instance of
collections.namedtuple()
, and may be accessed via the attributesposition
,node
,index
andgenotypes
as well as the usual positional approach. This is the recommended interface for working with variants as it is both more readable and also ensures that code is forward compatible with future extensions.The returned genotypes may be either a numpy array of 1 byte unsigned integer 0/1 values, or a Python bytes object of ‘0’/‘1’ ASCII characters. This behaviour is controller by the
as_bytes
parameter. The default behaviour is to return a numpy array, which is substantially more efficient.Warning: The same numpy array is used to represent genotypes between iterations, so if you wish the store the results of this iterator you must take a copy of the array. This warning does not apply when as_bytes
is True, as a new bytes object is allocated for each variant.Parameters: as_bytes (bool) – If True, the genotype values will be returned as a Python bytes object. This is useful in certain situations (i.e., directly printing the genotypes) or when numpy is not available. Otherwise, genotypes are returned as a numpy array (the default). Returns: An iterator of all \((x, u, j, g)\) tuples defining the variants in this tree sequence.

write_vcf
(output, ploidy=1, contig_id='1')¶ Writes a VCF formatted file to the specified filelike object. If a ploidy value is supplied, allele values are combined among adjacent samples to form a phased genotype of the required ploidy. For example, if we have a ploidy of 2 and a sample of size 6, then we will have 3 diploid samples in the output, consisting of the combined alleles for samples [0, 1], [2, 3] and [4, 5]. If we had alleles 011110 at a particular variant, then we would output the genotypes 01, 11 and 10 in VCF. Sample names are generated by appending the index to the prefix
msp_
such that we would have the sample namesmsp_0
,msp_1
andmsp_2
in the running example.Example usage:
>>> with open("output.vcf", "w") as vcf_file: >>> tree_sequence.write_vcf(vcf_file, 2)
Parameters:


class
msprime.
SparseTree
¶ A SparseTree is a single tree in a
TreeSequence
. In a sparse tree for a sample of size \(n\), the leaves are nodes \(0\) to \(n  1\) inclusive and internal nodes are integers \(\geq n\). The value of these nodes is strictly increasing as we ascend the tree and the root of the tree is the node with the largest value that is reachable from the leaves. Each node in the tree has a parent which is obtained using theget_parent()
method. The parent of the root node is theNULL_NODE
, \(1\). Similarly, each internal node has a pair of children, which are obtained using theget_children()
method. Each node in the tree has a time associated with it in generations. This value is obtained using theSparseTree.get_time()
method.Sparse trees are not intended to be instantiated directly, and are obtained as part of a
TreeSequence
using thetrees()
method.
draw
(path=None, width=200, height=200, show_times=False, show_mutation_labels=False, show_internal_node_labels=True, show_leaf_node_labels=True)¶ Returns a representation of this tree in SVG format.
Parameters:  path (str) – The path to the file to write the SVG. If None, do not write to file.
 width (int) – The width of the image in pixels.
 height (int) – The height of the image in pixels.
 show_times (bool) – If True, show time labels at each internal node.
 show_mutation_labels (bool) – If True, show labels for mutations.
 show_internal_node_labels (bool) – If True, show labels for internal nodes.
 show_leaf_node_labels (bool) – If True, show labels for leaf nodes.
Returns: A representation of this tree in SVG format.
Return type:

get_branch_length
(u)¶ Returns the length of the branch (in generations) joining the specified node to its parent. This is equivalent to
>>> tree.get_time(tree.get_parent(u))  tree.get_time(u)
Note that this is not related to the value returned by
get_length()
, which describes the length of the interval covered by the tree in genomic coordinates.Parameters: u (int) – The node of interest. Returns: The branch length from u to its parent. Return type: float

get_children
(u)¶ Returns the children of the specified node as a tuple \((v, w)\). For internal nodes, this tuple is always in sorted order such that \(v < w\). If u is a leaf or is not a node in the current tree, return the tuple (
NULL_NODE
,NULL_NODE
).Parameters: u (int) – The node of interest. Returns: The children of u as a pair of integers Return type: tuple

get_index
()¶ Returns the index this tree occupies in the parent tree sequence. This index is zero based, so the first tree in the sequence has index 0.
Returns: The index of this tree. Return type: int

get_interval
()¶ Returns the coordinates of the genomic interval that this tree represents the history of. The interval is returned as a tuple \((l, r)\) and is a halfopen interval such that the left coordinate is inclusive and the right coordinate is exclusive. This tree therefore applies to all genomic locations \(x\) such that \(l \leq x < r\).
Returns: A tuple (l, r) representing the leftmost (inclusive) and rightmost (exclusive) coordinates of the genomic region covered by this tree. Return type: tuple

get_length
()¶ Returns the length of the genomic interval that this tree represents. This is defined as \(r  l\), where \((l, r)\) is the genomic interval returned by
get_interval()
.Returns: The length of the genomic interval covered by this tree. Return type: int

get_mrca
(u, v)¶ Returns the most recent common ancestor of the specified nodes.
Parameters: Returns: The most recent common ancestor of u and v.
Return type:

get_num_leaves
(u)¶ Returns the number of leaves in this tree underneath the specified node.
If the
TreeSequence.trees()
method is called withleaf_counts=True
this method is a constant time operation. If not, a slower traversal based algorithm is used to count the leaves.Parameters: u (int) – The node of interest. Returns: The number of leaves in the subtree rooted at u. Return type: int

get_num_mutations
()¶ Returns the number of mutations on this tree.
Returns: The number of mutations on this tree. Return type: int

get_num_tracked_leaves
(u)¶ Returns the number of leaves in the set specified in the
tracked_leaves
parameter of theTreeSequence.trees()
method underneath the specified node. This is a constant time operation.Parameters: u (int) – The node of interest. Returns: The number of leaves within the set of tracked leaves in the subtree rooted at u. Return type: int Raises: RuntimeError – if the TreeSequence.trees()
method is not called withleaf_counts=True
.

get_parent
(u)¶ Returns the parent of the specified node. Returns the
NULL_NODE
1 if u is the root or is not a node in the current tree.Parameters: u (int) – The node of interest. Returns: The parent of u. Return type: int

get_population
(u)¶ Returns the population associated with the specified node. For leaf nodes this is the population of the sample, and for internal nodes this is the population where the corresponding coalescence occured. If the specified node is not a member of this tree or population level information was not stored in the tree sequence,
NULL_POPULATION
is returned.Parameters: u (int) – The node of interest. Returns: The ID of the population associated with node u. Return type: int

get_sample_size
()¶ Returns the sample size for this tree. This is the number of leaf nodes in the tree.
Returns: The number of leaf nodes in the tree. Return type: int

get_time
(u)¶ Returns the time of the specified node in generations. Returns 0 if u is a leaf or is not a node in the current tree.
Parameters: u (int) – The node of interest. Returns: The time of u. Return type: float

get_tmrca
(u, v)¶ Returns the time of the most recent common ancestor of the specified nodes. This is equivalent to:
tree.get_time(tree.get_mrca(u, v))
Parameters: Returns: The time of the most recent common ancestor of u and v.
Return type:

get_total_branch_length
()¶ Returns the sum of all the branch lengths in this tree (in units of generations). This is equivalent to
>>> sum( >>> tree.get_branch_length(u) for u in tree.nodes() >>> if u != tree.get_root())
Returns: The sum of all the branch lengths in this tree.

is_internal
(u)¶ Returns True if the specified node is not a leaf.
Parameters: u (int) – The node of interest. Returns: True if u is not a leaf node. Return type: bool

is_leaf
(u)¶ Returns True if the specified node is a leaf. A node \(u\) is a leaf if it has zero children.
Parameters: u (int) – The node of interest. Returns: True if u is a leaf node. Return type: bool

leaves
(u)¶ Returns an iterator over all the leaves in this tree underneath the specified node.
If the
TreeSequence.trees()
method is called withleaf_lists=True
, this method uses an efficient algorithm to find the leaves. If not, a simple traversal based method is used.Parameters: u (int) – The node of interest. Returns: An iterator over all leaves in the subtree rooted at u. Return type: iterator

mutations
()¶ Returns an iterator over the mutations in this tree. Each mutation is represented as a tuple \((x, u, j)\) where \(x\) is the position of the mutation in the sequence in chromosome coordinates, \(u\) is the node over which the mutation occurred and \(j\) is the zerobased index of the mutation within the overall tree sequence. Mutations are returned in nondecreasing order of position and increasing index.
Each mutation returned is an instance of
collections.namedtuple()
, and may be accessed via the attributesposition
,node
andindex
as well as the usual positional approach. This is the recommended interface for working with mutations as it is both more readable and also ensures that code is forward compatible with future extensions.Returns: An iterator of all \((x, u, j)\) tuples defining the mutations in this tree. Return type: iter

nodes
(root=None, order='preorder')¶ Returns an iterator over the nodes in this tree. If the root parameter is provided, iterate over the nodes in the subtree rooted at this node. If this is None, iterate over all nodes. If the order parameter is provided, iterate over the nodes in required tree traversal order.
Parameters: Return type: iterator

Calculating statistics¶
The msprime
API provides methods for efficiently calculating
population genetics statistics from a given TreeSequence
.

class
msprime.
LdCalculator
(tree_sequence)¶ Class for calculating linkage disequilibrium coefficients between pairs of mutations in a
TreeSequence
. This class requires the numpy library.This class supports multithreaded access using the Python
threading
module. Separate instances ofLdCalculator
referencing the same tree sequence can operate in parallel in multiple threads. See the Working with threads section in the Tutorial for an example of how use multiple threads to calculate LD values efficiently.Parameters: tree_sequence (TreeSequence) – The tree sequence containing the mutations we are interested in. 
get_r2
(a, b)¶ Returns the value of the \(r^2\) statistic between the pair of mutations at the specified indexes. This method is not an efficient method for computing large numbers of pairwise; please use either
get_r2_array()
orget_r2_matrix()
for this purpose.Parameters: Returns: The value of \(r^2\) between the mutations at indexes
a
andb
.Return type:

get_r2_array
(a, direction=1, max_mutations=None, max_distance=None)¶ Returns the value of the \(r^2\) statistic between the focal mutation at index \(a\) and a set of other mutations. The method operates by starting at the focal mutation and iterating over adjacent mutations (in either the forward or backwards direction) until either a maximum number of other mutations have been considered (using the
max_mutations
parameter), a maximum distance in sequence coordinates has been reached (using themax_distance
parameter) or the start/end of the sequence has been reached. For every mutation \(b\) considered, we then insert the value of \(r^2\) between \(a\) and \(b\) at the corresponding index in an array, and return the entire array. If the returned array is \(x\) anddirection
ismsprime.FORWARD
then \(x[0]\) is the value of the statistic for \(a\) and \(a + 1\), \(x[1]\) the value for \(a\) and \(a + 2\), etc. Similarly, ifdirection
ismsprime.REVERSE
then \(x[0]\) is the value of the statistic for \(a\) and \(a  1\), \(x[1]\) the value for \(a\) and \(a  2\), etc.Parameters:  a (int) – The index of the focal mutation.
 direction (int) – The direction in which to travel when
examining other mutations. Must be either
msprime.FORWARD
ormsprime.REVERSE
. Defaults tomsprime.FORWARD
.  max_mutations (int) – The maximum number of mutations to return \(r^2\) values for. Defaults to as many mutations as possible.
 max_distance (float) – The maximum absolute distance between the focal mutation and those for which \(r^2\) values are returned.
Returns: An array of double precision floating point values representing the \(r^2\) values for mutations in the specified direction.
Return type: numpy.ndarray
Warning: For efficiency reasons, the underlying memory used to store the returned array is shared between calls. Therefore, if you wish to store the results of a single call to
get_r2_array()
for later processing you must take a copy of the array!

get_r2_matrix
()¶ Returns the complete \(m \times m\) matrix of pairwise \(r^2\) values in a tree sequence with \(m\) mutations.
Returns: An 2 dimensional square array of double precision floating point values representing the \(r^2\) values for all pairs of mutations. Return type: numpy.ndarray
