API Documentation

This is the API documentation for msprime, and provides detailed information on the Python programming interface. See the Tutorial for an introduction to using this API to run simulations and analyse the results.

Simulation model

The simulation model in msprime closely follows the classical ms program. Unlike ms, however, time is measured in generations rather than “coalescent units”. Internally the same simulation algorithm is used, but msprime provides a translation layer to allow the user input times and rates in generations. Similarly, the times associated with the trees produced by msprime are in measured generations. To enable this translation from generations into coalescent units and vice-versa, a reference effective population size must be provided, which is given by the Ne parameter in the simulate() function. (Note that we assume diploid population sizes thoughout, since we scale by \(4 N_e\).) Population sizes for individual demes and for past demographic events are defined as absolute values, not scaled by Ne. All migration rates and growth rates are also per generation.

When running simulations we define the length in bases \(L\) of the sequence in question using the length parameter. This defines the coordinate space within which trees and mutations are defined. \(L\) is a continuous value, and coordinates can take any value from \(0\) to \(L\). Mutations occur in an infinite sites process along this sequence, and mutation rates are specified per generation, per unit of sequence length. Thus, given the per-generation mutation rate \(\mu\), the rate of mutation over the entire sequence in coalescent time units is \(\theta = 4 N_e \mu L\). It is important to remember these scaling factors when comparing with analytical results!

Similarly, recombination rates are per base, per generation in msprime. Thus, given the per generation crossover rate \(r\), the overall rate of recombination between the ends of the sequence in coalescent time units is \(\rho = 4 N_e r L\). Recombination events occur in a continuous coordinate space, so that breakpoints do not necessarily occur at integer locations. However, the underlying recombination model is finite, and the behaviour of a small number of loci can be modelled using the RecombinationMap class. However, this is considered an advanced feature and the majority of cases should be well served with the default recombination model and number of loci.

Population structure is modelled by specifying a fixed number of demes \(d\), and a \(d \times d\) matrix \(M\) of per generation migration rates. Each element of the matrix \(M_{j,k}\) defines the fraction of population \(j\) that consists of migrants from population \(k\) in each generation. Each deme has an initial absolute population size \(s\) and a per generation exponential growth rate \(\alpha\). The size of a given population at time \(t\) in the past (measured in generations) is therefore given by \(s e^{-\alpha t}\). Demographic events that occur in the history of the simulated population alter some aspect of this population configuration at a particular time in the past.

Warning

This parameterisation of recombination, mutation and migration rates is different to ms, which states these rates over the entire region and in coalescent time units. The motivation for this is to allow the user change the size of the simulated region without having to rescale the recombination and mutation rates, and to also allow users directly state times and rates in units of generations. However, the mspms command line application is fully ms compatible.

Running simulations

The simulate() function provides the primary interface to running coalescent simulations in msprime.

msprime.simulate(sample_size=None, Ne=1, length=None, recombination_rate=None, recombination_map=None, mutation_rate=None, population_configurations=None, migration_matrix=None, demographic_events=[], samples=None, model=None, record_migrations=False, random_seed=None, mutation_generator=None, num_replicates=None)

Simulates the coalescent with recombination under the specified model parameters and returns the resulting TreeSequence.

Parameters:
  • sample_size (int) – The number of individuals in our sample. If not specified or None, this defaults to the sum of the subpopulation sample sizes. Either sample_size, population_configurations or samples must be specified.
  • Ne (float) – The effective (diploid) population size for the reference population. This determines the factor by which the per-generation recombination and mutation rates are scaled in the simulation. This defaults to 1 if not specified.
  • length (float) – The length of the simulated region in bases. This parameter cannot be used along with recombination_map. Defaults to 1 if not specified.
  • recombination_rate (float) – The rate of recombination per base per generation. This parameter cannot be used along with recombination_map. Defaults to 0 if not specified.
  • recombination_map (RecombinationMap) – The map describing the changing rates of recombination along the simulated chromosome. This parameter cannot be used along with the recombination_rate or length parameters, as these values are encoded within the map. Defaults to a uniform rate as described in the recombination_rate parameter if not specified.
  • mutation_rate (float) – The rate of mutation per base per generation. If not specified, no mutations are generated.
  • population_configurations (list or None.) – The list of PopulationConfiguration instances describing the sampling configuration, relative sizes and growth rates of the populations to be simulated. If this is not specified, a single population with a sample of size sample_size is assumed.
  • migration_matrix (list) – The matrix describing the rates of migration between all pairs of populations. If \(N\) populations are defined in the population_configurations parameter, then the migration matrix must be an \(N\times N\) matrix consisting of \(N\) lists of length \(N\).
  • demographic_events (list) – The list of demographic events to simulate. Demographic events describe changes to the populations in the past. Events should be supplied in non-decreasing order of time. Events with the same time value will be applied sequentially in the order that they were supplied before the simulation algorithm continues with the next time step.
  • samples (list) – The list specifying the location and time of all samples. This parameter may be used to specify historical samples, and cannot be used in conjunction with the sample_size parameter. Each sample is a (population_id, time) pair such that the sample in position j in the list of samples is drawn in the specified population at the specfied time. Time is measured in generations, as elsewhere.
  • random_seed (int) – The random seed. If this is None, a random seed will be automatically generated. Valid random seeds must be between 1 and \(2^{32} - 1\).
  • num_replicates (int) – The number of replicates of the specified parameters to simulate. If this is not specified or None, no replication is performed and a TreeSequence object returned. If num_replicates is provided, the specified number of replicates is performed, and an iterator over the resulting TreeSequence objects returned.
Returns:

The TreeSequence object representing the results of the simulation if no replication is performed, or an iterator over the independent replicates simulated if the num_replicates parameter has been used.

Return type:

TreeSequence or an iterator over TreeSequence replicates.

Warning:

If using replication, do not store the results of the iterator in a list! For performance reasons, the same underlying object may be used for every TreeSequence returned which will most likely lead to unexpected behaviour.

Population structure

Population structure is modelled in msprime by specifying a fixed number of demes, with the migration rates between those demes defined by a migration matrix. Each deme has an initial_size that defines its absolute size at time zero and a per-generation growth_rate which specifies the exponential growth rate of the sub-population. We must also define the size of the sample to draw from each deme. The number of populations and their initial configuration is defined using the population_configurations parameter to simulate(), which takes a list of PopulationConfiguration instances. Population IDs are zero indexed, and correspond to their position in the list.

Samples are drawn sequentially from populations in increasing order of population ID. For example, if we specified an overall sample size of 5, and specify that 2 samples are drawn from population 0 and 3 from population 1, then individuals 0 and 1 will be initially located in population 0, and individuals 2, 3 and 4 will be drawn from population 2.

Given \(N\) populations, migration matrices are specified using an \(N \times N\) matrix of deme-to-deme migration rates. See the documentation for simulate() and the Simulation model section for more details on the migration rates.

class msprime.PopulationConfiguration(sample_size=None, initial_size=None, growth_rate=0.0)

The initial configuration of a population (or deme) in a simulation.

Parameters:
  • sample_size (int) – The number of initial samples that are drawn from this population.
  • initial_size (float) – The absolute size of the population at time zero. Defaults to the reference population size \(N_e\).
  • growth_rate (float) – The exponential growth rate of the population per generation. Growth rates can be negative. This is zero for a constant population size. Defaults to 0.

Demographic Events

Demographic events change some aspect of the population configuration at some time in the past, and are specified using the demographic_events parameter to simulate(). Each element of this list must be an instance of one of the following demographic events that are currently supported. Note that all times are measured in generations, all sizes are absolute (i.e., not relative to \(N_e\)), and all rates are per-generation.

class msprime.PopulationParametersChange(time, initial_size=None, growth_rate=None, population_id=None)

Changes the demographic parameters of a population at a given time.

This event generalises the -eg, -eG, -en and -eN options from ms. Note that unlike ms we do not automatically set growth rates to zero when the population size is changed.

Parameters:
  • time (float) – The time at which this event occurs in generations.
  • initial_size (float) – The absolute size of the population at the beginning of the time slice starting at time. If None, this is calculated according to the initial population size and growth rate over the preceding time slice.
  • growth_rate (float) – The new per-generation growth rate. If None, the growth rate is not changed. Defaults to None.
  • population_id (int) – The ID of the population affected. If population_id is None, the changes affect all populations simultaneously.
class msprime.MigrationRateChange(time, rate, matrix_index=None)

Changes the rate of migration to a new value at a specific time.

Parameters:
  • time (float) – The time at which this event occurs in generations.
  • rate (float) – The new per-generation migration rate.
  • matrix_index (tuple) – A tuple of two population IDs descibing the matrix index of interest. If matrix_index is None, all non-diagonal entries of the migration matrix are changed simultaneously.
class msprime.MassMigration(time, source, destination, proportion=1.0)

A mass migration event in which some fraction of the population in one deme simultaneously move to another deme, viewed backwards in time. For each lineage currently present in the source population, they move to the destination population with probability equal to proportion.

This event class generalises the population split (-ej) and admixture (-es) events from ms. Note that MassMigrations do not have any side effects on the migration matrix.

Parameters:
  • time (float) – The time at which this event occurs in generations.
  • source (int) – The ID of the source population.
  • destination (int) – The ID of the destination population.
  • proportion (float) – The probability that any given lineage within the source population migrates to the destination population.

Debugging demographic models

Warning

The DemographyDebugger class is preliminary, and the API is likely to change in the future.

class msprime.DemographyDebugger(Ne=1, population_configurations=None, migration_matrix=None, demographic_events=[])

A class to facilitate debugging of population parameters and migration rates in the past.

print_history(output=<open file '<stdout>', mode 'w'>)

Prints a summary of the history of the populations.

Variable recombination rates

class msprime.RecombinationMap(positions, rates, num_loci=None)

A RecombinationMap represents the changing rates of recombination along a chromosome. This is defined via two lists of numbers: positions and rates, which must be of the same length. Given an index j in these lists, the rate of recombination per base per generation is rates[j] over the interval positions[j] to positions[j + 1]. Consequently, the first position must be zero, and by convention the last rate value is also required to be zero (although it does not used).

Parameters:
  • positions (list) – The positions (in bases) denoting the distinct intervals where recombination rates change. These can be floating point values.
  • rates (list) – The list of rates corresponding to the supplied positions. Recombination rates are specified per base, per generation.
  • num_loci (int) – The maximum number of non-recombining loci in the underlying simulation. By default this is set to the largest possible value, allowing the maximum resolution in the recombination process. However, for a finite sites model this can be set to smaller values.
classmethod read_hapmap(filename)

Parses the specified file in HapMap format. These files must contain a single header line (which is ignored), and then each subsequent line denotes a position/rate pair. Positions are in units of bases, and recombination rates in centimorgans/Megabase. The first column in this file is ignored, as are subsequence columns after the Position and Rate. A sample of this format is as follows:

Chromosome  Position(bp)    Rate(cM/Mb)     Map(cM)
chr1        55550   2.981822        0.000000
chr1        82571   2.082414        0.080572
chr1        88169   2.081358        0.092229
chr1        254996  3.354927        0.439456
chr1        564598  2.887498        1.478148
Parameters:filename (str) – The name of the file to be parsed. This may be in plain text or gzipped plain text.

Processing results

The TreeSequence class represents a sequence of correlated trees output by a simulation. The SparseTree class represents a single tree in this sequence.

msprime.NULL_NODE = -1

Special reserved value, representing the null node. If the parent of a given node is null, then this node is a root. Similarly, if the children of a node are null, this node is a leaf.

msprime.NULL_POPULATION = -1

Special reserved value, representing the null population ID. If the population associated with a particular tree node is not defined, or population information was not available in the underlying tree sequence, then this value will be returned by SparseTree.get_population().

msprime.FORWARD = 1

Constant representing the forward direction of travel (i.e., increasing coordinate values).

msprime.REVERSE = -1

Constant representing the reverse direction of travel (i.e., decreasing coordinate values).

msprime.load(path)

Loads a tree sequence from the specified file path. This file must be in the HDF5 file format produced by the TreeSequence.dump() method.

Parameters:path (str) – The file path of the HDF5 file containing the tree sequence we wish to load.
Returns:The tree sequence object containing the information stored in the specified file path.
Return type:msprime.TreeSequence
class msprime.TreeSequence

A TreeSequence represents the information generated in a coalescent simulation. This includes all the trees across the simulated region, along with the mutations (if any are present).

branch_stats(leaf_sets, weight_fun)

Here leaf_sets is a list of lists of leaves, and weight_fun is a function whose argument is a list of integers of the same length as leaf_sets that returns a boolean. A branch in a tree is weighted by weight_fun(x), where x[i] is the number of leaves in leaf_sets[i] below that branch. This finds the sum of all counted branches for each tree, and averages this across the tree sequence, weighted by genomic length.

branch_stats_vector(leaf_sets, weight_fun, windows=None)

Here leaf_sets is a list of lists of leaves, and weight_fun is a function whose argument is a list of integers of the same length as leaf_sets that returns a boolean. A branch in a tree is weighted by weight_fun(x), where x[i] is the number of leaves in leaf_sets[i] below that branch. This finds the sum of all counted branches for each tree, and averages this across the tree sequence, weighted by genomic length.

It does this separately for each window [windows[i], windows[i+1]) and returns the values in a vector.

branch_stats_windowed(leaf_sets, weight_fun, windows=None)

Here leaf_sets is a list of lists of leaves, and weight_fun is a function whose argument is a list of integers of the same length as leaf_sets that returns a boolean. A branch in a tree is weighted by weight_fun(x), where x[i] is the number of leaves in leaf_sets[i] below that branch. This finds the sum of all counted branches for each tree, and averages this across the tree sequence, weighted by genomic length.

breakpoints()

Returns an iterator over the breakpoints along the chromosome, including the two extreme points 0 and L. This is equivalent to

>>> [0] + [t.get_interval()[1] for t in self.trees()]

although we do not build an explicit list.

Returns:An iterator over all the breakpoints along the simulated sequence.
Return type:iter
diffs()

Returns an iterator over the differences between adjacent trees in this tree sequence. Each diff returned by this method is a tuple of the form (length, records_out, records_in). The length is the length of the genomic interval covered by the current tree, and is equivalent to the value returned by msprime.SparseTree.get_length(). The records_out value is list of \((u, c, t)\) tuples, and corresponds to the coalescence records that have been invalidated by moving to the current tree. As in the records() method, \(u\) is a tree node, \(c\) is a tuple containing its children, and \(t\) is the time the event occurred. These records are returned in time-decreasing order, such that the record affecting the highest parts of the tree (i.e., closest to the root) are returned first. The records_in value is also a list of \((u, c, t)\) tuples, and these describe the records that must be applied to create the tree covering the current interval. These records are returned in time-increasing order, such that the records affecting the lowest parts of the tree (i.e., closest to the leaves) are returned first.

Returns:An iterator over the diffs between adjacent trees in this tree sequence.
Return type:iter
dump(path, zlib_compression=False)

Writes the tree sequence to the specified file path.

Parameters:
  • path (str) – The file path to write the TreeSequence to.
  • zlib_compression (bool) – If True, use HDF5’s native compression when storing the data leading to smaller file size. When loading, data will be decompressed transparently, but load times will be significantly slower.
get_num_mutations()

Returns the number of mutations in this tree sequence. See the msprime.TreeSequence.mutations() method for details on how mutations are defined.

Returns:The number of mutations in this tree sequence.
Return type:int
get_num_nodes()

Returns the number of nodes in this tree sequence. This 1 + the largest value \(u\) such that u is a node in any of the constituent trees.

Returns:The total number of nodes in this tree sequence.
Return type:int
get_num_records()

Returns the number of coalescence records in this tree sequence. See the records() method for details on these objects.

Returns:The number of coalescence records defining this tree sequence.
Return type:int
get_num_trees()

Returns the number of distinct trees in this tree sequence. This is equal to the number of trees returned by the trees() method.

Returns:The number of trees in this tree sequence.
Return type:int
get_pairwise_diversity(samples=None)

Returns the value of pi, the pairwise nucleotide site diversity, which is the average number of mutations that differ between a randomly chosen pair of samples. If samples is specified, calculate the diversity within this set.

Parameters:samples (iterable) – The set of samples within which we calculate the diversity. If None, calculate diversity within the entire sample.
Returns:The pairwise nucleotide site diversity.
Return type:float
get_population(u)

Returns the population ID for the specified sample ID.

Parameters:u (int) – The individual ID of interest.
Returns:The population ID where the specified individual lived. Returns NULL_POPULATION if no population information is available.
Return type:int
get_sample_size()

Returns the sample size for this tree sequence. This is the number of leaf nodes in each tree.

Returns:The number of leaf nodes in the tree sequence.
Return type:int
get_samples(population_id=None)

Returns the samples matching the specified population ID.

Parameters:population_id (int) – The population of interest. If None, return all samples.
Returns:The ID of the population we wish to find samples from. If None, return samples from all populations.
Return type:list
get_sequence_length()

Returns the sequence length in this tree sequence. This defines the genomic scale over which tree coordinates are defined. Given a tree sequence with a sequence length \(L\), the constituent trees will be defined over the half-closed interval \((0, L]\). Each tree then covers some subset of this interval — see msprime.SparseTree.get_interval() for details.

Returns:The length of the sequence in this tree sequence in bases.
Return type:float
get_time(u)

Returns the time that the specified ID was alive at.

Parameters:u (int) – The individual ID of interest.
Returns:The time at which the specified individual was alive at.
Return type:int
haplotypes()

Returns an iterator over the haplotypes resulting from the trees and mutations in this tree sequence as a string of ‘1’s and ‘0’s. The iterator returns a total of \(n\) strings, each of which contains \(s\) characters (\(n\) is the sample size returned by msprime.TreeSequence.get_sample_size() and \(s\) is the number of mutations returned by msprime.TreeSequence.get_num_mutations()). The first string returned is the haplotype for sample 0, and so on.

Returns:An iterator over the haplotype strings for the samples in this tree sequence.
Return type:iter
mutations()

Returns an iterator over the mutations in this tree sequence. Each mutation is represented as a tuple \((x, u, j)\) where \(x\) is the position of the mutation in the sequence in chromosome coordinates, \(u\) is the node over which the mutation occurred and \(j\) is the zero-based index of the mutation within the overall tree sequence. Mutations are returned in non-decreasing order of position and increasing index.

Each mutation returned is an instance of collections.namedtuple(), and may be accessed via the attributes position, node and index as well as the usual positional approach. This is the recommended interface for working with mutations as it is both more readable and also ensures that code is forward compatible with future extensions.

Returns:An iterator of all \((x, u, j)\) tuples defining the mutations in this tree sequence.
Return type:iter
records()

Returns an iterator over the coalescence records in this tree sequence in time-sorted order. Each record is a tuple \((l, r, u, c, t, d)\) defining the assignment of a tree node across an interval. The range of this record is the half-open genomic interval \([l, r)\), such that it applies to all positions \(l \leq x < r\). Each record represents the assignment of a pair of children \(c\) to a parent parent \(u\). This assignment happens at \(t\) generations in the past within the population with ID \(d\). If population information was not stored for this tree sequence then the population ID will be NULL_POPULATION.

Each record returned is an instance of collections.namedtuple(), and may be accessed via the attributes left, right, node, children, time and population, as well as the usual positional approach. For example, if we wished to print out the genomic length of each record, we could write:

>>> for record in tree_sequence.records():
>>>     print(record.right - record.left)
Returns:An iterator of all \((l, r, u, c, t, d)\) tuples defining the coalescence records in this tree sequence.
Return type:iter
trees(tracked_leaves=None, leaf_counts=True, leaf_lists=False)

Returns an iterator over the trees in this tree sequence. Each value returned in this iterator is an instance of SparseTree.

The leaf_counts and leaf_lists parameters control the features that are enabled for the resulting trees. If leaf_counts is True, then it is possible to count the number of leaves underneath a particular node in constant time using the get_num_leaves() method. If leaf_lists is True a more efficient algorithm is used in the SparseTree.leaves() method.

The tracked_leaves parameter can be used to efficiently count the number of leaves in a given set that exist in a particular subtree using the SparseTree.get_num_tracked_leaves() method. It is an error to use the tracked_leaves parameter when the leaf_counts flag is False.

Warning:

Do not store the results of this iterator in a list! For performance reasons, the same underlying object is used for every tree returned which will most likely lead to unexpected behaviour.

Parameters:
Returns:

An iterator over the sparse trees in this tree sequence.

Return type:

iter

variants(as_bytes=False)

Returns an iterator over the variants in this tree sequence. Each variant corresponds to a single mutation and is represented as a tuple \((x, u, j, g)\). The values of \(x\), \(u\) and \(j\) are identical to the values returned by the TreeSequence.mutations() method, and \(g\) represents the sample genotypes for this variant. Thus, \(g[k]\) is the observed state for sample \(k\) at this site; zero represents the ancestral type and one the derived type.

Each variant returned is an instance of collections.namedtuple(), and may be accessed via the attributes position, node, index and genotypes as well as the usual positional approach. This is the recommended interface for working with variants as it is both more readable and also ensures that code is forward compatible with future extensions.

The returned genotypes may be either a numpy array of 1 byte unsigned integer 0/1 values, or a Python bytes object of ‘0’/‘1’ ASCII characters. This behaviour is controller by the as_bytes parameter. The default behaviour is to return a numpy array, which is substantially more efficient.

Warning:The same numpy array is used to represent genotypes between iterations, so if you wish the store the results of this iterator you must take a copy of the array. This warning does not apply when as_bytes is True, as a new bytes object is allocated for each variant.
Parameters:as_bytes (bool) – If True, the genotype values will be returned as a Python bytes object. This is useful in certain situations (i.e., directly printing the genotypes) or when numpy is not available. Otherwise, genotypes are returned as a numpy array (the default).
Returns:An iterator of all \((x, u, j, g)\) tuples defining the variants in this tree sequence.
write_vcf(output, ploidy=1, contig_id='1')

Writes a VCF formatted file to the specified file-like object. If a ploidy value is supplied, allele values are combined among adjacent samples to form a phased genotype of the required ploidy. For example, if we have a ploidy of 2 and a sample of size 6, then we will have 3 diploid samples in the output, consisting of the combined alleles for samples [0, 1], [2, 3] and [4, 5]. If we had alleles 011110 at a particular variant, then we would output the genotypes 0|1, 1|1 and 1|0 in VCF. Sample names are generated by appending the index to the prefix msp_ such that we would have the sample names msp_0, msp_1 and msp_2 in the running example.

Example usage:

>>> with open("output.vcf", "w") as vcf_file:
>>>     tree_sequence.write_vcf(vcf_file, 2)
Parameters:
  • output (File) – The file-like object to write the VCF output.
  • ploidy (int) – The ploidy of the individual samples in the VCF. This sample size must be divisible by ploidy.
  • contig_id (str) – The value of the CHROM column in the output VCF.
class msprime.SparseTree

A SparseTree is a single tree in a TreeSequence. In a sparse tree for a sample of size \(n\), the leaves are nodes \(0\) to \(n - 1\) inclusive and internal nodes are integers \(\geq n\). The value of these nodes is strictly increasing as we ascend the tree and the root of the tree is the node with the largest value that is reachable from the leaves. Each node in the tree has a parent which is obtained using the get_parent() method. The parent of the root node is the NULL_NODE, \(-1\). Similarly, each internal node has a pair of children, which are obtained using the get_children() method. Each node in the tree has a time associated with it in generations. This value is obtained using the SparseTree.get_time() method.

Sparse trees are not intended to be instantiated directly, and are obtained as part of a TreeSequence using the trees() method.

draw(path, width=200, height=200, show_times=False)

Draws a representation of this tree to the specified path in SVG format.

Parameters:
  • path (str) – The path to the file to write the SVG.
  • width (int) – The width of the image in pixels.
  • height (int) – The height of the image in pixels.
  • show_times (bool) – If True, show time labels at each internal node.
get_branch_length(u)

Returns the length of the branch (in generations) joining the specified node to its parent. This is equivalent to

>>> tree.get_time(tree.get_parent(u)) - tree.get_time(u)

Note that this is not related to the value returned by get_length(), which describes the length of the interval covered by the tree in genomic coordinates.

Parameters:u (int) – The node of interest.
Returns:The branch length from u to its parent.
Return type:float
get_children(u)

Returns the children of the specified node as a tuple \((v, w)\). For internal nodes, this tuple is always in sorted order such that \(v < w\). If u is a leaf or is not a node in the current tree, return the tuple (NULL_NODE, NULL_NODE).

Parameters:u (int) – The node of interest.
Returns:The children of u as a pair of integers
Return type:tuple
get_index()

Returns the index this tree occupies in the parent tree sequence. This index is zero based, so the first tree in the sequence has index 0.

Returns:The index of this tree.
Return type:int
get_interval()

Returns the coordinates of the genomic interval that this tree represents the history of. The interval is returned as a tuple \((l, r)\) and is a half-open interval such that the left coordinate is inclusive and the right coordinate is exclusive. This tree therefore applies to all genomic locations \(x\) such that \(l \leq x < r\).

Returns:A tuple (l, r) representing the left-most (inclusive) and right-most (exclusive) coordinates of the genomic region covered by this tree.
Return type:tuple
get_length()

Returns the length of the genomic interval that this tree represents. This is defined as \(r - l\), where \((l, r)\) is the genomic interval returned by get_interval().

Returns:The length of the genomic interval covered by this tree.
Return type:int
get_mrca(u, v)

Returns the most recent common ancestor of the specified nodes.

Parameters:
  • u (int) – The first node.
  • v (int) – The second node.
Returns:

The most recent common ancestor of u and v.

Return type:

int

get_num_leaves(u)

Returns the number of leaves in this tree underneath the specified node.

If the TreeSequence.trees() method is called with leaf_counts=True this method is a constant time operation. If not, a slower traversal based algorithm is used to count the leaves.

Parameters:u (int) – The node of interest.
Returns:The number of leaves in the subtree rooted at u.
Return type:int
get_num_mutations()

Returns the number of mutations on this tree.

Returns:The number of mutations on this tree.
Return type:int
get_num_tracked_leaves(u)

Returns the number of leaves in the set specified in the tracked_leaves parameter of the TreeSequence.trees() method underneath the specified node. This is a constant time operation.

Parameters:u (int) – The node of interest.
Returns:The number of leaves within the set of tracked leaves in the subtree rooted at u.
Return type:int
Raises:RuntimeError – if the TreeSequence.trees() method is not called with leaf_counts=True.
get_parent(u)

Returns the parent of the specified node. Returns the NULL_NODE -1 if u is the root or is not a node in the current tree.

Parameters:u (int) – The node of interest.
Returns:The parent of u.
Return type:int
get_population(u)

Returns the population associated with the specified node. For leaf nodes this is the population of the sample, and for internal nodes this is the population where the corresponding coalescence occured. If the specified node is not a member of this tree or population level information was not stored in the tree sequence, NULL_POPULATION is returned.

Parameters:u (int) – The node of interest.
Returns:The ID of the population associated with node u.
Return type:int
get_root()

Returns the root of this tree.

Returns:The root node.
Return type:int
get_sample_size()

Returns the sample size for this tree. This is the number of leaf nodes in the tree.

Returns:The number of leaf nodes in the tree.
Return type:int
get_time(u)

Returns the time of the specified node in generations. Returns 0 if u is a leaf or is not a node in the current tree.

Parameters:u (int) – The node of interest.
Returns:The time of u.
Return type:float
get_tmrca(u, v)

Returns the time of the most recent common ancestor of the specified nodes. This is equivalent to:

tree.get_time(tree.get_mrca(u, v))
Parameters:
  • u (int) – The first node.
  • v (int) – The second node.
Returns:

The time of the most recent common ancestor of u and v.

Return type:

float

get_total_branch_length()

Returns the sum of all the branch lengths in this tree (in units of generations). This is equivalent to

>>> sum(
>>>    tree.get_branch_length(u) for u in tree.nodes()
>>>    if u != tree.get_root())
Returns:The sum of all the branch lengths in this tree.
is_internal(u)

Returns True if the specified node is not a leaf.

Parameters:u (int) – The node of interest.
Returns:True if u is not a leaf node.
Return type:bool
is_leaf(u)

Returns True if the specified node is a leaf. A node \(u\) is a leaf if it has zero children.

Parameters:u (int) – The node of interest.
Returns:True if u is a leaf node.
Return type:bool
leaves(u)

Returns an iterator over all the leaves in this tree underneath the specified node.

If the TreeSequence.trees() method is called with leaf_lists=True, this method uses an efficient algorithm to find the leaves. If not, a simple traversal based method is used.

Parameters:u (int) – The node of interest.
Returns:An iterator over all leaves in the subtree rooted at u.
Return type:iterator
mutations()

Returns an iterator over the mutations in this tree. Each mutation is represented as a tuple \((x, u, j)\) where \(x\) is the position of the mutation in the sequence in chromosome coordinates, \(u\) is the node over which the mutation occurred and \(j\) is the zero-based index of the mutation within the overall tree sequence. Mutations are returned in non-decreasing order of position and increasing index.

Each mutation returned is an instance of collections.namedtuple(), and may be accessed via the attributes position, node and index as well as the usual positional approach. This is the recommended interface for working with mutations as it is both more readable and also ensures that code is forward compatible with future extensions.

Returns:An iterator of all \((x, u, j)\) tuples defining the mutations in this tree.
Return type:iter
nodes(root=None, order='preorder')

Returns an iterator over the nodes in this tree. If the root parameter is provided, iterate over the nodes in the subtree rooted at this node. If this is None, iterate over all nodes. If the order parameter is provided, iterate over the nodes in required tree traversal order.

Parameters:
  • root (int) – The root of the subtree we are traversing.
  • order (str) – The traversal ordering. Currently ‘preorder’, ‘inorder’, ‘postorder’ and ‘levelorder’ (‘breadthfirst’) are supported.
Return type:

iterator

Calculating statistics

The msprime API provides methods for efficiently calculating population genetics statistics from a given TreeSequence.

class msprime.LdCalculator(tree_sequence)

Class for calculating linkage disequilibrium coefficients between pairs of mutations in a TreeSequence. This class requires the numpy library.

This class supports multithreaded access using the Python threading module. Separate instances of LdCalculator referencing the same tree sequence can operate in parallel in multiple threads. See the Working with threads section in the Tutorial for an example of how use multiple threads to calculate LD values efficiently.

Parameters:tree_sequence (TreeSequence) – The tree sequence containing the mutations we are interested in.
get_r2(a, b)

Returns the value of the \(r^2\) statistic between the pair of mutations at the specified indexes. This method is not an efficient method for computing large numbers of pairwise; please use either get_r2_array() or get_r2_matrix() for this purpose.

Parameters:
  • a (int) – The index of the first mutation.
  • b (int) – The index of the second mutation.
Returns:

The value of \(r^2\) between the mutations at indexes a and b.

Return type:

float

get_r2_array(a, direction=1, max_mutations=None, max_distance=None)

Returns the value of the \(r^2\) statistic between the focal mutation at index \(a\) and a set of other mutations. The method operates by starting at the focal mutation and iterating over adjacent mutations (in either the forward or backwards direction) until either a maximum number of other mutations have been considered (using the max_mutations parameter), a maximum distance in sequence coordinates has been reached (using the max_distance parameter) or the start/end of the sequence has been reached. For every mutation \(b\) considered, we then insert the value of \(r^2\) between \(a\) and \(b\) at the corresponding index in an array, and return the entire array. If the returned array is \(x\) and direction is msprime.FORWARD then \(x[0]\) is the value of the statistic for \(a\) and \(a + 1\), \(x[1]\) the value for \(a\) and \(a + 2\), etc. Similarly, if direction is msprime.REVERSE then \(x[0]\) is the value of the statistic for \(a\) and \(a - 1\), \(x[1]\) the value for \(a\) and \(a - 2\), etc.

Parameters:
  • a (int) – The index of the focal mutation.
  • direction (int) – The direction in which to travel when examining other mutations. Must be either msprime.FORWARD or msprime.REVERSE. Defaults to msprime.FORWARD.
  • max_mutations (int) – The maximum number of mutations to return \(r^2\) values for. Defaults to as many mutations as possible.
  • max_distance (float) – The maximum absolute distance between the focal mutation and those for which \(r^2\) values are returned.
Returns:

An array of double precision floating point values representing the \(r^2\) values for mutations in the specified direction.

Return type:

numpy.ndarray

Warning:

For efficiency reasons, the underlying memory used to store the returned array is shared between calls. Therefore, if you wish to store the results of a single call to get_r2_array() for later processing you must take a copy of the array!

get_r2_matrix()

Returns the complete \(m \times m\) matrix of pairwise \(r^2\) values in a tree sequence with \(m\) mutations.

Returns:An 2 dimensional square array of double precision floating point values representing the \(r^2\) values for all pairs of mutations.
Return type:numpy.ndarray