API Documentation¶

This is the API documentation for msprime, and provides detailed information on the Python programming interface. See the Tutorial for an introduction to using this API to run simulations and analyse the results.

Simulation model¶

The simulation model in msprime closely follows the classical ms program. Unlike ms, however, time is measured in generations rather than in units of $$4 N_e$$ generations, i.e., “coalescent units”. This means that when simulating a population with diploid effective size $$N_e$$, the mean time to coalescence between two samples in an msprime simulation will be around $$2 N_e$$, while in an ms simulation, the mean time will be around $$0.5$$. Internally, msprime uses the same algorithm as ms, and so the Ne parameter to the simulate() function still acts as a time scaling, and can be set to 0.5 to match many theoretical results, or to 0.25 to match ms. Population sizes for each subpopulation and for past demographic events are also defined as absolute values, not scaled by Ne. All migration rates and growth rates are also per generation.

When running simulations we define the length $$L$$ of the sequence in question using the length parameter. This defines the coordinate space within which trees and mutations are defined. $$L$$ is a continuous value, so units are arbitrary, and coordinates can take any value from $$0$$ to $$L$$. (So, although we recommend setting the units of length to be analogous to “bases”, events can occur at fractional positions.) Mutations occur in an infinite sites process along this sequence, and mutation rates are specified per generation, per unit of sequence length. Thus, given the per-generation mutation rate $$\mu$$, the rate of mutation over the entire sequence in coalescent time units is $$\theta = 4 N_e \mu L$$. It is important to remember these scaling factors when comparing with analytical results!

Similarly, recombination rates are per unit of sequence length and per generation in msprime. Thus, given the per generation crossover rate $$r$$, the overall rate of recombination between the ends of the sequence in coalescent time units is $$\rho = 4 N_e r L$$. Although breakpoints do not necessarily occur at integer locations, the underlying recombination model is finite, and the behaviour of a small number of loci can be modelled using the RecombinationMap class. However, this is considered an advanced feature and the majority of cases should be well served with the default recombination model and number of loci.

Population structure is modelled by specifying a fixed number of subpopulations $$d$$, and a $$d \times d$$ matrix $$M$$ of per generation migration rates. Each element of the matrix $$M_{j,k}$$ defines the fraction of population $$j$$ that consists of migrants from population $$k$$ in each generation. Each subpopulation has an initial absolute population size $$s$$ and a per generation exponential growth rate $$\alpha$$. The size of a given population at time $$t$$ in the past (measured in generations) is therefore given by $$s e^{-\alpha t}$$. Demographic events that occur in the history of the simulated population alter some aspect of this population configuration at a particular time in the past.

Warning

This parameterisation of recombination, mutation and migration rates is different to ms, which states these rates over the entire region and in coalescent time units. The motivation for this is to allow the user change the size of the simulated region without having to rescale the recombination and mutation rates, and to also allow users directly state times and rates in units of generations. However, the mspms command line application is fully ms compatible.

Running simulations¶

The simulate() function provides the primary interface to running coalescent simulations in msprime.

msprime.simulate(sample_size=None, Ne=1, length=None, recombination_rate=None, recombination_map=None, mutation_rate=None, population_configurations=None, migration_matrix=None, demographic_events=[], samples=None, model=None, record_migrations=False, random_seed=None, mutation_generator=None, num_replicates=None, from_ts=None, start_time=None, __tmp_max_time=None)

Simulates the coalescent with recombination under the specified model parameters and returns the resulting TreeSequence. Note that Ne is the effective diploid population size (so the effective number of genomes in the population is 2*Ne), but sample_size is the number of (monoploid) genomes sampled.

Parameters: sample_size (int) – The number of sampled monoploid genomes. If not specified or None, this defaults to the sum of the subpopulation sample sizes. Either sample_size, population_configurations or samples must be specified. Ne (float) – The effective (diploid) population size for the reference population. This defaults to 1 if not specified. length (float) – The length of the simulated region in bases. This parameter cannot be used along with recombination_map. Defaults to 1 if not specified. recombination_rate (float) – The rate of recombination per base per generation. This parameter cannot be used along with recombination_map. Defaults to 0 if not specified. recombination_map (RecombinationMap) – The map describing the changing rates of recombination along the simulated chromosome. This parameter cannot be used along with the recombination_rate or length parameters, as these values are encoded within the map. Defaults to a uniform rate as described in the recombination_rate parameter if not specified. mutation_rate (float) – The rate of infinite sites mutations per unit of sequence length per generation. If not specified, no mutations are generated. This option only allows for infinite sites mutations with a binary (i.e., 0/1) alphabet. For more control over the mutational process, please use the mutate() function. population_configurations (list or None.) – The list of PopulationConfiguration instances describing the sampling configuration, relative sizes and growth rates of the populations to be simulated. If this is not specified, a single population with a sample of size sample_size is assumed. migration_matrix (list) – The matrix describing the rates of migration between all pairs of populations. If $$N$$ populations are defined in the population_configurations parameter, then the migration matrix must be an $$N \times N$$ matrix with 0 on the diagonal, consisting of $$N$$ lists of length $$N$$ or an $$N \times N$$ numpy array. demographic_events (list) – The list of demographic events to simulate. Demographic events describe changes to the populations in the past. Events should be supplied in non-decreasing order of time in the past. Events with the same time value will be applied sequentially in the order that they were supplied before the simulation algorithm continues with the next time step. samples (list) – The list specifying the location and time of all samples. This parameter may be used to specify historical samples, and cannot be used in conjunction with the sample_size parameter. Each sample is a (population, time) pair such that the sample in position j in the list of samples is drawn in the specified population at the specfied time. Time is measured in generations ago, as elsewhere. random_seed (int) – The random seed. If this is None, a random seed will be automatically generated. Valid random seeds must be between 1 and $$2^{32} - 1$$. num_replicates (int) – The number of replicates of the specified parameters to simulate. If this is not specified or None, no replication is performed and a TreeSequence object returned. If num_replicates is provided, the specified number of replicates is performed, and an iterator over the resulting TreeSequence objects returned. from_ts (TreeSequence) – If specified, initialise the simulation from the root segments of this tree sequence and return the completed tree sequence. Please see here for details on the required properties of this tree sequence and its interactions with other parameters. (Default: None). start_time (float) – If specified, set the initial time that the simulation starts to this value. If not specified, the start time is zero if performing a simulation of a set of samples, or is the time of the oldest node if simulating from an existing tree sequence (see the from_ts parameter). The TreeSequence object representing the results of the simulation if no replication is performed, or an iterator over the independent replicates simulated if the num_replicates parameter has been used. TreeSequence or an iterator over TreeSequence replicates. If using replication, do not store the results of the iterator in a list! For performance reasons, the same underlying object may be used for every TreeSequence returned which will most likely lead to unexpected behaviour.

Population structure¶

Population structure is modelled in msprime by specifying a fixed number of subpopulations, with the migration rates between those subpopulations defined by a migration matrix. Each subpopulation has an initial_size that defines its absolute diploid size at time zero and a per-generation growth_rate which specifies the exponential growth rate of the sub-population. We must also define the number of genomes to sample from each subpopulation. The number of populations and their initial configuration is defined using the population_configurations parameter to simulate(), which takes a list of PopulationConfiguration instances. Population IDs are zero indexed, and correspond to their position in the list.

Samples are drawn sequentially from populations in increasing order of population ID. For example, if we specified an overall sample size of 6, and specify that 2 samples are drawn from population 0 and 4 from population 1, then samples 0 and 1 will be initially located in population 0, and samples 2, 3, 4, and 5 will be drawn from population 2.

Given $$N$$ populations, migration matrices are specified using an $$N \times N$$ matrix of between-subpopulation migration rates. See the documentation for simulate() and the Simulation model section for more details on the migration rates.

class msprime.PopulationConfiguration(sample_size=None, initial_size=None, growth_rate=0.0)

The initial configuration of a population (or deme) in a simulation.

Parameters: sample_size (int) – The number of initial samples that are drawn from this population. initial_size (float) – The absolute size of the population at time zero. Defaults to the reference population size $$N_e$$. growth_rate (float) – The forwards-time exponential growth rate of the population per generation. Growth rates can be negative. This is zero for a constant population size, and positive for a population that has been growing. Defaults to 0.

Demographic Events¶

Demographic events change some aspect of the population configuration at some time in the past, and are specified using the demographic_events parameter to simulate(). Each element of this list must be an instance of one of the following demographic events that are currently supported. Note that all times are measured in generations, all sizes are absolute (i.e., not relative to $$N_e$$), and all rates are per-generation.

class msprime.PopulationParametersChange(time, initial_size=None, growth_rate=None, population=None, population_id=None)

Changes the demographic parameters of a population at a given time.

This event generalises the -eg, -eG, -en and -eN options from ms. Note that unlike ms we do not automatically set growth rates to zero when the population size is changed.

Parameters: time (float) – The length of time ago at which this event occurred. initial_size (float) – The absolute diploid size of the population at the beginning of the time slice starting at time. If None, this is calculated according to the initial population size and growth rate over the preceding time slice. growth_rate (float) – The new per-generation growth rate. If None, the growth rate is not changed. Defaults to None. population (int) – The ID of the population affected. If population is None, the changes affect all populations simultaneously.
class msprime.MigrationRateChange(time, rate, matrix_index=None)

Changes the rate of migration to a new value at a specific time.

Parameters: time (float) – The time at which this event occurs in generations. rate (float) – The new per-generation migration rate. matrix_index (tuple) – A tuple of two population IDs descibing the matrix index of interest. If matrix_index is None, all non-diagonal entries of the migration matrix are changed simultaneously.
class msprime.MassMigration(time, source, dest=None, proportion=1.0, destination=None)

A mass migration event in which some fraction of the population in one deme simultaneously move to another deme, viewed backwards in time. Each lineage currently present in the source population moves to the destination population with probability equal to proportion.

This event class generalises the population split (-ej) and admixture (-es) events from ms. Note that MassMigrations do not have any side effects on the migration matrix.

Parameters: time (float) – The time at which this event occurs in generations. source (int) – The ID of the source population. dest (int) – The ID of the destination population. proportion (float) – The probability that any given lineage within the source population migrates to the destination population.

Debugging demographic models¶

Warning

The DemographyDebugger class is preliminary, and the API is likely to change in the future.

class msprime.DemographyDebugger(Ne=1, population_configurations=None, migration_matrix=None, demographic_events=[])

A class to facilitate debugging of population parameters and migration rates in the past.

print_history(output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)

Prints a summary of the history of the populations.

Variable recombination rates¶

class msprime.RecombinationMap(positions, rates, num_loci=None)

A RecombinationMap represents the changing rates of recombination along a chromosome. This is defined via two lists of numbers: positions and rates, which must be of the same length. Given an index j in these lists, the rate of recombination per base per generation is rates[j] over the interval positions[j] to positions[j + 1]. Consequently, the first position must be zero, and by convention the last rate value is also required to be zero (although it is not used).

Warning

The chromosome is divided into num_loci regions of equal recombination distance, between which recombinations occur. This means that if recombination is constant and the genome is length n, then num_loci=n will produce breakpoints only at integer values. If recombination rate is not constant, breakpoints will still only occur at n distinct positions, but these will probably not coincide with the positions provided.

Parameters: positions (list) – The positions (in bases) denoting the distinct intervals where recombination rates change. These can be floating point values. rates (list) – The list of rates corresponding to the supplied positions. Recombination rates are specified per base, per generation. num_loci (int) – The maximum number of non-recombining loci in the underlying simulation. By default this is set to the largest possible value, allowing the maximum resolution in the recombination process. However, for a finite sites model this can be set to smaller values.
DEFAULT_NUM_LOCI = 4294967295

The default number of non-recombining loci in a RecombinationMap.

classmethod read_hapmap(filename)

Parses the specified file in HapMap format. These files must contain a single header line (which is ignored), and then each subsequent line denotes a position/rate pair. Positions are in units of bases, and recombination rates in centimorgans/Megabase. The first column in this file is ignored, as are subsequence columns after the Position and Rate. A sample of this format is as follows:

Chromosome  Position(bp)    Rate(cM/Mb)     Map(cM)
chr1        55550   2.981822        0.000000
chr1        82571   2.082414        0.080572
chr1        88169   2.081358        0.092229
chr1        254996  3.354927        0.439456
chr1        564598  2.887498        1.478148

Parameters: filename (str) – The name of the file to be parsed. This may be in plain text or gzipped plain text.
classmethod uniform_map(length, rate, num_loci=None)

Returns a RecombinationMap instance in which the recombination rate is constant over a chromosome of the specified length. The optional num_loci controls the number of discrete loci in the underlying simulation, and is by default large enough to be effectively be a continuous model.

The following map can be used to simulate a true finite locus model with a fixed number of loci m:

>>> recomb_map = RecombinationMap.uniform_map(m, rate, num_loci=m)

Parameters: length (float) – The length of the chromosome. rate (float) – The rate of recombination per unit of sequence length along this chromosome. num_loci (int) – The number of discrete loci in the underlying simulation. By default this is set to a large number.

Initialising simulations from a tree sequence¶

By default msprime simulations are initialised by specifying a set of samples, using the sample_size or samples parameters to simulate(). This initialises the simulation with segments of ancestral material covering the whole sequence. Simulation then proceeds backwards in time until a most recent common ancestor has been found at all points along this sequence. We can also start simulations from different initial conditions by using the from_ts argument to simulate(). Informally, we take an ‘unfinished’ tree sequence as a parameter to simulate, initialise the simulation from the state of this tree sequence and then run the simulation until coalescence. The returned tree sequence is then the result of taking the input tree sequence and completing the trees using the coalescent.

This is useful for forwards-time simulators such as SLiM that can output tree sequences. By running forward-time simulation for a certain number of generations we obtain a tree sequence, but these trees may not have had sufficient time to reach a most recent common ancestor. By using the from_ts argument to simulate() we can combine the best of both forwards- and backwards-time simulators. The recent past can be simulated forwards in time and the ancient past by the coalescent. The coalescent simulation is initialised by the root segments of the input tree sequence, ensuring that the minimal amount of ancestral material possible is simulated.

Please see the tutorial for an example of how to use this feature with a simple forwards-time Wright-Fisher simulator

Input requirements¶

Any tree sequence can be provided as input to this process, but there is a specific topological requirement that must be met for the simulations to be statistically correct. To ensure that ancestral segments are correctly associated within chromosomes when constructing the initial conditions for the coalescent simulation, forward-time simulators must retain the nodes corresponding to the initial generation. Furthermore, for every sample in the final generation (i.e. the extant population at the present time) there must be a path to one of the founder population nodes. (Please see the tutorial for further explanation of this point and an example.)

Recombination map limitations¶

Because of the way that msprime handles recombination internally, care must be taken when specifying recombination when using the from_ts argument. If recombination positions are generated in the same way in both the initial tree sequence and the coalescent simulation, then everything should work. However, the fine scale details of the underlying recombination model matter, so matching nonuniform recombination maps between simulators may not be possible at present. (To make it work, we must ensure that every recombination breakpoint in from_ts matches exactly to a possible recombination breakpoint in msprime’s recombination map, which is not guaranteed because of msprime’s discrete recombination model.)

One case in which it is guaranteed to work is if from_ts has integer coordinates, and we want to simulate a coalescent with a uniform recombination rate. In this case, to have a uniform recombination rate r use:

L = int(from_ts.sequence_length)
recomb_map = msprime.RecombinationMap.uniform_map(L, r, L)
final_ts = mpsrime.simulate(from_ts=from_ts, recomb_map=recomb_map)


Simulating mutations¶

When running coalescent simulations it’s usually most convenient to use the mutation_rate argument to the simulate() function to throw neutral mutations down on the trees. However, sometimes we wish to throw mutations down on an existing tree sequence: for example, if we want to see the outcome of different random mutational processes on top of a single simulated topology, or if we have obtained the tree sequence from another program and wish to overlay neutral mutations on this tree sequence.

class msprime.InfiniteSites(alphabet=0)

The “infinitely many sites” mutation model. In this model each mutation corresponds to a unique site, which has a floating-point position chosen uniformly along the sequence. As a result, each site is associated with exactly one mutation.

By default, the ancestral and derived states in this model are represented by the characters “0” and “1”. Thus, the ancestral state at a site is always “0” and the derived state for a mutation is always “1”. However, by specifying the alphabet=NUCLEOTIDES we can generate mutations from the nucleotide characters ACGT. In this case, for each mutation an ancestral state is chosen uniformly from these letters. The derived state is then chosen uniformly from the remaining characters so that the ancestral and derived states are always distinct.

msprime.BINARY == 0

The binary mutation alphabet where ancestral states are always “0” and derived states “1”.

msprime.NUCLEOTIDES == 1

The nucleotides mutation alphabet in which ancestral and derived states are chosen from the characters “A”, “C”, “G” and “T”.

msprime.mutate(tree_sequence, rate=None, random_seed=None, model=None, keep=False, start_time=None, end_time=None)

Simulates mutations on the specified ancestry and returns the resulting TreeSequence. Mutations are generated at the specified rate in measured generations. Mutations are generated under the infinite sites model, and so the rate of new mutations is per unit of sequence length per generation.

If a random seed is specified, this is used to seed the random number generator. If the same seed is specified and all other parameters are equal then the same mutations will be generated. If no random seed is specified then one is generated automatically.

If the model parameter is specified, this determines the model under which mutations are generated. Currently only the InfiniteSites mutation model is supported. This parameter is useful if you wish to obtain sequences with letters from the nucleotide alphabet rather than the default 0/1 states. By default mutations from the infinite sites model with a binary alphabet are generated.

By default, sites and mutations in the parameter tree sequence are discarded. If the keep parameter is true, however, additional mutations are simulated. Under the infinite sites mutation model, all new mutations generated will occur at distinct positions from each other and from any existing mutations (by rejection sampling).

The time interval over which mutations can occur may be controlled using the start_time and end_time parameters. The start_time defines the lower bound (in time-ago) on this interval and max_time the upper bound. Note that we may have mutations associated with nodes with time <= start_time since mutations store the node at the bottom (i.e., towards the leaves) of the branch that they occur on.

Parameters: tree_sequence (TreeSequence) – The tree sequence onto which we wish to throw mutations. rate (float) – The rate of mutation per generation. (Default: 0). random_seed (int) – The random seed. If this is None, a random seed will be automatically generated. Valid random seeds must be between 1 and $$2^{32} - 1$$. model (MutationModel) – The mutation model to use when generating mutations. If not specified or None, the InfiniteSites mutation model is used. keep (bool) – Whether to keep existing mutations (default: False). start_time (float) – The minimum time at which a mutation can occur. (Default: no restriction.) end_time (float) – The maximum time at which a mutation can occur (Default: no restriction). The TreeSequence object resulting from overlaying mutations on the input tree sequence. TreeSequence

Using simulation results¶

The TreeSequence class represents a sequence of correlated trees output by a simulation. The SparseTree class represents a single tree in this sequence. These classes are the interfaces used to interact with the trees and mutational information stored in a tree sequence returned from a simulation. There are also methods for loading data into these objects, either from the native format using msprime.load(), or from another sources using msprime.load_text() or TableCollection.tree_sequence().

Top level-classes¶

class msprime.TreeSequence

A single tree sequence, as defined by the data model. A TreeSequence instance can be created from a set of tables using TableCollection.tree_sequence(); or loaded from a set of text files using load_text(); or, loaded from a native binary file using load().

TreeSequences are immutable. To change the data held in a particular tree sequence, first get the table information as a TableCollection instance (using dump_tables()), edit those tables using the tables api, and create a new tree sequence using TableCollection.tree_sequence().

The trees() method iterates over all trees in a tree sequence, and the variants() method iterates over all sites and their genotypes.

breakpoints()

Returns an iterator over the breakpoints along the chromosome, including the two extreme points 0 and L. This is equivalent to

>>> [0] + [t.get_interval()[1] for t in self.trees()]


although we do not build an explicit list.

Returns: An iterator over all the breakpoints along the simulated sequence. iter
dump(path, zlib_compression=False)

Writes the tree sequence to the specified file path.

Parameters: path (str) – The file path to write the TreeSequence to. zlib_compression (bool) – This parameter is deprecated and ignored.
dump_tables()

A copy of the tables defining this tree sequence.

Returns: A TableCollection containing all tables underlying the tree sequence. TableCollection
dump_text(nodes=None, edges=None, sites=None, mutations=None, individuals=None, populations=None, provenances=None, precision=6, encoding='utf8', base64_metadata=True)

Writes a text representation of the tables underlying the tree sequence to the specified connections.

If Base64 encoding is not used, then metadata will be saved directly, possibly resulting in errors reading the tables back in if metadata includes whitespace.

Parameters: nodes (stream) – The file-like object (having a .write() method) to write the NodeTable to. edges (stream) – The file-like object to write the EdgeTable to. sites (stream) – The file-like object to write the SiteTable to. mutations (stream) – The file-like object to write the MutationTable to. individuals (stream) – The file-like object to write the IndividualTable to. populations (stream) – The file-like object to write the PopulationTable to. provenances (stream) – The file-like object to write the ProvenanceTable to. precision (int) – The number of digits of precision. encoding (string) – Encoding used for text representation. base64_metadata (bool) – If True, metadata is encoded using Base64 encoding; otherwise, as plain text.
edge_diffs()

Returns an iterator over all the edges that are inserted and removed to build the trees as we move from left-to-right along the tree sequence. The iterator yields a sequence of 3-tuples, (interval, edges_out, edges_in). The interval is a pair (left, right) representing the genomic interval (see SparseTree.interval). The edges_out value is a tuple of the edges that were just-removed to create the tree covering the interval (hence, edges_out will always be empty for the first tree). The edges_in value is a tuple of edges that were just inserted to contruct the tree convering the current interval.

Returns: An iterator over the (interval, edges_out, edges_in) tuples. iter(tuple, tuple, tuple)
edges()

Returns an iterator over all the edges in this tree sequence. Edges are returned in the order required for a valid tree sequence. So, edges are guaranteed to be ordered such that (a) all parents with a given ID are contiguous; (b) edges are returned in non-descreasing order of parent time ago; (c) within the edges for a given parent, edges are sorted first by child ID and then by left coordinate.

Returns: An iterator over all edges. iter(Edge)
first()

Returns the first tree in this TreeSequence. To iterate over all trees in the sequence, use the trees() method.

Currently does not support the extra options for the trees() method.

Returns: The first tree in this tree sequence. SparseTree.
genotype_matrix()

Returns an $$m \times n$$ numpy array of the genotypes in this tree sequence, where $$m$$ is the number of sites and $$n$$ the number of samples. The genotypes are the indexes into the array of alleles, as described for the Variant class. The value 0 always corresponds to the ancestal state, and values > 0 represent distinct derived states.

Warning

This method can consume a very large amount of memory! If all genotypes are not needed at once, it is usually better to access them sequentially using the variants() iterator.

Returns: The full matrix of genotypes. numpy.ndarray (dtype=np.uint8)
haplotypes()

Returns an iterator over the haplotypes resulting from the trees and mutations in this tree sequence as a string. The iterator returns a total of $$n$$ strings, each of which contains $$s$$ characters ($$n$$ is the sample size returned by msprime.TreeSequence.num_samples and $$s$$ is the number of sites returned by msprime.TreeSequence.num_sites). The first string returned is the haplotype for sample 0, and so on. For a given haplotype h, the value of h[j] is the observed allelic state at site j.

See also the variants() iterator for site-centric access to sample genotypes.

This method is only supported for single-letter alleles.

Returns: An iterator over the haplotype strings for the samples in this tree sequence. iter LibraryError if called on a tree sequence containing multiletter alleles.
individual(id_)

Returns the individual in this tree sequence with the specified ID.

Return type: Individual
individuals()

Returns an iterator over all the individuals in this tree sequence.

Returns: An iterator over all individuals. iter(Individual)
migrations()

Returns an iterator over all the migrations in this tree sequence.

Migrations are returned in nondecreasing order of the time value.

Returns: An iterator over all migrations. iter(Migration)
mutation(id_)

Returns the mutation in this tree sequence with the specified ID.

Return type: Mutation
mutations()

Returns an iterator over all the mutations in this tree sequence. Mutations are returned in order of nondecreasing site ID. See the Mutation class for details on the available fields for each mutation.

The returned iterator is equivalent to iterating over all sites and all mutations in each site, i.e.:

>>> for site in tree_sequence.sites():
>>>     for mutation in site.mutations:
>>>         yield mutation

Returns: An iterator over all mutations in this tree sequence. iter(Mutation)
node(id_)

Returns the node in this tree sequence with the specified ID.

Return type: Node
nodes()

Returns an iterator over all the nodes in this tree sequence.

Returns: An iterator over all nodes. iter(Node)
num_edges

Returns the number of edges in this tree sequence.

Returns: The number of edges in this tree sequence. int
num_individuals

Returns the number of individuals in this tree sequence.

Returns: The number of individuals in this tree sequence. int
num_migrations

Returns the number of migrations in this tree sequence.

Returns: The number of migrations in this tree sequence. int
num_mutations

Returns the number of mutations in this tree sequence.

Returns: The number of mutations in this tree sequence. int
num_nodes

Returns the number of nodes in this tree sequence.

Returns: The number of nodes in this tree sequence. int
num_populations

Returns the number of populations in this tree sequence.

Returns: The number of populations in this tree sequence. int
num_provenances

Returns the number of provenances in this tree sequence.

Returns: The number of provenances in this tree sequence. int
num_samples

Returns the number of samples in this tree sequence. This is the number of sample nodes in each tree.

Returns: The number of sample nodes in this tree sequence. int
num_sites

Returns the number of sites in this tree sequence.

Returns: The number of sites in this tree sequence. int
num_trees

Returns the number of distinct trees in this tree sequence. This is equal to the number of trees returned by the trees() method.

Returns: The number of trees in this tree sequence. int
pairwise_diversity(samples=None)

Returns the value of $$\pi$$, the pairwise nucleotide site diversity, the average number of mutations per unit of genome length that differ between a randomly chosen pair of samples. If samples is specified, calculate the diversity within this set.

Note

This method does not currently support sites that have more than one mutation. Using it on such a tree sequence will raise a LibraryError with an “Unsupported operation” message.

Parameters: samples (iterable) – The set of samples within which we calculate the diversity. If None, calculate diversity within the entire sample. The pairwise nucleotide site diversity. float
population(id_)

Returns the population in this tree sequence with the specified ID.

Return type: Population
populations()

Returns an iterator over all the populations in this tree sequence.

Returns: An iterator over all populations. iter(Population)
provenances()

Returns an iterator over all the provenances in this tree sequence.

Returns: An iterator over all provenances. iter(Provenance)
samples(population=None, population_id=None)

Returns an array of the sample node IDs in this tree sequence. If the population parameter is specified, only return sample IDs from this population.

Parameters: population (int) – The population of interest. If None, return all samples. population_id (int) – Deprecated alias for population. A numpy array of the node IDs for the samples of interest. numpy.ndarray (dtype=np.int32)
sequence_length

Returns the sequence length in this tree sequence. This defines the genomic scale over which tree coordinates are defined. Given a tree sequence with a sequence length $$L$$, the constituent trees will be defined over the half-closed interval $$[0, L)$$. Each tree then covers some subset of this interval — see msprime.SparseTree.get_interval() for details.

Returns: The length of the sequence in this tree sequence in bases. float
simplify(samples=None, filter_zero_mutation_sites=None, map_nodes=False, reduce_to_site_topology=False, filter_populations=True, filter_individuals=True, filter_sites=True)

Returns a simplified tree sequence that retains only the history of the nodes given in the list samples. If map_nodes is true, also return a numpy array mapping the node IDs in this tree sequence to their node IDs in the simplified tree tree sequence. If a node u is not present in the new tree sequence, the value of this mapping will be NULL_NODE (-1).

In the returned tree sequence, the node with ID 0 corresponds to samples[0], node 1 corresponds to samples[1], and so on. Besides the samples, node IDs in the returned tree sequence are then allocated sequentially in time order.

If you wish to simplify a set of tables that do not satisfy all requirements for building a TreeSequence, then use TableCollection.simplify().

If the reduce_to_site_topology parameter is True, the returned tree sequence will contain only topological information that is necessary to represent the trees that contain sites. If there are zero sites in this tree sequence, this will result in an output tree sequence with zero edges. When the number of sites is greater than zero, every tree in the output tree sequence will contain at least one site. For a given site, the topology of the tree containing that site will be identical (up to node ID remapping) to the topology of the corresponding tree in the input tree sequence.

If filter_populations, filter_individuals or filter_sites is True, any of the corresponding objects that are not referenced elsewhere are filtered out. As this is the default behaviour, it is important to realise IDs for these objects may change through simplification. By setting these parameters to False, however, the corresponding tables can be preserved without changes.

Parameters: samples (list) – The list of nodes for which to retain information. This may be a numpy array (or array-like) object (dtype=np.int32). filter_zero_mutation_sites (bool) – Deprecated alias for filter_sites. map_nodes (bool) – If True, return a tuple containing the resulting tree sequence and a numpy array mapping node IDs in the current tree sequence to their corresponding node IDs in the returned tree sequence. If False (the default), return only the tree sequence object itself. reduce_to_site_topology (bool) – Whether to reduce the topology down to the trees that are present at sites. (Default: False) filter_populations (bool) – If True, remove any populations that are not referenced by nodes after simplification; new population IDs are allocated sequentially from zero. If False, the population table will not be altered in any way. (Default: True) filter_individuals (bool) – If True, remove any individuals that are not referenced by nodes after simplification; new individual IDs are allocated sequentially from zero. If False, the individual table will not be altered in any way. (Default: True) filter_sites (bool) – If True, remove any sites that are not referenced by mutations after simplification; new site IDs are allocated sequentially from zero. If False, the site table will not be altered in any way. (Default: True) The simplified tree sequence, or (if map_nodes is True) a tuple consisting of the simplified tree sequence and a numpy array mapping source node IDs to their corresponding IDs in the new tree sequence. TreeSequence or a (TreeSequence, numpy.array) tuple
site(id_)

Returns the site in this tree sequence with the specified ID.

Return type: Site
sites()

Returns an iterator over all the sites in this tree sequence. Sites are returned in order of increasing ID (and also position). See the Site class for details on the available fields for each site.

Returns: An iterator over all sites. iter(Site)
tables

A copy of the tables underlying this tree sequence. See also dump_tables().

Warning

This propery currently returns a copy of the tables underlying a tree sequence but it may return a read-only view in the future. Thus, if the tables will subsequently be updated, please use the dump_tables() method instead as this will always return a new copy of the TableCollection.

Returns: A TableCollection containing all a copy of the tables underlying this tree sequence. TableCollection
trees(tracked_samples=None, sample_counts=True, sample_lists=False, tracked_leaves=None, leaf_counts=None, leaf_lists=None)

Returns an iterator over the trees in this tree sequence. Each value returned in this iterator is an instance of SparseTree.

The sample_counts and sample_lists parameters control the features that are enabled for the resulting trees. If sample_counts is True, then it is possible to count the number of samples underneath a particular node in constant time using the num_samples() method. If sample_lists is True a more efficient algorithm is used in the SparseTree.samples() method.

The tracked_samples parameter can be used to efficiently count the number of samples in a given set that exist in a particular subtree using the SparseTree.get_num_tracked_samples() method. It is an error to use the tracked_samples parameter when the sample_counts flag is False.

Warning: Do not store the results of this iterator in a list! For performance reasons, the same underlying object is used for every tree returned which will most likely lead to unexpected behaviour. tracked_samples (list) – The list of samples to be tracked and counted using the SparseTree.get_num_tracked_samples() method. sample_counts (bool) – If True, support constant time sample counts via the SparseTree.num_samples() and SparseTree.get_num_tracked_samples() methods. sample_lists (bool) – If True, provide more efficient access to the samples beneath a give node using the SparseTree.samples() method. An iterator over the sparse trees in this tree sequence. iter
variants(as_bytes=False, samples=None)

Returns an iterator over the variants in this tree sequence. See the Variant class for details on the fields of each returned object. By default the genotypes for the variants are numpy arrays, corresponding to indexes into the alleles array. If the as_bytes parameter is true, these allelic values are recorded directly into a bytes array.

Note

The as_bytes parameter is kept as a compatibility option for older code. It is not the recommended way of accessing variant data, and will be deprecated in a later release. Another method will be provided to obtain the allelic states for each site directly.

Parameters: as_bytes (bool) – If True, the genotype values will be returned as a Python bytes object. This is useful in certain situations (i.e., directly printing the genotypes) or when numpy is not available. Otherwise, genotypes are returned as a numpy array (the default). An iterator of all variants this tree sequence. iter(Variant)
write_vcf(output, ploidy=1, contig_id='1')

Writes a VCF formatted file to the specified file-like object. If a ploidy value is supplied, allele values are combined among adjacent samples to form a phased genotype of the required ploidy. For example, if we have a ploidy of 2 and a sample of size 6, then we will have 3 diploid samples in the output, consisting of the combined alleles for samples [0, 1], [2, 3] and [4, 5]. If we had alleles 011110 at a particular variant, then we would output the genotypes 0|1, 1|1 and 1|0 in VCF. Sample names are generated by appending the index to the prefix msp_ such that we would have the sample names msp_0, msp_1 and msp_2 in the running example.

Example usage:

>>> with open("output.vcf", "w") as vcf_file:
>>>     tree_sequence.write_vcf(vcf_file, 2)


Warning

This output function does not currently use information in the IndividualTable, and so will only correctly produce non-haploid output if the nodes corresponding to each individual are contiguous as described above.

Parameters: output (File) – The file-like object to write the VCF output. ploidy (int) – The ploidy of the individuals to be written to VCF. This sample size must be evenly divisible by ploidy. contig_id (str) – The value of the CHROM column in the output VCF.
class msprime.SparseTree

A SparseTree is a single tree in a TreeSequence. The SparseTree implementation differs from most tree implementations by using integer node IDs to refer to nodes rather than objects. Thus, when we wish to find the parent of the node with ID ‘0’, we use tree.parent(0), which returns another integer. If ‘0’ does not have a parent in the current tree (e.g., if it is a root), then the special value NULL_NODE ($$-1$$) is returned. The children of a node are found using the children() method. To obtain information about a particular node, one may either use tree.tree_sequence.node(u) to obtain the corresponding Node instance, or use the time() or population() shorthands. Tree traversals in various orders is possible using the SparseTree.nodes() iterator.

Sparse trees are not intended to be instantiated directly, and are obtained as part of a TreeSequence using the trees() method.

branch_length(u)

Returns the length of the branch (in generations) joining the specified node to its parent. This is equivalent to

>>> tree.time(tree.parent(u)) - tree.time(u)


Note that this is not related to the value returned by length, which describes the length of the interval covered by the tree in genomic coordinates.

Parameters: u (int) – The node of interest. The branch length from u to its parent. float
children(u)

Returns the children of the specified node u as a tuple of integer node IDs. If u is a leaf, return the empty tuple.

Parameters: u (int) – The node of interest. The children of u as a tuple of integers tuple(int)
draw(path=None, width=None, height=None, node_labels=None, node_colours=None, mutation_labels=None, mutation_colours=None, format=None)

Returns a drawing of this tree.

When working in a Jupyter notebook, use the IPython.display.SVG function to display the SVG output from this function inline in the notebook:

>>> SVG(tree.draw())


The unicode format uses unicode box drawing characters to render the tree. This allows rendered trees to be printed out to the terminal:

>>> print(tree.draw(format="unicode"))
6
┏━┻━┓
┃   5
┃ ┏━┻┓
┃ ┃  4
┃ ┃ ┏┻┓
3 0 1 2


The node_labels argument allows the user to specify custom labels for nodes, or no labels at all:

>>> print(tree.draw(format="unicode", node_labels={}))
┃
┏━┻━┓
┃   ┃
┃ ┏━┻┓
┃ ┃  ┃
┃ ┃ ┏┻┓
┃ ┃ ┃ ┃

Parameters: path (str) – The path to the file to write the output. If None, do not write to file. width (int) – The width of the image in pixels. If not specified, either defaults to the minimum size required to depict the tree (text formats) or 200 pixels. height (int) – The height of the image in pixels. If not specified, either defaults to the minimum size required to depict the tree (text formats) or 200 pixels. node_labels (map) – If specified, show custom labels for the nodes that are present in the map. Any nodes not specified in the map will not have a node label. node_colours (map) – If specified, show custom colours for nodes. (Only supported in the SVG format.) format (str) – The format of the returned image. Currently supported are ‘svg’, ‘ascii’ and ‘unicode’. A representation of this tree in the requested format. str
index

Returns the index this tree occupies in the parent tree sequence. This index is zero based, so the first tree in the sequence has index 0.

Returns: The index of this tree. int
interval

Returns the coordinates of the genomic interval that this tree represents the history of. The interval is returned as a tuple $$(l, r)$$ and is a half-open interval such that the left coordinate is inclusive and the right coordinate is exclusive. This tree therefore applies to all genomic locations $$x$$ such that $$l \leq x < r$$.

Returns: A tuple (l, r) representing the left-most (inclusive) and right-most (exclusive) coordinates of the genomic region covered by this tree. tuple
is_internal(u)

Returns True if the specified node is not a leaf. A node is internal if it has one or more children in the current tree.

Parameters: u (int) – The node of interest. True if u is not a leaf node. bool
is_leaf(u)

Returns True if the specified node is a leaf. A node $$u$$ is a leaf if it has zero children.

Parameters: u (int) – The node of interest. True if u is a leaf node. bool
is_sample(u)

Returns True if the specified node is a sample. A node $$u$$ is a sample if it has been marked as a sample in the parent tree sequence.

Parameters: u (int) – The node of interest. True if u is a sample. bool
leaves(u=None)

Returns an iterator over all the leaves in this tree that are underneath the specified node. If u is not specified, return all leaves in the tree.

Parameters: u (int) – The node of interest. An iterator over all leaves in the subtree rooted at u. iterator
length

Returns the length of the genomic interval that this tree represents. This is defined as $$r - l$$, where $$(l, r)$$ is the genomic interval returned by interval.

Returns: The length of the genomic interval covered by this tree. int
mrca(u, v)

Returns the most recent common ancestor of the specified nodes.

Parameters: u (int) – The first node. v (int) – The second node. The most recent common ancestor of u and v. int
mutations()

Returns an iterator over all the mutations in this tree. Mutations are returned in order of nondecreasing site ID. See the Mutation class for details on the available fields for each mutation.

The returned iterator is equivalent to iterating over all sites and all mutations in each site, i.e.:

>>> for site in tree.sites():
>>>     for mutation in site.mutations:
>>>         yield mutation

Returns: An iterator over all mutations in this tree. iter(Mutation)
newick(precision=14, root=None, node_labels=None)

Returns a newick encoding of this tree. If the root argument is specified, return a representation of the specified subtree, otherwise the full tree is returned. If the tree has multiple roots then seperate newick strings for each rooted subtree must be found (i.e., we do not attempt to concatenate the different trees).

By default, leaf nodes are labelled with their numerical ID + 1, and internal nodes are not labelled. Arbitrary node labels can be specified using the node_labels argument, which maps node IDs to the desired labels.

Warning

Node labels are not Newick escaped, so care must be taken to provide labels that will not break the encoding.

Parameters: precision (int) – The numerical precision with which branch lengths are printed. root (int) – If specified, return the tree rooted at this node. node_labels (map) – If specified, show custom labels for the nodes that are present in the map. Any nodes not specified in the map will not have a node label. A newick representation of this tree. str
nodes(root=None, order='preorder')

Returns an iterator over the nodes in this tree. If the root parameter is provided, iterate over the nodes in the subtree rooted at this node. If this is None, iterate over all nodes. If the order parameter is provided, iterate over the nodes in required tree traversal order.

Parameters: root (int) – The root of the subtree we are traversing. order (str) – The traversal ordering. Currently ‘preorder’, ‘inorder’, ‘postorder’ and ‘levelorder’ (‘breadthfirst’) are supported. An iterator over the nodes in the tree in some traversal order. iterator
num_mutations

Returns the total number of mutations across all sites on this tree.

Returns: The total number of mutations over all sites on this tree. int
num_nodes

Returns the number of nodes in the sparse tree.

Return type: int
num_roots

The number of roots in this tree, as defined in the roots attribute.

Requires O(number of roots) time.

Return type: int
num_samples(u=None)

Returns the number of samples in this tree underneath the specified node (including the node itself). If u is not specified return the total number of samples in the tree.

If the TreeSequence.trees() method is called with sample_counts=True this method is a constant time operation. If not, a slower traversal based algorithm is used to count the samples.

Parameters: u (int) – The node of interest. The number of samples in the subtree rooted at u. int
num_sites

Returns the number of sites on this tree.

Returns: The number of sites on this tree. int
num_tracked_samples(u=None)

Returns the number of samples in the set specified in the tracked_samples parameter of the TreeSequence.trees() method underneath the specified node. If the input node is not specified, return the total number of tracked samples in the tree.

This is a constant time operation.

Parameters: u (int) – The node of interest. The number of samples within the set of tracked samples in the subtree rooted at u. int RuntimeError – if the TreeSequence.trees() method is not called with sample_counts=True.
parent(u)

Returns the parent of the specified node. Returns the NULL_NODE if u is the root or is not a node in the current tree.

Parameters: u (int) – The node of interest. The parent of u. int
population(u)

Returns the population associated with the specified node. Equivalent to tree.tree_sequence.node(u).population.

Parameters: u (int) – The node of interest. The ID of the population associated with node u. int
root

The root of this tree. If the tree contains multiple roots, a ValueError is raised indicating that the roots attribute should be used instead.

Returns: The root node. int ValueError if this tree contains more than one root.
roots

The list of roots in this tree. A root is defined as a unique endpoint of the paths starting at samples. We can define the set of roots as follows:

roots = set()
for u in tree_sequence.samples():
while tree.parent(u) != msprime.NULL_NODE:
u = tree.parent(u)
# roots is now the set of all roots in this tree.
assert sorted(roots) == sorted(tree.roots)


The roots of the tree are returned in a list, in no particular order.

Requires O(number of roots) time.

Returns: The list of roots in this tree. list
sample_size

Returns the sample size for this tree. This is the number of sample nodes in the tree.

Returns: The number of sample nodes in the tree. int
samples(u=None)

Returns an iterator over all the samples in this tree that are underneath the specified node. If u is a sample, it is included in the returned iterator. If u is not specified, return all samples in the tree.

If the TreeSequence.trees() method is called with sample_lists=True, this method uses an efficient algorithm to find the samples. If not, a simple traversal based method is used.

Parameters: u (int) – The node of interest. An iterator over all samples in the subtree rooted at u. iterator
sites()

Returns an iterator over all the sites in this tree. Sites are returned in order of increasing ID (and also position). See the Site class for details on the available fields for each site.

Returns: An iterator over all sites in this tree. iter(Site)
time(u)

Returns the time of the specified node in generations. Equivalent to tree.tree_sequence.node(u).time.

Parameters: u (int) – The node of interest. The time of u. float
tmrca(u, v)

Returns the time of the most recent common ancestor of the specified nodes. This is equivalent to:

>>> tree.time(tree.mrca(u, v))

Parameters: u (int) – The first node. v (int) – The second node. The time of the most recent common ancestor of u and v. float
total_branch_length

Returns the sum of all the branch lengths in this tree (in units of generations). This is equivalent to

>>> sum(
>>>    tree.branch_length(u) for u in tree.nodes()
>>>    if u not in self.roots)

Returns: The sum of all the branch lengths in this tree. float
tree_sequence

Returns the tree sequence that this tree is from.

Returns: The parent tree sequence for this tree. TreeSequence

Constants¶

These constants are used in TreeSequence and SparseTree instances to define special values of various variables.

msprime.NULL_NODE == -1

Special reserved value, representing the null node. If the parent of a given node is null, then this node is either a root (the ancestor of one or more samples), or the node is not present in the current tree.

msprime.NULL_POPULATION == -1

Special reserved value, representing the null population ID. This value is returned if the population associated with a particular node is not defined, or population information was not available in the underlying tree sequence.

msprime.NULL_INDIVIDUAL == -1

Special reserved value, representing the null individual ID. This value is returned if the individual associated with a particular node is not defined, or individual information was not available in the underlying tree sequence.

msprime.NULL_MUTATION == -1

Special reserved value, representing the null mutation ID. If there is a single mutation at a site, or if a mutation at a given site does not have an ancestal mutations it inherited from, this is the value returned by mutation.parent. See also Mutation.

msprime.FORWARD == 1

Constant representing the forward direction of travel (i.e., increasing genomic coordinate values).

msprime.REVERSE == -1

Constant representing the reverse direction of travel (i.e., decreasing genomic coordinate values).

Simple container classes¶

These classes are simple shallow containers representing the entities defined in the Data model. These classes are not intended to be instantiated directly, but are the return types for the various iterators provided by the TreeSequence and SparseTree classes.

class msprime.Individual

An individual in a tree sequence. Since nodes correspond to genomes, individuals are associated with a collection of nodes (e.g., two nodes per diploid). See Nodes, Genomes, or Individuals? for more discussion of this distinction.

Modifying the attributes in this class will have no effect on the underlying tree sequence data.

Variables: id (int) – The integer ID of this individual. Varies from 0 to TreeSequence.num_individuals - 1. flags (int) – The bitwise flags for this individual. location (numpy.ndarray) – The spatial location of this individual as a numpy array. The location is an empty array if no spatial location is defined. nodes – The IDs of the nodes that are associated with this individual as a numpy array (dtype=np.int32). If no nodes are associated with the individual this array will be empty. metadata (bytes) – The metadata for this individual.
class msprime.Node

A node in a tree sequence, corresponding to a single genome. The time and population are attributes of the Node, rather than the Individual, as discussed in Nodes, Genomes, or Individuals?.

Modifying the attributes in this class will have no effect on the underlying tree sequence data.

Variables: id (int) – The integer ID of this node. Varies from 0 to TreeSequence.num_nodes - 1. flags (int) – The bitwise flags for this node. time (float) – The birth time of this node. population (int) – The integer ID of the population that this node was born in. individual (int) – The integer ID of the individual that this node was a part of. metadata (bytes) – The metadata for this node.
is_sample()

Returns True if this node is a sample. This value is derived from the flag variable.

Return type: bool
class msprime.Edge

An edge in a tree sequence.

Modifying the attributes in this class will have no effect on the underlying tree sequence data.

Variables: left (float) – The left coordinate of this edge. right (float) – The right coordinate of this edge. parent (int) – The integer ID of the parent node for this edge. To obtain further information about a node with a given ID, use TreeSequence.node(). child (int) – The integer ID of the child node for this edge. To obtain further information about a node with a given ID, use TreeSequence.node().
class msprime.Site

A site in a tree sequence.

Modifying the attributes in this class will have no effect on the underlying tree sequence data.

Variables: id (int) – The integer ID of this site. Varies from 0 to TreeSequence.num_sites - 1. position (float) – The floating point location of this site in genome coordinates. Ranges from 0 (inclusive) to TreeSequence.sequence_length (exclusive). ancestral_state (str) – The ancestral state at this site (i.e., the state inherited by nodes, unless mutations occur). metadata (bytes) – The metadata for this site. mutations (list[Mutation]) – The list of mutations at this site. Mutations within a site are returned in the order they are specified in the underlying MutationTable.
class msprime.Mutation

A mutation in a tree sequence.

Modifying the attributes in this class will have no effect on the underlying tree sequence data.

Variables: id (int) – The integer ID of this mutation. Varies from 0 to TreeSequence.num_mutations - 1. site (int) – The integer ID of the site that this mutation occurs at. To obtain further information about a site with a given ID use TreeSequence.site(). node (int) – The integer ID of the first node that inherits this mutation. To obtain further information about a node with a given ID, use TreeSequence.node(). derived_state (str) – The derived state for this mutation. This is the state inherited by nodes in the subtree rooted at this mutation’s node, unless another mutation occurs. parent (int) – The integer ID of this mutation’s parent mutation. When multiple mutations occur at a site along a path in the tree, mutations must record the mutation that is immediately above them. If the mutation does not have a parent, this is equal to the NULL_MUTATION (-1). To obtain further information about a mutation with a given ID, use TreeSequence.mutation(). metadata (bytes) – The metadata for this site.
class msprime.Variant

A variant represents the observed variation among the samples for a given site. A variant consists (a) of a reference to the Site instance in question; (b) the alleles that may be observed at the samples for this site; and (c) the genotypes mapping sample IDs to the observed alleles.

Each element in the alleles tuple is a string, representing the actual observed state for a given sample. The first element of this tuple is guaranteed to be the same as the site’s ancestral_state value. The list of alleles is also guaranteed not to contain any duplicates. However, allelic values may be listed that are not referred to by any samples. For example, if we have a site that is fixed for the derived state (i.e., we have a mutation over the tree root), all genotypes will be 1, but the alleles list will be equal to ('0', '1'). Other than the ancestral state being the first allele, the alleles are listed in no particular order, and the ordering should not be relied upon.

The genotypes represent the observed allelic states for each sample, such that var.alleles[var.genotypes[j]] gives the string allele for sample ID j. Thus, the elements of the genotypes array are indexes into the alleles list. The genotypes are provided in this way via a numpy array to enable efficient calculations.

Modifying the attributes in this class will have no effect on the underlying tree sequence data.

Variables: site (Site) – The site object for this variant. alleles (tuple(str)) – A tuple of the allelic values that may be observed at the samples at the current site. The first element of this tuple is always the sites’s ancestral state. genotypes (numpy.ndarray) – An array of indexes into the list alleles, giving the state of each sample at the current site.
class msprime.Migration

A migration in a tree sequence.

Modifying the attributes in this class will have no effect on the underlying tree sequence data.

Variables: left (float) – The left end of the genomic interval covered by this migration (inclusive). right (float) – The right end of the genomic interval covered by this migration (exclusive). node (int) – The integer ID of the node involved in this migration event. To obtain further information about a node with a given ID, use TreeSequence.node(). source (int) – The source population ID. dest (int) – The destination population ID. time (float) – The time at which this migration occured at.
class msprime.Population

A population in a tree sequence.

Modifying the attributes in this class will have no effect on the underlying tree sequence data.

Variables: id (int) – The integer ID of this population. Varies from 0 to TreeSequence.num_populations - 1. metadata (bytes) – The metadata for this population.

There are several methods for loading data into a TreeSequence instance. The simplest and most convenient is the use the msprime.load() function to load a tree sequence file. For small scale data and debugging, it is often convenient to use the msprime.load_text() to read data in the text file format. The TableCollection.tree_sequence() function efficiently creates a TreeSequence object from a set of tables using the Tables API.

msprime.load(path)

Loads a tree sequence from the specified file path. This file must be in the tree sequence file format produced by the TreeSequence.dump() method.

Parameters: path (str) – The file path of the .trees file containing the tree sequence we wish to load. The tree sequence object containing the information stored in the specified file path. msprime.TreeSequence
msprime.load_text(nodes, edges, sites=None, mutations=None, individuals=None, populations=None, sequence_length=0, strict=True, encoding='utf8', base64_metadata=True)

Parses the tree sequence data from the specified file-like objects, and returns the resulting TreeSequence object. The format for these files is documented in the Text file formats section, and is produced by the TreeSequence.dump_text() method. Further properties required for an input tree sequence are described in the Valid tree sequence requirements section. This method is intended as a convenient interface for importing external data into msprime; the binary file format using by msprime.load() is many times more efficient than this text format.

The nodes and edges parameters are mandatory and must be file-like objects containing text with whitespace delimited columns, parsable by parse_nodes() and parse_edges(), respectively. sites, mutations, individuals and populations are optional, and must be parsable by parse_sites(), parse_individuals(), parse_populations(), and parse_mutations(), respectively.

TODO: there is no method to parse the remaining tables at present, so only tree sequences not requiring Population and Individual tables can be loaded. This will be fixed: https://github.com/tskit-dev/msprime/issues/498

The sequence_length parameter determines the TreeSequence.sequence_length of the returned tree sequence. If it is 0 or not specified, the value is taken to be the maximum right coordinate of the input edges. This parameter is useful in degenerate situations (such as when there are zero edges), but can usually be ignored.

The strict parameter controls the field delimiting algorithm that is used. If strict is True (the default), we require exactly one tab character separating each field. If strict is False, a more relaxed whitespace delimiting algorithm is used, such that any run of whitespace is regarded as a field separator. In most situations, strict=False is more convenient, but it can lead to error in certain situations. For example, if a deletion is encoded in the mutation table this will not be parseable when strict=False.

After parsing the tables, sort_tables() is called to ensure that the loaded tables satisfy the tree sequence ordering requirements. Note that this may result in the IDs of various entities changing from their positions in the input file.

Parameters: nodes (stream) – The file-like object containing text describing a NodeTable. edges (stream) – The file-like object containing text describing an EdgeTable. sites (stream) – The file-like object containing text describing a SiteTable. mutations (stream) – The file-like object containing text describing a MutationTable. individuals (stream) – The file-like object containing text describing a IndividualTable. populations (stream) – The file-like object containing text describing a PopulationTable. sequence_length (float) – The sequence length of the returned tree sequence. If not supplied or zero this will be inferred from the set of edges. strict (bool) – If True, require strict tab delimiting (default). If False, a relaxed whitespace splitting algorithm is used. encoding (string) – Encoding used for text representation. base64_metadata (bool) – If True, metadata is encoded using Base64 encoding; otherwise, as plain text. The tree sequence object containing the information stored in the specified file paths. msprime.TreeSequence

Calculating statistics¶

The msprime API provides methods for efficiently calculating population genetics statistics from a given TreeSequence.

class msprime.LdCalculator(tree_sequence)

Class for calculating linkage disequilibrium coefficients between pairs of mutations in a TreeSequence. This class requires the numpy library.

This class supports multithreaded access using the Python threading module. Separate instances of LdCalculator referencing the same tree sequence can operate in parallel in multiple threads. See the Working with threads section in the Tutorial for an example of how use multiple threads to calculate LD values efficiently.

Note

This class does not currently support sites that have more than one mutation. Using it on such a tree sequence will raise a LibraryError with an “Unsupported operation” message.

Parameters: tree_sequence (TreeSequence) – The tree sequence containing the mutations we are interested in.
r2(a, b)

Returns the value of the $$r^2$$ statistic between the pair of mutations at the specified indexes. This method is not an efficient method for computing large numbers of pairwise; please use either r2_array() or r2_matrix() for this purpose.

Parameters: a (int) – The index of the first mutation. b (int) – The index of the second mutation. The value of $$r^2$$ between the mutations at indexes a and b. float
r2_array(a, direction=1, max_mutations=None, max_distance=None)

Returns the value of the $$r^2$$ statistic between the focal mutation at index $$a$$ and a set of other mutations. The method operates by starting at the focal mutation and iterating over adjacent mutations (in either the forward or backwards direction) until either a maximum number of other mutations have been considered (using the max_mutations parameter), a maximum distance in sequence coordinates has been reached (using the max_distance parameter) or the start/end of the sequence has been reached. For every mutation $$b$$ considered, we then insert the value of $$r^2$$ between $$a$$ and $$b$$ at the corresponding index in an array, and return the entire array. If the returned array is $$x$$ and direction is msprime.FORWARD then $$x[0]$$ is the value of the statistic for $$a$$ and $$a + 1$$, $$x[1]$$ the value for $$a$$ and $$a + 2$$, etc. Similarly, if direction is msprime.REVERSE then $$x[0]$$ is the value of the statistic for $$a$$ and $$a - 1$$, $$x[1]$$ the value for $$a$$ and $$a - 2$$, etc.

Parameters: a (int) – The index of the focal mutation. direction (int) – The direction in which to travel when examining other mutations. Must be either msprime.FORWARD or msprime.REVERSE. Defaults to msprime.FORWARD. max_mutations (int) – The maximum number of mutations to return $$r^2$$ values for. Defaults to as many mutations as possible. max_distance (float) – The maximum absolute distance between the focal mutation and those for which $$r^2$$ values are returned. An array of double precision floating point values representing the $$r^2$$ values for mutations in the specified direction. numpy.ndarray For efficiency reasons, the underlying memory used to store the returned array is shared between calls. Therefore, if you wish to store the results of a single call to get_r2_array() for later processing you must take a copy of the array!
r2_matrix()

Returns the complete $$m \times m$$ matrix of pairwise $$r^2$$ values in a tree sequence with $$m$$ mutations.

Returns: An 2 dimensional square array of double precision floating point values representing the $$r^2$$ values for all pairs of mutations. numpy.ndarray

Tables API¶

The tables API provides an efficient way of working with and interchanging tree sequence data. Each table class (e.g, NodeTable, EdgeTable) has a specific set of columns with fixed types, and a set of methods for setting and getting the data in these columns. The number of rows in the table t is given by len(t). Each table supports accessing the data either by row or column. To access the row j in table t simply use t[j]. The value returned by such an access is an instance of collections.namedtuple(), and therefore supports either positional or named attribute access. To access the data in a column, we can use standard attribute access which will return a numpy array of the data. For example:

>>> import msprime
>>> t = msprime.EdgeTable()
0
1
>>> print(t)
id      left            right           parent  child
0       0.00000000      1.00000000      10      11
1       1.00000000      2.00000000      9       11
>>> t[0]
EdgeTableRow(left=0.0, right=1.0, parent=10, child=11)
>>> t[-1]
EdgeTableRow(left=1.0, right=2.0, parent=9, child=11)
>>> t.left
array([ 0.,  1.])
>>> t.parent
array([10,  9], dtype=int32)
>>> len(t)
2
>>>


Tables also support the pickle protocol, and so can be easily serialised and deserialised (for example, when performing parallel computations using the multiprocessing module).

>>> serialised = pickle.dumps(t)
>>> print(t2)
id      left            right           parent  child
0       0.00000000      1.00000000      10      11
1       1.00000000      2.00000000      9       11


However, pickling will not be as efficient as storing tables in the native format.

Tables support the equality operator == based on the data held in the columns:

>>> t == t2
True
>>> t is t2
False
2
>>> print(t2)
id      left            right           parent  child
0       0.00000000      1.00000000      10      11
1       1.00000000      2.00000000      9       11
2       0.00000000      1.00000000      2       3
>>> t == t2
False


Text columns¶

As described in the Encoding ragged columns, working with variable length columns is somewhat more involved. Columns encoding text data store the encoded bytes of the flattened strings, and the offsets into this column in two separate arrays.

Consider the following example:

>>> t = msprime.SiteTable()
>>> print(t)
0       0.00000000      A
1       1.00000000      BB
2       2.00000000
3       3.00000000      CCC
>>> t[0]
>>> t[1]
>>> t[2]
>>> t[3]


Here we create a SiteTable and add four rows, each with a different ancestral_state. We can then access this information from each row in a straightforward manner. Working with the data in the columns is a little trickier, however:

>>> t.ancestral_state
array([65, 66, 66, 67, 67, 67], dtype=int8)
>>> t.ancestral_state_offset
array([0, 1, 3, 3, 6], dtype=uint32)
>>> msprime.unpack_strings(t.ancestral_state, t.ancestral_state_offset)
['A', 'BB', '', 'CCC']


Here, the ancestral_state array is the UTF8 encoded bytes of the flattened strings, and the ancestral_state_offset is the offset into this array for each row. The unpack_strings() function, however, is a convient way to recover the original strings from this encoding. We can also use the pack_strings() to insert data using this approach:

>>> a, off = msprime.pack_strings(["0", "12", ""])
>>> t.set_columns(position=[0, 1, 2], ancestral_state=a, ancestral_state_offset=off)
>>> print(t)
0       0.00000000      0
1       1.00000000      12
2       2.00000000


When inserting many rows with standard infinite sites mutations (i.e., ancestral state is “0”), it is more efficient to construct the numpy arrays directly than to create a list of strings and use pack_strings(). When doing this, it is important to note that it is the encoded byte values that are stored; by default, we use UTF8 (which corresponds to ASCII for simple printable characters).:

>>> t_s = msprime.SiteTable()
>>> m = 10
>>> a = ord("0") + np.zeros(m, dtype=np.int8)
>>> off = np.arange(m + 1, dtype=np.uint32)
>>> t_s.set_columns(position=np.arange(m), ancestral_state=a, ancestral_state_offset=off)
>>> print(t_s)
0       0.00000000      0
1       1.00000000      0
2       2.00000000      0
3       3.00000000      0
4       4.00000000      0
5       5.00000000      0
6       6.00000000      0
7       7.00000000      0
8       8.00000000      0
9       9.00000000      0
>>> t_s.ancestral_state
array([48, 48, 48, 48, 48, 48, 48, 48, 48, 48], dtype=int8)
>>> t_s.ancestral_state_offset
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype=uint32)


Here we create 10 sites at regular positions, each with ancestral state equal to “0”. Note that we use ord("0") to get the ASCII code for “0” (48), and create 10 copies of this by adding it to an array of zeros. We have done this for illustration purposes: it is equivalent (though slower for large examples) to do a, off = msprime.pack_strings(["0"] * m).

Mutations can be handled similarly:

>>> t_m = msprime.MutationTable()
>>> site = np.arange(m, dtype=np.int32)
>>> d, off = msprime.pack_strings(["1"] * m)
>>> node = np.zeros(m, dtype=np.int32)
>>> t_m.set_columns(site=site, node=node, derived_state=d, derived_state_offset=off)
>>> print(t_m)
id      site    node    derived_state   parent  metadata
0       0       0       1       -1
1       1       0       1       -1
2       2       0       1       -1
3       3       0       1       -1
4       4       0       1       -1
5       5       0       1       -1
6       6       0       1       -1
7       7       0       1       -1
8       8       0       1       -1
9       9       0       1       -1
>>>


Binary columns¶

Columns storing binary data take the same approach as Text columns to encoding variable length data. The difference between the two is only raw bytes values are accepted: no character encoding or decoding is done on the data. Consider the following example:

>>> t = msprime.NodeTable()
b'raw bytes'
b'\x80\x03}q\x00X\x01\x00\x00\x00xq\x01G?\xf1\x99\x99\x99\x99\x99\x9as.'
{'x': 1.1}
>>> print(t)
0       0       -1      0.00000000000000        cmF3IGJ5dGVz
1       0       -1      0.00000000000000        gAN9cQBYAQAAAHhxAUc/8ZmZmZmZmnMu
array([ 114,   97,  119,   32,   98,  121,  116,  101,  115, -128,    3,
125,  113,    0,   88,    1,    0,    0,    0,  120,  113,    1,
71,   63,  -15, -103, -103, -103, -103, -103, -102,  115,   46], dtype=int8)
array([ 0,  9, 33], dtype=uint32)


Here we add two rows to a NodeTable, with different metadata. The first row contains a simple byte string, and the second contains a Python dictionary serialised using pickle. We then show several different (and seemingly incompatible!) different views on the same data.

When we access the data in a row (e.g., t[0].metadata) we are returned a Python bytes object containing precisely the bytes that were inserted. The pickled dictionary is encoded in 24 bytes containing unprintable characters, and when we unpickle it using pickle.loads(), we obtain the original dictionary.

When we print the table, however, we see some data which is seemingly unrelated to the original contents. This is because the binary data is base64 encoded to ensure that it is print-safe (and doesn’t break your terminal). (See the Metadata section for more information on the use of base64 encoding.).

Finally, when we print the metadata column, we see the raw byte values encoded as signed integers. As for Text columns, the metadata_offset column encodes the offsets into this array. So, we see that the first metadata value is 9 bytes long and the second is 24.

The pack_bytes() and unpack_bytes() functions are also useful for encoding data in these columns.

Table classes¶

This section describes the methods and variables available for each table class. For description and definition of each table’s meaning and use, see the table definitions.

class msprime.IndividualTable

A table defining the individuals in a tree sequence. Note that although each Individual has associated nodes, reference to these is not stored in the individual table, but rather reference to the individual is stored for each node in the NodeTable. This is similar to the way in which the relationship between sites and mutations is modelled.

Warning: The numpy arrays returned by table attribute accesses are copies of the underlying data. In particular, this means that you cannot edit the values in the columns by updating the attribute arrays. NOTE: this behaviour may change in future. flags (numpy.ndarray, dtype=np.uint32) – The array of flags values. location (numpy.ndarray, dtype=np.float64) – The flattened array of floating point location values. See Encoding ragged columns for more details. location_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the location column. See Encoding ragged columns for more details. metadata (numpy.ndarray, dtype=np.int8) – The flattened array of binary metadata values. See Binary columns for more details. metadata_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the metadata column. See Binary columns for more details.
add_row(flags=0, location=None, metadata=None)

Adds a new row to this IndividualTable and returns the ID of the corresponding individual.

Parameters: flags (int) – The bitwise flags for the new node. location (array-like) – A list of numeric values or one-dimensional numpy array describing the location of this individual. If not specified or None, a zero-dimensional location is stored. metadata (bytes) – The binary-encoded metadata for the new node. If not specified or None, a zero-length byte string is stored. The ID of the newly added node. int
append_columns(flags, location=None, location_offset=None, metadata=None, metadata_offset=None)

Appends the specified arrays to the end of the columns in this IndividualTable. This allows many new rows to be added at once.

The flags array is mandatory and defines the number of extra individuals to add to the table. The location and location_offset parameters must be supplied together, and meet the requirements for Encoding ragged columns. The metadata and metadata_offset parameters must be supplied together, and meet the requirements for Encoding ragged columns. See Binary columns for more information.

Parameters: flags (numpy.ndarray, dtype=np.uint32) – The bitwise flags for each individual. Required. location (numpy.ndarray, dtype=np.float64) – The flattened location array. Must be specified along with location_offset. If not specified or None, an empty location value is stored for each individual. location_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the location array. metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with metadata_offset. If not specified or None, an empty metadata value is stored for each individual. metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the metadata array.
clear()

Deletes all rows in this table.

copy()

Returns a deep copy of this table.

set_columns(flags, location=None, location_offset=None, metadata=None, metadata_offset=None)

Sets the values for each column in this IndividualTable using the values in the specified arrays. Overwrites any data currently stored in the table.

The flags array is mandatory and defines the number of individuals the table will contain. The location and location_offset parameters must be supplied together, and meet the requirements for Encoding ragged columns. The metadata and metadata_offset parameters must be supplied together, and meet the requirements for Encoding ragged columns. See Binary columns for more information.

Parameters: flags (numpy.ndarray, dtype=np.uint32) – The bitwise flags for each individual. Required. location (numpy.ndarray, dtype=np.float64) – The flattened location array. Must be specified along with location_offset. If not specified or None, an empty location value is stored for each individual. location_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the location array. metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with metadata_offset. If not specified or None, an empty metadata value is stored for each individual. metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the metadata array.
truncate(num_rows)

Truncates this table so that the only the first num_rows are retained.

Parameters: num_rows (int) – The number of rows to retain in this table.
class msprime.NodeTable

A table defining the nodes in a tree sequence. See the definitions for details on the columns in this table and the tree sequence requirements section for the properties needed for a node table to be a part of a valid tree sequence.

Warning: The numpy arrays returned by table attribute accesses are copies of the underlying data. In particular, this means that you cannot edit the values in the columns by updating the attribute arrays. NOTE: this behaviour may change in future. time (numpy.ndarray, dtype=np.float64) – The array of time values. flags (numpy.ndarray, dtype=np.uint32) – The array of flags values. population (numpy.ndarray, dtype=np.int32) – The array of population IDs. individual (numpy.ndarray, dtype=np.int32) – The array of individual IDs that each node belongs to. metadata (numpy.ndarray, dtype=np.int8) – The flattened array of binary metadata values. See Binary columns for more details. metadata_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the metadata column. See Binary columns for more details.
add_row(flags=0, time=0, population=-1, individual=-1, metadata=None)

Adds a new row to this NodeTable and returns the ID of the corresponding node.

Parameters: flags (int) – The bitwise flags for the new node. time (float) – The birth time for the new node. population (int) – The ID of the population in which the new node was born. Defaults to the NULL_POPULATION. individual (int) – The ID of the individual in which the new node was born. Defaults to the NULL_INDIVIDUAL. metadata (bytes) – The binary-encoded metadata for the new node. If not specified or None, a zero-length byte string is stored. The ID of the newly added node. int
append_columns(flags, time, population=None, individual=None, metadata=None, metadata_offset=None)

Appends the specified arrays to the end of the columns in this NodeTable. This allows many new rows to be added at once.

The flags, time and population arrays must all be of the same length, which is equal to the number of nodes that will be added to the table. The metadata and metadata_offset parameters must be supplied together, and meet the requirements for Encoding ragged columns. See Binary columns for more information.

Parameters: flags (numpy.ndarray, dtype=np.uint32) – The bitwise flags for each node. Required. time (numpy.ndarray, dtype=np.float64) – The time values for each node. Required. population (numpy.ndarray, dtype=np.int32) – The population values for each node. If not specified or None, the NULL_POPULATION value is stored for each node. individual (numpy.ndarray, dtype=np.int32) – The individual values for each node. If not specified or None, the NULL_INDIVIDUAL value is stored for each node. metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with metadata_offset. If not specified or None, an empty metadata value is stored for each node. metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the metadata array.
clear()

Deletes all rows in this table.

copy()

Returns a deep copy of this table.

set_columns(flags, time, population=None, individual=None, metadata=None, metadata_offset=None)

Sets the values for each column in this NodeTable using the values in the specified arrays. Overwrites any data currently stored in the table.

The flags, time and population arrays must all be of the same length, which is equal to the number of nodes the table will contain. The metadata and metadata_offset parameters must be supplied together, and meet the requirements for Encoding ragged columns. See Binary columns for more information.

Parameters: flags (numpy.ndarray, dtype=np.uint32) – The bitwise flags for each node. Required. time (numpy.ndarray, dtype=np.float64) – The time values for each node. Required. population (numpy.ndarray, dtype=np.int32) – The population values for each node. If not specified or None, the NULL_POPULATION value is stored for each node. individual (numpy.ndarray, dtype=np.int32) – The individual values for each node. If not specified or None, the NULL_INDIVIDUAL value is stored for each node. metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with metadata_offset. If not specified or None, an empty metadata value is stored for each node. metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the metadata array.
truncate(num_rows)

Truncates this table so that the only the first num_rows are retained.

Parameters: num_rows (int) – The number of rows to retain in this table.
class msprime.EdgeTable

A table defining the edges in a tree sequence. See the definitions for details on the columns in this table and the tree sequence requirements section for the properties needed for an edge table to be a part of a valid tree sequence.

Warning: The numpy arrays returned by table attribute accesses are copies of the underlying data. In particular, this means that you cannot edit the values in the columns by updating the attribute arrays. NOTE: this behaviour may change in future. left (numpy.ndarray, dtype=np.float64) – The array of left coordinates. right (numpy.ndarray, dtype=np.float64) – The array of right coordinates. parent (numpy.ndarray, dtype=np.int32) – The array of parent node IDs. child (numpy.ndarray, dtype=np.int32) – The array of child node IDs.
add_row(left, right, parent, child)

Adds a new row to this EdgeTable and returns the ID of the corresponding edge.

Parameters: left (float) – The left coordinate (inclusive). right (float) – The right coordinate (exclusive). parent (int) – The ID of parent node. child (int) – The ID of child node. The ID of the newly added edge. int
append_columns(left, right, parent, child)

Appends the specified arrays to the end of the columns of this EdgeTable. This allows many new rows to be added at once.

All four parameters are mandatory, and must be numpy arrays of the same length (which is equal to the number of additional edges to add to the table).

Parameters: left (numpy.ndarray, dtype=np.float64) – The left coordinates (inclusive). right (numpy.ndarray, dtype=np.float64) – The right coordinates (exclusive). parent (numpy.ndarray, dtype=np.int32) – The parent node IDs. child (numpy.ndarray, dtype=np.int32) – The child node IDs.
clear()

Deletes all rows in this table.

copy()

Returns a deep copy of this table.

set_columns(left, right, parent, child)

Sets the values for each column in this EdgeTable using the values in the specified arrays. Overwrites any data currently stored in the table.

All four parameters are mandatory, and must be numpy arrays of the same length (which is equal to the number of edges the table will contain).

Parameters: left (numpy.ndarray, dtype=np.float64) – The left coordinates (inclusive). right (numpy.ndarray, dtype=np.float64) – The right coordinates (exclusive). parent (numpy.ndarray, dtype=np.int32) – The parent node IDs. child (numpy.ndarray, dtype=np.int32) – The child node IDs.
truncate(num_rows)

Truncates this table so that the only the first num_rows are retained.

Parameters: num_rows (int) – The number of rows to retain in this table.
class msprime.MigrationTable

A table defining the migrations in a tree sequence. See the definitions for details on the columns in this table and the tree sequence requirements section for the properties needed for a migration table to be a part of a valid tree sequence.

Warning: The numpy arrays returned by table attribute accesses are copies of the underlying data. In particular, this means that you cannot edit the values in the columns by updating the attribute arrays. NOTE: this behaviour may change in future. left (numpy.ndarray, dtype=np.float64) – The array of left coordinates. right (numpy.ndarray, dtype=np.float64) – The array of right coordinates. node (numpy.ndarray, dtype=np.int32) – The array of node IDs. source (numpy.ndarray, dtype=np.int32) – The array of source population IDs. dest (numpy.ndarray, dtype=np.int32) – The array of destination population IDs. time (numpy.ndarray, dtype=np.float64) – The array of time values.
add_row(left, right, node, source, dest, time)

Adds a new row to this MigrationTable and returns the ID of the corresponding migration.

Parameters: left (float) – The left coordinate (inclusive). right (float) – The right coordinate (exclusive). node (int) – The node ID. source (int) – The ID of the source population. dest (int) – The ID of the destination population. time (float) – The time of the migration event. The ID of the newly added migration. int
append_columns(left, right, node, source, dest, time)

Appends the specified arrays to the end of the columns of this MigrationTable. This allows many new rows to be added at once.

All six parameters are mandatory, and must be numpy arrays of the same length (which is equal to the number of additional migrations to add to the table).

Parameters: left (numpy.ndarray, dtype=np.float64) – The left coordinates (inclusive). right (numpy.ndarray, dtype=np.float64) – The right coordinates (exclusive). node (numpy.ndarray, dtype=np.int32) – The node IDs. source (numpy.ndarray, dtype=np.int32) – The source population IDs. dest (numpy.ndarray, dtype=np.int32) – The destination population IDs. time (numpy.ndarray, dtype=np.int64) – The time of each migration.
clear()

Deletes all rows in this table.

copy()

Returns a deep copy of this table.

set_columns(left, right, node, source, dest, time)

Sets the values for each column in this MigrationTable using the values in the specified arrays. Overwrites any data currently stored in the table.

All six parameters are mandatory, and must be numpy arrays of the same length (which is equal to the number of migrations the table will contain).

Parameters: left (numpy.ndarray, dtype=np.float64) – The left coordinates (inclusive). right (numpy.ndarray, dtype=np.float64) – The right coordinates (exclusive). node (numpy.ndarray, dtype=np.int32) – The node IDs. source (numpy.ndarray, dtype=np.int32) – The source population IDs. dest (numpy.ndarray, dtype=np.int32) – The destination population IDs. time (numpy.ndarray, dtype=np.int64) – The time of each migration.
truncate(num_rows)

Truncates this table so that the only the first num_rows are retained.

Parameters: num_rows (int) – The number of rows to retain in this table.
class msprime.SiteTable

A table defining the sites in a tree sequence. See the definitions for details on the columns in this table and the tree sequence requirements section for the properties needed for a site table to be a part of a valid tree sequence.

Warning: The numpy arrays returned by table attribute accesses are copies of the underlying data. In particular, this means that you cannot edit the values in the columns by updating the attribute arrays. NOTE: this behaviour may change in future. position (numpy.ndarray, dtype=np.float64) – The array of site position coordinates. ancestral_state (numpy.ndarray, dtype=np.int8) – The flattened array of ancestral state strings. See Text columns for more details. ancestral_state_offset (numpy.ndarray, dtype=np.uint32) – The offsets of rows in the ancestral_state array. See Text columns for more details. metadata (numpy.ndarray, dtype=np.int8) – The flattened array of binary metadata values. See Binary columns for more details. metadata_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the metadata column. See Binary columns for more details.
add_row(position, ancestral_state, metadata=None)

Adds a new row to this SiteTable and returns the ID of the corresponding site.

Parameters: position (float) – The position of this site in genome coordinates. ancestral_state (str) – The state of this site at the root of the tree. metadata (bytes) – The binary-encoded metadata for the new node. If not specified or None, a zero-length byte string is stored. The ID of the newly added site. int
append_columns(position, ancestral_state, ancestral_state_offset, metadata=None, metadata_offset=None)

Appends the specified arrays to the end of the columns of this SiteTable. This allows many new rows to be added at once.

The position, ancestral_state and ancestral_state_offset parameters are mandatory, and must be 1D numpy arrays. The length of the position array determines the number of additional rows to add the table. The ancestral_state and ancestral_state_offset parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Text columns for more information). The metadata and metadata_offset parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Binary columns for more information).

Parameters: position (numpy.ndarray, dtype=np.float64) – The position of each site in genome coordinates. ancestral_state (numpy.ndarray, dtype=np.int8) – The flattened ancestral_state array. Required. ancestral_state_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the ancestral_state array. metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with metadata_offset. If not specified or None, an empty metadata value is stored for each node. metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the metadata array.
clear()

Deletes all rows in this table.

copy()

Returns a deep copy of this table.

set_columns(position, ancestral_state, ancestral_state_offset, metadata=None, metadata_offset=None)

Sets the values for each column in this SiteTable using the values in the specified arrays. Overwrites any data currently stored in the table.

The position, ancestral_state and ancestral_state_offset parameters are mandatory, and must be 1D numpy arrays. The length of the position array determines the number of rows in table. The ancestral_state and ancestral_state_offset parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Text columns for more information). The metadata and metadata_offset parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Binary columns for more information).

Parameters: position (numpy.ndarray, dtype=np.float64) – The position of each site in genome coordinates. ancestral_state (numpy.ndarray, dtype=np.int8) – The flattened ancestral_state array. Required. ancestral_state_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the ancestral_state array. metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with metadata_offset. If not specified or None, an empty metadata value is stored for each node. metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the metadata array.
truncate(num_rows)

Truncates this table so that the only the first num_rows are retained.

Parameters: num_rows (int) – The number of rows to retain in this table.
class msprime.MutationTable

A table defining the mutations in a tree sequence. See the definitions for details on the columns in this table and the tree sequence requirements section for the properties needed for a mutation table to be a part of a valid tree sequence.

Warning: The numpy arrays returned by table attribute accesses are copies of the underlying data. In particular, this means that you cannot edit the values in the columns by updating the attribute arrays. NOTE: this behaviour may change in future. site (numpy.ndarray, dtype=np.int32) – The array of site IDs. node (numpy.ndarray, dtype=np.int32) – The array of node IDs. derived_state (numpy.ndarray, dtype=np.int8) – The flattened array of derived state strings. See Text columns for more details. derived_state_offset (numpy.ndarray, dtype=np.uint32) – The offsets of rows in the derived_state array. See Text columns for more details. parent (numpy.ndarray, dtype=np.int32) – The array of parent mutation IDs. metadata (numpy.ndarray, dtype=np.int8) – The flattened array of binary metadata values. See Binary columns for more details. metadata_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the metadata column. See Binary columns for more details.
add_row(site, node, derived_state, parent=-1, metadata=None)

Adds a new row to this MutationTable and returns the ID of the corresponding mutation.

Parameters: site (int) – The ID of the site that this mutation occurs at. node (int) – The ID of the first node inheriting this mutation. derived_state (str) – The state of the site at this mutation’s node. parent (int) – The ID of the parent mutation. If not specified, defaults to the NULL_MUTATION. metadata (bytes) – The binary-encoded metadata for the new node. If not specified or None, a zero-length byte string is stored. The ID of the newly added mutation. int
append_columns(site, node, derived_state, derived_state_offset, parent=None, metadata=None, metadata_offset=None)

Appends the specified arrays to the end of the columns of this MutationTable. This allows many new rows to be added at once.

The site, node, derived_state and derived_state_offset parameters are mandatory, and must be 1D numpy arrays. The site and node (also parent, if supplied) arrays must be of equal length, and determine the number of additional rows to add to the table. The derived_state and derived_state_offset parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Text columns for more information). The metadata and metadata_offset parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Binary columns for more information).

Parameters: site (numpy.ndarray, dtype=np.int32) – The ID of the site each mutation occurs at. node (numpy.ndarray, dtype=np.int32) – The ID of the node each mutation is associated with. derived_state (numpy.ndarray, dtype=np.int8) – The flattened derived_state array. Required. derived_state_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the derived_state array. parent (numpy.ndarray, dtype=np.int32) – The ID of the parent mutation for each mutation. metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with metadata_offset. If not specified or None, an empty metadata value is stored for each node. metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the metadata array.
clear()

Deletes all rows in this table.

copy()

Returns a deep copy of this table.

set_columns(site, node, derived_state, derived_state_offset, parent=None, metadata=None, metadata_offset=None)

Sets the values for each column in this MutationTable using the values in the specified arrays. Overwrites any data currently stored in the table.

The site, node, derived_state and derived_state_offset parameters are mandatory, and must be 1D numpy arrays. The site and node (also parent, if supplied) arrays must be of equal length, and determine the number of rows in the table. The derived_state and derived_state_offset parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Text columns for more information). The metadata and metadata_offset parameters must be supplied together, and meet the requirements for Encoding ragged columns (see Binary columns for more information).

Parameters: site (numpy.ndarray, dtype=np.int32) – The ID of the site each mutation occurs at. node (numpy.ndarray, dtype=np.int32) – The ID of the node each mutation is associated with. derived_state (numpy.ndarray, dtype=np.int8) – The flattened derived_state array. Required. derived_state_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the derived_state array. parent (numpy.ndarray, dtype=np.int32) – The ID of the parent mutation for each mutation. metadata (numpy.ndarray, dtype=np.int8) – The flattened metadata array. Must be specified along with metadata_offset. If not specified or None, an empty metadata value is stored for each node. metadata_offset (numpy.ndarray, dtype=np.uint32.) – The offsets into the metadata array.
truncate(num_rows)

Truncates this table so that the only the first num_rows are retained.

Parameters: num_rows (int) – The number of rows to retain in this table.
class msprime.PopulationTable

A table defining the populations referred to in a tree sequence. The PopulationTable stores metadata for populations that may be referred to in the NodeTable and MigrationTable”. Note that although nodes may be associated with populations, this association is stored in the NodeTable: only metadata on each population is stored in the population table.

Warning: The numpy arrays returned by table attribute accesses are copies of the underlying data. In particular, this means that you cannot edit the values in the columns by updating the attribute arrays. NOTE: this behaviour may change in future. metadata (numpy.ndarray, dtype=np.int8) – The flattened array of binary metadata values. See Binary columns for more details. metadata_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the metadata column. See Binary columns for more details.
add_row(metadata=None)

Adds a new row to this PopulationTable and returns the ID of the corresponding population.

Parameters: metadata (bytes) – The binary-encoded metadata for the new population. If not specified or None, a zero-length byte string is stored. The ID of the newly added population. int
clear()

Deletes all rows in this table.

copy()

Returns a deep copy of this table.

truncate(num_rows)

Truncates this table so that the only the first num_rows are retained.

Parameters: num_rows (int) – The number of rows to retain in this table.
class msprime.ProvenanceTable

A table recording the provenance (i.e., history) of this table, so that the origin of the underlying data and sequence of subsequent operations can be traced. Each row contains a “record” string (recommended format: JSON) and a timestamp.

Todo

The format of the record field will be more precisely specified in the future.

Variables: record (numpy.ndarray, dtype=np.int8) – The flattened array containing the record strings. Text columns for more details. record_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the record column. See Text columns for more details. timestamp (numpy.ndarray, dtype=np.int8) – The flattened array containing the timestamp strings. Text columns for more details. timestamp_offset (numpy.ndarray, dtype=np.uint32) – The array of offsets into the timestamp column. See Text columns for more details.
add_row(record, timestamp=None)

Adds a new row to this ProvenanceTable consisting of the specified record and timestamp. If timestamp is not specified, it is automatically generated from the current time.

Parameters: record (str) – A provenance record, describing the parameters and environment used to generate the current set of tables. timestamp (str) – A string timestamp. This should be in ISO8601 form.
clear()

Deletes all rows in this table.

copy()

Returns a deep copy of this table.

truncate(num_rows)

Truncates this table so that the only the first num_rows are retained.

Parameters: num_rows (int) – The number of rows to retain in this table.

Table Collections¶

Each of the table classes defines a different aspect of the structure of a tree sequence. It is convenient to be able to refer to a set of these tables which together define a tree sequence. We refer to this grouping of related tables as a TableCollection. The TableCollection and TreeSequence classes are deeply related. A TreeSequence instance is based on the information encoded in a TableCollection. Tree sequences are immutable, and provide methods for obtaining trees from the sequence. A TableCollection is mutable, and does not have any methods for obtaining trees. The TableCollection class essentially exists to allow the dynamic creation of tree sequences.

class msprime.TableCollection(sequence_length=0)

A collection of mutable tables defining a tree sequence. See the Data model section for definition on the various tables and how they together define a TreeSequence. Arbitrary data can be stored in a TableCollection, but there are certain requirements that must be satisfied for these tables to be interpreted as a tree sequence.

To obtain a TreeSequence instance corresponding to the current state of a TableCollection, please use the tree_sequence() method.

Variables: individuals (IndividualTable) – The individual table. nodes (NodeTable) – The node table. edges (EdgeTable) – The edge table. migrations (MigrationTable) – The migration table. sites (SiteTable) – The site table. mutations (MutationTable) – The mutation table. populations (PopulationTable) – The population table. provenances (ProvenanceTable) – The provenance table. sequence_length (float) – The sequence length defining the coordinate space. file_uuid (str) – The UUID for the file this TableCollection is derived from, or None if not derived from a file.
asdict()

Returns this TableCollection as a dictionary mapping the keys “nodes”, “edges”, etc to their respective table objects.

compute_mutation_parents()

Modifies the tables in place, computing the parent column of the mutation table. For this to work, the node and edge tables must be valid, and the site and mutation tables must be sorted (see TableCollection.sort()). This will produce an error if mutations are not sorted (i.e., if a mutation appears before its mutation parent) unless the two mutations occur on the same branch, in which case there is no way to detect the error.

The parent of a given mutation is the ID of the next mutation encountered traversing the tree upwards from that mutation, or NULL_MUTATION if there is no such mutation.

deduplicate_sites()

Modifies the tables in place, removing entries in the site table with duplicate position (and keeping only the first entry for each site), and renumbering the site column of the mutation table appropriately. This requires the site table to be sorted by position.

simplify(samples=None, filter_zero_mutation_sites=None, reduce_to_site_topology=False, filter_populations=True, filter_individuals=True, filter_sites=True)

Simplifies the tables in place to retain only the information necessary to reconstruct the tree sequence describing the given samples. This will change the ID of the nodes, so that the node samples[k] will have ID k in the result. The resulting NodeTable will have only the first len(samples) individuals marked as samples. The mapping from node IDs in the current set of tables to their equivalent values in the simplified tables is also returned as a numpy array. If an array a is returned by this function and u is the ID of a node in the input table, then a[u] is the ID of this node in the output table. For any node u that is not mapped into the output tables, this mapping will equal -1.

Tables operated on by this function must: be sorted (see TableCollection.sort())), have children be born strictly after their parents, and the intervals on which any individual is a child must be disjoint. Other than this the tables need not satisfy remaining requirements to specify a valid tree sequence (but the resulting tables will).

Please see the TreeSequence.simplify() method for a description of the remaining parameters.

Parameters: samples (list[int]) – A list of node IDs to retain as samples. If not specified or None, use all nodes marked with the IS_SAMPLE flag. filter_zero_mutation_sites (bool) – Deprecated alias for filter_sites. reduce_to_site_topology (bool) – Whether to reduce the topology down to the trees that are present at sites. (default: False). filter_populations (bool) – If True, remove any populations that are not referenced by nodes after simplification; new population IDs are allocated sequentially from zero. If False, the population table will not be altered in any way. (Default: True) filter_individuals (bool) – If True, remove any individuals that are not referenced by nodes after simplification; new individual IDs are allocated sequentially from zero. If False, the individual table will not be altered in any way. (Default: True) filter_sites (bool) – If True, remove any sites that are not referenced by mutations after simplification; new site IDs are allocated sequentially from zero. If False, the site table will not be altered in any way. (Default: True) A numpy array mapping node IDs in the input tables to their corresponding node IDs in the output tables. numpy array (dtype=np.int32)
sort(edge_start=0)

Sorts the tables in place. This ensures that all tree sequence ordering requirements listed in the Valid tree sequence requirements section are met, as long as each site has at most one mutation (see below).

If the edge_start parameter is provided, this specifies the index in the edge table where sorting should start. Only rows with index greater than or equal to edge_start are sorted; rows before this index are not affected. This parameter is provided to allow for efficient sorting when the user knows that the edges up to a given index are already sorted.

The individual, node, population and provenance tables are not affected by this method.

Edges are sorted as follows:

• time of parent, then
• parent node ID, then
• child node ID, then
• left endpoint.

Note that this sorting order exceeds the edge sorting requirements for a valid tree sequence. For a valid tree sequence, we require that all edges for a given parent ID are adjacent, but we do not require that they be listed in sorted order.

Sites are sorted by position, and sites with the same position retain their relative ordering.

Mutations are sorted by site ID, and mutations with the same site retain their relative ordering. This does not currently rearrange tables so that mutations occur after their mutation parents, which is a requirement for valid tree sequences.

Parameters: edge_start (int) – The index in the edge table where sorting starts (default=0; must be <= len(edges)).
tree_sequence()

Returns a TreeSequence instance with the structure defined by the tables in this TableCollection. If the table collection is not in canonical form (i.e., does not meet sorting requirements) or cannot be interpreted as a tree sequence an exception is raised. The sort() method may be used to ensure that input sorting requirements are met.

Returns: A TreeSequence instance reflecting the structures defined in this set of tables. TreeSequence

Table functions¶

msprime.parse_nodes(source, strict=True, encoding='utf8', base64_metadata=True, table=None)

Parse the specified file-like object containing a whitespace delimited description of a node table and returns the corresponding NodeTable instance. See the node text format section for the details of the required format and the node table definition section for the required properties of the contents.

See load_text() for a detailed explanation of the strict parameter.

Parameters: source (stream) – The file-like object containing the text. strict (bool) – If True, require strict tab delimiting (default). If False, a relaxed whitespace splitting algorithm is used. encoding (string) – Encoding used for text representation. base64_metadata (bool) – If True, metadata is encoded using Base64 encoding; otherwise, as plain text. table (NodeTable) – If specified write into this table. If not, create a new NodeTable instance.
msprime.parse_edges(source, strict=True, table=None)

Parse the specified file-like object containing a whitespace delimited description of a edge table and returns the corresponding EdgeTable instance. See the edge text format section for the details of the required format and the edge table definition section for the required properties of the contents.

See load_text() for a detailed explanation of the strict parameter.

Parameters: source (stream) – The file-like object containing the text. strict (bool) – If True, require strict tab delimiting (default). If False, a relaxed whitespace splitting algorithm is used. table (EdgeTable) – If specified, write the edges into this table. If not, create a new EdgeTable instance and return.
msprime.parse_sites(source, strict=True, encoding='utf8', base64_metadata=True, table=None)

Parse the specified file-like object containing a whitespace delimited description of a site table and returns the corresponding SiteTable instance. See the site text format section for the details of the required format and the site table definition section for the required properties of the contents.

See load_text() for a detailed explanation of the strict parameter.

Parameters: source (stream) – The file-like object containing the text. strict (bool) – If True, require strict tab delimiting (default). If False, a relaxed whitespace splitting algorithm is used. encoding (string) – Encoding used for text representation. base64_metadata (bool) – If True, metadata is encoded using Base64 encoding; otherwise, as plain text. table (SiteTable) – If specified write site into this table. If not, create a new SiteTable instance.
msprime.parse_mutations(source, strict=True, encoding='utf8', base64_metadata=True, table=None)

Parse the specified file-like object containing a whitespace delimited description of a mutation table and returns the corresponding MutationTable instance. See the mutation text format section for the details of the required format and the mutation table definition section for the required properties of the contents.

See load_text() for a detailed explanation of the strict parameter.

Parameters: source (stream) – The file-like object containing the text. strict (bool) – If True, require strict tab delimiting (default). If False, a relaxed whitespace splitting algorithm is used. encoding (string) – Encoding used for text representation. base64_metadata (bool) – If True, metadata is encoded using Base64 encoding; otherwise, as plain text. table (MutationTable) – If specified, write mutations into this table. If not, create a new MutationTable instance.
msprime.pack_strings(strings, encoding='utf8')

Packs the specified list of strings into a flattened numpy array of 8 bit integers and corresponding offsets using the specified text encoding. See Encoding ragged columns for details of this encoding of columns of variable length data.

Parameters: data (list[str]) – The list of strings to encode. encoding (str) – The text encoding to use when converting string data to bytes. See the codecs module for information on available string encodings. The tuple (packed, offset) of numpy arrays representing the flattened input data and offsets. numpy.array (dtype=np.int8), numpy.array (dtype=np.uint32)
msprime.unpack_strings(packed, offset, encoding='utf8')

Unpacks a list of strings from the specified numpy arrays of packed byte data and corresponding offsets using the specified text encoding. See Encoding ragged columns for details of this encoding of columns of variable length data.

Parameters: packed (numpy.ndarray) – The flattened array of byte values. offset (numpy.ndarray) – The array of offsets into the packed array. encoding (str) – The text encoding to use when converting string data to bytes. See the codecs module for information on available string encodings. The list of strings unpacked from the parameter arrays. list[str]
msprime.pack_bytes(data)

Packs the specified list of bytes into a flattened numpy array of 8 bit integers and corresponding offsets. See Encoding ragged columns for details of this encoding.

Parameters: data (list[bytes]) – The list of bytes values to encode. The tuple (packed, offset) of numpy arrays representing the flattened input data and offsets. numpy.array (dtype=np.int8), numpy.array (dtype=np.uint32)
msprime.unpack_bytes(packed, offset)

Unpacks a list of bytes from the specified numpy arrays of packed byte data and corresponding offsets. See Encoding ragged columns for details of this encoding.

Parameters: packed (numpy.ndarray) – The flattened array of byte values. offset (numpy.ndarray) – The array of offsets into the packed array. The list of bytes values unpacked from the parameter arrays.

Provenance API¶

We provide some preliminary support for validating JSON documents against the provenance schema. Programmatic access to provenance information is planned for future versions.

msprime.validate_provenance(provenance)

Validates the specified dict-like object against the tskit provenance schema. If the input does not represent a valid instance of the schema an exception is raised.

Parameters: provenance (dict) – The dictionary representing a JSON document to be validated against the schema. ProvenanceValidationError
exception msprime.ProvenanceValidationError

A JSON document did non validate against the provenance schema.