Tutorial

This is the tutorial for the Python interface to the msprime library. Detailed API Documentation is also available for this library. An ms compatible command line interface is also available if you wish to use msprime directly within an existing work flow.

Simulating trees

Running simulations is very straightforward in msprime:

>>> import msprime
>>> tree_sequence = msprime.simulate(sample_size=5, Ne=1000)
>>> tree = tree_sequence.first()
>>> print(tree.draw(format="unicode"))
    8
 ┏━━┻━┓
 ┃    7
 ┃  ┏━┻┓
 6  ┃  ┃
┏┻┓ ┃  ┃
┃ ┃ ┃  5
┃ ┃ ┃ ┏┻┓
2 3 4 0 1

Here, we simulate the coalescent for a sample of size 5 with an effective population size of 1000, and then print out a depiction of the resulting tree. The simulate() function returns a TreeSequence object, which provides a very efficient way to access the correlated trees in simulations involving recombination. In this example we know that there can only be one tree because we have not provided a value for recombination_rate, and it defaults to zero. Therefore, we access the only tree in the sequence using the call tree_sequence.first(). Finally, we draw a simple depiction of the tree to the terminal using the draw() method.

Trees are represented within msprime in a slightly unusual way. In the majority of libraries dealing with trees, each node is represented as an object in memory and the relationship between nodes as pointers between these objects. In msprime, however, nodes are integers. In the tree above, we can see that the leaves of the tree are labelled with 0 to 4, and all the internal nodes of the tree are also integers with the root of the tree being 8.

We can easily trace our path back to the root for a particular sample using the parent() method:

>>> u = 0
>>> while u != msprime.NULL_NODE:
>>>     print("node {}: time = {}".format(u, tree.time(u)))
>>>     u = tree.parent(u)
node 0: time = 0.0
node 5: time = 93.8536165443679
node 7: time = 770.0314897144547
node 8: time = 1950.6090018042169

In this code chunk we iterate up the tree starting at node 0 and stop when we get to the root. We know that a node is a root if its parent is msprime.NULL_NODE, which is a special reserved node. (The value of the null node is -1, but we recommend using the symbolic constant to make code more readable.) We also use the time() method to get the time for each node, which corresponds to the time in generations at which the coalescence event happened during the simulation. We can also obtain the length of a branch joining a node to its parent using the branch_length() method:

>>> print(tree.branch_length(5))
676.1778731700867

The branch length for node 5 is about 676 generations as the time for node 5 is 94 and the time of its parent is around 770 generation. It is also often useful to obtain the total branch length of the tree, i.e., the sum of the lengths of all branches:

>>> print(tree.total_branch_length)
4926.506226955324

Recombination

Simulating the history of a single locus is a very useful, but we are most often interesting in simulating the history of our sample across large genomic regions under the influence of recombination. The msprime API is specifically designed to make this common requirement both easy and efficient. To model genomic sequences under the influence of recombination we have two parameters to the simulate() function. The length parameter specifies the length of the simulated sequence in bases, and may be a floating point number. If length is not supplied, it is assumed to be 1. The recombination_rate parameter specifies the rate of crossing over per base per generation, and is zero by default. See the API Documentation for a discussion of the precise recombination model used.

Here, we simulate the trees across over a 10kb region with a recombination rate of \(2 \times 10^{-8}\) per base per generation, with an effective population size of 1000:

>>> tree_sequence = msprime.simulate(
...     sample_size=5, Ne=1000, length=1e4, recombination_rate=2e-8)
>>> for tree in tree_sequence.trees():
...     print("-" * 20)
...     print("tree {}: interval = {}".format(tree.index, tree.interval))
...     print(tree.draw(format="unicode"))
--------------------
tree 0: interval = (0.0, 6016.224463474058)
    9
 ┏━━┻━┓
 ┃    8
 ┃  ┏━┻┓
 7  ┃  ┃
┏┻┓ ┃  ┃
┃ ┃ ┃  6
┃ ┃ ┃ ┏┻┓
1 2 0 3 4

--------------------
tree 1: interval = (6016.224463474058, 10000.0)
     9
  ┏━━┻━━┓
  8     ┃
┏━┻━┓   ┃
┃   6   ┃
┃ ┏━┻┓  ┃
┃ ┃  5  ┃
┃ ┃ ┏┻┓ ┃
0 4 1 3 2

In this example, we use the trees() method to iterate over the trees in the sequence. For each tree we print out its index (i.e., it’s position in the sequence) and the interval the tree covers (i.e., the genomic coordinates which all share precisely this tree) using the index and interval attributes. Thus, the first tree covers the first 6kb of sequence and the second tree covers the remaining 4kb. We can see that these trees share a great deal of their structure, but that there are also important differences between the trees.

Warning

Do not store the values returned from the trees() iterator in a list and operate on them afterwards! For efficiency reasons msprime uses the same instance of SparseTree for each tree in the sequence and updates the internal state for each new tree. Therefore, if you store the trees returned from the iterator in a list, they will all refer to the same tree.

Mutations

Mutations are generated in msprime by throwing mutations down on the branches of trees at a particular rate. The mutations are generated under the infinite sites model, and so each mutation occurs at a unique (floating point) point position along the genomic interval occupied by a tree. The mutation rate for simulations is specified using the mutation_rate parameter of simulate(). For example, the following chunk simulates 50kb of nonrecombining sequence with a mutation rate of \(2 \times 10^{-8}\) per base per generation:

>>> tree_sequence = msprime.simulate(
...    sample_size=5, Ne=1000, length=50e3, mutation_rate=2e-8, random_seed=30)
>>> tree = tree_sequence.first()
>>> for site in tree.sites():
...     for mutation in site.mutations:
...         print("Mutation @ position {:.2f} over node {}".format(
...             site.position, mutation.node))
Mutation @ position 1179.94 over node 0
Mutation @ position 4485.17 over node 5
Mutation @ position 9788.56 over node 5
Mutation @ position 11949.32 over node 7
Mutation @ position 31454.99 over node 7
Mutation @ position 34832.55 over node 0
Mutation @ position 36747.63 over node 7
Mutation @ position 46692.51 over node 5
>>> print(tree.draw(format="unicode"))
    8
 ┏━━┻━┓
 ┃    7
 ┃  ┏━┻┓
 ┃  ┃  6
 ┃  ┃ ┏┻┓
 5  ┃ ┃ ┃
┏┻┓ ┃ ┃ ┃
0 3 1 2 4

Variants

We are often interesting in accessing the sequence data that results from simulations directly. The most efficient way to do this is by using the TreeSequence.variants() method, which returns an iterator over all the Variant objects arising from the trees and mutations. Each variant contains a reference to the site object, as well as the alleles and the observed sequences for each sample in the genotypes field:

>>> tree_sequence = msprime.simulate(
...     sample_size=20, Ne=1e4, length=5e3, recombination_rate=2e-8,
...     mutation_rate=2e-8, random_seed=10)
>>> for variant in tree_sequence.variants():
...     print(
...         variant.site.id, variant.site.position,
...         variant.alleles, variant.genotypes, sep="\t")
0       2432.768327416852       ('0', '1')      [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
1       2577.6939414924095      ('0', '1')      [1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1]
2       2844.682702049562       ('0', '1')      [0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
3       4784.266628557816       ('0', '1')      [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]

In this example we simulate some data and then print out the observed sequences. We loop through each variant and print out the observed state of each sample as an array of zeros and ones, along with the index and position of the corresponding mutation. In this example, the alleles are always '0' (the ancestral state) and '1' (the derived state), because we are simulating simple infinite sites mutations. More complex models are possible, however.

This way of working with the sequence data is quite efficient because we do not need to keep the entire genotype matrix in memory at once. However, if we do want the full genotype matrix it is simple to obtain:

>>> A = tree_sequence.genotype_matrix()
>>> A
array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8)

In this example, we run the same simulation but this time store entire variant matrix in a two-dimensional numpy array. This is useful for integrating with tools such as scikit allel.

Historical samples

Simulating coalescent histories in which some of the samples are not from the present time is straightforward in msprime. By using the samples argument to msprime.simulate() we can specify the location and time at which all samples are made.

def historical_samples_example():
    samples = [
        msprime.Sample(population=0, time=0),
        msprime.Sample(0, 0),  # Or, we can use positional arguments.
        msprime.Sample(0, 1.0)
    ]
    tree_seq = msprime.simulate(samples=samples)
    tree = tree_seq.first()
    for u in tree.nodes():
        print(u, tree.parent(u), tree.time(u), sep="\t")
    print(tree.draw(format="unicode"))

In this example we create three samples, two taken at the present time and one taken 1.0 generations in the past. There are a number of different ways in which we can describe the samples using the msprime.Sample object (samples can be provided as plain tuples also if more convenient). Running this example, we get:

>>> historical_samples_example()
4       -1      1.3805592647990812
1       4       0.0
3       4       1.052473878566028
0       3       0.0
2       3       1.0
  4
┏━┻┓
┃  3
┃ ┏┻┓
┃ ┃ 2
┃ ┃
1 0

Because nodes 0 and 1 were sampled at time 0, their times in the tree are both 0. Node 2 was sampled at time 1.0, and so its time is recorded as 1.0 in the tree.

Replication

A common task for coalescent simulations is to check the accuracy of analytical approximations to statistics of interest. To do this, we require many independent replicates of a given simulation. msprime provides a simple and efficient API for replication: by providing the num_replicates argument to the simulate() function, we can iterate over the replicates in a straightforward manner. Here is an example where we compare the analytical results for the number of segregating sites with simulations:

import msprime
import numpy as np

def segregating_sites_example(n, theta, num_replicates):
    S = np.zeros(num_replicates)
    replicates = msprime.simulate(
                    Ne=0.5,
        sample_size=n,
        mutation_rate=theta / 2,
        num_replicates=num_replicates)
    for j, tree_sequence in enumerate(replicates):
        S[j] = tree_sequence.num_sites
    # Now, calculate the analytical predictions
    S_mean_a = np.sum(1 / np.arange(1, n)) * theta
    S_var_a = (
        theta * np.sum(1 / np.arange(1, n)) +
        theta**2 * np.sum(1 / np.arange(1, n)**2))
    print("              mean              variance")
    print("Observed      {}\t\t{}".format(np.mean(S), np.var(S)))
    print("Analytical    {:.5f}\t\t{:.5f}".format(S_mean_a, S_var_a))

Running this code, we get:

>>> segregating_sites_example(10, 5, 100000)
              mean              variance
Observed      14.17893          53.0746740551
Analytical    14.14484          52.63903

Note that in this example we set \(N_e = 0.5\) and the mutation rate to \(\theta / 2\) when calling simulate(). This is essential to convert the diploid per-generation time units of msprime into the the haploid coalescent units required for many theoretical results. However, it is important to note that conventions vary widely, and great care is needed with such factor-of-two rescalings.

Population structure

Todo

Update this example to use Ne=0.5 as recommended elsewhere.

Population structure in msprime closely follows the model used in the ms simulator: we have \(N\) demes with an \(N\times N\) matrix describing the migration rates between these subpopulations. The sample sizes, population sizes and growth rates of all demes can be specified independently. Migration rates are specified using a migration matrix. Unlike ms however, all times and rates are specified in generations and all populations sizes are absolute (that is, not multiples of \(N_e\)).

In the following example, we calculate the mean coalescence time for a pair of lineages sampled in different demes in a symmetric island model, and compare this with the analytical expectation.

import msprime
import numpy as np

def migration_example(num_replicates=10**6):
    # M is the overall symmetric migration rate, and d is the number
    # of demes.
    M = 0.2
    d = 3
    m = M / (4 * (d - 1))
    # Allocate the initial sample. Because we are interested in the
    # between deme coalescence times, we choose one sample each
    # from the first two demes.
    population_configurations = [
        msprime.PopulationConfiguration(sample_size=1),
        msprime.PopulationConfiguration(sample_size=1),
        msprime.PopulationConfiguration(sample_size=0)]
    # Now we set up the migration matrix. Since this is a symmetric
    # island model, we have the same rate of migration between all
    # pairs of demes. Diagonal elements must be zero.
    migration_matrix = [
        [0, m, m],
        [m, 0, m],
        [m, m, 0]]
    # We pass these values to the simulate function, and ask it
    # to run the required number of replicates.
    replicates = msprime.simulate(
        population_configurations=population_configurations,
        migration_matrix=migration_matrix,
        num_replicates=num_replicates)
    # And then iterate over these replicates
    T = np.zeros(num_replicates)
    for i, tree_sequence in enumerate(replicates):
        tree = tree_sequence.first()
        T[i] = tree.time(tree.root) / 4
    # Finally, calculate the analytical expectation and print
    # out the results
    analytical = d / 2 + (d - 1) / (2 * M)
    print("Observed  =", np.mean(T))
    print("Predicted =", analytical)

Running this example we get:

>>> migration_example()
Observed  = 6.50638181614
Predicted = 6.5

Demography

Msprime provides a flexible and simple way to model past demographic events in arbitrary combinations. Here is an example describing the Gutenkunst et al. out-of-Africa model. See Figure 2B for a schematic of this model, and Table 1 for the values used.

Todo

Add a diagram of the model for convenience.

import math
def out_of_africa():
    # First we set out the maximum likelihood values of the various parameters
    # given in Table 1.
    N_A = 7300
    N_B = 2100
    N_AF = 12300
    N_EU0 = 1000
    N_AS0 = 510
    # Times are provided in years, so we convert into generations.
    generation_time = 25
    T_AF = 220e3 / generation_time
    T_B = 140e3 / generation_time
    T_EU_AS = 21.2e3 / generation_time
    # We need to work out the starting (diploid) population sizes based on
    # the growth rates provided for these two populations
    r_EU = 0.004
    r_AS = 0.0055
    N_EU = N_EU0 / math.exp(-r_EU * T_EU_AS)
    N_AS = N_AS0 / math.exp(-r_AS * T_EU_AS)
    # Migration rates during the various epochs.
    m_AF_B = 25e-5
    m_AF_EU = 3e-5
    m_AF_AS = 1.9e-5
    m_EU_AS = 9.6e-5
    # Population IDs correspond to their indexes in the population
    # configuration array. Therefore, we have 0=YRI, 1=CEU and 2=CHB
    # initially.
    population_configurations = [
        msprime.PopulationConfiguration(
            sample_size=0, initial_size=N_AF),
        msprime.PopulationConfiguration(
            sample_size=1, initial_size=N_EU, growth_rate=r_EU),
        msprime.PopulationConfiguration(
            sample_size=1, initial_size=N_AS, growth_rate=r_AS)
    ]
    migration_matrix = [
        [      0, m_AF_EU, m_AF_AS],
        [m_AF_EU,       0, m_EU_AS],
        [m_AF_AS, m_EU_AS,       0],
    ]
    demographic_events = [
        # CEU and CHB merge into B with rate changes at T_EU_AS
        msprime.MassMigration(
            time=T_EU_AS, source=2, destination=1, proportion=1.0),
        msprime.MigrationRateChange(time=T_EU_AS, rate=0),
        msprime.MigrationRateChange(
            time=T_EU_AS, rate=m_AF_B, matrix_index=(0, 1)),
        msprime.MigrationRateChange(
            time=T_EU_AS, rate=m_AF_B, matrix_index=(1, 0)),
        msprime.PopulationParametersChange(
            time=T_EU_AS, initial_size=N_B, growth_rate=0, population_id=1),
        # Population B merges into YRI at T_B
        msprime.MassMigration(
            time=T_B, source=1, destination=0, proportion=1.0),
        # Size changes to N_A at T_AF
        msprime.PopulationParametersChange(
            time=T_AF, initial_size=N_A, population_id=0)
    ]
    # Use the demography debugger to print out the demographic history
    # that we have just described.
    dd = msprime.DemographyDebugger(
        population_configurations=population_configurations,
        migration_matrix=migration_matrix,
        demographic_events=demographic_events)
    dd.print_history()

The DemographyDebugger provides a method to debug the history that you have described so that you can be sure that the migration rates, population sizes and growth rates are all as you intend during each epoch:

=============================
Epoch: 0 -- 848.0 generations
=============================
     start     end      growth_rate |     0        1        2
   -------- --------       -------- | -------- -------- --------
0 |1.23e+04 1.23e+04              0 |     0      3e-05   1.9e-05
1 |2.97e+04   1e+03           0.004 |   3e-05      0     9.6e-05
2 |5.41e+04    510           0.0055 |  1.9e-05  9.6e-05     0

Events @ generation 848.0
   - Mass migration: lineages move from 2 to 1 with probability 1.0
   - Migration rate change to 0 everywhere
   - Migration rate change for (0, 1) to 0.00025
   - Migration rate change for (1, 0) to 0.00025
   - Population parameter change for 1: initial_size -> 2100 growth_rate -> 0
==================================
Epoch: 848.0 -- 5600.0 generations
==================================
     start     end      growth_rate |     0        1        2
   -------- --------       -------- | -------- -------- --------
0 |1.23e+04 1.23e+04              0 |     0     0.00025     0
1 | 2.1e+03  2.1e+03              0 |  0.00025     0        0
2 |   510   2.27e-09         0.0055 |     0        0        0

Events @ generation 5600.0
   - Mass migration: lineages move from 1 to 0 with probability 1.0
===================================
Epoch: 5600.0 -- 8800.0 generations
===================================
     start     end      growth_rate |     0        1        2
   -------- --------       -------- | -------- -------- --------
0 |1.23e+04 1.23e+04              0 |     0     0.00025     0
1 | 2.1e+03  2.1e+03              0 |  0.00025     0        0
2 |2.27e-09 5.17e-17         0.0055 |     0        0        0

Events @ generation 8800.0
   - Population parameter change for 0: initial_size -> 7300
================================
Epoch: 8800.0 -- inf generations
================================
     start     end      growth_rate |     0        1        2
   -------- --------       -------- | -------- -------- --------
0 | 7.3e+03  7.3e+03              0 |     0     0.00025     0
1 | 2.1e+03  2.1e+03              0 |  0.00025     0        0
2 |5.17e-17     0            0.0055 |     0        0        0

Warning

The output of the DemographyDebugger.print_history() method is intended only for debugging purposes, and is not meant to be machine readable. The format is also preliminary; if there is other information that you think would be useful, please open an issue on GitHub

Once you are satisfied that the demographic history that you have built is correct, it can then be simulated by calling the simulate() function.

Recombination maps

The msprime API allows us to quickly and easily simulate data from an arbitrary recombination map. In this example we read a recombination map for human chromosome 22, and simulate a single replicate. After the simulation is completed, we plot histograms of the recombination rates and the simulated breakpoints. These show that density of breakpoints follows the recombination rate closely.

import numpy as np
import scipy.stats
import matplotlib.pyplot as pyplot

def variable_recomb_example():
    infile = "hapmap/genetic_map_GRCh37_chr22.txt"
    # Read in the recombination map using the read_hapmap method,
    recomb_map = msprime.RecombinationMap.read_hapmap(infile)

    # Now we get the positions and rates from the recombination
    # map and plot these using 500 bins.
    positions = np.array(recomb_map.get_positions()[1:])
    rates = np.array(recomb_map.get_rates()[1:])
    num_bins = 500
    v, bin_edges, _ = scipy.stats.binned_statistic(
        positions, rates, bins=num_bins)
    x = bin_edges[:-1][np.logical_not(np.isnan(v))]
    y = v[np.logical_not(np.isnan(v))]
    fig, ax1 = pyplot.subplots(figsize=(16, 6))
    ax1.plot(x, y, color="blue")
    ax1.set_ylabel("Recombination rate")
    ax1.set_xlabel("Chromosome position")

    # Now we run the simulation for this map. We assume Ne=10^4
    # and have a sample of 100 individuals
    tree_sequence = msprime.simulate(
        sample_size=100,
        Ne=10**4,
        recombination_map=recomb_map)
    # Now plot the density of breakpoints along the chromosome
    breakpoints = np.array(list(tree_sequence.breakpoints()))
    ax2 = ax1.twinx()
    v, bin_edges = np.histogram(breakpoints, num_bins, density=True)
    ax2.plot(bin_edges[:-1], v, color="green")
    ax2.set_ylabel("Breakpoint density")
    ax2.set_xlim(1.5e7, 5.3e7)
    fig.savefig("hapmap_chr22.svg")
Density of breakpoints along the chromosome.

Calculating LD

The msprime API provides methods to efficiently calculate population genetics statistics. For example, the LdCalculator class allows us to compute pairwise linkage disequilibrium coefficients. Here we use the get_r2_matrix() method to easily make an LD plot using matplotlib. (Thanks to the excellent scikit-allel for the basic plotting code used here.)

import msprime
import matplotlib.pyplot as pyplot

def ld_matrix_example():
    ts = msprime.simulate(100, recombination_rate=10, mutation_rate=20,
            random_seed=1)
    ld_calc = msprime.LdCalculator(ts)
    A = ld_calc.r2_matrix()
    # Now plot this matrix.
    x = A.shape[0] / pyplot.rcParams['figure.dpi']
    x = max(x, pyplot.rcParams['figure.figsize'][0])
    fig, ax = pyplot.subplots(figsize=(x, x))
    fig.tight_layout(pad=0)
    im = ax.imshow(A, interpolation="none", vmin=0, vmax=1, cmap="Blues")
    ax.set_xticks([])
    ax.set_yticks([])
    for s in 'top', 'bottom', 'left', 'right':
        ax.spines[s].set_visible(False)
    pyplot.gcf().colorbar(im, shrink=.5, pad=0)
    pyplot.savefig("ld.svg")
An example LD matrix plot.

Working with threads

When performing large calculations it’s often useful to split the work over multiple processes or threads. The msprime API can be used without issues across multiple processes, and the Python multiprocessing module often provides a very effective way to work with many replicate simulations in parallel.

When we wish to work with a single very large dataset, however, threads can offer better resource usage because of the shared memory space. The Python threading library gives a very simple interface to lightweight CPU threads and allows us to perform several CPU intensive tasks in parallel. The msprime API is designed to allow multiple threads to work in parallel when CPU intensive tasks are being undertaken.

Note

In the CPython implementation the Global Interpreter Lock ensures that only one thread executes Python bytecode at one time. This means that Python code does not parallelise well across threads, but avoids a large number of nasty pitfalls associated with multiple threads updating data structures in parallel. Native C extensions like numpy and msprime release the GIL while expensive tasks are being performed, therefore allowing these calculations to proceed in parallel.

In the following example we wish to find all mutations that are in approximate LD (\(r^2 \geq 0.5\)) with a given set of mutations. We parallelise this by splitting the input array between a number of threads, and use the LdCalculator.r2_array() method to compute the \(r^2\) value both up and downstream of each focal mutation, filter out those that exceed our threshold, and store the results in a dictionary. We also use the very cool tqdm module to give us a progress bar on this computation.

import threading
import numpy as np
import tqdm
import msprime

def find_ld_sites(
        tree_sequence, focal_mutations, max_distance=1e6, r2_threshold=0.5,
        num_threads=8):
    results = {}
    progress_bar = tqdm.tqdm(total=len(focal_mutations))
    num_threads = min(num_threads, len(focal_mutations))

    def thread_worker(thread_index):
        ld_calc = msprime.LdCalculator(tree_sequence)
        chunk_size = int(math.ceil(len(focal_mutations) / num_threads))
        start = thread_index * chunk_size
        for focal_mutation in focal_mutations[start: start + chunk_size]:
            a = ld_calc.r2_array(
                focal_mutation, max_distance=max_distance,
                direction=msprime.REVERSE)
            rev_indexes = focal_mutation - np.nonzero(a >= r2_threshold)[0] - 1
            a = ld_calc.r2_array(
                focal_mutation, max_distance=max_distance,
                direction=msprime.FORWARD)
            fwd_indexes = focal_mutation + np.nonzero(a >= r2_threshold)[0] + 1
            indexes = np.concatenate((rev_indexes[::-1], fwd_indexes))
            results[focal_mutation] = indexes
            progress_bar.update()

    threads = [
        threading.Thread(target=thread_worker, args=(j,))
        for j in range(num_threads)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    progress_bar.close()
    return results

def threads_example():
    ts = msprime.simulate(
        sample_size=1000, Ne=1e4, length=1e7, recombination_rate=2e-8,
        mutation_rate=2e-8)
    counts = np.zeros(ts.num_sites)
    for tree in ts.trees():
        for site in tree.sites():
            assert len(site.mutations) == 1
            mutation = site.mutations[0]
            counts[site.id] = tree.get_num_samples(mutation.node)
    doubletons = np.nonzero(counts == 2)[0]
    results = find_ld_sites(ts, doubletons, num_threads=8)
    print(
        "Found LD sites for", len(results), "doubleton sites out of",
        ts.num_sites)

In this example, we first simulate 1000 samples of 10 megabases and find all doubleton mutations in the resulting tree sequence. We then call the find_ld_sites() function to find all mutations that are within 1 megabase of these doubletons and have an \(r^2\) statistic of greater than 0.5.

The find_ld_sites() function performs these calculations in parallel using 8 threads. The real work is done in the nested thread_worker() function, which is called once by each thread. In the thread worker, we first allocate an instance of the LdCalculator class. (It is critically important that each thread has its own instance of LdCalculator, as the threads will not work efficiently otherwise.) After this, each thread works out the slice of the input array that it is responsible for, and then iterates over each focal mutation in turn. After the \(r^2\) values have been calculated, we then find the indexes of the mutations corresponding to values greater than 0.5 using numpy.nonzero(). Finally, the thread stores the resulting array of mutation indexes in the results dictionary, and moves on to the next focal mutation.

Running this example we get:

>>> threads_example()
100%|████████████████████████████████████████████████| 4045/4045 [00:09<00:00, 440.29it/s]
Found LD sites for 4045 doubleton mutations out of 60100

Editing tree sequences

Sometimes we wish to make some minor modifications to a tree sequence that has been generated by a simulation. However, tree sequence objects are immutable and so we cannot edit a them in place. The answer is to use the Tables API: we export the tree sequence to a set of tables, edit these tables, and then create a new tree sequence from them. In the following example, we use this approach to remove all singleton sites from a given tree sequence.

def strip_singletons(ts):
    sites = msprime.SiteTable()
    mutations = msprime.MutationTable()
    for tree in ts.trees():
        for site in tree.sites():
            assert len(site.mutations) == 1  # Only supports infinite sites muts.
            mut = site.mutations[0]
            if tree.num_samples(mut.node) > 1:
                site_id = sites.add_row(
                    position=site.position,
                    ancestral_state=site.ancestral_state)
                mutations.add_row(
                    site=site_id, node=mut.node, derived_state=mut.derived_state)
    tables = ts.dump_tables()
    new_ts = msprime.load_tables(
        nodes=tables.nodes, edges=tables.edges, sites=sites, mutations=mutations)
    return new_ts

This function takes a tree sequence containing some infinite sites mutations as input, and returns a copy in which all singleton sites have been removed. The approach is very simple: we allocate SiteTable and MutationTable instances to hold the new sites and mutations that we define, and then consider each site in turn. If the allele frequency of the mutation is greater than one, we add the site and mutation to our output tables using SiteTable.add_row() and MutationTable.add_row(). (In this case we consider only simple infinite sites mutations, where we cannot have back or recurrent mutations. These would require a slightly more involved approach where we keep a map of mutation IDs so that mutation parent values could be computed.)

After considering each site, we then create a new tree sequence using load_tables() using the node and edge tables from the original tree sequence and the just-created site and mutation tables. Using this function then, we get:

>>> ts = msprime.simulate(10, mutation_rate=10)
>>> ts.num_sites
50
>>> ts_new = strip_singletons(ts)
>>> ts_new.num_sites
44
>>>

Thus, we have removed 6 singleton sites from the tree sequence.

Todo

Add another example here where we use the array oriented API to edit the nodes and edges of a tree sequence. Perhaps decapitating would be a good example?

Working with Tables

Todo

This section is a work in progress and needs to be revised.

Tables provide a convenient method for viewing, importing and exporting tree sequences. msprime provides direct access to the the columns of a table as numpy arrays: for instance, if n is a NodeTable, then n.time will return an array containing the birth times of the individuals in the table. However, it is important to note that this is not a shallow copy: modifying n.time will not change the node table n. This may change in the future, but currently there are two ways to modify tables: .add_row() and .set_columns() (and also .clear(), which empties the table).

The example node table above would be constructed using .add_row() as follows:

n = msprime.NodeTable()
sv = [True, True, True, False, False, False, False]
tv = [0.0, 0.0, 0.0, 0.4, 0.5, 0.7, 1.0]
pv = [0, 0, 0, 0, 0, 0, 0]
for s, t, p in zip(sv, tv, pv):
    n.add_row(flags=s, population=p, time=t)

print(n)

The .add_row() method is natural (and should be reasonably efficient) if new records appear one-by-one. In the example above it would have been more natural to use .set_columns():

n = msprime.NodeTable()
n.set_columns(flags=sv, population=pv, time=tv)

Finally, here is an example where we add 1.4 to every time except the first in the NodeTable constructed above using numpy indexing:

fn = n.flags
pn = n.population
tn = n.time
tn[1:] = tn[1:] + 1.4
n.set_columns(flags=fn, population=pn, time=tn)

Here is an example. Consider the following sequence of trees:

time
----
1.0                6
0.7               / \                                       5
                 /   x                                     / \
0.5             /     4                 4                 /   4
               /     / \               / x               /   / \
0.4           /     /   \             /   3             /   /   \
             /     /     \           /   / \           /   /     \
            /     /       \         /   /   x         /   /       \
           /     /         \       /   /     \       /   /         \
0.0       0     1           2     1   0       2     0   1           2

position 0.0                  0.2               0.8                1.0

First, we specify the nodes:

NodeTable:

id      is_sample    population   time
0       1            0            0
1       1            0            0
2       1            0            0
3       0            0            0.4
4       0            0            0.5
5       0            0            0.7
6       0            0            1.0

Importantly, the first column, id, is not actually recorded, and is only shown when printing out node tables (as here) for convenience. This has three samples: nodes 0, 1, and 2, and lists their birth times. Then, we specify the edges:

EdgeTable:

left    right   parent  children
0.2     0.8     3       0
0.2     0.8     3       2
0.0     0.2     4       1
0.0     0.2     4       2
0.2     0.8     4       1
0.2     0.8     4       3
0.8     1.0     4       1
0.8     1.0     4       2
0.8     1.0     5       0
0.8     1.0     5       4
0.0     0.2     6       0
0.0     0.2     6       4

Since node 3 is most recent, the edge that says that nodes 0 and 2 inherit from node 3 on the interval between 0.2 and 0.8 comes first. Next are the edges from node 4: there are three of these, for each of the three genomic intervals over which node 4 is ancestor to a distinct set of nodes. At this point, we know the full tree on the middle interval. Finally, edges specifying the common ancestor of 0 and 4 on the remaining intervals (parents 6 and 5 respectively) allow us to construct all trees across the entire interval.

In the depiction above, x denotes mutations. Suppose that the first mutation occurs at position 0.1 and the mutations in the second tree both occurred at the same position, at 0.5 (with a back mutation). The positions are recorded in the sites table:

SiteTable:

id  position        ancestral_state
0   0.1             0
1   0.5             0

As with node tables, the id column is not actually recorded, but is implied by the position in the table. The acutal mutations are then recorded:

MutationTable:

site        node    derived_state
0       4       1
1       3       1
1       2       0

This would then result in the following (two-locus) haplotypes for the three samples:

sample  haplotype
------  ---------
0       01
1       10
2       11