Tutorial

This is the tutorial for the Python interface to the msprime library. Detailed API Documentation is also available for this library. An ms compatible command line interface is also available if you wish to use msprime directly within an existing work flow. Please see the tskit documentation for more information on how to use the tskit Python API to analyse simulation results.

If this is your first time using msprime, head to the Getting started section to learn how to specify basic models of recombination and mutation. Next, the Demography section will show you how to add demographic models and population structure into your simulations. The Advanced features section covers more specific features of msprime, including ancient sampling, hybrid forward-backward histories and full ARG recording.

Getting started

Tree sequences - the data structure

msprime outputs simulated datasets in the tree sequence format. This is an encoding of a complete genealogy for a sample of chromosomes at each chromosomal location. They offer a few benefits to population geneticists compared with traditional genetic file formats:

  • They can store large simulated datasets extremely compactly. (Often many times smaller than VCFs for real-sized datasets!)
  • As they hold rich detail about the history of the sample, many important processes can be observed directly from the tree structure. So a tree sequence is often more informative than raw genotype/haplotype data, even though it is also more compact.
  • They can be queried and modified extremely quickly. This enables quick calculation of many important population statistics.

If you’ve never dealt with tree sequences before, you may wish to read through some of the documentation for tskit, a Python package with a bunch of useful tools for working with tree sequences.

A basic example

Running simulations is very straightforward in msprime. Here, we simulate the coalescent for a sample of size six with an effective population size of 1000 diploids:

import msprime
ts = msprime.simulate(sample_size=6, Ne=1000)

The msprime library uses tskit to return the simulation result as a tskit.TreeSequence object. This provides a very efficient way to access the correlated trees in simulations involving recombination. Using the tskit.TreeSequence.num_trees attribute, we can see that there is only a single tree in our simulated tree sequence:

print("Number of trees in tree sequence:", ts.num_trees)
# 1

This is because we have not yet provided a value for the recombination rate, and it defaults to zero.

We can access this tree using the first() method, and can draw a simple depiction of the tree to the terminal using the draw() method:

tree = ts.first()
print(tree.draw(format="unicode"))

#     10
#  ┏━━┻━┓
#  ┃    9
#  ┃  ┏━┻━┓
#  8  ┃   ┃
# ┏┻┓ ┃   ┃
# ┃ ┃ ┃   7
# ┃ ┃ ┃ ┏━┻┓
# ┃ ┃ ┃ ┃  6
# ┃ ┃ ┃ ┃ ┏┻┓
# 3 5 0 4 1 2

Random seeds

In general, running the same msprime commands multiple times will produce different outputs. To ensure the same output every time, you can specify a random seed using the random_seed argument.

ts = msprime.simulate(sample_size = 6, random_seed = 184)

Sequence length

Because we haven’t yet specified a sequence length, our simulated sequence will have length 1:

ts.sequence_length
# 1.0

It is usually most convenient to set the sequence length to be the number of nucleotide bases in the desired simulated sequence. We use the length argument to specify this:

ts = msprime.simulate(sample_size = 6, random_seed = 1, length = 1000)
print(ts.sequence_length)
# 1000.0

Effective population size

Recall that each tree sequence has an equivalent representation as a set of tables. Let’s have a look at one of these tables now:

print(ts.tables.nodes)

# id  flags   population  individual  time    metadata
# 0   1   0   -1  0.00000000000000
# 1   1   0   -1  0.00000000000000
# 2   1   0   -1  0.00000000000000
# 3   1   0   -1  0.00000000000000
# 4   1   0   -1  0.00000000000000
# 5   1   0   -1  0.00000000000000
# 6   0   0   -1  0.07194744353492
# 7   0   0   -1  0.61124301112428
# 8   0   0   -1  0.73124726040958
# 9   0   0   -1  0.91078323219376
# 10  0   0   -1  1.32301250012150

The first six nodes with time=0.0 correspond to the samples. All other nodes correspond to ancestors of the samples, and so have positive times.

Note

The ‘time’ of a node records how long ago the node was born. Since msprime is a coalescent simulator, it looks “backwards in time”, i.e., all ‘time’ attributes in msprime are measured in units of time ago.

The reason why the node times in our simple example are so small is because, by default, msprime assumes a constant (diploid) effective population size of Ne = 1, which is equivalent to measuring time in units of Ne generations.

While this scaling can be useful when comparing simulations against analytic results from coalescent theory, it’s often simpler to think of time in units of generations backwards-in-time. We can do this by specifying our desired effective population size using the Ne input into simulate:

ts = msprime.simulate(sample_size = 6, random_seed = 1, Ne = 10000)
print(ts.tables.nodes)

# id  flags   population  individual  time    metadata
# 0   1   0   -1  0.00000000000000
# 1   1   0   -1  0.00000000000000
# 2   1   0   -1  0.00000000000000
# 3   1   0   -1  0.00000000000000
# 4   1   0   -1  0.00000000000000
# 5   1   0   -1  0.00000000000000
# 6   0   0   -1  719.47443534915067
# 7   0   0   -1  6112.43011124283566
# 8   0   0   -1  7312.47260409581213
# 9   0   0   -1  9107.83232193760159
# 10  0   0   -1  13230.12500121500307

Recall that under the coalescent model, each simulated ancestral node represents a coalescence event at which two lineages converge. These coalescences should occur less frequently in a larger population. As expected, rescaling our effective population size has also rescaled our coalescence times by the same factor!

Hopefully, you can already see that simulations with msprime can help us clarify our intuition about how the coalescent model works.

Recombination

Simulating the history of a single locus is a very useful, but we are most often interested in simulating the history of our sample across large genomic regions that are under the influence of recombination. The msprime API is specifically designed to make this easy and efficient, and supports both uniform and variable models of recombination.

By default, recombination in msprime is simulated under an infinite sites model. The sequence_length parameter is a floating point number, so recombination (and mutation) can occur at any location along the sequence. To learn how to use a finite sites model instead, see the Finite-site recombination section.

Uniform recombination

To simulate with a uniform recombination rate, we specify two extra inputs to msprime.simulate(): a sequence_length, usually specified as a number of bases, and a recombination_rate, specified as the rate of crossovers per unit of length per generation.

Here, we simulate a tree sequence across over a 10kb region with a recombination rate of \(2 \times 10^{-8}\) per base per generation, and with a diploid effective population size of 1000:

ts = msprime.simulate(
    sample_size=6, Ne=1000, length=1e4, recombination_rate=2e-8)

We’ll then use the tskit.TreeSequence.trees() method to iterate over the trees in the sequence. For each tree we print out its index (i.e., its position in the sequence) and the interval the tree covers (i.e., the genomic coordinates which all share precisely this tree) using the index and interval attributes:

for tree in ts.trees():
    print("-" * 20)
    print("tree {}: interval = {}".format(tree.index, tree.interval))
    print(tree.draw(format="unicode"))

# --------------------
# tree 0: interval = (0.0,  6016.224463474058)
#    11
# ┏━━┻━━┓
# ┃     10
# ┃  ┏━━┻━┓
# ┃  ┃    9
# ┃  ┃  ┏━┻┓
# ┃  7  ┃  ┃
# ┃ ┏┻┓ ┃  ┃
# ┃ ┃ ┃ ┃  6
# ┃ ┃ ┃ ┃ ┏┻┓
# 3 0 1 2 4 5
#
# --------------------
# tree 1: interval = (6016.224463474058, 10000.0)
#      10
#   ┏━━┻━━┓
#   9     ┃
# ┏━┻┓    ┃
# ┃  ┃    8
# ┃  ┃  ┏━┻┓
# ┃  ┃  ┃  7
# ┃  ┃  ┃ ┏┻┓
# ┃  6  ┃ ┃ ┃
# ┃ ┏┻┓ ┃ ┃ ┃
# 2 4 5 3 0 1

Thus, the first tree covers the first 6kb of sequence and the second tree covers the remaining 4kb. We can see that these trees share a great deal of their structure, but that there are also important differences between the trees.

Warning

Do not store the values returned from the trees() iterator in a list and operate on them afterwards! For efficiency reasons tskit uses the same instance of tskit.Tree for each tree in the sequence and updates the internal state for each new tree. Therefore, if you store the trees returned from the iterator in a list, they will all refer to the same tree.

Non-uniform recombination

The msprime API allows us to quickly and easily simulate data from an arbitrary recombination map. To do this, we can specify an external recombination map as a msprime.RecombinationMap() object. We need to supply a list of positions in the map, and a list showing rates of crossover between each specified position.

In the example below, we specify a recombination map with distinct recombination rates between each 100th base.

# Making a simple RecombinationMap object.
map_positions = [i*100 for i in range(0, 11)]
map_rates = [0, 1e-4, 5e-4, 1e-4, 0, 0, 0, 5e-4, 6e-4, 1e-4, 0]
my_map = msprime.RecombinationMap(map_positions, map_rates)
# Simulating with the recombination map.
ts = msprime.simulate(sample_size = 6, random_seed = 12, recombination_map = my_map)

The resulting tree sequence has no interval breakpoints between positions 400 and 700, as our recombination map specified a crossover rate of 0 between these positions.

for tree in ts.trees():
    print("-" * 20)
    print("tree {}: interval = {}".format(tree.index, tree.interval))
    print(tree.draw(format="unicode"))

# --------------------
# tree 0: interval = (0.0, 249.0639823488891)
#    11
#  ┏━━┻━━┓
#  ┃     9
#  ┃   ┏━┻━┓
#  8   ┃   ┃
# ┏┻┓  ┃   ┃
# ┃ ┃  ┃   7
# ┃ ┃  ┃  ┏┻┓
# ┃ ┃  6  ┃ ┃
# ┃ ┃ ┏┻┓ ┃ ┃
# 2 5 0 1 3 4
#
# --------------------
# tree 1: interval = (249.0639823488891, 849.2285335049714)
#    12
# ┏━━━┻━━━┓
# ┃      11
# ┃    ┏━━┻━┓
# ┃    9    ┃
# ┃  ┏━┻━┓  ┃
# ┃  ┃   7  ┃
# ┃  ┃  ┏┻┓ ┃
# ┃  6  ┃ ┃ ┃
# ┃ ┏┻┓ ┃ ┃ ┃
# 5 0 1 3 4 2
#
# --------------------
# tree 2: interval = (849.2285335049714, 1000.0)
#   12
# ┏━━┻━━┓
# ┃    11
# ┃  ┏━━┻━┓
# ┃  ┃   10
# ┃  ┃  ┏━┻┓
# ┃  ┃  ┃  7
# ┃  ┃  ┃ ┏┻┓
# ┃  6  ┃ ┃ ┃
# ┃ ┏┻┓ ┃ ┃ ┃
# 5 0 1 2 3 4

A more advanced example is included below. In this example we read a recombination map for human chromosome 22, and simulate a single replicate. After the simulation is completed, we plot histograms of the recombination rates and the simulated breakpoints. These show that density of breakpoints follows the recombination rate closely.

import numpy as np
import scipy.stats
import matplotlib.pyplot as pyplot

def variable_recomb_example():
    infile = "hapmap/genetic_map_GRCh37_chr22.txt"
    # Read in the recombination map using the read_hapmap method,
    recomb_map = msprime.RecombinationMap.read_hapmap(infile)

    # Now we get the positions and rates from the recombination
    # map and plot these using 500 bins.
    positions = np.array(recomb_map.get_positions()[1:])
    rates = np.array(recomb_map.get_rates()[1:])
    num_bins = 500
    v, bin_edges, _ = scipy.stats.binned_statistic(
        positions, rates, bins=num_bins)
    x = bin_edges[:-1][np.logical_not(np.isnan(v))]
    y = v[np.logical_not(np.isnan(v))]
    fig, ax1 = pyplot.subplots(figsize=(16, 6))
    ax1.plot(x, y, color="blue")
    ax1.set_ylabel("Recombination rate")
    ax1.set_xlabel("Chromosome position")

    # Now we run the simulation for this map. We simulate
    # 50 diploids (100 sampled genomes) in a population with Ne=10^4.
    tree_sequence = msprime.simulate(
        sample_size=100,
        Ne=10**4,
        recombination_map=recomb_map)
    # Now plot the density of breakpoints along the chromosome
    breakpoints = np.array(list(tree_sequence.breakpoints()))
    ax2 = ax1.twinx()
    v, bin_edges = np.histogram(breakpoints, num_bins, density=True)
    ax2.plot(bin_edges[:-1], v, color="green")
    ax2.set_ylabel("Breakpoint density")
    ax2.set_xlim(1.5e7, 5.3e7)
    fig.savefig("hapmap_chr22.svg")
Density of breakpoints along the chromosome.

Finite-site recombination

Todo

Add this.

Mutations

Mutations are generated in msprime by throwing mutations down on the branches of trees at a particular rate.

Infinite sites mutations

By default, the mutations are generated under the infinite sites model, and so each mutation occurs at a unique (floating point) point position along the genomic interval occupied by a tree. The mutation rate for simulations is specified using the mutation_rate parameter of simulate(). For example, the following chunk simulates 50kb of nonrecombining sequence with a mutation rate of \(1 \times 10^{-8}\) per base per generation:

ts = msprime.simulate(
   sample_size=6, Ne=1000, length=50e3, mutation_rate=1e-8, random_seed=30)
tree = ts.first()
for site in tree.sites():
    for mutation in site.mutations:
        print("Mutation @ position {:.2f} over node {}".format(
            site.position, mutation.node))

# Mutation @ position 1556.54 over node 9
# Mutation @ position 4485.17 over node 6
# Mutation @ position 9788.56 over node 6
# Mutation @ position 11759.03 over node 6
# Mutation @ position 11949.32 over node 6
# Mutation @ position 14321.77 over node 9
# Mutation @ position 31454.99 over node 6
# Mutation @ position 45125.69 over node 9
# Mutation @ position 49709.68 over node 6

print(tree.draw(format="unicode"))

#     10
#  ┏━━┻━━┓
#  ┃     9
#  ┃   ┏━┻━┓
#  ┃   ┃   8
#  ┃   ┃  ┏┻┓
#  ┃   7  ┃ ┃
#  ┃  ┏┻┓ ┃ ┃
#  6  ┃ ┃ ┃ ┃
# ┏┻┓ ┃ ┃ ┃ ┃
# 0 4 2 5 1 3

It is also possible to add mutations to an existing tree sequence using the msprime.mutate() function.

Finite sites mutations

Todo

Add details about simulating mutations under a finite sites model.

Variants

We are often interesting in accessing the sequence data that results from simulations directly. The most efficient way to do this is by using the tskit.TreeSequence.variants() method, which returns an iterator over all the tskit.Variant objects arising from the trees and mutations. Each variant contains a reference to the site object, as well as the alleles and the observed sequences for each sample in the genotypes field.

In the following example we loop through each variant in a simulated dataset. We print the observed state of each sample, along with the index and position of the corresponding mutation:

ts = msprime.simulate(
    sample_size=20, Ne=1e4, length=5e3, recombination_rate=2e-8,
    mutation_rate=2e-8, random_seed=10)
for variant in ts.variants():
    print(
        variant.site.id, variant.site.position,
        variant.alleles, variant.genotypes, sep="\t")

# 0       2432.768327416852       ('0', '1')      [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
# 1       2577.6939414924095      ('0', '1')      [1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1]
# 2       2844.682702049562       ('0', '1')      [0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
# 3       4784.266628557816       ('0', '1')      [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]

Note that variant.alleles[variant.genotypes[j]] gives the allele of sample ID j at variant variant. In this example, the alleles are always '0' (the ancestral state) and '1' (the derived state), because we are simulating with the infinite sites mutation model, in which each mutation occurs at a unique position in the genome. More complex models are possible, however.

This way of working with the sequence data is quite efficient because we do not need to store all the sample genotypes at all variant sites in memory at once. However, if we do want the full genotype matrix as a numpy array, it is simple to obtain:

A = ts.genotype_matrix()
A

# array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
#        [1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
#        [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
#          [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8)

This is useful for integrating with tools such as scikit allel, but note that what we call a genotype matrix corresponds to a scikit-allel haplotype array.

Warning

Beware that this matrix might be very big (bigger than the tree sequence it’s extracted from, in most realistically-sized simulations!)

Demography

So far, we’ve been simulating samples from a single population of a constant size, which isn’t particularly exciting! One of the strengths of msprime is that it can be used to specify quite complicated models of demography and population history with a simple Python API.

Note

A lot of the material in this section was first presented in a workshop at the SMBE Speciation meeting in June 2019. You can download the exercises as a standalone Jupyter notebook over here, or run the exercises in an online binder session by following the instructions at the bottom of this page.

Population structure

msprime supports simulation from multiple discrete populations, each of which is initialized with a msprime.PopulationConfiguration() object. For each population, you can specify a sample size, an effective population size at time = 0 and an exponential growth rate.

Note

Population structure in msprime closely follows the model used in the ms simulator. Unlike ms however, all times and rates are specified in generations and all populations sizes are absolute (that is, not multiples of \(N_e\)).

Suppose we wanted to simulate three sequences each from two populations with a constant effective population size of 500.

pop0 = msprime.PopulationConfiguration(sample_size=3, initial_size = 500)
pop1 = msprime.PopulationConfiguration(sample_size=3, initial_size = 500)

You can give these to msprime.simulate() as a list using the population_configurations argument. (Note that we no longer need to specify Ne as we have provided a separate size for each population).

# ts = msprime.simulate(population_configurations = [pop0, pop1],
#       random_seed = 12, length = 1000, recombination_rate = 1e-4,
#       mutation_rate = 7e-4)

However, this simulation will run forever unless we also specify some migration between the groups! To understand why, recall that msprime is a coalescent-based simulator. The simulation will run backwards-in-time, simulating until all samples have coalesced to a single common ancestor at each genomic location. However, with no migration between our two populations, samples in one population will never coalesce with samples in another population. To fix this, let’s add some migration events to the specific demographic history.

Migrations

With msprime, you can specify continual rates of migrations between populations, as well as one-off mass migrations.

Constant migration

2 populations with constant migration rate.

Migration rates between the populations can be specified as the elements of an N by N numpy array, and given to msprime.simulate() via the migration_matrix argument. The diagonal elements of this array must each be 0, and the (i, j) th element specifies the fraction of population i that consists of new migrants from population j in each generation.

For instance, the following migration matrix specifies that in each generation, 5% of population 0 consists of migrants from population 1, and 2% of population 1 consists of migrants from population 0.

M = np.array([
[0, 0.05],
[0.02, 0]])

ts = msprime.simulate(
        population_configurations = [pop0, pop1],
        migration_matrix = M,
        length = 1000,
        random_seed = 17,
        recombination_rate = 1e-7)

One consequence of specifying msprime.PopulationConfiguration() objects is that each of the simulated nodes will now belong to one of our specified populations:

print(ts.tables.nodes)

# id  flags   population  individual  time    metadata
# 0   1   0   -1  0.00000000000000
# 1   1   0   -1  0.00000000000000
# 2   1   0   -1  0.00000000000000
# 3   1   1   -1  0.00000000000000
# 4   1   1   -1  0.00000000000000
# 5   1   1   -1  0.00000000000000
# 6   0   0   -1  11.88714489632197
# 7   0   1   -1  224.72850970133027
# 8   0   1   -1  471.21813561520798
# 9   0   1   -1  539.93458624531195
# 10  0   1   -1  1723.16029992759240
# 11  0   1   -1  3813.34990584180423

Notice that the population column of the node table now contains values of 0 and 1. If you are working in a Jupyter notebook, you can draw the tree sequence with nodes coloured by population label using SVG:

from IPython.display import SVG

colour_map = {0:"red", 1:"blue"}
node_colours = {u.id: colour_map[u.population] for u in ts.nodes()}
for tree in ts.trees():
    print("Tree on interval:", tree.interval)
    # The code below will only work in a Jupyter notebook with SVG output enabled.
    display(SVG(tree.draw(node_colours=node_colours)))
2 populations with constant migration rate.

More coalescences are happening in population 1 than population 0. This makes sense given that population 1 is specifying more migrants to population 0 than vice versa.

Changing migration rates

We can change any of the migration rates at any time in the simulation. To do this, we just need to add a msprime.MigrationRateChange() object specifying the index of the migration matrix to be changed, the time of the change and the new migration rate.

For instance, say we wanted to specify that in each generation prior to time = 100, 1% of population 0 consisted of migrants from population 1.

migration_rate_change = msprime.MigrationRateChange(
            time = 100, rate = 0.01, matrix_index=(0, 1))

A list of these changes can be supplied to msprime.simulate() via the demographic_events input: (If there is more than 1 change, ensure they are ordered by backwards-time!)

ts = msprime.simulate(
        population_configurations = [pop0, pop1],
        migration_matrix = M,
        length = 1000,
        demographic_events = [migration_rate_change],
        random_seed = 25,
        recombination_rate = 1e-6)

Mass migrations

msprime.MassMigration() objects are used to specify one-off events in which some fraction of a population moves into another population. These are useful for specifying divergence and admixture events.

2 populations with a mass migration.

You’ll need to provide the time of the event in generations, the ID of the source and destination populations, and a migration proportion (which defaults to 1.0). For example, the following specifies that 50 generations ago, 30% of population 0 was a migrant from population 1.

admixture_event  = msprime.MassMigration(time = 50, source = 0, dest = 1, proportion = 0.3)

Note that these are viewed as backwards-in-time events, so source is the population that receives migrants from dest.

Any mass migrations can be added into the list of demographic_events supplied to msprime.simulate().

ts = msprime.simulate(
        population_configurations = [pop0, pop1],
        migration_matrix = M,
        demographic_events = [admixture_event],
        random_seed = 12,
        length = 1000,
        recombination_rate = 1e-4,
        mutation_rate = 7e-4)

msprime.MassMigration() objects can also be used to specify divergence events, but we must take some care.

2 populations with a divergence.

The following specifies that 200 generations ago, 100% of population 1 was a migrant from population 0.

divergence_event = msprime.MassMigration(
        time = 200, source = 1, dest = 0, proportion = 1)

We’ll add this to our list of demographic_events.

ts = msprime.simulate(
        population_configurations = [pop0, pop1],
        migration_matrix = M,
        demographic_events = [admixture_event, divergence_event],
        random_seed = 28,
        length = 1000,
        recombination_rate = 1e-7)

However, when we look at the population IDs corresponding to the the nodes from more than 200 generations ago, there are still some nodes from both populations. This is not what what we’d expect to see if we’d correctly simulated a divergence event!

[u.population for u in ts.nodes() if u.time > 200]
# [1, 0, 1, 1]

The reason is that at present, we are simulating a situation in which population 1 exists prior to generation 200, but is completely replaced by migrants from population 0 at time = 200. And because we’ve specified a migration matrix, there will still be some migrants from population 0 to population 1 in prior generations.

2 populations with a divergence and migration beforehand.

We can fix this by also specifying that prior to time = 200, population 1 had no migration from population 0.

rate_change = msprime.MigrationRateChange(
    time = 200, rate = 0, matrix_index=None)

ts = msprime.simulate(
        population_configurations = [pop0, pop1],
        migration_matrix = M,
        demographic_events = [admixture_event, divergence_event, rate_change],
        random_seed = 28,
        length = 1000,
        recombination_rate = 1e-7)

Now all ancestral nodes prior to generation 200 are exclusively from population 0. Hooray!

[u.population for u in ts.nodes() if u.time > 200]
# [0, 0, 0, 0, 0]

# This only works in a Jupyter notebook.
from IPython.display import SVG

colour_map = {0:"red", 1:"blue"}
node_colours = {u.id: colour_map[u.population] for u in ts.nodes()}
for tree in ts.trees():
    display(SVG(tree.draw(node_colours=node_colours)))
2 populations with a divergence, SVG output.

Changing population sizes or growth rates

2 populations with a complex demographic history.

We may wish to specify changes to rates of population growth, or sudden changes in population size at a particular time. Both of these can be specified with msprime.PopulationParametersChange() objects in the supplied list of demographic_events.

# Bottleneck in Population 0 between 50 - 150 generations ago.
pop0_bottleneck_ends = msprime.PopulationParametersChange(
    time = 50, initial_size = 250, population = 0)
pop0_bottleneck_starts = msprime.PopulationParametersChange(
    time = 150, initial_size = 500, population = 0)

# Exponential growth in Population 1 starting 50 generations ago.
pop1_growth = msprime.PopulationParametersChange(
    time = 100, growth_rate = 0.01, population = 1)

ts = msprime.simulate(
        population_configurations = [pop0, pop1],
        migration_matrix = M,
        length = 1000,
        demographic_events = [pop0_bottleneck_ends, pop1_growth, pop0_bottleneck_starts],
        random_seed = 17,
        recombination_rate = 1e-6)

Note

Since msprime simulates backwards-in-time, parameter changes must be interpreted backwards-in-time as well. For instance, the pop1_growth event in the example above specifies continual growth in the early history of population 1 up until 100 generations in the past.

Census events

Census events allow you to add a node to each branch of the tree sequence at a given time during the simulation. This can be useful when you wish to study haplotypes that are ancestral to your simulated sample, or when you wish to know which lineages were present in which populations at specified times.

For instance, the following code specifies a simulation with two samples drawn from each of two populations. There are two demographic events: a migration rate change and a census event. At generation 100 and earlier, the two populations exchange migrants at a rate of 0.05. At generation 5000, a census is performed:

pop_config = msprime.PopulationConfiguration(sample_size=2, initial_size=1000)
mig_rate_change = msprime.MigrationRateChange(time=100, rate=0.05)
ts = msprime.simulate(
            population_configurations=[pop_config, pop_config],
            length=1000,
            demographic_events=[mig_rate_change, msprime.CensusEvent(time=5000)],
            recombination_rate=1e-7,
            random_seed=141)

The resulting tree sequence has nodes on each tree at the specified census time. These are the nodes with IDs 8, 9, 10, 11, 12 and 13:

# This will only work in a Jupyter notebook
from IPython.display import SVG
display(SVG(ts.draw_svg()))
A tree sequence with census nodes.

This tells us that the genetic material ancestral to the present day sample was held within 5 haplotypes at time 5000. The node table shows us that four of these haplotypes (nodes 8, 9, 10 and 11) were in population 0 at this time, and two of these haplotypes (nodes 12 and 13) were in population 1 at this time.

print(ts.tables.nodes)

# id  flags   population  individual  time    metadata
# 0   1       0   -1  0.00000000000000
# 1   1       0   -1  0.00000000000000
# 2   1       1   -1  0.00000000000000
# 3   1       1   -1  0.00000000000000
# 4   0       1   -1  2350.08685279051815
# 5   0       1   -1  3759.20387382847684
# 6   0       0   -1  4234.97992185234671
# 7   0       1   -1  4598.83898042243527
# 8   1048576 0   -1  5000.00000000000000
# 9   1048576 0   -1  5000.00000000000000
# 10  1048576 0   -1  5000.00000000000000
# 11  1048576 0   -1  5000.00000000000000
# 12  1048576 1   -1  5000.00000000000000
# 13  1048576 1   -1  5000.00000000000000
# 14  0       1   -1  5246.90282987397495
# 15  0       0   -1  8206.73121309170347

If we wish to study these ancestral haplotypes further, we can simplify the tree sequence with respect to the census nodes and perform subsequent analyses on this simplified tree sequence. In this example, ts_anc is a tree sequence obtained from the original tree sequence ts by labelling the census nodes as samples and removing all nodes and edges that are not ancestral to these census nodes.

nodes = [i.id for i in ts.nodes() if i.flags==msprime.NODE_IS_CEN_EVENT]
ts_anc = ts.simplify(samples=nodes)

Debugging demography

As we’ve seen, it’s pretty easy to make mistakes when specifying demography!

To help you spot these, msprime provides a debugger that prints out your population history in a more human-readable form. It’s good to get into the habit of running the msprime.DemographyDebugger() before running your simulations.

my_history = msprime.DemographyDebugger(
population_configurations=[pop0, pop1], migration_matrix = M,
demographic_events=[admixture_event, divergence_event, rate_change])

my_history.print_history()

# Model =  hudson(reference_size=1)
# ============================
# Epoch: 0 -- 50.0 generations
# ============================
#      start     end      growth_rate |     0        1
#    -------- --------       -------- | -------- --------
# 0 |   500      500                0 |     0      0.05
# 1 |   500      500                0 |   0.02       0
#
# Events @ generation 50.0
#    - Mass migration: Lineages moved with probability 0.3 backwards in time with source 0 & dest 1
#                      (equivalent to migration from 1 to 0 forwards in time)
# ================================
# Epoch: 50.0 -- 200.0 generations
# ================================
#      start     end      growth_rate |     0        1
#    -------- --------       -------- | -------- --------
# 0 |   500      500                0 |     0      0.05
# 1 |   500      500                0 |   0.02       0
#
# Events @ generation 200.0
#    - Mass migration: Lineages moved with probability 1 backwards in time with source 1 & dest 0
#                      (equivalent to migration from 0 to 1 forwards in time)
#    - Migration rate change to 0 everywhere
# ===============================
# Epoch: 200.0 -- inf generations
# ===============================
#      start     end      growth_rate |     0        1
#    -------- --------       -------- | -------- --------
# 0 |   500      500                0 |     0        0
# 1 |   500      500                0 |     0        0

A complete example

To illustrate msprime’s demography API on a real example from the literature, we implement the Gutenkunst et al. out-of-Africa model. The parameter values used are taken from Table 1. Here is an illustration of the model using the demography package (see also Figure 2B of the Gutenkunst et. al paper):

Schematic of Gutenkunst et al. (2009) out-of-Africa model.

The code below is provided as an example to help you develop your own models. If you want to use this precise model in your analyses we strongly recommend using stdpopsim, which provides a community maintained catalog of simulation species information and demographic models. The model given here is identical to the HomSam/OutOfAfrica_3G09 model.

Warning

The version of this model in the tutorial from 31 May 2016 to 29 May 2020 (on the stable branch) was incorrect. Specifically, it mistakenly allowed for migration to continue beyond the merger of the African and Eurasian bottleneck populations. This has now been fixed, but if you had copied this model from the tutorial for your own analyses, you should update your model code or use the implementation that has been verified in stdpopsim project. See here for more details on the faulty model and its likely effects on downstream analyses.

Coalescent simulation moves from the present back into the past, so times are in units of generations ago, and we build the model with most recent events first.

import math

import msprime


def out_of_africa():
    # First we set out the maximum likelihood values of the various parameters
    # given in Table 1.
    N_A = 7300
    N_B = 2100
    N_AF = 12300
    N_EU0 = 1000
    N_AS0 = 510
    # Times are provided in years, so we convert into generations.
    generation_time = 25
    T_AF = 220e3 / generation_time
    T_B = 140e3 / generation_time
    T_EU_AS = 21.2e3 / generation_time
    # We need to work out the starting population sizes based on the growth
    # rates provided for these two populations
    r_EU = 0.004
    r_AS = 0.0055
    N_EU = N_EU0 / math.exp(-r_EU * T_EU_AS)
    N_AS = N_AS0 / math.exp(-r_AS * T_EU_AS)
    # Migration rates during the various epochs.
    m_AF_B = 25e-5
    m_AF_EU = 3e-5
    m_AF_AS = 1.9e-5
    m_EU_AS = 9.6e-5
    # Population IDs correspond to their indexes in the popupulation
    # configuration array. Therefore, we have 0=YRI, 1=CEU and 2=CHB
    # initially.
    population_configurations = [
        msprime.PopulationConfiguration(sample_size=0, initial_size=N_AF),
        msprime.PopulationConfiguration(
            sample_size=1, initial_size=N_EU, growth_rate=r_EU
        ),
        msprime.PopulationConfiguration(
            sample_size=1, initial_size=N_AS, growth_rate=r_AS
        ),
    ]
    migration_matrix = [
        [0, m_AF_EU, m_AF_AS],
        [m_AF_EU, 0, m_EU_AS],
        [m_AF_AS, m_EU_AS, 0],
    ]
    demographic_events = [
        # CEU and CHB merge into B with rate changes at T_EU_AS
        msprime.MassMigration(time=T_EU_AS, source=2, destination=1, proportion=1.0),
        msprime.MigrationRateChange(time=T_EU_AS, rate=0),
        msprime.MigrationRateChange(time=T_EU_AS, rate=m_AF_B, matrix_index=(0, 1)),
        msprime.MigrationRateChange(time=T_EU_AS, rate=m_AF_B, matrix_index=(1, 0)),
        msprime.PopulationParametersChange(
            time=T_EU_AS, initial_size=N_B, growth_rate=0, population_id=1
        ),
        # Population B merges into YRI at T_B
        msprime.MassMigration(time=T_B, source=1, destination=0, proportion=1.0),
        msprime.MigrationRateChange(time=T_B, rate=0),
        # Size changes to N_A at T_AF
        msprime.PopulationParametersChange(
            time=T_AF, initial_size=N_A, population_id=0
        ),
    ]
    return {
        "population_configurations": population_configurations,
        "migration_matrix": migration_matrix,
        "demographic_events": demographic_events,
    }

Once we have defined the model, it is a very good idea to check the implementation using the DemographyDebugger:

# Use the demography debugger to print out the demographic history
# that we have just described.
dd = msprime.DemographyDebugger(**out_of_africa())
dd.print_history()

# =============================
# Epoch: 0 -- 848.0 generations
# =============================
#      start     end      growth_rate |     0        1        2
#    -------- --------       -------- | -------- -------- --------
# 0 |1.23e+04 1.23e+04              0 |     0      3e-05   1.9e-05
# 1 |2.97e+04   1e+03           0.004 |   3e-05      0     9.6e-05
# 2 |5.41e+04    510           0.0055 |  1.9e-05  9.6e-05     0
#
# Events @ generation 848.0
#    - Mass migration: Lineages moved with probability 1.0 backwards in time with source 2 & dest 1
#                      (equivalent to migration from 1 to 2 forwards in time)
#    - Migration rate change to 0 everywhere
#    - Migration rate change for (0, 1) to 0.00025
#    - Migration rate change for (1, 0) to 0.00025
#    - Population parameter change for 1: initial_size -> 2100 growth_rate -> 0
# ==================================
# Epoch: 848.0 -- 5600.0 generations
# ==================================
#      start     end      growth_rate |     0        1        2
#    -------- --------       -------- | -------- -------- --------
# 0 |1.23e+04 1.23e+04              0 |     0     0.00025     0
# 1 | 2.1e+03  2.1e+03              0 |  0.00025     0        0
# 2 |   510   2.27e-09         0.0055 |     0        0        0
#
# Events @ generation 5600.0
#    - Mass migration: Lineages moved with probability 1.0 backwards in time with source 1 & dest 0
#                    (equivalent to migration from 0 to 1 forwards in time)
#    - Migration rate change to 0 everywhere
# ===================================
# Epoch: 5600.0 -- 8800.0 generations
# ===================================
#      start     end      growth_rate |     0        1        2
#    -------- --------       -------- | -------- -------- --------
# 0 |1.23e+04 1.23e+04              0 |     0        0        0
# 1 | 2.1e+03  2.1e+03              0 |     0        0        0
# 2 |2.27e-09 5.17e-17         0.0055 |     0        0        0
#
# Events @ generation 8800.0
#    - Population parameter change for 0: initial_size -> 7300
# ================================
# Epoch: 8800.0 -- inf generations
# ================================
#      start     end      growth_rate |     0        1        2
#    -------- --------       -------- | -------- -------- --------
# 0 | 7.3e+03  7.3e+03              0 |     0        0        0
# 1 | 2.1e+03  2.1e+03              0 |     0        0        0
# 2 |5.17e-17     0            0.0055 |     0        0        0

Once you are satisfied that the demographic history that you have built is correct, it can then be simulated by calling the simulate() function:

ts = msprime.simulate(**out_of_africa())

Advanced features

Logging

Msprime uses the Python logging infrastructure to help debugging complex simulations. Messages at the INFO level are high-level information about the state of the current simulation, such as switches in simulation model, etc. While it is straightforward to set up logging messages using the built-in Python methods, the daiquiri is a bit more convenient. For example,

import daiquiri

import msprime


def logging_info_example():
    daiquiri.setup(level="INFO")
    msprime.simulate(10, Ne=1000, model=["dtwf", (100, "hudson")])

Running this code we would get output something like the following:

2020-06-24 16:02:20,983 [11188] INFO     msprime.ancestry: Running model {'name': 'dtwf'} until max time: 100.000000
2020-06-24 16:02:20,984 [11188] INFO     msprime.ancestry: Running model {'name': 'hudson'} until max time: inf
2020-06-24 16:02:20,984 [11188] INFO     msprime.ancestry: Completed at time=1449.4 nodes=19 edges=18

When running larger simulations and trying to figure out when they might finish, it can be helpful to use the DEBUG logging output. For example,

import daiquiri

import msprime


def logging_debug_example():
    daiquiri.setup(level="DEBUG")
    msprime.simulate(
        10 ** 5, Ne=10000, recombination_rate=2e-8, length=1e6, random_seed=32
    )

which gives us:

2020-06-24 16:11:33,373 [11729] INFO     msprime.ancestry: Running model {'name': 'hudson'} until max time: inf
2020-06-24 16:11:33,396 [11729] DEBUG    msprime.ancestry: time=0.0444792 ancestors=90162
2020-06-24 16:11:33,418 [11729] DEBUG    msprime.ancestry: time=0.0979881 ancestors=80314
2020-06-24 16:11:33,438 [11729] DEBUG    msprime.ancestry: time=0.167461 ancestors=70518
2020-06-24 16:11:33,466 [11729] DEBUG    msprime.ancestry: time=0.258912 ancestors=60734
2020-06-24 16:11:33,501 [11729] DEBUG    msprime.ancestry: time=0.386571 ancestors=51002
2020-06-24 16:11:33,542 [11729] DEBUG    msprime.ancestry: time=0.575475 ancestors=41292
2020-06-24 16:11:33,596 [11729] DEBUG    msprime.ancestry: time=0.872121 ancestors=31704
2020-06-24 16:11:33,668 [11729] DEBUG    msprime.ancestry: time=1.42071 ancestors=22228
2020-06-24 16:11:33,763 [11729] DEBUG    msprime.ancestry: time=2.75437 ancestors=13042
2020-06-24 16:11:33,906 [11729] DEBUG    msprime.ancestry: time=8.80469 ancestors=4752
2020-06-24 16:11:34,018 [11729] DEBUG    msprime.ancestry: time=277.713 ancestors=416
2020-06-24 16:11:34,034 [11729] DEBUG    msprime.ancestry: time=6352.95 ancestors=142
2020-06-24 16:11:34,042 [11729] DEBUG    msprime.ancestry: time=20955.5 ancestors=104
2020-06-24 16:11:34,050 [11729] DEBUG    msprime.ancestry: time=40321.8 ancestors=83
2020-06-24 16:11:34,056 [11729] DEBUG    msprime.ancestry: time=72735.7 ancestors=74
2020-06-24 16:11:34,059 [11729] INFO     msprime.ancestry: Completed at time=235063 nodes=207110 edges=233940

In this example we run a reasonably large simulation and turn on the DEBUG output. This will then periodically (every 10,000 simulation events) print out the current time in the simulation, and the number of extant ancestral lineages.

Warning

The format of these logging messages is not fixed and may change arbitrarily in the future. If you need to obtain the information within them, please open an issue on GitHub so that we can provide a documented API for this.

Parsing species trees

Models used in msprime for simulation can be designed to approximate the diversification history of a group of diverging species, by defining, for each species divergence, a mass migration event at which all lineages from one population move into another population. To faciliate the specification of these mass migration events it is possible to parse a species tree and generate the set of mass migration events automatically.

To be parseable, a species tree must be encoded in Newick format, with named leaves and branch lengths. One example of a parseable species tree in Newick format is (((human:5.6,chimpanzee:5.6):3.0,gorilla:8.6):9.4,orangutan:18.0). When visualized in a software like FigTree, this tree appears as shown below:

A species tree with branch lengths.

In the above figure, numbers written on branches indicate the lengths of these branches. In this case, the units of the branch lengths are millions of years, which is common for species trees; however, trees with branch lengths in units of years or generations can also be parsed. When the branch lengths are in units of millions of years or in units of years (i.e., not in units of generations), a generation time in number of years must be provided so that the simulation model can be set up. In addition, a population size is required. With the species tree, a generation time, and the population size, the model can be generated:

import msprime
pop_configs, demographic_events = msprime.parse_species_tree(
        tree="(((human:5.6,chimpanzee:5.6):3.0,gorilla:8.6):9.4,orangutan:18.0)",
        Ne=10000,
        branch_length_units="myr",
        generation_time=28)

The parse_species_tree() method returns a tuple of two lists:

print(type(pop_configs))
# <class 'list'>
print(type(demographic_events))
# <class 'list'>

The first of these two lists contains instances of PopulationConfiguration and the second contains instances of MassMigration:

print(type(pop_configs[0]))
# <class 'msprime.simulations.PopulationConfiguration'>
print(type(demographic_events[0]))
# <class 'msprime.simulations.MassMigration'>

The mass migration events are ordered by the time of the event and they thus specify the sequence in which lineages are moved from source to destination populations:

print(demographic_events[0])
# Mass migration: Lineages moved with probability 1.0 backwards in time with source 1 & dest 0
                     (equivalent to migration from 0 to 1 forwards in time)
print(demographic_events[1])
# Mass migration: Lineages moved with probability 1.0 backwards in time with source 2 & dest 0
                     (equivalent to migration from 0 to 2 forwards in time)
print(demographic_events[2])
# Mass migration: Lineages moved with probability 1.0 backwards in time with source 3 & dest 0
                     (equivalent to migration from 0 to 3 forwards in time)

The above output indicates that — viewed backwards in time — lineages from populations 1, 2, and 3 are consecutively moved into population 0. Viewed forwards in time instead, this means that population 3 is the first to diverge, followed by population 2 and finally the divergence between populations 0 and 1. This sequence of divergences corresponds to the species tree if population 3 is orangutan and populations 2, 1, and 0 are gorilla, chimpanzee, and human, respectively. While the parsed species names are not used as population labels, they are included in the population configurations in the form of metadata, with the “species_name” tag:

print([pop_config.metadata for pop_config in pop_configs])
# [{'species_name': 'human'}, {'species_name': 'chimpanzee'}, {'species_name': 'gorilla'}, {'species_name': 'orangutan'}]

As the above output shows, the information on the topology of the species tree is fully included in the set of population configurations and mass migration events. It also illustrates that it is always the left one (when viewed from the root towards the tips of the species tree in the tree figure above) of the two populations descending from a divergence event that is used as the destination population in mass migration events.

The population configurations also define the population size:

print([pop_config.initial_size for pop_config in pop_configs])
# [10000.0, 10000.0, 10000.0, 10000.0]

The population size is 10,000 because this value was specified for Ne when calling parse_species_tree(). The growth rates are zero for all populations, meaning that they all have constant population sizes:

print([pop_config.growth_rate for pop_config in pop_configs])
# [0.0, 0.0, 0.0, 0.0]

To simulate under the model corresponding to the species tree, the population configurations and mass migration events are used as input for simulate(). We can specify the genomes to sample either by using the samples parameter or by setting a sample size for each population:

for pop_config in pop_configs:
    pop_config.sample_size = 2
# pop_configs now has 2 genomes sampled from each population at time 0. Equivalent to
# msprime.simulate(samples=[(0,0), (0,0), (1,0), (1,0), (2,0), (2,0), (3,0), (3,0)], ...)
tree_sequence = msprime.simulate(
        population_configurations=pop_configs,
        demographic_events=demographic_events)
tree = tree_sequence.first()
print(tree.draw(format="unicode"))
#    14
#  ┏━━┻━━━┓
#  ┃     13
#  ┃   ┏━━┻━━┓
#  ┃   ┃    12
#  ┃   ┃   ┏━┻━┓
#  ┃   ┃   ┃  11
#  ┃   ┃   ┃  ┏┻┓
# 10   ┃   ┃  ┃ ┃
# ┏┻┓  ┃   ┃  ┃ ┃
# ┃ ┃  ┃   9  ┃ ┃
# ┃ ┃  ┃  ┏┻┓ ┃ ┃
# ┃ ┃  8  ┃ ┃ ┃ ┃
# ┃ ┃ ┏┻┓ ┃ ┃ ┃ ┃
# 6 7 4 5 2 3 0 1

The correspondence between the model and the species tree can also be verified by using the demography debugger:

dd = msprime.DemographyDebugger(
    population_configurations=pop_configs,
    demographic_events=demographic_events)
dd.print_history()
# Model =  hudson(reference_size=1)
# ================================
# Epoch: 0 -- 200000.0 generations
# ================================
#      start     end      growth_rate |     0        1        2        3
#    -------- --------       -------- | -------- -------- -------- --------
# 0 |  1e+04    1e+04               0 |     0        0        0        0
# 1 |  1e+04    1e+04               0 |     0        0        0        0
# 2 |  1e+04    1e+04               0 |     0        0        0        0
# 3 |  1e+04    1e+04               0 |     0        0        0        0
#
# Events @ generation 200000.0
#    - Mass migration: Lineages moved with probability 1.0 backwards in time with source 1 & dest 0
#                      (equivalent to migration from 0 to 1 forwards in time)
# =================================================
# Epoch: 200000.0 -- 307142.85714285716 generations
# =================================================
#      start     end      growth_rate |     0        1        2        3
#    -------- --------       -------- | -------- -------- -------- --------
# 0 |  1e+04    1e+04               0 |     0        0        0        0
# 1 |  1e+04    1e+04               0 |     0        0        0        0
# 2 |  1e+04    1e+04               0 |     0        0        0        0
# 3 |  1e+04    1e+04               0 |     0        0        0        0
#
# Events @ generation 307142.85714285716
#    - Mass migration: Lineages moved with probability 1.0 backwards in time with source 2 & dest 0
#                      (equivalent to migration from 0 to 2 forwards in time)
# =========================================================
# Epoch: 307142.85714285716 -- 642857.142857143 generations
# =========================================================
#      start     end      growth_rate |     0        1        2        3
#    -------- --------       -------- | -------- -------- -------- --------
# 0 |  1e+04    1e+04               0 |     0        0        0        0
# 1 |  1e+04    1e+04               0 |     0        0        0        0
# 2 |  1e+04    1e+04               0 |     0        0        0        0
# 3 |  1e+04    1e+04               0 |     0        0        0        0
#
# Events @ generation 642857.142857143
#    - Mass migration: Lineages moved with probability 1.0 backwards in time with source 3 & dest 0
#                      (equivalent to migration from 0 to 3 forwards in time)
# ==========================================
# Epoch: 642857.142857143 -- inf generations
# ==========================================
#      start     end      growth_rate |     0        1        2        3
#    -------- --------       -------- | -------- -------- -------- --------
# 0 |  1e+04    1e+04               0 |     0        0        0        0
# 1 |  1e+04    1e+04               0 |     0        0        0        0
# 2 |  1e+04    1e+04               0 |     0        0        0        0
# 3 |  1e+04    1e+04               0 |     0        0        0        0

The epoch boundaries 200000, 307142.9, and 642857.1 correspond to the species divergence times 5.6, 8.6, and 18.0 after converting the branch length units of the species tree from millions of years to generations with the specified generation time of 28 years.

Historical samples

Simulating coalescent histories in which some of the samples are not from the present time is straightforward in msprime. By using the samples argument to msprime.simulate() we can specify the location and time at which all samples are made.

def historical_samples_example():
    samples = [
        msprime.Sample(population=0, time=0),
        msprime.Sample(0, 0),  # Or, we can use positional arguments.
        msprime.Sample(0, 1.0),
        msprime.Sample(0, 1.0)
    ]
    tree_seq = msprime.simulate(samples=samples)
    tree = tree_seq.first()
    for u in tree.nodes():
        print(u, tree.parent(u), tree.time(u), sep="\t")
    print(tree.draw(format="unicode"))

In this example we create four samples, two taken at the present time and two taken 1.0 generations in the past, as might represent one modern and one ancient diploid individual. There are a number of different ways in which we can describe the samples using the msprime.Sample object (samples can be provided as plain tuples also if more convenient). Running this example, we get:

historical_samples_example()
# 6    -1    2.8240255501413247
# 4    6    0.0864109319103291
# 0    4    0.0
# 1    4    0.0
# 5    6    1.9249243960710336
# 2    5    1.0
# 3    5    1.0
#    6
#  ┏━┻━┓
#  ┃   5
#  ┃  ┏┻┓
#  ┃  2 3
#  ┃
#  4
# ┏┻┓
# 0 1

Because nodes 0 and 1 were sampled at time 0, their times in the tree are both 0. Nodes 2 and 3 were sampled at time 1.0, and so their times are recorded as 1.0 in the tree.

Replication

A common task for coalescent simulations is to check the accuracy of analytical approximations to statistics of interest. To do this, we require many independent replicates of a given simulation. msprime provides a simple and efficient API for replication: by providing the num_replicates argument to the simulate() function, we can iterate over the replicates in a straightforward manner. Here is an example where we compare the analytical results for the number of segregating sites with simulations:

import numpy as np

import msprime


def segregating_sites(n, theta, num_replicates):
    S = np.zeros(num_replicates)
    replicates = msprime.simulate(
        sample_size=n, mutation_rate=theta / 4, num_replicates=num_replicates
    )
    for j, tree_sequence in enumerate(replicates):
        S[j] = tree_sequence.get_num_mutations()
    S_mean_a = np.sum(1 / np.arange(1, n)) * theta
    S_var_a = theta * np.sum(1 / np.arange(1, n)) + theta ** 2 * np.sum(
        1 / np.arange(1, n) ** 2
    )
    print("              mean              variance")
    print(f"Observed      {np.mean(S):.5f}\t\t{np.var(S):.5f}")
    print(f"Analytical    {S_mean_a:.5f}\t\t{S_var_a:.5f}")

Running this code, we get:

segregating_sites(10, 5, 100000)
#          mean              variance
# Observed      14.17893          53.0746740551
# Analytical    14.14484          52.63903

Note that in this example we set \(N_e = 0.5\) and the mutation rate to \(\theta / 2\) when calling simulate(). This works because msprime simulates Kingman’s coalescent, for which \(N_e\) is only a time scaling; since \(N_e\) is the diploid effective population size, setting \(N_e = 0.5\) means that the mean time for two samples to coalesce is equal to one time unit in the resulting trees. This is helpful for converting the diploid per-generation time units of msprime into the haploid coalescent units used in many theoretical results. However, it is important to note that conventions vary widely, and great care is needed with such factor-of-two rescalings.

Multiple chromosomes

Warning

This approach is somewhat hacky; hopefully we will have a more elegant solution soon!

Multiple chromosomes can be simulated by specifying a recombination map with hotspots between chromosomes. For example, to simulate two chromosomes each 1 Morgan in length:

rho = 1e-8
positions = [0, 1e8-1, 1e8, 2e8-1]
rates = [rho, 0.5, rho, 0]
num_loci = int(positions[-1])

recombination_map = msprime.RecombinationMap(
    positions=positions, rates=rates, num_loci=num_loci)

tree_sequence = msprime.simulate(
    sample_size=100, Ne=1000, recombination_map=recombination_map,
    model="dtwf")

Care must be taken when simulating whole genomes this way, as the rescaling required to model such large fluctuations in recombination rate can result in the following error: Bad edge interval where right <= left

This can be avoided by discretizing the genome into 100bp chunks by changing the above to:

rho = 1e-8
positions = [0, 1e8-1, 1e8, 2e8-1]
rates = [rho, 0.5, rho, 0]
num_loci = positions[-1] // 100 # Discretize into 100bp chunks

Also note that recombinations will still occur in the gaps between chromosomes, with corresponding trees in the tree sequence. This will be fixed in a future release.

Hybrid simulations

In some situations Wright-Fisher simulations are desireable but less computationally efficient than coalescent simulations, for example simulating a small sample in a recently admixed population. In these cases, a hybrid model offers an excellent tradeoff between simulation accuracy and performance.

This is done through a SimulationModelChange event, which is a special type of demographic event.

For example, here we switch from the discrete-time Wright-Fisher model to the standard Hudson coalescent 500 generations in the past:

ts = msprime.simulate(
    sample_size=6, Ne=1000, model="dtwf", random_seed=2,
    demographic_events=[
        msprime.SimulationModelChange(time=500, model="hudson")])
print(ts.tables.nodes)
# id      flags   population      individual      time    metadata
# 0       1       0       -1      0.00000000000000
# 1       1       0       -1      0.00000000000000
# 2       1       0       -1      0.00000000000000
# 3       1       0       -1      0.00000000000000
# 4       1       0       -1      0.00000000000000
# 5       1       0       -1      0.00000000000000
# 6       0       0       -1      78.00000000000000
# 7       0       0       -1      227.00000000000000
# 8       0       0       -1      261.00000000000000
# 9       0       0       -1      272.00000000000000
#10      0       0       -1      1629.06982528980075

Because of the integer node times, we can see here that most of the coalescent happened during the Wright-Fisher phase of the simulation, and as-of 500 generations in the past, there were only two lineages left. The continuous time standard coalescent model was then used to simulate the ancient past of these two lineages.

Completing forwards simulations

The msprime simulator generates tree sequences using the backwards in time coalescent model. But it is also possible to output tree sequences from forwards-time simulators such as SLiM. There are many advantages to using forward-time simulators, but they are usually quite slow compared to similar coalescent simulations. In this section we show how to combine the best of both approaches by simulating the recent past using a forwards-time simulator and then complete the simulation of the ancient past using msprime. (We sometimes refer to this “recapitation”, as we can think of it as adding a “head” onto a tree sequence.)

First, we define a simple Wright-Fisher simulator which returns a tree sequence with the properties that we require (please see the API section for a formal description of these properties):

import random

import numpy as np
import tskit


def wright_fisher(N, T, L=100, random_seed=None):
    """
    Simulate a Wright-Fisher population of N haploid individuals with L
    discrete loci for T generations. Based on Algorithm W from
    https://www.biorxiv.org/content/biorxiv/early/2018/01/16/248500.full.pdf
    """
    random.seed(random_seed)
    tables = tskit.TableCollection(L)
    P = np.arange(N, dtype=int)
    # Mark the initial generation as samples so that we remember these nodes.
    for _ in range(N):
        tables.nodes.add_row(time=T, flags=tskit.NODE_IS_SAMPLE)
    t = T
    while t > 0:
        t -= 1
        Pp = P.copy()
        for j in range(N):
            u = tables.nodes.add_row(time=t, flags=0)
            Pp[j] = u
            a = random.randint(0, N - 1)
            b = random.randint(0, N - 1)
            x = random.randint(1, L - 1)
            tables.edges.add_row(0, x, P[a], u)
            tables.edges.add_row(x, L, P[b], u)
        P = Pp

    # Now do some table manipulations to ensure that the tree sequence
    # that we output has the form that msprime needs to finish the
    # simulation. Much of the complexity here is caused by the tables API
    # not allowing direct access to memory, which will change soon.

    # Mark the extant population as samples also
    flags = tables.nodes.flags
    flags[P] = tskit.NODE_IS_SAMPLE
    tables.nodes.set_columns(flags=flags, time=tables.nodes.time)
    tables.sort()
    # Simplify with respect to the current generation, but ensuring we keep the
    # ancient nodes from the initial population.
    tables.simplify()
    # Unmark the initial generation as samples
    flags = tables.nodes.flags
    time = tables.nodes.time
    flags[:] = 0
    flags[time == 0] = tskit.NODE_IS_SAMPLE
    # The final tables must also have at least one population which
    # the samples are assigned to
    tables.populations.add_row()
    tables.nodes.set_columns(
        flags=flags, time=time, population=np.zeros_like(tables.nodes.population)
    )
    return tables.tree_sequence()

We then run a tiny forward simulation of 10 two-locus individuals for 5 generations, and print out the resulting trees:

num_loci = 2
N = 10
wf_ts = wright_fisher(N, 5, L=num_loci, random_seed=3)
for tree in wf_ts.trees():
    print("interval = ", tree.interval)
    print(tree.draw(format="unicode"))

We get:

interval =  (0.0, 1.0)
       0                 7
       ┃                 ┃
       25                ┃
  ┏━━━━┻━━━━┓            ┃
  23        24           ┃
┏━┻━┓    ┏━━╋━━━┓        ┃
┃   21   ┃  ┃   22       20
┃  ┏┻━┓  ┃  ┃  ┏┻━┓   ┏━━╋━━┓
10 14 19 11 18 15 17  12 13 16

interval =  (1.0, 2.0)
        0          8    4     7
        ┃          ┃   ┏┻━┓   ┃
        21         ┃   ┃  ┃   ┃
┏━━┳━━┳━┻┳━━┳━━┓   ┃   ┃  ┃   ┃
14 19 10 13 16 18  11  15 17  12

Because our Wright Fisher simulation ran for only 5 generations, there has not been enough time for the trees to fully coalesce. Therefore, instead of having one root, the trees have several — the first tree has 2 and the second 4. Nodes 0 to 9 in this simulation represent the initial population of the simulation, and so we can see that all samples in the first tree trace back to one of two individuals from the initial generation. These unary branches joining samples and coalesced subtrees to the nodes in the initial generation are essential as they allow use to correctly assemble the various fragments of ancestral material into chromosomes when creating the initial conditions for the coalescent simulation. (Please see the API section for more details on the required properties of input tree sequences.)

The process of completing this tree sequence using a coalescent simulation begins by first examining the root segments on the input trees. We get the following segments:

[(0, 2, 0), (0, 2, 7), (1, 2, 8), (1, 2, 4)]

where each segment is a (left, right, node) tuple. As nodes 0 and 7 are present in both trees, they have segments spanning both loci. Nodes 8 and 4 are present only in the second tree, and so they have ancestral segments only for the second locus. Note that this means that we do not simulate the ancestry of the entire initial generation of the simulation, but rather the exact minimum that we need in order to complete the ancestry of the current generation. For instance, root 8 has not coalesced over the interval from 1.0 to 2.0, while root 0 has not coalesced over the entire segment from 0.0 to 2.0.

We run the coalescent simulation to complete this tree sequence using the from_ts argument to simulate(). Because we have simulated a two locus system with a recombination rate of 1 / num_loci per generation in the Wright-Fisher model, we want to use the same system in the coalescent simulation. To do this we create recombination map using the RecombinationMap.uniform_map() class method to easily create a discrete map with the required number of loci. (Please see the API section for more details on the restrictions on recombination maps when completing an existing simulation.) We also use a Ne value of N / 2 since the Wright-Fisher simulation was haploid and msprime is diploid.

recomb_map = msprime.RecombinationMap.uniform_map(num_loci, 1 / num_loci, num_loci)
coalesced_ts = msprime.simulate(
    Ne=N / 2, from_ts=wf_ts, recombination_map=recomb_map, random_seed=5)

After running this simulation we get the following trees:

interval =  (0.0, 1.0)
                26
       ┏━━━━━━━━┻━━━━━━━┓
       0                7
       ┃                ┃
       25               ┃
  ┏━━━━┻━━━━┓           ┃
  23        24          ┃
┏━┻━┓    ┏━━╋━━━┓       ┃
┃   21   ┃  ┃   22      20
┃  ┏┻━┓  ┃  ┃  ┏┻━┓  ┏━━╋━━┓
10 14 19 11 18 15 17 12 13 16

interval =  (1.0, 2.0)
                  28
             ┏━━━━┻━━━━━┓
             ┃          27
             ┃        ┏━┻━━┓
             26       ┃    ┃
        ┏━━━━┻━━━━┓   ┃    ┃
        0         7   4    8
        ┃         ┃  ┏┻━┓  ┃
        21        ┃  ┃  ┃  ┃
┏━━┳━━┳━┻┳━━┳━━┓  ┃  ┃  ┃  ┃
14 19 10 13 16 18 12 15 17 11

The trees have fully coalesced and we’ve successfully combined a forwards-time Wright-Fisher simulation with a coalescent simulation: hooray!

Why record the initial generation?

We can now see why it is essential that the forwards simulator records the initial generation in a tree sequence that will later be used as a from_ts argument to simulate(). In the example above, if node 7 was not in the tree sequence, we would not know that the segment that node 20 inherits from on [0.0, 1.0) and the segment that node 12 inherits from on [1.0, 2.0) both exist in the same node (here, node 7).

However, note that although the intial generation (above, nodes 0, 4, 7, and 8) must be in the tree sequence, they do not have to be samples. The easiest way to do this is to (a) retain the initial generation as samples throughout the forwards simulation (so they persist through simplify()), but then (b) before we output the final tree sequence, we remove the flags that mark them as samples, so that simulate() does not simulate their entire history as well. This is the approach taken in the toy simulator provided above (although we skip the periodic simplify() steps which are essential in any practical simulation for simplicity).

Topology gotchas

The trees that we output from this combined forwards and backwards simulation process have some slightly odd properties that are important to be aware of. In the example above, we can see that the old roots are still present in both trees, even through they have only one child and are clearly redundant. This is because the tables of from_ts have been retained, without modification, at the top of the tables of the output tree sequence. While this redundancy is not important for many tasks, there are some cases where they may cause problems:

  1. When computing statistics on the number of nodes, edges or trees in a tree sequence, having these unary edges and redundant nodes will slightly inflate the values.
  2. If you are computing the overall tree “height” by taking the time of the root node, you may overestimate the height because there is a unary edge above the “real” root (this would happen if one of the trees had already coalesced in the forwards-time simulation).

For these reasons it is usually better to remove this redundancy from your computed tree sequence which is easily done using the tskit.TreeSequence.simplify() method:

final_ts = coalesced_ts.simplify()

for tree in final_ts.trees():
    print("interval = ", tree.interval)
    print(tree.draw(format="unicode"))

giving us:

interval =  (0.0, 1.0)
      17
  ┏━━━┻━━━━┓
  ┃        15
  ┃     ┏━━┻━━┓
  ┃     13    14
  ┃   ┏━┻┓  ┏━╋━━┓
  10  ┃  11 ┃ ┃  12
┏━╋━┓ ┃ ┏┻┓ ┃ ┃ ┏┻┓
2 3 6 0 4 9 1 8 5 7

interval =  (1.0, 2.0)
          19
    ┏━━━━━┻━━━━━┓
    ┃           18
    ┃         ┏━┻┓
    17        ┃  ┃
┏━━━┻━━┓      ┃  ┃
┃      ┃      ┃  16
┃      ┃      ┃ ┏┻┓
┃      11     ┃ ┃ ┃
┃ ┏━┳━┳┻┳━┳━┓ ┃ ┃ ┃
2 4 9 0 3 6 8 1 5 7

This final tree sequence is topologically identical to the original tree sequence, but has the redundant nodes and edges removed. Note also that he node IDs have been reassigned so that the samples are 0 to 9 — if you need the IDs from the original tree sequence, please set map_nodes=True when calling simplify to get a mapping between the two sets of IDs.

Recording the full ARG

In msprime we usually want to simulate the coalescent with recombination and represent the output as efficiently as possible. As a result, we don’t store individual recombination events, but rather their effects on the output tree sequence. We also do not explicitly store common ancestor events that do not result in marginal coalescences. For some purposes, however, we want to get information on the full history of the simulation, not just the minimal representation of its outcome. The record_full_arg option to simulate() provides this functionality, as illustrated in the following example:

def full_arg_example():
    ts = msprime.simulate(
        sample_size=5, recombination_rate=0.1,
        record_full_arg=True, random_seed=42)
    print(ts.tables.nodes)
    print()
    for tree in ts.trees():
        print("interval:", tree.interval)
        print(tree.draw(format="unicode"))

Running this code we get:

id      flags   population      individual      time    metadata
0       1       0       -1      0.00000000000000
1       1       0       -1      0.00000000000000
2       1       0       -1      0.00000000000000
3       1       0       -1      0.00000000000000
4       1       0       -1      0.00000000000000
5       0       0       -1      0.31846010419674
6       0       0       -1      0.82270149120229
7       0       0       -1      1.21622732856555
8       131072  0       -1      1.51542116580501
9       131072  0       -1      1.51542116580501
10      262144  0       -1      2.12814260094490
11      0       0       -1      2.16974122606933

interval: (0.0, 0.7323522972251177)
      11
   ┏━━┻━┓
   ┃    10
   ┃    ┃
   ┃    8
   ┃    ┃
   7    ┃
 ┏━┻━┓  ┃
 ┃   6  ┃
 ┃  ┏┻┓ ┃
 5  ┃ ┃ ┃
┏┻┓ ┃ ┃ ┃
0 4 2 3 1

interval: (0.7323522972251177, 1.0)
      11
   ┏━━┻━┓
   ┃    10
   ┃    ┃
   ┃    9
   ┃    ┃
   7    ┃
 ┏━┻━┓  ┃
 ┃   6  ┃
 ┃  ┏┻┓ ┃
 5  ┃ ┃ ┃
┏┻┓ ┃ ┃ ┃
0 4 2 3 1

After running the simulation we first print out the node table, which contains information on all the nodes in the tree sequence. Note that flags column contains several different values: all of the sample nodes (at time 0) have a flag value of 1 (tskit.NODE_IS_SAMPLE). Other internal nodes have a flag value of 0, which is the standard for internal nodes in a coalescent simulations.

Nodes 8 and 9 have flags equal to 131072 (NODE_IS_RE_EVENT), which tells us that they correspond to a recombination event in the ARG. A recombination event results in two extra nodes being recorded, one identifying the individual providing the genetic material to the left of the breakpoint and the other identifying the individuals providing the genetic material to the right. The effect of this extra node can be seen in the trees: node 8 is present as a ‘unary’ node in the left hand tree and node 9 in the right.

Node 10 has a flags value of 262144 (NODE_IS_CA_EVENT), which tells us that it is an ARG common ancestor event that did not result in marginal coalescence. This class of event also results in unary nodes in the trees, which we can see in the example.

If we wish to reduce these trees down to the minimal representation, we can use tskit.TreeSequence.simplify(). The resulting tree sequence will have all of these unary nodes removed and will be equivalent to (but not identical, due to stochastic effects) calling simulate() without the record_full_arg argument.

Migrations nodes are also recording in the ARG using the NODE_IS_MIG_EVENT flag. See the Node flags section for more details.

Evaluating likelihoods

msprime can be used to evaluate the sampling probability of a tree sequence for a given effective population size and per-site, per-generation recombination rate, as well as the probability of a configuration of infinite sites mutations given a tree sequence and a per-site, per-generation mutation probability. In both cases, the tree sequence must conform to the record_full_arg option of the simulate() function. The following example illustrates the evaluation of these log likelihoods:

def likelihood_example():
    ts = msprime.simulate(
        sample_size=5, recombination_rate=0.1, mutation_rate=0.1,
        record_full_arg=True, random_seed=42)
    print(msprime.log_arg_likelihood(ts, recombination_rate=0.1, Ne=1))
    print(msprime.log_arg_likelihood(ts, recombination_rate=1, Ne=1))
    print(msprime.log_arg_likelihood(ts, recombination_rate=1, Ne=10))
    print(msprime.unnormalised_log_mutation_likelihood(ts, mu=0))
    print(msprime.unnormalised_log_mutation_likelihood(ts, mu=0.1))
    print(msprime.unnormalised_log_mutation_likelihood(ts, mu=1))

Running this code we get:

-11.22279995534112
-14.947399100839986
-22.154011926066893
-inf
-5.665181028073889
-7.087195080578711

Multiple merger coalescents

msprime can simulate several multiple merger coalescent models, in which any number of lineages can coalesce in up to four simultaneous mergers. These can often be appropriate models when the distribution of offspring numbers among individuals in the population is highly skewed. Provided are the classes of Beta-Xi-coalescents and Dirac-Xi-coalescents.

The Beta-Xi-coalescent can be simulated as follows:

def beta_multiple_merger_example():
    ts = msprime.simulate(
        sample_size=10, random_seed=1,
        model=msprime.BetaCoalescent(alpha=1.001, truncation_point=1))
    tree = ts.first()
    print(tree.draw(format="unicode"))

Running this code we get:

     16
 ┏━━━━┻━━━┓
 ┃       15
 ┃    ┏━━━┻━━━┓
 ┃    ┃      14
 ┃    ┃     ┏━┻━┓
12    ┃     ┃  13
┏┻┓   ┃     ┃ ┏━╋━┓
┃ ┃  11     ┃ ┃ ┃ ┃
┃ ┃ ┏━┻━┓   ┃ ┃ ┃ ┃
┃ ┃ ┃  10   ┃ ┃ ┃ ┃
┃ ┃ ┃ ┏━╋━┓ ┃ ┃ ┃ ┃
8 9 2 0 4 6 1 3 5 7

The specified value of \(\alpha = 1.001\) corresponds to a heavily skewed offspring distribution. Values closer to \(\alpha = 2\) result in trees whose distribution is closer to that of the standard coalescent, often featuring no multiple mergers for small sample sizes:

def beta_few_multiple_mergers_example():
    ts = msprime.simulate(
        sample_size=10, random_seed=1,
        model=msprime.BetaCoalescent(alpha=1.8, truncation_point=1))
    tree = ts.first()
    print(tree.draw(format="unicode"))

Running this code we get:

     18
  ┏━━━┻━━━┓
  ┃      17
  ┃     ┏━┻━┓
 16     ┃   ┃
┏━┻━┓   ┃   ┃
┃   ┃   ┃  15
┃   ┃   ┃ ┏━┻━┓
┃  14   ┃ ┃   ┃
┃ ┏━┻┓  ┃ ┃   ┃
┃ ┃ 13  ┃ ┃   ┃
┃ ┃ ┏┻┓ ┃ ┃   ┃
┃ ┃ ┃ ┃ ┃ ┃  12
┃ ┃ ┃ ┃ ┃ ┃ ┏━┻━┓
┃ ┃ ┃ ┃ ┃ ┃ ┃  11
┃ ┃ ┃ ┃ ┃ ┃ ┃ ┏━┻┓
┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ 10
┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┏┻┓
5 2 3 7 8 0 6 9 1 4

The timescale of the Beta-Xi-coalescent depends nonlinearly on both \(\alpha\) and the effective population size \(N_e\) as detailed in Multiple merger coalescents. For a fixed \(\alpha\), a unit of coalescent time is proportional to \(N_e^{\alpha - 1}\) generations, albeit with a complicated constant of proportionality that depends on \(\alpha\). The dependence on \(\alpha\) for fixed \(N_e\) is not monotone. Since two lineages merge in 0.5 units of coalescent time on average regadless of \(N_e\) and \(\alpha\), the time to their most recent common ancestor depends on both of these parameters when measured in generations.

To illustrate, for \(\alpha\) close to 2 the relationship between effective population size and number of generations is almost linear:

def beta_high_scaling_example():
    ts = msprime.simulate(
        sample_size=2, random_seed=1,
        model=msprime.BetaCoalescent(reference_size=1,
            alpha=1.9, truncation_point=1))
    tree = ts.first()
    print(tree.tmrca(0,1))
    ts = msprime.simulate(
        sample_size=2, random_seed=1,
        model=msprime.BetaCoalescent(reference_size=100,
            alpha=1.9, truncation_point=1))
    tree = ts.first()
    print(tree.tmrca(0,1))

which results in:

1.5705367504768712
99.09416974894381

For \(\alpha\) close to 1 the effective population size has little effect:

def beta_low_scaling_example():
    ts = msprime.simulate(
        sample_size=2, random_seed=1,
        model=msprime.BetaCoalescent(reference_size=1,
            alpha=1.1, truncation_point=1))
    tree = ts.first()
    print(tree.tmrca(0,1))
    ts = msprime.simulate(
        sample_size=2, random_seed=1,
        model=msprime.BetaCoalescent(reference_size=100,
            alpha=1.1, truncation_point=1))
    tree = ts.first()
    print(tree.tmrca(0,1))

which gives:

7.058990342664795
11.18774573973818

The Dirac-Xi-coalescent is simulated similarly:

def dirac_multiple_merger_example():
    ts = msprime.simulate(
        sample_size=10, random_seed=1,
        model=msprime.DiracCoalescent(psi=0.9, c=10))
    tree = ts.first()
    print(tree.draw(format="unicode"))

which gives:

       14
   ┏━━━━┻━━━┓
   ┃       13
   ┃      ┏━┻━━┓
  12      ┃    ┃
┏━┳┻━━┓   ┃    ┃
┃ ┃  11   ┃   10
┃ ┃ ┏━╋━┓ ┃ ┏━┳┻┳━┓
2 6 3 4 8 1 0 5 7 9

Larger values of the parameter \(c > 0\) result in more frequent multiple mergers, while larger values of \(0 < \psi \leq 1\) result in multiple mergers with more participating lineages. Setting either parameter to 0 would correspond to the standard coalescent.

The Dirac-Xi-coalescent is obtained as the infinite population scaling limit of Moran models, and therefore coalescent time is measured in units of \(N_e^2\) generations. However, under a Moran model, the population-rescaled recombination rate is still obtained from the per-generation recombination probability by rescaling with \(N_e\). The overall effect is that coalescent branch lengths scale with the square of the effective population size in units of generations, and thus so do other quantities which depend on branch lengths such as the number of mutations, while the number of recombinations scales linearly.

def dirac_scaling_example():
    ts = msprime.simulate(
        sample_size=10, random_seed=1,
        recombination_rate=1e-2, mutation_rate=1e-2,
        model=msprime.DiracCoalescent(reference_size=1000, psi=0.01, c=1),
        num_replicates=100)
    trees = 0
    sites = 0
    for t in ts:
        trees = trees + t.num_trees
        sites = sites + t.num_sites
    print(sites / trees)

Running this code results in:

1241.8241770462635

which is larger than \(N_e = 1000\) because not every recombination results in a new tree.

This behaviour is a consequence of the time and parameter scalings under the Moran model, as well as the fact that msprime simulates recombinations on the coalescent time scale, but mutations on trees in units of generations. The rates of mutations and recombinations can be made commensurate by either dividing the mutation rate by \(N_e\), or multiplying the recombination rate by \(N_e\).

Old stuff

Todo

The material in this section are leftovers from an older version of the tutorial. Let’s figure out where it goes best.

In the following example, we calculate the mean coalescence time for a pair of lineages sampled in different subpopulations in a symmetric island model, and compare this with the analytical expectation.

import numpy as np

import msprime


def migration_example(num_replicates=10 ** 5):
    # M is the overall symmetric migration rate, and d is the number
    # of subpopulations.
    M = 0.2
    d = 3
    # We rescale m into per-generation values for msprime.
    m = M / (4 * (d - 1))
    # Allocate the initial sample. Because we are interested in the
    # between subpopulation coalescence times, we choose one sample each
    # from the first two subpopulations.
    population_configurations = [
        msprime.PopulationConfiguration(sample_size=1),
        msprime.PopulationConfiguration(sample_size=1),
        msprime.PopulationConfiguration(sample_size=0),
    ]
    # Now we set up the migration matrix. Since this is a symmetric
    # island model, we have the same rate of migration between all
    # pairs of subpopulations. Diagonal elements must be zero.
    migration_matrix = [[0, m, m], [m, 0, m], [m, m, 0]]
    # We pass these values to the simulate function, and ask it
    # to run the required number of replicates.
    replicates = msprime.simulate(
        population_configurations=population_configurations,
        migration_matrix=migration_matrix,
        num_replicates=num_replicates,
    )
    # And then iterate over these replicates
    T = np.zeros(num_replicates)
    for i, tree_sequence in enumerate(replicates):
        tree = tree_sequence.first()
        # Convert the TMRCA to coalecent units.
        T[i] = tree.get_time(tree.get_root()) / 4
    # Finally, calculate the analytical expectation and print
    # out the results
    analytical = d / 2 + (d - 1) / (2 * M)
    print("Observed  =", np.mean(T))
    print("Predicted =", analytical)

Again, we set \(N_e = 0.5\) to agree with convention in theoretical results, where usually one coalescent time unit is, in generations, the effective number of haploid individuals. Running this example we get:

migration_example()
# Observed  = 3.254904176088153
# Predicted = 3.25