API Documentation

This is the API documentation for msprime, and provides detailed information on the Python programming interface. See the Tutorial for an introduction to using this API to run simulations. See the tskit documentation for information on how to use the tskit Python API to analyse simulation results.

Simulation model

The simulation model in msprime closely follows the classical ms program. Unlike ms, however, time is measured in generations rather than in units of \(4 N_e\) generations, i.e., “coalescent units”. This means that when simulating a population with diploid effective size \(N_e\), the mean time to coalescence between two samples in an msprime simulation will be around \(2 N_e\), while in an ms simulation, the mean time will be around \(0.5\). Internally, msprime uses the same algorithm as ms, and so the Ne parameter to the simulate() function still acts as a time scaling, and can be set to 0.5 to match many theoretical results, or to 0.25 to match ms. Population sizes for each subpopulation and for past demographic events are also defined as absolute values, not scaled by Ne. All migration rates and growth rates are also per generation.

When running simulations we define the length \(L\) of the sequence in question using the length parameter. This defines the coordinate space within which trees and mutations are defined. \(L\) is a continuous value, so units are arbitrary, and coordinates can take any continuous value from \(0\) up to (but not including) \(L\). (So, although we recommend setting the units of length to be analogous to “bases”, events can occur at fractional positions.) Mutations occur in an infinite sites process along this sequence, and mutation rates are specified per generation, per unit of sequence length. Thus, given the per-generation mutation rate \(\mu\), the rate of mutation over the entire sequence in coalescent time units is \(\theta = 4 N_e \mu L\). It is important to remember these scaling factors when comparing with analytical results!

Similarly, recombination rates are per unit of sequence length and per generation in msprime. Thus, given the per generation crossover rate \(r\), the overall rate of recombination between the ends of the sequence in coalescent time units is \(\rho = 4 N_e r L\). Although breakpoints do not necessarily occur at integer locations, the underlying recombination model is finite, and the behaviour of a small number of loci can be modelled using the RecombinationMap class. However, this is considered an advanced feature and the majority of cases should be well served with the default recombination model and number of loci.

Population structure is modelled by specifying a fixed number of subpopulations \(d\), and a \(d \times d\) matrix \(M\) of per generation migration rates. Each element of the matrix \(M_{j,k}\) defines the fraction of population \(j\) that consists of migrants from population \(k\) in each generation. Each subpopulation has an initial absolute population size \(s\) and a per generation exponential growth rate \(\alpha\). The size of a given population at time \(t\) in the past (measured in generations) is therefore given by \(s e^{-\alpha t}\). Demographic events that occur in the history of the simulated population alter some aspect of this population configuration at a particular time in the past.

Warning

This parameterisation of recombination, mutation and migration rates is different to ms, which states these rates over the entire region and in coalescent time units. The motivation for this is to allow the user change the size of the simulated region without having to rescale the recombination and mutation rates, and to also allow users directly state times and rates in units of generations. However, the mspms command line application is fully ms compatible.

Running simulations

The simulate() function provides the primary interface to running coalescent simulations in msprime.

msprime.simulate()

Simulates the coalescent with recombination under the specified model parameters and returns the resulting tskit.TreeSequence. Note that Ne is the effective diploid population size (so the effective number of genomes in the population is 2*Ne), but sample_size is the number of (monoploid) genomes sampled.

Parameters:
  • sample_size (int) – The number of sampled monoploid genomes. If not specified or None, this defaults to the sum of the subpopulation sample sizes. Either sample_size, population_configurations or samples must be specified.
  • Ne (float) – The effective (diploid) population size for the reference population. This defaults to 1 if not specified.
  • length (float) – The length of the simulated region in bases. This parameter cannot be used along with recombination_map. Defaults to 1 if not specified.
  • recombination_rate (float) – The rate of recombination per base per generation. This parameter cannot be used along with recombination_map. Defaults to 0 if not specified.
  • recombination_map (RecombinationMap) – The map describing the changing rates of recombination along the simulated chromosome. This parameter cannot be used along with the recombination_rate or length parameters, as these values are encoded within the map. Defaults to a uniform rate as described in the recombination_rate parameter if not specified.
  • mutation_rate (float) – The rate of infinite sites mutations per unit of sequence length per generation. If not specified, no mutations are generated. This option only allows for infinite sites mutations with a binary (i.e., 0/1) alphabet. For more control over the mutational process, please use the mutate() function.
  • population_configurations (list or None.) – The list of PopulationConfiguration instances describing the sampling configuration, relative sizes and growth rates of the populations to be simulated. If this is not specified, a single population with a sample of size sample_size is assumed.
  • migration_matrix (list) – The matrix describing the rates of migration between all pairs of populations. If \(N\) populations are defined in the population_configurations parameter, then the migration matrix must be an \(N \times N\) matrix with 0 on the diagonal, consisting of \(N\) lists of length \(N\) or an \(N \times N\) numpy array.
  • demographic_events (list) – The list of demographic events to simulate. Demographic events describe changes to the populations in the past. Events should be supplied in non-decreasing order of time in the past. Events with the same time value will be applied sequentially in the order that they were supplied before the simulation algorithm continues with the next time step.
  • samples (list) – The list specifying the location and time of all samples. This parameter may be used to specify historical samples, and cannot be used in conjunction with the sample_size parameter. Each sample is a (population, time) pair such that the sample in position j in the list of samples is drawn in the specified population at the specfied time. Time is measured in generations ago, as elsewhere.
  • random_seed (int) – The random seed. If this is None, a random seed will be automatically generated. Valid random seeds must be between 1 and \(2^{32} - 1\).
  • num_replicates (int) – The number of replicates of the specified parameters to simulate. If this is not specified or None, no replication is performed and a tskit.TreeSequence object returned. If num_replicates is provided, the specified number of replicates is performed, and an iterator over the resulting tskit.TreeSequence objects returned.
  • from_ts (tskit.TreeSequence) – If specified, initialise the simulation from the root segments of this tree sequence and return the completed tree sequence. Please see here for details on the required properties of this tree sequence and its interactions with other parameters. (Default: None).
  • start_time (float) – If specified, set the initial time that the simulation starts to this value. If not specified, the start time is zero if performing a simulation of a set of samples, or is the time of the oldest node if simulating from an existing tree sequence (see the from_ts parameter).
  • end_time (float) – If specified, terminate the simulation at the specified time. In the returned tree sequence, all rootward paths from samples with time < end_time will end in a node with one child with time equal to end_time. Sample nodes with time >= end_time will also be present in the output tree sequence. If not specified or None, run the simulation until all samples have an MRCA at all positions in the genome.
  • record_full_arg (bool) – If True, record all intermediate nodes arising from common ancestor and recombination events in the output tree sequence. This will result in unary nodes (i.e., nodes in marginal trees that have only one child). Defaults to False.
  • model (str or simulation model instance) – The simulation model to use. This can either be a string (e.g., "smc_prime") or an instance of a simulation model class (e.g, msprime.DiscreteTimeWrightFisher(100). Please see the Simulation models section for more details on specifying simulations models.
Returns:

The tskit.TreeSequence object representing the results of the simulation if no replication is performed, or an iterator over the independent replicates simulated if the num_replicates parameter has been used.

Return type:

tskit.TreeSequence or an iterator over tskit.TreeSequence replicates.

Warning:

If using replication, do not store the results of the iterator in a list! For performance reasons, the same underlying object may be used for every TreeSequence returned which will most likely lead to unexpected behaviour.

Population structure

Population structure is modelled in msprime by specifying a fixed number of subpopulations, with the migration rates between those subpopulations defined by a migration matrix. Each subpopulation has an initial_size that defines its absolute diploid size at time zero and a per-generation growth_rate which specifies the exponential growth rate of the sub-population. We must also define the number of genomes to sample from each subpopulation. The number of populations and their initial configuration is defined using the population_configurations parameter to simulate(), which takes a list of PopulationConfiguration instances. Population IDs are zero indexed, and correspond to their position in the list.

Samples are drawn sequentially from populations in increasing order of population ID. For example, if we specified an overall sample size of 6, and specify that 2 samples are drawn from population 0 and 4 from population 1, then samples 0 and 1 will be initially located in population 0, and samples 2, 3, 4, and 5 will be drawn from population 2.

Given \(N\) populations, migration matrices are specified using an \(N \times N\) matrix of between-subpopulation migration rates. See the documentation for simulate() and the Simulation model section for more details on the migration rates.

class msprime.PopulationConfiguration(sample_size=None, initial_size=None, growth_rate=0.0, metadata=None)

The initial configuration of a population (or deme) in a simulation.

Parameters:
  • sample_size (int) – The number of initial samples that are drawn from this population.
  • initial_size (float) – The absolute size of the population at time zero. Defaults to the reference population size \(N_e\).
  • growth_rate (float) – The forwards-time exponential growth rate of the population per generation. Growth rates can be negative. This is zero for a constant population size, and positive for a population that has been growing. Defaults to 0.
  • metadata (dict) – A JSON-encodable dictionary of metadata to associate with the corresponding Population in the output tree sequence. If not specified or None, no metadata is stored (i.e., an empty bytes array). Note that this metadata is ignored when using the from_ts argument to simulate(), as the population definitions in the tree sequence that is used as the starting point take precedence.

Demographic Events

Demographic events change some aspect of the population configuration at some time in the past, and are specified using the demographic_events parameter to simulate(). Each element of this list must be an instance of one of the following demographic events that are currently supported. Note that all times are measured in generations, all sizes are absolute (i.e., not relative to \(N_e\)), and all rates are per-generation.

class msprime.PopulationParametersChange(time, initial_size=None, growth_rate=None, population=None, population_id=None)

Changes the demographic parameters of a population at a given time.

This event generalises the -eg, -eG, -en and -eN options from ms. Note that unlike ms we do not automatically set growth rates to zero when the population size is changed.

Parameters:
  • time (float) – The length of time ago at which this event occurred.
  • initial_size (float) – The absolute diploid size of the population at the beginning of the time slice starting at time. If None, this is calculated according to the initial population size and growth rate over the preceding time slice.
  • growth_rate (float) – The new per-generation growth rate. If None, the growth rate is not changed. Defaults to None.
  • population (int) – The ID of the population affected. If population is None, the changes affect all populations simultaneously.
class msprime.MigrationRateChange(time, rate, matrix_index=None)

Changes the rate of migration to a new value at a specific time.

Parameters:
  • time (float) – The time at which this event occurs in generations.
  • rate (float) – The new per-generation migration rate.
  • matrix_index (tuple) – A tuple of two population IDs descibing the matrix index of interest. If matrix_index is None, all non-diagonal entries of the migration matrix are changed simultaneously.
class msprime.MassMigration(time, source, dest=None, proportion=1.0, destination=None)

A mass migration event in which some fraction of the population in one deme simultaneously move to another deme, viewed backwards in time. Each lineage currently present in the source population moves to the destination population with probability equal to proportion.

This event class generalises the population split (-ej) and admixture (-es) events from ms. Note that MassMigrations do not have any side effects on the migration matrix.

Parameters:
  • time (float) – The time at which this event occurs in generations.
  • source (int) – The ID of the source population.
  • dest (int) – The ID of the destination population.
  • proportion (float) – The probability that any given lineage within the source population migrates to the destination population.
class msprime.SimulationModelChange(time, model=None)

An event representing a change of underlying simulation model.

Parameters:
  • time (float) – The time at which the simulation model changes to the new model, in generations. After this time, all internal tree nodes, edges and migrations are the result of the new model.
  • model (str or simulation model instance) – The new simulation model to use. This can either be a string (e.g., "smc_prime") or an instance of a simulation model class (e.g, msprime.DiscreteTimeWrightFisher(100). Please see the Simulation models section for more details on specifying simulations models. If the argument is a string, the reference population size is set from the top level Ne parameter to simulate(). If this is None (the default) we revert to the standard coalescent with the reference population size set by Ne.

Debugging demographic models

Warning

The DemographyDebugger class is preliminary, and the API is likely to change in the future.

class msprime.DemographyDebugger(Ne=1, population_configurations=None, migration_matrix=None, demographic_events=None, model='hudson')

A class to facilitate debugging of population parameters and migration rates in the past.

coalescence_rate_trajectory(steps, num_samples, min_pop_size=1, double_step_validation=True)

This function will calculate the mean coalescence rates and proportions of uncoalesced lineages between the lineages of the sample configuration provided in num_samples, at each of the times ago listed by steps, in this demographic model. The coalescence rate at time t in the past is the average rate of coalescence of as-yet-uncoalesed lineages, computed as follows: let \(p(t)\) be the probability that the lineages of a randomly chosen pair of samples has not yet coalesced by time \(t\), let \(p(z,t)\) be the probability that the lineages of a randomly chosen pair of samples has not yet coalesced by time \(t\) and are both in population \(z\), and let \(N(z,t)\) be the diploid effective population size of population \(z\) at time \(t\). Then the mean coalescence rate at time \(t\) is \(r(t) = (\sum_z p(z,t) / (2 * N(z,t)) / p(t)\).

The computation is done by approximating population size trajectories with piecewise constant trajectories between each of the steps. For this to be accurate, the distance between the steps must be small enough so that (a) short epochs (e.g., bottlenecks) are not missed, and (b) populations do not change in size too much over that time, if they are growing or shrinking. This function optionally provides a simple check of this approximation by recomputing the coalescence rates on a grid of steps twice as fine and throwing a warning if the resulting values do not match to a relative tolerance of 0.001.

Parameters:
  • steps (list) – The times ago at which coalescence rates will be computed.
  • num_samples (list) – A list of the same length as the number of populations, so that num_samples[j] is the number of sampled chromosomes in subpopulation j.
  • min_pop_size (int) – The smallest allowed population size during computation of coalescent rates (i.e., coalescence rates are actually 1 / (2 * max(min_pop_size, N(z,t))). Spurious very small population sizes can occur in models where populations grow exponentially but are unused before some time in the past, and lead to floating point error. This should be set to a value smaller than the smallest desired population size in the model.
  • double_step_validation (bool) – Whether to perform the check that step sizes are sufficiently small, as described above. This is highly recommended, and will take at most four times the computation.
Returns:

A tuple of arrays whose jth elements, respectively, are the coalescence rate at the jth time point (denoted r(t[j]) above), and the probablility that a randomly chosen pair of lineages has not yet coalesced (denoted p(t[j]) above).

Return type:

(numpy.array, numpy.array)

epoch_times

Returns array of epoch times defined by the demographic model

num_epochs

Returns the number of epochs defined by the demographic model.

population_size_history

Returns a (num_pops, num_epochs) numpy array giving the starting population size for each population in each epoch.

population_size_trajectory(steps)

This function returns an array of per-population effective population sizes, as defined by the demographic model. These are the initial_size parameters of the model, modified by any population growth rates. The sizes are computed at the time points given by steps.

Parameters:steps (list) – List of times ago at which the population size will be computed.
Returns:Returns a numpy array of population sizes, with one column per population, whose [i,j]th entry is the size of population j at time steps[i] ago.
print_history(output=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)

Prints a summary of the history of the populations.

Variable recombination rates

class msprime.RecombinationMap(positions, rates, num_loci=None)

A RecombinationMap represents the changing rates of recombination along a chromosome. This is defined via two lists of numbers: positions and rates, which must be of the same length. Given an index j in these lists, the rate of recombination per base per generation is rates[j] over the interval positions[j] to positions[j + 1]. Consequently, the first position must be zero, and by convention the last rate value is also required to be zero (although it is not used).

Warning

The chromosome is divided into num_loci regions of equal recombination distance, between which recombinations occur. This means that if recombination is constant and the genome is length n, then num_loci=n will produce breakpoints only at integer values. If recombination rate is not constant, breakpoints will still only occur at n distinct positions, but these will probably not coincide with the positions provided.

Parameters:
  • positions (list) – The positions (in bases) denoting the distinct intervals where recombination rates change. These can be floating point values.
  • rates (list) – The list of rates corresponding to the supplied positions. Recombination rates are specified per base, per generation.
  • num_loci (int) – The maximum number of non-recombining loci in the underlying simulation. By default this is set to the largest possible value, allowing the maximum resolution in the recombination process. However, for a finite sites model this can be set to smaller values.
DEFAULT_NUM_LOCI = 4294967295

The default number of non-recombining loci in a RecombinationMap.

mean_recombination_rate

Return the weighted mean recombination rate across all windows of the entire recombination map.

classmethod read_hapmap(filename)

Parses the specified file in HapMap format. These files must be white-space-delimited, and contain a single header line (which is ignored), and then each subsequent line contains the starting position and recombination rate for the segment from that position (inclusive) to the starting position on the next line (exclusive). Starting positions of each segment are given in units of bases, and recombination rates in centimorgans/Megabase. The first column in this file is ignored, as are additional columns after the third (Position is assumed to be the second column, and Rate is assumed to be the third). If the first starting position is not equal to zero, then a zero-recombination region is inserted at the start of the chromosome.

A sample of this format is as follows:

Chromosome  Position(bp)    Rate(cM/Mb)     Map(cM)
chr1        55550           2.981822        0.000000
chr1        82571           2.082414        0.080572
chr1        88169           2.081358        0.092229
chr1        254996          3.354927        0.439456
chr1        564598          2.887498        1.478148
...
chr1        182973428       2.512769        122.832331
chr1        183630013       0.000000        124.482178
Parameters:filename (str) – The name of the file to be parsed. This may be in plain text or gzipped plain text.
classmethod uniform_map(length, rate, num_loci=None)

Returns a RecombinationMap instance in which the recombination rate is constant over a chromosome of the specified length. The optional num_loci controls the number of discrete loci in the underlying simulation, and is by default large enough to be effectively be a continuous model.

The following map can be used to simulate a true finite locus model with a fixed number of loci m:

>>> recomb_map = RecombinationMap.uniform_map(m, rate, num_loci=m)
Parameters:
  • length (float) – The length of the chromosome.
  • rate (float) – The rate of recombination per unit of sequence length along this chromosome.
  • num_loci (int) – The number of discrete loci in the underlying simulation. By default this is set to a large number.

Simulation models

The default simulation model in msprime is the standard coalescent with recombination model. We also support a number of different models, which are documented in this section.

Simulations models are specified using the model parameter to simulate(). This parameter can either take the form of a string describing the model (e.g. model="dtwf") or an instance of a model definition class (e.g model=msprime.DiscreteTimeWrightFisher(1000)). The available models are documented in the following subsections.

A key element of simulation models in msprime is the concept of a “reference population size”. The Ne argument to simulate() can also be used to define this parameter when combined with a string shorthand for a model.

msprime.simulate(10, Ne=1000, model="dtwf")

and

msprime.simulate(10, model=msprime.DiscreteTimeWrightFisher(1000))

define the same simulation.

Todo

Add a discussion of population sizes here, describing what Ne/model.reference_size really means, and how it interacts with the individual population sizes.

We are often interested in simulating mixtures of models: for example, using the DiscreteTimeWrightFisher model to simulate the recent past and then using the standard coalescent to complete the simulation of the ancient past. This can be achieved using the SimulationModelChange event. See the Hybrid simulations for an example of this approach.

Coalescent and approximations

class msprime.StandardCoalescent(reference_size=None)

The classical coalescent with recombination model (i.e., Hudson’s algorithm). The string "hudson" can be used to refer to this model.

This is the default simulation model.

class msprime.SmcApproxCoalescent(reference_size=None)

The original SMC model defined by McVean and Cardin. This model is implemented using a naive rejection sampling approach and so it may not be any more efficient to simulate than the standard Hudson model.

The string "smc" can be used to refer to this model.

class msprime.SmcPrimeApproxCoalescent(reference_size=None)

The SMC’ model defined by Marjoram and Wall as an improvement on the original SMC. model is implemented using a naive rejection sampling approach and so it may not be any more efficient to simulate than the standard Hudson model.

The string "smc_prime" can be used to refer to this model.

Discrete time Wright-Fisher

Msprime provides the option to perform discrete-time Wright-Fisher simulations for scenarios when the coalescent model is not appropriate, including large sample sizes, multiple chromosomes, or recent migration.

To use this option, set the flag model="dtwf" as in the following example:

>>> tree_sequence = msprime.simulate(
...     sample_size=6, Ne=1000, length=1e4, recombination_rate=2e-8,
...     model="dtwf")

All other parameters can be set as usual.

class msprime.DiscreteTimeWrightFisher(reference_size=None)

A discrete backwards-time Wright-Fisher model, with diploid back-and-forth recombination. The string "dtwf" can be used to refer to this model.

Wright-Fisher simulations are performed very similarly to coalescent simulations, with all parameters denoting the same quantities in both models. Because events occur at discrete times however, the order in which they occur matters. Each generation consists of the following ordered events:

  • Migration events. As in the Hudson coalescent, these move single extant lineages between populations. Because migration events occur before lineages choose parents, migrant lineages choose parents from their new population in the same generation.
  • Demographic events. All events with previous_generation < event_time <= current_generation are carried out here.
  • Lineages draw parents. Each (monoploid) extant lineage draws a parent from their current population.
  • Diploid recombination. Each parent is diploid, so all child lineages recombine back-and-forth into the same two parental genome copies. These become two independent lineages in the next generation.
  • Historical sampling events. All historical samples with previous_generation < sample_time <= current_generation are inserted.

Initialising simulations from a tree sequence

By default msprime simulations are initialised by specifying a set of samples, using the sample_size or samples parameters to simulate(). This initialises the simulation with segments of ancestral material covering the whole sequence. Simulation then proceeds backwards in time until a most recent common ancestor has been found at all points along this sequence. We can also start simulations from different initial conditions by using the from_ts argument to simulate(). Informally, we take an ‘unfinished’ tree sequence as a parameter to simulate, initialise the simulation from the state of this tree sequence and then run the simulation until coalescence. The returned tree sequence is then the result of taking the input tree sequence and completing the trees using the coalescent.

This is useful for forwards-time simulators such as SLiM that can output tree sequences. By running forward-time simulation for a certain number of generations we obtain a tree sequence, but these trees may not have had sufficient time to reach a most recent common ancestor. By using the from_ts argument to simulate() we can combine the best of both forwards- and backwards-time simulators. The recent past can be simulated forwards in time and the ancient past by the coalescent. The coalescent simulation is initialised by the root segments of the input tree sequence, ensuring that the minimal amount of ancestral material possible is simulated.

Please see the tutorial for an example of how to use this feature with a simple forwards-time Wright-Fisher simulator

Input requirements

Any tree sequence can be provided as input to this process, but there is a specific topological requirement that must be met for the simulations to be statistically correct. To ensure that ancestral segments are correctly associated within chromosomes when constructing the initial conditions for the coalescent simulation, forward-time simulators must retain the nodes corresponding to the initial generation. Furthermore, for every sample in the final generation (i.e. the extant population at the present time) there must be a path to one of the founder population nodes. (Please see the tutorial for further explanation of this point and an example.)

Recombination map limitations

Because of the way that msprime handles recombination internally, care must be taken when specifying recombination when using the from_ts argument. If recombination positions are generated in the same way in both the initial tree sequence and the coalescent simulation, then everything should work. However, the fine scale details of the underlying recombination model matter, so matching nonuniform recombination maps between simulators may not be possible at present. (To make it work, we must ensure that every recombination breakpoint in from_ts matches exactly to a possible recombination breakpoint in msprime’s recombination map, which is not guaranteed because of msprime’s discrete recombination model.)

One case in which it is guaranteed to work is if from_ts has integer coordinates, and we want to simulate a coalescent with a uniform recombination rate. In this case, to have a uniform recombination rate r use:

L = int(from_ts.sequence_length)
recomb_map = msprime.RecombinationMap.uniform_map(L, r, L)
final_ts = mpsrime.simulate(from_ts=from_ts, recomb_map=recomb_map)

Node flags

For standard coalescent simulations, all samples are marked with the tskit.NODE_IS_SAMPLE flag; internal nodes all have a flags value of 0. When using the record_full_arg argument to simulate(), the following flags values are defined:

msprime.NODE_IS_RE_EVENT

The node is an ARG recombination event. Each recombination event is marked with two nodes, one identifying the individual providing the genetic material to the left of the breakpoint and the other providing the genetic material the right.

msprime.NODE_IS_CA_EVENT

The node is an ARG common ancestor event that did not result in marginal coalescence.

msprime.NODE_IS_MIG_EVENT

The node is an ARG migration event identifying the individual that migrated. Can be used in combination with the record_migrations argument to simulate().

Simulating mutations

When running coalescent simulations it’s usually most convenient to use the mutation_rate argument to the simulate() function to throw neutral mutations down on the trees. However, sometimes we wish to throw mutations down on an existing tree sequence: for example, if we want to see the outcome of different random mutational processes on top of a single simulated topology, or if we have obtained the tree sequence from another program and wish to overlay neutral mutations on this tree sequence.

class msprime.InfiniteSites(alphabet=0)

The “infinitely many sites” mutation model. In this model each mutation corresponds to a unique site, which has a floating-point position chosen uniformly along the sequence. As a result, each site is associated with exactly one mutation.

By default, the ancestral and derived states in this model are represented by the characters “0” and “1”. Thus, the ancestral state at a site is always “0” and the derived state for a mutation is always “1”. However, by specifying the alphabet=NUCLEOTIDES we can generate mutations from the nucleotide characters ACGT. In this case, for each mutation an ancestral state is chosen uniformly from these letters. The derived state is then chosen uniformly from the remaining characters so that the ancestral and derived states are always distinct.

msprime.BINARY == 0

The binary mutation alphabet where ancestral states are always “0” and derived states “1”.

msprime.NUCLEOTIDES == 1

The nucleotides mutation alphabet in which ancestral and derived states are chosen from the characters “A”, “C”, “G” and “T”.

msprime.mutate(tree_sequence, rate=None, random_seed=None, model=None, keep=False, start_time=None, end_time=None)

Simulates mutations on the specified ancestry and returns the resulting tskit.TreeSequence. Mutations are generated at the specified rate in measured generations. Mutations are generated under the infinite sites model, and so the rate of new mutations is per unit of sequence length per generation.

If a random seed is specified, this is used to seed the random number generator. If the same seed is specified and all other parameters are equal then the same mutations will be generated. If no random seed is specified then one is generated automatically.

If the model parameter is specified, this determines the model under which mutations are generated. Currently only the InfiniteSites mutation model is supported. This parameter is useful if you wish to obtain sequences with letters from the nucleotide alphabet rather than the default 0/1 states. By default mutations from the infinite sites model with a binary alphabet are generated.

By default, sites and mutations in the parameter tree sequence are discarded. If the keep parameter is true, however, additional mutations are simulated. Under the infinite sites mutation model, all new mutations generated will occur at distinct positions from each other and from any existing mutations (by rejection sampling).

The time interval over which mutations can occur may be controlled using the start_time and end_time parameters. The start_time defines the lower bound (in time-ago) on this interval and max_time the upper bound. Note that we may have mutations associated with nodes with time <= start_time since mutations store the node at the bottom (i.e., towards the leaves) of the branch that they occur on.

Parameters:
  • tree_sequence (tskit.TreeSequence) – The tree sequence onto which we wish to throw mutations.
  • rate (float) – The rate of mutation per generation. (Default: 0).
  • random_seed (int) – The random seed. If this is None, a random seed will be automatically generated. Valid random seeds must be between 1 and \(2^{32} - 1\).
  • model (MutationModel) – The mutation model to use when generating mutations. If not specified or None, the InfiniteSites mutation model is used.
  • keep (bool) – Whether to keep existing mutations (default: False).
  • start_time (float) – The minimum time at which a mutation can occur. (Default: no restriction.)
  • end_time (float) – The maximum time at which a mutation can occur (Default: no restriction).
Returns:

The tskit.TreeSequence object resulting from overlaying mutations on the input tree sequence.

Return type:

tskit.TreeSequence