Command line interface

Two command-line applications are provided with msprime: msp and mspms. The msp program is an experimental interface for interacting with the library, and is a POSIX compliant command line interface. The mspms program is a fully-ms compatible interface. This is useful for those who wish to get started quickly with using the library, and also as a means of plugging msprime into existing work flows. However, there is a substantial overhead involved in translating data from msprime‘s native history file into legacy formats, and so new code should use the Python API where possible.

msp

The msp program provides a convenient interface to the msprime API. It is based on subcommands that either generate or consume a history file. The simulate subcommand runs a simulation storing the results in a file. The other commands are concerned with converting this file into other formats.

Warning

This tool is very new, and the interface may need to change over time. This should be considered an alpha feature!

msp simulate

msp simulate provides a command line interface to the msprime.simulate() API function. Using the parameters provided at the command line, we run a simulation and then save the resulting tree sequence to the file provided as an argument.

usage: msp simulate [-h] [--length LENGTH]
                    [--recombination-rate RECOMBINATION_RATE]
                    [--mutation-rate MUTATION_RATE]
                    [--effective-population-size EFFECTIVE_POPULATION_SIZE]
                    [--random-seed RANDOM_SEED] [--compress]
                    sample_size history_file
Positional arguments:
sample_size The number of individuals in the sample
history_file The msprime history file in HDF5 format
Options:
--length, -L The length of the simulated region in base pairs.
--recombination-rate, -r
 The recombination rate per base per generation
--mutation-rate, -u
 The mutation rate per base per generation
--effective-population-size, -N
 The effective population size Ne
--random-seed, -s
 The random seed. If not specified one is chosen randomly
--compress, -z Enable HDF5’s transparent zlib compression

Note

The way in which recombination and mutation rates are specified is different to ms. In ms these rates are scaled by the length of the simulated region, whereas we use rates per unit distance. The rationale for this change is simplify running simulations on a variety of sequence lengths, so that we need to change only parameter and not three simultaneously.

msp upgrade

msp upgrade is a command line tool to convert tree sequence files written by older versions of msprime to the latest version. This tool requires h5py, so please ensure that it is installed. The upgrade process involves creating a new tree sequence file from the records stored in the older file and is non-destructive.

usage: msp upgrade [-h] source destination
Positional arguments:
source The source msprime history file in legacy HDF5 format
destination The filename of the upgraded copy.

msp records

msp records is a command line interface to the msprime.TreeSequence.write_records() method. It prints out the variants stored in the history file in VCF format.

usage: msp records [-h] [--header] [--precision PRECISION] history_file
Positional arguments:
history_file The msprime history file in HDF5 format
Options:
--header, -H Print a header line in the output.
--precision, -p
 The number of decimal places to print in records

msp vcf

msp vcf is a command line interface to the msprime.TreeSequence.write_vcf() method. It prints out the coalescence vcf in a history file in a tab-delimited text format.

usage: msp vcf [-h] [--ploidy PLOIDY] history_file
Positional arguments:
history_file The msprime history file in HDF5 format
Options:
--ploidy, -P The ploidy level of samples

msp mutations

msp mutations is a command line interface to the msprime.TreeSequence.mutations() method. It prints out the coalescence mutations in a history file in a tab-delimited text format.

usage: msp mutations [-h] [--header] [--precision PRECISION] history_file
Positional arguments:
history_file The msprime history file in HDF5 format
Options:
--header, -H Print a header line in the output.
--precision, -p
 The number of decimal places to print in records

msp newick

msp mutations prints out the marginal genealogies in the tree sequence in newick format.

usage: msp mutations [-h] [--header] [--precision PRECISION] history_file
Positional arguments:
history_file The msprime history file in HDF5 format
Options:
--header, -H Print a header line in the output.
--precision, -p
 The number of decimal places to print in records

mspms

The mspms program is an ms-compatible command line interface to the msprime library. This interface should be useful for legacy applications, where it can be used as a drop-in replacement for ms. This interface is not recommended for new applications, particularly if the simulated trees are required as part of the output as Newick is very inefficient. The Python API is the recommended interface, providing direct access to the structures used within msprime.

Supported Features

mspms supports a subset of ms‘s functionality. Please open an issue on GitHub if there is a feature of ms that you would like to see added. We currently support:

  • Basic functionality (sample size, replicates, tree and haplotype output);
  • Recombination (via the -r option);
  • Spatial structure with arbitrary migration matrices;
  • Support for ms demographic events. (The implementation of the -es option is limited, and has restrictions on how it may be combined with other options.)

Gene-conversion is not currently supported, but is planned for a future release.

Argument details

This section provides the detailed listing of the arguments to mspms (also available via mspms --help). See the documentation for ms for details on how these values should be interpreted.

mspms is an ms-compatible interface to the msprime library. It simulates the coalescent with recombination for a variety of demographic models and outputs the results in a text-based format. It supports a subset of the functionality available in ms and aims for full compatibility.

usage: mspms [-h] [-V] [--mutation-rate theta] [--trees]
             [--recombination rho num_loci] [--structure value [value ...]]
             [--migration-matrix-entry dest source rate]
             [--migration-matrix entry [entry ...]]
             [--migration-rate-change t x]
             [--migration-matrix-entry-change time dest source rate]
             [--migration-matrix-change entry [entry ...]]
             [--growth-rate alpha]
             [--population-growth-rate population_id alpha]
             [--population-size population_id size]
             [--growth-rate-change t alpha]
             [--population-growth-rate-change t population_id alpha]
             [--size-change t x] [--population-size-change t population_id x]
             [--population-split t dest source]
             [--admixture t population_id proportion]
             [--random-seeds x1 x2 x3] [--precision PRECISION]
             sample_size num_replicates
Positional arguments:
sample_size The number of individuals in the sample
num_replicates Number of independent replicates
Options:
-V, --version show program’s version number and exit
--mutation-rate, -t
 Mutation rate theta=4*N0*mu
--trees, -T Print out trees in Newick format
--recombination, -r
 Recombination at rate rho=4*N0*r where r is the rate of recombination between the ends of the region being simulated; num_loci is the number of sites between which recombination can occur
--structure, -I
 Sample from populations with the specified deme structure. The arguments are of the form ‘num_populations n1 n2 ... [4N0m]’, specifying the number of populations, the sample configuration, and optionally, the migration rate for a symmetric island model
--migration-matrix-entry, -m
 Sets an entry M[dest, source] in the migration matrix to the specified rate. source and dest are (1-indexed) population IDs. Multiple options can be specified.
--migration-matrix, -ma
 Sets the migration matrix to the specified value. The entries are in the order M[1,1], M[1, 2], ..., M[2, 1],M[2, 2], ..., M[N, N], where N is the number of populations.
--migration-rate-change, -eM
 Set the symmetric island model migration rate to x / (npop - 1) at time t
--migration-matrix-entry-change, -em
 Sets an entry M[dest, source] in the migration matrix to the specified rate at the specified time. source and dest are (1-indexed) population IDs.
--migration-matrix-change, -ema
 Sets the migration matrix to the specified value at time t.The entries are in the order M[1,1], M[1, 2], ..., M[2, 1],M[2, 2], ..., M[N, N], where N is the number of populations.
--growth-rate, -G
 Set the growth rate to alpha for all populations.
--population-growth-rate, -g
 Set the growth rate to alpha for a specific population.
--population-size, -n
 Set the size of a specific population to size*N0.
--growth-rate-change, -eG
 Set the growth rate for all populations to alpha at time t
--population-growth-rate-change, -eg
 Set the growth rate for a specific population to alpha at time t
--size-change, -eN
 Set the population size for all populations to x * N0 at time t
--population-size-change, -en
 Set the population size for a specific population to x * N0 at time t
--population-split, -ej
 Move all lineages in population dest to source at time t. Forwards in time, this corresponds to a population split in which lineages in source split into dest. All migration rates for population source are set to zero.
--admixture, -es
 Split the specified population into a new population, such that the specified proportion of lineages remains in the population population_id. Forwards in time this corresponds to an admixture event. The new population has ID num_populations + 1. Migration rates to and from the new population are set to 0, and growth rate is 0 and the population size for the new population is N0.
--random-seeds, -seeds
 Random seeds (must be three integers)
--precision, -p
 Number of values after decimal place to print

If you use msprime in your work, please cite the following paper: Jerome Kelleher, Alison M Etheridge and Gilean McVean (2016), “Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes”, PLoS Comput Biol 12(5): e1004842. doi: 10.1371/journal.pcbi.1004842