Tree Sequence Formats

The correlated genealogical trees that describe the shared ancestry of set of samples are stored concisely in msprime as a sequence of coalescent events, represented by a collection of easy-to-understand tables. These are output by coalescent simulation in msprime or can be read in from another source. To make this information as efficient and easy as possible to use, we store the data on disk in a HDF5 based file format. This page fully documents these tables and the associated HDF5 format, allowing efficient and convenient access to the genealogical data generated by msprime outside of the native Python API. Using the specification defined here, it should be straightforward to access tree sequence information in any language with HDF5 support, or to store genealogies from other sources efficiently using msprime.

Definitions

To begin, here are definitions of some key ideas encountered later. This will define the terminology, as well as giving properties of the tables that these are stored in. These are properties that can be assumed when writing methods that operate on an msprime tree sequence; the function sort_tables is provided to put unsorted tables in the proper order.

First are those that describe genealogical relationships:

Todo

Update the definitions here to change Edgeset to Edge.

tree
A “gene tree”, i.e., the genealogical tree describing how each of the individuals at the tips of the tree are related to each other. A “tree sequence” contains information sufficient to reconstruct the genealogical tree relating all samples to each other at any point along the genome.
node

Each branching point in each tree is associated with a particular ancestor, called “nodes”. Since each node represents a certain ancestor, it has a unique time, thought of as her birth time, which determines the height of any branching points she is associated with. A given node will be associated with branching points of all trees across a region if that node is the most recent common ancestor to the subtending tips across that region. For each node, we record:

(flags, population, time)

where flags records information about the ancestor; population is the integer ID of the ancestor’s (birth) population, and time is how long ago the ancestor was born. Each node also has a unique (integer) ID, but this is not recorded explicitly - rather, the individual’s ID is given by their position in the tree sequence’s node table.

samples
Those nodes in the tree that we have obtained data from. These are distinguished from other nodes by the fact that a tree sequence must describe the genealogical history of all samples at every point on the genome. These are a special kind of node, having flags set to 1 (as a binary mask).
edgeset

Tree sequences are constructed by specifying over which segments of genome which nodes inherit from which other nodes. This information is stored by recording:

(left, right, parent, children)

where each node in the list children inherits from the node parent on the half-open interval of chromosome [left, right).

Here are the formal requirements for a set of nodes and edgesets to make sense, and to allow msprime‘s algorithms to work properly.

To disallow time travel and multiple inheritance:

  1. Offspring must be born after their parents (and hence, no loops).
  2. The set of intervals on which each individual is a child must be disjoint.

and for algorithmic reasons:

  1. The leftmost endpoint of each chromosome is 0.0.
  2. Node times must be strictly greater than zero.
  3. The list of offspring in an edgeset must be sorted.
  4. Edgesets must be sorted in nondecreasing time order.
  5. The set of intervals on which each individual is a parent must be disjoint.
  6. Each edgeset must contain at least two children.

A set of tables satisfying requirements 1-4 can be transformed into a completely valid set of tables by applying first sort_tables() (which ensures 5 and 6) and then simplify_tables() (which ensures 7 and 8).

Note that since each node time is equal to the (birth) time of the corresponding parent, time is measured in clock time (not meioses).

Mutations

In addition to genealogical relationships, msprime generates and stores mutations. Associating these with nodes means that a variant shared by many individuals need only be stored once, allowing retrieval and processing of variant information much more efficiently than if every individual’s genotype was stored directly.

mutation

This type records a mutation that has occurred at some point in the genealogical history. Each mutation is associated with a particular node (i.e., a particular ancestor), so that any sample which inherits from that node will also inherit that mutation, unless another mutation intervenes. The type records:

site    node    derived_state
0           14      1

Here site is the index of the site at which the mutation occurred, node records the ID of the ancestral node associated with the mutation, and derived_state is the allele that any sample inheriting from that node at this site will have if another mutation does not intervene. The node is not necessarily the ancestor in whom the mutation occurred, but rather the ancestor at the bottom of the branch in the tree at that site on which the mutation occurred.

site

Rather than storing a position on the genome directly, a mutation stores the index of a site, that describes that position. This is to allow efficient processing of multiple mutations at the same genomic position. A site records a position on the genome where a mutation has occurred along with the ancestral state (i.e., the state at the root of the tree at that position):

id      position        ancestral_state
0       0.1             0

As with nodes, the id is not stored directly, but is implied by its index in the site table.

To allow for efficent algorithms, it is required that

  1. Sites are sorted by increasing position,
  2. and mutations are sorted by site.

Mutations at the same site may currently be in any order, but this may change.

Migrations

In simulations trees can be thought of as spread across space, and it is helpful for inferring demographic history to record this history. This is stored using the following type.

migration

Migrations are performed by individual ancestors, but most likely not by an individual tracked as a node (as in a discrete-deme model they are unlikely to be both a migrant and a most recent common ancestor). So, msprime records when a segment of ancestry has moved between populations:

left    right   node    source  dest    time
0.0     0.3     3       0       1       2.1

This migration records that the ancestor who was alive 2.1 time units in the past from which node 3 inherited the segment of genome between 0.0 and 0.3 migrated from population 0 to population 1.

A valid migration:

  1. Has time strictly between the time of its node and the time of any ancestral node from which that node inherits on the segment [left, right).
  2. Has the population of any such ancestor matching source, if another migration does not intervene.

Working with Tables

Here is an example. Consider the following sequence of trees:

time
----
1.0                6
0.7               / \                                       5
                 /   x                                     / \
0.5             /     4                 4                 /   4
               /     / \               / x               /   / \
0.4           /     /   \             /   3             /   /   \
             /     /     \           /   / \           /   /     \
            /     /       \         /   /   x         /   /       \
           /     /         \       /   /     \       /   /         \
0.0       0     1           2     1   0       2     0   1           2

position 0.0                  0.2               0.8                1.0

First, we specify the nodes:

NodeTable:

id      is_sample    population   time
0       1            0            0
1       1            0            0
2       1            0            0
3       0            0            0.4
4       0            0            0.5
5       0            0            0.7
6       0            0            1.0

Importantly, the first column, id, is not actually recorded, and is only shown when printing out node tables (as here) for convenience. This has three samples: nodes 0, 1, and 2, and lists their birth times. Then, we specify the edgesets:

EdgesetTable:

left    right   parent  children
0.2     0.8     3       0,2
0.0     0.2     4       1,2
0.2     0.8     4       1,3
0.8     1.0     4       1,2
0.8     1.0     5       0,4
0.0     0.2     6       0,4

Since node 3 is most recent, the edgeset that says that nodes 0 and 2 inherit from node 3 on the interval between 0.2 and 0.8 comes first. Next are the edgesets from node 4: there are three of these, for each of the three genomic intervals over which node 4 is ancestor to a distinct set of nodes. At this point, we know the full tree on the middle interval. Finally, edgesets specifying the common ancestor of 0 and 4 on the remaining intervals (parents 6 and 5 respectively) allow us to construct all trees across the entire interval.

In the depiction above, x denotes mutations. Suppose that the first mutation occurs at position 0.1 and the mutations in the second tree both occurred at the same position, at 0.5 (with a back mutation). The positions are recorded in the sites table:

SiteTable:

id  position        ancestral_state
0   0.1             0
1   0.5             0

As with node tables, the id column is not actually recorded, but is implied by the position in the table. The acutal mutations are then recorded:

MutationTable:

site        node    derived_state
0       4       1
1       3       1
1       2       0

This would then result in the following (two-locus) haplotypes for the three samples:

sample  haplotype
------  ---------
0       01
1       10
2       11

Tables API

Tables provide a convenient method for viewing, importing and exporting tree sequences. msprime provides direct access to the the columns of a table as numpy arrays: for instance, if n is a NodeTable, then n.time will return an array containing the birth times of the individuals in the table. However, it is important to note that this is not a shallow copy: modifying n.time will not change the node table n. This may change in the future, but currently there are two ways to modify tables: .add_row() and .set_columns() (and also .reset(), which empties the table).

The example node table above would be constructed using .add_row() as follows:

n = msprime.NodeTable()
sv = [True, True, True, False, False, False, False]
tv = [0.0, 0.0, 0.0, 0.4, 0.5, 0.7, 1.0]
pv = [0, 0, 0, 0, 0, 0, 0]
for s, t, p in zip(sv, tv, pv):
    n.add_row(flags=s, population=p, time=t)

print(n)

The .add_row() method is natural (and should be reasonably efficient) if new records appear one-by-one. In the example above it would have been more natural to use .set_columns():

n = msprime.NodeTable()
n.set_columns(flags=sv, population=pv, time=tv)

Finally, here is an example where we add 1.4 to every time except the first in the NodeTable constructed above using numpy indexing:

fn = n.flags
pn = n.population
tn = n.time
tn[1:] = tn[1:] + 1.4
n.set_columns(flags=fn, population=pn, time=tn)

Sorting and simplifying tables

Tables that are noncontradictory but do not satisfy all algorithmic requirements listed above may be converted to a TreeSequence by first sorting, then simplifying them (both operate on the tables in place):

msprime.sort_tables(nodes, edgesets[, migrations, sites, mutations, edge_start])

Sorts the given tables in place, as follows:

Edges are ordered by

  • time of parent, then
  • parent node ID, then
  • child node ID, then
  • left endpoint.

Sites are ordered by position, and Mutations are ordered by site.

If the edge_start parameter is provided, this specifies the index in the edge table where sorting should start. Only rows with index greater than or equal to edge_start are sorted; rows before this index are not affected. This parameter is provided to allow for efficient sorting when the user knows that the edges up to a given index are already sorted.

Todo

Update this documentation to describe the keyword arguments and combinations that are allowed.

Parameters:

Note: the following function is more general than TreeSequence.simplify(), since it can be applied to tables not satisfying all criteria above (and that hence could not be loaded into a TreeSequence).

msprime.simplify_tables(samples, nodes, edgesets[, migrations, sites, mutations])

Simplifies the tables, in place, to retain only the information necessary to reconstruct the tree sequence describing the given samples. This will change the ID of the nodes, so that the individual samples[k]] will have ID k in the result. The resulting NodeTable will have only the first len(samples) individuals marked as samples.

Tables operated on by this function must: be sorted (see sort_tables), have children be born strictly after their parents, and the intervals on which any individual is a child must be disjoint; but other than this the tables need not satisfy remaining requirements to specify a valid tree sequence (but the resulting tables will).

Parameters:
  • samples (list) – A list of Node IDs of individuals to retain as samples.
  • nodes (NodeTable) – The NodeTable to be simplified.
  • edges (EdgeTable) – The NodeTable to be simplified.
  • migrations (MigrationTable) – The MigrationTable to be simplified.
  • sites (SiteTable) – The SiteTable to be simplified.
  • mutations (MutationTable) – The MutationTable to be simplified.
  • filter_invariant_sites (bool) – Whether to remove sites that have no mutations from the output (default: True).

NodeTable

msprime.NodeTable

EdgeTable

msprime.EdgeTable

SiteTable

msprime.SiteTable

MutationTable

msprime.MutationTable

Import and export

This section describes how to extract tables from a TreeSequence, and how to construct a TreeSequence from tables. Since tree sequences are immutible, often the best way to modify a TreeSequence is something along the lines of (for ts a TreeSequence):

nodes = msprime.NodeTable()
edges = msprime.EdgeTable()
ts.dump_tables(nodes=nodes, edges=edges)
# (modify nodes and edges)
ts.load_tables(nodes=nodes, edges=edges)
classmethod TreeSequence.load_tables(**kwargs)
TreeSequence.dump_tables(nodes=None, edges=None, migrations=None, sites=None, mutations=None)

Copy the contents of the tables underlying the tree sequence to the specified objects.

Parameters:
  • nodes (NodeTable) – The NodeTable to load the nodes into.
  • edges (EdgeTable) – The EdgeTable to load the edges into.
  • migrations (MigrationTable) – The MigrationTable to load the migrations into.
  • sites (SiteTable) – The SiteTable to load the sites into.
  • mutations (MutationTable) – The NodeTable to load the mutations into.
Returns:

A TableCollection containing all tables underlying the tree sequence.

Return type:

TableCollection

HDF5 Format

The file format is broken into a number of groups. Each group contains datasets to define the data along with attributes to provide necessary contextual information.

The root group contains one attributes, format_version. This is a pair (major, minor) describing the file format version. This document describes version 3.2.

Path Type Dim Description
/format_version H5T_STD_U32LE 2 The (major, minor) file format version.

Provenance dataset

The provenance dataset records information relating the the provenance of a particular tree sequence file. When a tree sequence file is generated all the information required to reproduce the file should be encoded as a string and stored in this dataset. Subsequent modifications to the file should be also be recorded and appended to the list of strings.

The format of these strings is implementation defined. In the current version of msprime provenance information is encoded as JSON. This information is incomplete, and will be updated in future versions.

Path Type Dim Description
/provenance H5T_STRING Scalar Provenance information.

Mutations group

The mutations group is optional, and describes the location of mutations with respect to tree nodes and their positions along the sequence. Each mutation consists of a node (which must be defined in the trees group) and a position. Positions are defined as a floating point value to allow us to express infinite sites mutations. A mutation position \(x\) is defined on the same scale as the genomic coordinates for trees, and so we must have \(0 \leq x < L\), where \(L\) is the largest value in the /trees/breakpoints dataset.

As for the coalescence records in the trees group, mutation records are stored as seperate vectors for efficiency reasons. Mutations must be stored in nondecreasing order of position.

Path Type Dim
/mutations/node H5T_STD_U32LE M
/mutations/position H5T_IEEE_F64LE M

Trees group

The trees group is mandatory and describes the topology of the tree sequence. The trees group contains a number of nested groups and datasets, which we will describe in turn.

Breakpoints dataset

The /trees/breakpoints dataset records the floating point positions of the breakpoints between trees in the tree sequence, and the flanking positions \(0\) and \(L\). Positions in the /trees/records group refer to (zero based) indexes into this array. The first breakpoint must be zero, and they must be listed in increasing order.

Path Type
/trees/breakpoints H5T_IEEE_F64LE

Nodes group

The /trees/nodes group records information about the individual nodes in a tree sequence. Leaf nodes (from \(0\) to \(n - 1\)) represent the samples and internal nodes (\(\geq n\)) represent their ancestors. Each node corresponds to a particular individual that lived at some time time in the history of the sample. The nodes group is used to record information about these individuals.

Path Type
/trees/nodes/population H5T_STD_U32LE
/trees/nodes/time H5T_IEEE_F64LE

Records group

The /trees/records group stores the individual coalesence records. Each record consists of four pieces of information: the left and right coordinates of the coalescing interval, the list of child nodes and the parent node.

The left and right datasets are indexes into the /trees/breakpoints dataset and define the genomic interval over which the record applies. The interval is half-open, so that the left coordinate is inclusive and the right coordinate is exclusive.

The node dataset records the parent node of the record, and is an index into the /trees/nodes group.

The num_children dataset records the number of children for a particular record. The children dataset then records the actual child nodes for each coalescence record. This 1-dimensional array lists the child nodes for every record in order, and therefore by using the num_children array we can efficiently recover the actual children involved in each event. Within a given event, child nodes must be sorted in increasing order. The records must be listed in time increasing order.

Path Type Dim
/trees/left H5T_STD_U32LE N
/trees/right H5T_STD_U32LE N
/trees/node H5T_STD_U32LE N
/trees/num_children H5T_STD_U32LE N
/trees/children H5T_STD_U32LE \(\leq 2 \times\) N

Indexes group

The /trees/indexes group records information required to efficiently reconstruct the individual trees from the tree sequence. The insertion_order dataset contains the order in which records must be applied and the removal_order dataset the order in which records must be removed for a left-to-right traversal of the trees.

Path Type
/trees/indexes/insertion_order H5T_STD_U32LE
/trees/indexes/removal_order H5T_STD_U32LE