Tree Sequence File Format

The correlated trees output by a coalescent simulation are stored very concisely in msprime as a sequence of coalescent records. To make this information as efficient and easy as possible to use, we store the data in a HDF5 based file format. This page fully documents this format allowing efficient and convenient access to the genealogical data generated by msprime outside of the native Python API. Using the specification defined here, it should be straightforward to access tree sequence information in any language with HDF5 support.

Structure

The file format is broken into a number of groups. Each group contains datasets to define the data along with attributes to provide necessary contextual information.

The root group contains one attributes, format_version. This is a pair (major, minor) describing the file format version. This document describes version 3.2.

Path Type Dim Description
/format_version H5T_STD_U32LE 2 The (major, minor) file format version.

Provenance dataset

The provenance dataset records information relating the the provenance of a particular tree sequence file. When a tree sequence file is generated all the information required to reproduce the file should be encoded as a string and stored in this dataset. Subsequent modifications to the file should be also be recorded and appended to the list of strings.

The format of these strings is implementation defined. In the current version of msprime provenance information is encoded as JSON. This information is incomplete, and will be updated in future versions.

Path Type Dim Description
/provenance H5T_STRING Scalar Provenance information.

Mutations group

The mutations group is optional, and describes the location of mutations with respect to tree nodes and their positions along the sequence. Each mutation consists of a node (which must be defined in the trees group) and a position. Positions are defined as a floating point value to allow us to express infinite sites mutations. A mutation position \(x\) is defined on the same scale as the genomic coordinates for trees, and so we must have \(0 \leq x < L\), where \(L\) is the largest value in the /trees/breakpoints dataset.

As for the coalescence records in the trees group, mutation records are stored as seperate vectors for efficiency reasons. Mutations must be stored in nondecreasing order of position.

Path Type Dim
/mutations/node H5T_STD_U32LE M
/mutations/position H5T_IEEE_F64LE M

Trees group

The trees group is mandatory and describes the topology of the tree sequence. The trees group contains a number of nested groups and datasets, which we will describe in turn.

Breakpoints dataset

The /trees/breakpoints dataset records the floating point positions of the breakpoints between trees in the tree sequence, and the flanking positions \(0\) and \(L\). Positions in the /trees/records group refer to (zero based) indexes into this array. The first breakpoint must be zero, and they must be listed in increasing order.

Path Type
/trees/breakpoints H5T_IEEE_F64LE

Nodes group

The /trees/nodes group records information about the individual nodes in a tree sequence. Leaf nodes (from \(0\) to \(n - 1\)) represent the samples and internal nodes (\(\geq n\)) represent their ancestors. Each node corresponds to a particular individual that lived at some time time in the history of the sample. The nodes group is used to record information about these individuals.

Path Type
/trees/nodes/population H5T_STD_U32LE
/trees/nodes/time H5T_IEEE_F64LE

Records group

The /trees/records group stores the individual coalesence records. Each record consists of four pieces of information: the left and right coordinates of the coalescing interval, the list of child nodes and the parent node.

The left and right datasets are indexes into the /trees/breakpoints dataset and define the genomic interval over which the record applies. The interval is half-open, so that the left coordinate is inclusive and the right coordinate is exclusive.

The node dataset records the parent node of the record, and is an index into the /trees/nodes group.

The num_children dataset records the number of children for a particular record. The children dataset then records the actual child nodes for each coalescence record. This 1-dimensional array lists the child nodes for every record in order, and therefore by using the num_children array we can efficiently recover the actual children involved in each event. Within a given event, child nodes must be sorted in increasing order. The records must be listed in time increasing order.

Path Type Dim
/trees/left H5T_STD_U32LE N
/trees/right H5T_STD_U32LE N
/trees/node H5T_STD_U32LE N
/trees/num_children H5T_STD_U32LE N
/trees/children H5T_STD_U32LE \(\leq 2 \times\) N

Indexes group

The /trees/indexes group records information required to efficiently reconstruct the individual trees from the tree sequence. The insertion_order dataset contains the order in which records must be applied and the removal_order dataset the order in which records must be removed for a left-to-right traversal of the trees.

Path Type
/trees/indexes/insertion_order H5T_STD_U32LE
/trees/indexes/removal_order H5T_STD_U32LE