Tree Sequence File Format¶
The correlated trees output by a coalescent simulation are stored very
msprime as a sequence of coalescent records. To make this
information as efficient and easy as possible to use, we store the data in a
HDF5 based file format. This page fully
documents this format allowing efficient and convenient access to the
genealogical data generated by
msprime outside of the native Python
API. Using the specification defined here, it should be
straightforward to access tree sequence information in any language with HDF5
The file format is broken into a number of groups. Each group contains datasets to define the data along with attributes to provide necessary contextual information.
The root group contains one attributes,
is a pair
(major, minor) describing the file format version. This
document describes version 3.1.
|/format_version||H5T_STD_U32LE||2||The (major, minor) file format version.|
The provenance dataset records information relating the the provenance of a particular tree sequence file. When a tree sequence file is generated all the information required to reproduce the file should be encoded as a string and stored in this dataset. Subsequent modifications to the file should be also be recorded and appended to the list of strings.
The format of these strings is implementation defined. In the
current version of
msprime provenance information is encoded
as JSON. This information is incomplete, and will be updated in future
mutations group is optional, and describes the location of mutations
with respect to tree nodes and their positions along the sequence. Each mutation
consists of a node (which must be defined in the
trees group) and a
position. Positions are defined as a floating point value to allow us to
express infinite sites mutations. A mutation position \(x\) is defined on the same
scale as the genomic coordinates for trees, and so we must have
\(0 \leq x < L\), where \(L\) is the largest value in the
As for the coalescence records in the
trees group, mutation records are
stored as seperate vectors for efficiency reasons. Mutations must be stored
in nondecreasing order of position.
trees group is mandatory and describes the topology of the tree
trees group contains a number of nested groups and datasets,
which we will describe in turn.
/trees/breakpoints dataset records the floating point positions of the
breakpoints between trees in the tree sequence, and the flanking positions
\(0\) and \(L\). Positions in the
/trees/records group refer to
(zero based) indexes into this array. The first breakpoint must be zero, and
they must be listed in increasing order.
/trees/nodes group records information about the individual
nodes in a tree sequence. Leaf nodes (from \(0\) to \(n - 1\))
represent the samples and internal nodes (\(\geq n\)) represent
their ancestors. Each node corresponds to a particular individual that
lived at some time time in the history of the sample. The
group is used to record information about these individuals.
/trees/records group stores the individual coalesence records.
Each record consists of four pieces of information: the left and
right coordinates of the coalescing interval, the list of child nodes
and the parent node.
right datasets are indexes into the
dataset and define the genomic interval over which the record applies. The
interval is half-open, so that the left coordinate is inclusive and the right
coordinate is exclusive.
node dataset records the parent node of the record, and is
an index into the
num_children dataset records the number of children for a particular
children dataset then records the actual child nodes for each
coalescence record. This 1-dimensional array lists the child nodes for every
record in order, and therefore by using the
num_children array we can
efficiently recover the actual children involved in each event. Within a given
event, child nodes must be sorted in increasing order. The records must be
listed in time increasing order.
|/trees/children||H5T_STD_U32LE||\(\leq 2 \times\) N|
/trees/indexes group records information required to efficiently
reconstruct the individual trees from the tree sequence. The
insertion_order dataset contains the order in which records must be applied
removal_order dataset the order in which records must be
removed for a left-to-right traversal of the trees.