Tree Sequence Formats¶
The correlated genealogical trees that describe the shared ancestry of set of
samples are stored concisely in
msprime as a sequence of coalescent events,
represented by a collection of easy-to-understand tables. These are output by
coalescent simulation in
msprime or can be read in from another source. To
make this information as efficient and easy as possible to use, we store the
data on disk in a HDF5 based file format.
This page fully documents these tables and the associated HDF5 format, allowing
efficient and convenient access to the genealogical data generated by
msprime outside of the native Python API. Using the
specification defined here, it should be straightforward to access tree
sequence information in any language with HDF5 support, or to
store genealogies from other sources efficiently using
To begin, here are definitions of some key ideas encountered later. This will
define the terminology, as well as giving properties of the tables that these
are stored in. These are properties that can be assumed when writing methods
that operate on an
msprime tree sequence; the function
provided to put unsorted tables in the proper order.
First are those that describe genealogical relationships:
Update the definitions here to change Edgeset to Edge.
- A “gene tree”, i.e., the genealogical tree describing how each of the individuals at the tips of the tree are related to each other. A “tree sequence” contains information sufficient to reconstruct the genealogical tree relating all samples to each other at any point along the genome.
Each branching point in each tree is associated with a particular ancestor, called “nodes”. Since each node represents a certain ancestor, it has a unique
time, thought of as her birth time, which determines the height of any branching points she is associated with. A given node will be associated with branching points of all trees across a region if that node is the most recent common ancestor to the subtending tips across that region. For each node, we record:
(flags, population, time)
flagsrecords information about the ancestor;
populationis the integer ID of the ancestor’s (birth) population, and
timeis how long ago the ancestor was born. Each node also has a unique (integer) ID, but this is not recorded explicitly - rather, the individual’s ID is given by their position in the tree sequence’s node table.
- Those nodes in the tree that we have obtained data from. These are
distinguished from other nodes by the fact that a tree sequence must
describe the genealogical history of all samples at every point on the
genome. These are a special kind of node, having
flagsset to 1 (as a binary mask).
Tree sequences are constructed by specifying over which segments of genome which nodes inherit from which other nodes. This information is stored by recording:
(left, right, parent, children)
where each node in the list
childreninherits from the node
parenton the half-open interval of chromosome
Here are the formal requirements for a set of nodes and edgesets to make sense,
and to allow
msprime‘s algorithms to work properly.
To disallow time travel and multiple inheritance:
- Offspring must be born after their parents (and hence, no loops).
- The set of intervals on which each individual is a child must be disjoint.
and for algorithmic reasons:
- The leftmost endpoint of each chromosome is 0.0.
- Node times must be strictly greater than zero.
- The list of offspring in an edgeset must be sorted.
- Edgesets must be sorted in nondecreasing time order.
- The set of intervals on which each individual is a parent must be disjoint.
- Each edgeset must contain at least two children.
A set of tables satisfying requirements 1-4 can be transformed into a completely
valid set of tables by applying first
sort_tables() (which ensures 5 and 6)
simplify_tables() (which ensures 7 and 8).
Note that since each node time is equal to the (birth) time of the corresponding parent, time is measured in clock time (not meioses).
In addition to genealogical relationships,
msprime generates and stores
mutations. Associating these with nodes means that a variant shared by many
individuals need only be stored once, allowing retrieval and processing of
variant information much more efficiently than if every individual’s genotype
was stored directly.
This type records a mutation that has occurred at some point in the genealogical history. Each mutation is associated with a particular
node(i.e., a particular ancestor), so that any sample which inherits from that node will also inherit that mutation, unless another mutation intervenes. The type records:
site node derived_state 0 14 1
siteis the index of the
siteat which the mutation occurred,
noderecords the ID of the ancestral node associated with the mutation, and
derived_stateis the allele that any sample inheriting from that node at this site will have if another mutation does not intervene. The
nodeis not necessarily the ancestor in whom the mutation occurred, but rather the ancestor at the bottom of the branch in the tree at that site on which the mutation occurred.
Rather than storing a position on the genome directly, a
mutationstores the index of a
site, that describes that position. This is to allow efficient processing of multiple mutations at the same genomic position. A
siterecords a position on the genome where a mutation has occurred along with the ancestral state (i.e., the state at the root of the tree at that position):
id position ancestral_state 0 0.1 0
As with nodes, the
idis not stored directly, but is implied by its index in the site table.
To allow for efficent algorithms, it is required that
- Sites are sorted by increasing position,
- and mutations are sorted by site.
Mutations at the same site may currently be in any order, but this may change.
In simulations trees can be thought of as spread across space, and it is helpful for inferring demographic history to record this history. This is stored using the following type.
Migrations are performed by individual ancestors, but most likely not by an individual tracked as a
node(as in a discrete-deme model they are unlikely to be both a migrant and a most recent common ancestor). So,
msprimerecords when a segment of ancestry has moved between populations:
left right node source dest time 0.0 0.3 3 0 1 2.1
migrationrecords that the ancestor who was alive 2.1 time units in the past from which
node3 inherited the segment of genome between 0.0 and 0.3 migrated from population 0 to population 1.
timestrictly between the time of its
nodeand the time of any ancestral node from which that node inherits on the segment
- Has the
populationof any such ancestor matching
source, if another
migrationdoes not intervene.
Working with Tables¶
Here is an example. Consider the following sequence of trees:
time ---- 1.0 6 0.7 / \ 5 / x / \ 0.5 / 4 4 / 4 / / \ / x / / \ 0.4 / / \ / 3 / / \ / / \ / / \ / / \ / / \ / / x / / \ / / \ / / \ / / \ 0.0 0 1 2 1 0 2 0 1 2 position 0.0 0.2 0.8 1.0
First, we specify the nodes:
NodeTable: id is_sample population time 0 1 0 0 1 1 0 0 2 1 0 0 3 0 0 0.4 4 0 0 0.5 5 0 0 0.7 6 0 0 1.0
Importantly, the first column,
id, is not actually recorded, and is
only shown when printing out node tables (as here) for convenience. This has
three samples: nodes 0, 1, and 2, and lists their birth times. Then, we
specify the edgesets:
EdgesetTable: left right parent children 0.2 0.8 3 0,2 0.0 0.2 4 1,2 0.2 0.8 4 1,3 0.8 1.0 4 1,2 0.8 1.0 5 0,4 0.0 0.2 6 0,4
Since node 3 is most recent, the edgeset that says that nodes 0 and 2 inherit from node 3 on the interval between 0.2 and 0.8 comes first. Next are the edgesets from node 4: there are three of these, for each of the three genomic intervals over which node 4 is ancestor to a distinct set of nodes. At this point, we know the full tree on the middle interval. Finally, edgesets specifying the common ancestor of 0 and 4 on the remaining intervals (parents 6 and 5 respectively) allow us to construct all trees across the entire interval.
In the depiction above,
x denotes mutations. Suppose that the first
mutation occurs at position 0.1 and the mutations in the second tree both
occurred at the same position, at 0.5 (with a back mutation). The positions
are recorded in the sites table:
SiteTable: id position ancestral_state 0 0.1 0 1 0.5 0
As with node tables, the
id column is not actually recorded, but is
implied by the position in the table. The acutal mutations are then recorded:
MutationTable: site node derived_state 0 4 1 1 3 1 1 2 0
This would then result in the following (two-locus) haplotypes for the three samples:
sample haplotype ------ --------- 0 01 1 10 2 11
Tables provide a convenient method for viewing, importing and exporting tree
msprime provides direct access to the the columns of a table as
numpy arrays: for instance, if
n is a
will return an array containing the birth times of the individuals in the
table. However, it is important to note that this is not a shallow copy:
n.time will not change the node table
n. This may change in
the future, but currently there are two ways to modify tables:
.set_columns() (and also
.reset(), which empties the table).
The example node table above would be constructed using
n = msprime.NodeTable() sv = [True, True, True, False, False, False, False] tv = [0.0, 0.0, 0.0, 0.4, 0.5, 0.7, 1.0] pv = [0, 0, 0, 0, 0, 0, 0] for s, t, p in zip(sv, tv, pv): n.add_row(flags=s, population=p, time=t) print(n)
.add_row() method is natural (and should be reasonably efficient) if
new records appear one-by-one. In the example above it would have been more
natural to use
n = msprime.NodeTable() n.set_columns(flags=sv, population=pv, time=tv)
Finally, here is an example where we add 1.4 to every
time except the first
in the NodeTable constructed above using
fn = n.flags pn = n.population tn = n.time tn[1:] = tn[1:] + 1.4 n.set_columns(flags=fn, population=pn, time=tn)
Sorting and simplifying tables¶
Tables that are noncontradictory but do not satisfy all algorithmic requirements listed above may be converted to a TreeSequence by first sorting, then simplifying them (both operate on the tables in place):
sort_tables(nodes, edgesets[, migrations, sites, mutations, edge_start])¶
Sorts the given tables in place, as follows:
Edges are ordered by
- time of parent, then
- parent node ID, then
- child node ID, then
- left endpoint.
Sites are ordered by position, and Mutations are ordered by site.
edge_startparameter is provided, this specifies the index in the edge table where sorting should start. Only rows with index greater than or equal to
edge_startare sorted; rows before this index are not affected. This parameter is provided to allow for efficient sorting when the user knows that the edges up to a given index are already sorted.
Update this documentation to describe the keyword arguments and combinations that are allowed.
Note: the following function is more general than
TreeSequence.simplify(), since it can be applied to tables not satisfying
all criteria above (and that hence could not be loaded into a TreeSequence).
simplify_tables(samples, nodes, edgesets[, migrations, sites, mutations])¶
Simplifies the tables, in place, to retain only the information necessary to reconstruct the tree sequence describing the given
samples. This will change the ID of the nodes, so that the individual
samples[k]]will have ID
kin the result. The resulting NodeTable will have only the first
len(samples)individuals marked as samples.
Tables operated on by this function must: be sorted (see
sort_tables), have children be born strictly after their parents, and the intervals on which any individual is a child must be disjoint; but other than this the tables need not satisfy remaining requirements to specify a valid tree sequence (but the resulting tables will).
- samples (list) – A list of Node IDs of individuals to retain as samples.
- nodes (NodeTable) – The NodeTable to be simplified.
- edges (EdgeTable) – The NodeTable to be simplified.
- migrations (MigrationTable) – The MigrationTable to be simplified.
- sites (SiteTable) – The SiteTable to be simplified.
- mutations (MutationTable) – The MutationTable to be simplified.
- filter_invariant_sites (bool) – Whether to remove sites that have no mutations from the output (default: True).
Import and export¶
This section describes how to extract tables from a
TreeSequence, and how
to construct a
TreeSequence from tables. Since tree sequences are
immutible, often the best way to modify a
TreeSequence is something along
the lines of (for
nodes = msprime.NodeTable() edges = msprime.EdgeTable() ts.dump_tables(nodes=nodes, edges=edges) # (modify nodes and edges) ts.load_tables(nodes=nodes, edges=edges)
dump_tables(nodes=None, edges=None, migrations=None, sites=None, mutations=None)
Copy the contents of the tables underlying the tree sequence to the specified objects.
- nodes (NodeTable) – The NodeTable to load the nodes into.
- edges (EdgeTable) – The EdgeTable to load the edges into.
- migrations (MigrationTable) – The MigrationTable to load the migrations into.
- sites (SiteTable) – The SiteTable to load the sites into.
- mutations (MutationTable) – The NodeTable to load the mutations into.
TableCollectioncontaining all tables underlying the tree sequence.
The file format is broken into a number of groups. Each group contains datasets to define the data along with attributes to provide necessary contextual information.
The root group contains one attributes,
is a pair
(major, minor) describing the file format version. This
document describes version 3.2.
|/format_version||H5T_STD_U32LE||2||The (major, minor) file format version.|
The provenance dataset records information relating the the provenance of a particular tree sequence file. When a tree sequence file is generated all the information required to reproduce the file should be encoded as a string and stored in this dataset. Subsequent modifications to the file should be also be recorded and appended to the list of strings.
The format of these strings is implementation defined. In the
current version of
msprime provenance information is encoded
as JSON. This information is incomplete, and will be updated in future
mutations group is optional, and describes the location of mutations
with respect to tree nodes and their positions along the sequence. Each mutation
consists of a node (which must be defined in the
trees group) and a
position. Positions are defined as a floating point value to allow us to
express infinite sites mutations. A mutation position \(x\) is defined on the same
scale as the genomic coordinates for trees, and so we must have
\(0 \leq x < L\), where \(L\) is the largest value in the
As for the coalescence records in the
trees group, mutation records are
stored as seperate vectors for efficiency reasons. Mutations must be stored
in nondecreasing order of position.
trees group is mandatory and describes the topology of the tree
trees group contains a number of nested groups and datasets,
which we will describe in turn.
/trees/breakpoints dataset records the floating point positions of the
breakpoints between trees in the tree sequence, and the flanking positions
\(0\) and \(L\). Positions in the
/trees/records group refer to
(zero based) indexes into this array. The first breakpoint must be zero, and
they must be listed in increasing order.
/trees/nodes group records information about the individual
nodes in a tree sequence. Leaf nodes (from \(0\) to \(n - 1\))
represent the samples and internal nodes (\(\geq n\)) represent
their ancestors. Each node corresponds to a particular individual that
lived at some time time in the history of the sample. The
group is used to record information about these individuals.
/trees/records group stores the individual coalesence records.
Each record consists of four pieces of information: the left and
right coordinates of the coalescing interval, the list of child nodes
and the parent node.
right datasets are indexes into the
dataset and define the genomic interval over which the record applies. The
interval is half-open, so that the left coordinate is inclusive and the right
coordinate is exclusive.
node dataset records the parent node of the record, and is
an index into the
num_children dataset records the number of children for a particular
children dataset then records the actual child nodes for each
coalescence record. This 1-dimensional array lists the child nodes for every
record in order, and therefore by using the
num_children array we can
efficiently recover the actual children involved in each event. Within a given
event, child nodes must be sorted in increasing order. The records must be
listed in time increasing order.
|/trees/children||H5T_STD_U32LE||\(\leq 2 \times\) N|
/trees/indexes group records information required to efficiently
reconstruct the individual trees from the tree sequence. The
insertion_order dataset contains the order in which records must be applied
removal_order dataset the order in which records must be
removed for a left-to-right traversal of the trees.