Tree Sequence Formats¶
The correlated genealogical trees that describe the shared ancestry of set of
samples are stored concisely in msprime
as a sequence of coalescent events,
represented by a collection of easytounderstand tables. These are output by
coalescent simulation in msprime
or can be read in from another source. To
make this information as efficient and easy as possible to use, we store the
data on disk in a HDF5 based file format.
This page fully documents these tables and the associated HDF5 format, allowing
efficient and convenient access to the genealogical data generated by
msprime
outside of the native Python API. Using the
specification defined here, it should be straightforward to access tree
sequence information in any language with HDF5 support, or to
store genealogies from other sources efficiently using msprime
.
Definitions¶
To begin, here are definitions of some key ideas encountered later. This will
define the terminology, as well as giving properties of the tables that these
are stored in. These are properties that can be assumed when writing methods
that operate on an msprime
tree sequence; the function sort_tables
is
provided to put unsorted tables in the proper order.
First are those that describe genealogical relationships:
Todo
Update the definitions here to change Edgeset to Edge.
 tree
 A “gene tree”, i.e., the genealogical tree describing how each of the individuals at the tips of the tree are related to each other. A “tree sequence” contains information sufficient to reconstruct the genealogical tree relating all samples to each other at any point along the genome.
 node
Each branching point in each tree is associated with a particular ancestor, called “nodes”. Since each node represents a certain ancestor, it has a unique
time
, thought of as her birth time, which determines the height of any branching points she is associated with. A given node will be associated with branching points of all trees across a region if that node is the most recent common ancestor to the subtending tips across that region. For each node, we record:(flags, population, time)
where
flags
records information about the ancestor;population
is the integer ID of the ancestor’s (birth) population, andtime
is how long ago the ancestor was born. Each node also has a unique (integer) ID, but this is not recorded explicitly  rather, the individual’s ID is given by their position in the tree sequence’s node table. samples
 Those nodes in the tree that we have obtained data from. These are
distinguished from other nodes by the fact that a tree sequence must
describe the genealogical history of all samples at every point on the
genome. These are a special kind of node, having
flags
set to 1 (as a binary mask).  edge
Tree sequences are constructed by specifying over which segments of genome which nodes inherit from which other nodes. This information is stored by recording:
(left, right, parent, child)
where each node in
child
inherits from the nodeparent
on the halfopen interval of chromosome[left, right)
.
Here are the formal requirements for a set of nodes and edges to make sense,
and to allow msprime
’s algorithms to work properly.
To disallow time travel and multiple inheritance:
 Offspring must be born after their parents (and hence, no loops).
 The set of intervals on which each individual is a child must be disjoint.
and for algorithmic reasons:
 The leftmost endpoint of each chromosome is 0.0.
 Node times must be strictly greater than zero.
 Edges must be sorted in nondecreasing time order.
 The set of intervals on which each individual is a parent must be disjoint.
A set of tables satisfying requirements 14 can be transformed into a completely
valid set of tables by applying first sort_tables()
(which ensures 5)
and then simplify_tables()
(which ensures 6).
Note that since each node time is equal to the (birth) time of the corresponding parent, time is measured in clock time (not meioses).
Mutations¶
In addition to genealogical relationships, msprime
generates and stores
mutations. Associating these with nodes means that a variant shared by many
individuals need only be stored once, allowing retrieval and processing of
variant information much more efficiently than if every individual’s genotype
was stored directly.
 mutation
This type records a mutation that has occurred at some point in the genealogical history. Each mutation is associated with a particular
node
(i.e., a particular ancestor), so that any sample which inherits from that node will also inherit that mutation, unless another mutation intervenes. The type records:site node derived_state 0 14 1
Here
site
is the index of thesite
at which the mutation occurred,node
records the ID of the ancestral node associated with the mutation, andderived_state
is the allele that any sample inheriting from that node at this site will have if another mutation does not intervene. Thenode
is not necessarily the ancestor in whom the mutation occurred, but rather the ancestor at the bottom of the branch in the tree at that site on which the mutation occurred. site
Rather than storing a position on the genome directly, a
mutation
stores the index of asite
, that describes that position. This is to allow efficient processing of multiple mutations at the same genomic position. Asite
records a position on the genome where a mutation has occurred along with the ancestral state (i.e., the state at the root of the tree at that position):id position ancestral_state 0 0.1 0
As with nodes, the
id
is not stored directly, but is implied by its index in the site table.
To allow for efficent algorithms, it is required that
 Sites are sorted by increasing position,
 and mutations are sorted by site.
Mutations at the same site may currently be in any order, but this may change.
Migrations¶
In simulations trees can be thought of as spread across space, and it is helpful for inferring demographic history to record this history. This is stored using the following type.
 migration
Migrations are performed by individual ancestors, but most likely not by an individual tracked as a
node
(as in a discretedeme model they are unlikely to be both a migrant and a most recent common ancestor). So,msprime
records when a segment of ancestry has moved between populations:left right node source dest time 0.0 0.3 3 0 1 2.1
This
migration
records that the ancestor who was alive 2.1 time units in the past from whichnode
3 inherited the segment of genome between 0.0 and 0.3 migrated from population 0 to population 1.
A valid migration
:
 Has
time
strictly between the time of itsnode
and the time of any ancestral node from which that node inherits on the segment[left, right)
.  Has the
population
of any such ancestor matchingsource
, if anothermigration
does not intervene.
Working with Tables¶
Here is an example. Consider the following sequence of trees:
time

1.0 6
0.7 / \ 5
/ x / \
0.5 / 4 4 / 4
/ / \ / x / / \
0.4 / / \ / 3 / / \
/ / \ / / \ / / \
/ / \ / / x / / \
/ / \ / / \ / / \
0.0 0 1 2 1 0 2 0 1 2
position 0.0 0.2 0.8 1.0
First, we specify the nodes:
NodeTable:
id is_sample population time
0 1 0 0
1 1 0 0
2 1 0 0
3 0 0 0.4
4 0 0 0.5
5 0 0 0.7
6 0 0 1.0
Importantly, the first column, id
, is not actually recorded, and is
only shown when printing out node tables (as here) for convenience. This has
three samples: nodes 0, 1, and 2, and lists their birth times. Then, we
specify the edges:
EdgesetTable:
left right parent children
0.2 0.8 3 0
0.2 0.8 3 2
0.0 0.2 4 1
0.0 0.2 4 2
0.2 0.8 4 1
0.2 0.8 4 3
0.8 1.0 4 1
0.8 1.0 4 2
0.8 1.0 5 0
0.8 1.0 5 4
0.0 0.2 6 0
0.0 0.2 6 4
Since node 3 is most recent, the edgeset that says that nodes 0 and 2 inherit from node 3 on the interval between 0.2 and 0.8 comes first. Next are the edges from node 4: there are three of these, for each of the three genomic intervals over which node 4 is ancestor to a distinct set of nodes. At this point, we know the full tree on the middle interval. Finally, edges specifying the common ancestor of 0 and 4 on the remaining intervals (parents 6 and 5 respectively) allow us to construct all trees across the entire interval.
In the depiction above, x
denotes mutations. Suppose that the first
mutation occurs at position 0.1 and the mutations in the second tree both
occurred at the same position, at 0.5 (with a back mutation). The positions
are recorded in the sites table:
SiteTable:
id position ancestral_state
0 0.1 0
1 0.5 0
As with node tables, the id
column is not actually recorded, but is
implied by the position in the table. The acutal mutations are then recorded:
MutationTable:
site node derived_state
0 4 1
1 3 1
1 2 0
This would then result in the following (twolocus) haplotypes for the three samples:
sample haplotype
 
0 01
1 10
2 11
Tables API¶
Tables provide a convenient method for viewing, importing and exporting tree
sequences. msprime
provides direct access to the the columns of a table as
numpy
arrays: for instance, if n
is a NodeTable
, then n.time
will return an array containing the birth times of the individuals in the
table. However, it is important to note that this is not a shallow copy:
modifying n.time
will not change the node table n
. This may change in
the future, but currently there are two ways to modify tables: .add_row()
and .set_columns()
(and also .reset()
, which empties the table).
The example node table above would be constructed using .add_row()
as
follows:
n = msprime.NodeTable()
sv = [True, True, True, False, False, False, False]
tv = [0.0, 0.0, 0.0, 0.4, 0.5, 0.7, 1.0]
pv = [0, 0, 0, 0, 0, 0, 0]
for s, t, p in zip(sv, tv, pv):
n.add_row(flags=s, population=p, time=t)
print(n)
The .add_row()
method is natural (and should be reasonably efficient) if
new records appear onebyone. In the example above it would have been more
natural to use .set_columns()
:
n = msprime.NodeTable()
n.set_columns(flags=sv, population=pv, time=tv)
Finally, here is an example where we add 1.4 to every time
except the first
in the NodeTable constructed above using numpy
indexing:
fn = n.flags
pn = n.population
tn = n.time
tn[1:] = tn[1:] + 1.4
n.set_columns(flags=fn, population=pn, time=tn)
Sorting and simplifying tables¶
Tables that are noncontradictory but do not satisfy all algorithmic requirements listed above may be converted to a TreeSequence by first sorting, then simplifying them (both operate on the tables in place):

msprime.
sort_tables
(nodes, edges[, migrations, sites, mutations, edge_start])¶ Sorts the given tables in place, as follows:
Edges are ordered by
 time of parent, then
 parent node ID, then
 child node ID, then
 left endpoint.
Sites are ordered by position, and Mutations are ordered by site.
If the
edge_start
parameter is provided, this specifies the index in the edge table where sorting should start. Only rows with index greater than or equal toedge_start
are sorted; rows before this index are not affected. This parameter is provided to allow for efficient sorting when the user knows that the edges up to a given index are already sorted.Todo
Update this documentation to describe the keyword arguments and combinations that are allowed.
Parameters:  nodes (NodeTable) –
 edges (EdgeTable) –
 migrations (MigrationTable) –
 sites (SiteTable) –
 mutations (MutationTable) –
 edge_start (int) – The index in the edge table where sorting starts.
Note: the following function is more general than
TreeSequence.simplify()
, since it can be applied to tables not satisfying
all criteria above (and that hence could not be loaded into a TreeSequence).

msprime.
simplify_tables
(samples, nodes, edges[, migrations, sites, mutations])¶ Simplifies the tables, in place, to retain only the information necessary to reconstruct the tree sequence describing the given
samples
. This will change the ID of the nodes, so that the individualsamples[k]]
will have IDk
in the result. The resulting NodeTable will have only the firstlen(samples)
individuals marked as samples. The mapping from node IDs in the current set of tables to their equivalent values in the simplified tables is returned as a numpy array. If an arraya
is returned by this function andu
is the ID of a node in the input table, thena[u]
is the ID of this node in the output table. For any nodeu
that is not mapped into the output tables, this mapping will equal1
.Tables operated on by this function must: be sorted (see
sort_tables
), have children be born strictly after their parents, and the intervals on which any individual is a child must be disjoint; but other than this the tables need not satisfy remaining requirements to specify a valid tree sequence (but the resulting tables will).Parameters:  samples (list) – A list of Node IDs of individuals to retain as samples.
 nodes (NodeTable) – The NodeTable to be simplified.
 edges (EdgeTable) – The EdgeTable to be simplified.
 migrations (MigrationTable) – The MigrationTable to be simplified.
 sites (SiteTable) – The SiteTable to be simplified.
 mutations (MutationTable) – The MutationTable to be simplified.
 filter_invariant_sites (bool) – Whether to remove sites that have no mutations from the output (default: True).
Returns: A numpy array mapping node IDs in the input tables to their corresponding node IDs in the output tables.
Return type: numpy array (dtype=np.int32)
Import and export¶
This section describes how to extract tables from a TreeSequence
, and how
to construct a TreeSequence
from tables. Since tree sequences are
immutible, often the best way to modify a TreeSequence
is something along
the lines of (for ts
a TreeSequence
):
nodes = msprime.NodeTable()
edges = msprime.EdgeTable()
ts.dump_tables(nodes=nodes, edges=edges)
# (modify nodes and edges)
ts.load_tables(nodes=nodes, edges=edges)

classmethod
TreeSequence.
load_tables
(**kwargs)¶

TreeSequence.
dump_tables
(nodes=None, edges=None, migrations=None, sites=None, mutations=None, provenances=None) Copy the contents of the tables underlying the tree sequence to the specified objects.
Parameters:  nodes (NodeTable) – The NodeTable to load the nodes into.
 edges (EdgeTable) – The EdgeTable to load the edges into.
 migrations (MigrationTable) – The MigrationTable to load the migrations into.
 sites (SiteTable) – The SiteTable to load the sites into.
 mutations (ProvenanceTable) – The MutationTable to load the mutations into.
 mutations – The ProvenanceTable to load the provenances into.
Returns: A
TableCollection
containing all tables underlying the tree sequence.Return type: TableCollection
HDF5 Format¶
The file format is broken into a number of groups. Each group contains datasets to define the data along with attributes to provide necessary contextual information.
The root group contains one attributes, format_version
. This
is a pair (major, minor)
describing the file format version. This
document describes version 3.2.
Path  Type  Dim  Description 

/format_version  H5T_STD_U32LE  2  The (major, minor) file format version. 
Provenance dataset¶
The provenance dataset records information relating the the provenance of a particular tree sequence file. When a tree sequence file is generated all the information required to reproduce the file should be encoded as a string and stored in this dataset. Subsequent modifications to the file should be also be recorded and appended to the list of strings.
The format of these strings is implementation defined. In the
current version of msprime
provenance information is encoded
as JSON. This information is incomplete, and will be updated in future
versions.
Path  Type  Dim  Description 

/provenance  H5T_STRING  Scalar  Provenance information. 
Mutations group¶
The mutations
group is optional, and describes the location of mutations
with respect to tree nodes and their positions along the sequence. Each mutation
consists of a node (which must be defined in the trees
group) and a
position. Positions are defined as a floating point value to allow us to
express infinite sites mutations. A mutation position \(x\) is defined on the same
scale as the genomic coordinates for trees, and so we must have
\(0 \leq x < L\), where \(L\) is the largest value in the
/trees/breakpoints
dataset.
As for the coalescence records in the trees
group, mutation records are
stored as seperate vectors for efficiency reasons. Mutations must be stored
in nondecreasing order of position.
Path  Type  Dim 

/mutations/node  H5T_STD_U32LE  M 
/mutations/position  H5T_IEEE_F64LE  M 
Trees group¶
The trees
group is mandatory and describes the topology of the tree
sequence. The trees
group contains a number of nested groups and datasets,
which we will describe in turn.
Breakpoints dataset¶
The /trees/breakpoints
dataset records the floating point positions of the
breakpoints between trees in the tree sequence, and the flanking positions
\(0\) and \(L\). Positions in the /trees/records
group refer to
(zero based) indexes into this array. The first breakpoint must be zero, and
they must be listed in increasing order.
Path  Type 

/trees/breakpoints  H5T_IEEE_F64LE 
Nodes group¶
The /trees/nodes
group records information about the individual
nodes in a tree sequence. Leaf nodes (from \(0\) to \(n  1\))
represent the samples and internal nodes (\(\geq n\)) represent
their ancestors. Each node corresponds to a particular individual that
lived at some time time in the history of the sample. The nodes
group is used to record information about these individuals.
Path  Type 

/trees/nodes/population  H5T_STD_U32LE 
/trees/nodes/time  H5T_IEEE_F64LE 
Records group¶
The /trees/records
group stores the individual coalesence records.
Each record consists of four pieces of information: the left and
right coordinates of the coalescing interval, the list of child nodes
and the parent node.
The left
and right
datasets are indexes into the /trees/breakpoints
dataset and define the genomic interval over which the record applies. The
interval is halfopen, so that the left coordinate is inclusive and the right
coordinate is exclusive.
The node
dataset records the parent node of the record, and is
an index into the /trees/nodes
group.
The num_children
dataset records the number of children for a particular
record. The children
dataset then records the actual child nodes for each
coalescence record. This 1dimensional array lists the child nodes for every
record in order, and therefore by using the num_children
array we can
efficiently recover the actual children involved in each event. Within a given
event, child nodes must be sorted in increasing order. The records must be
listed in time increasing order.
Path  Type  Dim 

/trees/left  H5T_STD_U32LE  N 
/trees/right  H5T_STD_U32LE  N 
/trees/node  H5T_STD_U32LE  N 
/trees/num_children  H5T_STD_U32LE  N 
/trees/children  H5T_STD_U32LE  \(\leq 2 \times\) N 
Indexes group¶
The /trees/indexes
group records information required to efficiently
reconstruct the individual trees from the tree sequence. The
insertion_order
dataset contains the order in which records must be applied
and the removal_order
dataset the order in which records must be
removed for a lefttoright traversal of the trees.
Path  Type 

/trees/indexes/insertion_order  H5T_STD_U32LE 
/trees/indexes/removal_order  H5T_STD_U32LE 