09Feb / 2020

Fossils and networks 1 – Farris and Felsenstein

Over 60 years ago, Robert Sokal and Peter Sneath changed the way we quantitatively study evolution, by providing the first numerical approach to infer a phylogenetic tree. About the same time, but in German, Willi Hennig established the importance of distinguishing primitive and advanced character states, rather than treating all states as equal. This established a distinction between phenetics and phylogenetics; and the latter is the basis of all modern studies, whether it is explicitly acknowledged or not.

More than two decades later, Steve Farris and the Willi Hennig Society (WHS) established parsimony as the standard approach for evaluating character-state changes for tree inference. In this approach, morphological traits are scored and arranged data matrices, and then the most parsimonious solution is found to explain the data. This tree, usually a collection of most-parsimonious trees (MPT), was considered to be the best approximation of the true tree. Clades in the trees were synonymized with monophyly sensu Hennig (1950, short English version published 1965), and grades with paraphyly: Cladistics was born (see also: Let's distinguish between Hennig and Cladistics).

Why parsimony? Joe Felsenstein, who was not a member of the WHS but brought us, among many other things, the nonparametic bootstrap (Felsenstein 1985), put it like this (Felsenstein 2001):

History: William of Ockham told Popper to tell Hennig to use parsimony

Soon, parsimony and cladistics came under threat by advances in computer technology and Kary Mullis' development of the polymerase-chain-reaction (PCR; Mullis & Faloona 1987) in the early 80s (note: Mullis soon went on with more fun stuff, outside science). While the data analysis took ages (literally) in the early days, more and more speedy heuristics were invented for probabilistic inferences. PCR marked the beginning of the Molecular Revolution, and genetic data became easy to access. Soon, many researchers realized that parsimony trees perform badly for this new kind of data, a notion bitterly rejected by the parsimonists, organized mainly in the WHS: the "Phylogenetic Wars" raged.

The parsimony faction lost. Today, when we analyze (up to) terabytes of molecular data, we use probabilistic methods such as maximum likelihood (ML) and Bayesian inference (BI). However, one parsimony bastion has largely remained unfazed: palaeontology.

In a series of new posts, we will try to change that; and outline what easy-to-compute networks have to offer when analyzing non-molecular data.

It's just similarity, stupid!

One collateral damage of the Phylogenetic Wars was distance-based methods, which, still today, are sometimes classified as "phenetic" in opposite to the "phylogenetic" character-based methods (parsimony, ML, BI). The first numerical phylogenetic trees were not based on character matrices but distance matrices (eg. Michener & Sokal 1957 using a cluster algorithm; Cavalli-Sforza & Edwards 1965 using parsimony; see also Felsenstein 2004, pp.123ff).

But no matter which method, optimality criterion or data-type we use, in principal we do the analysis under the same basic assumptions:

the more closely related two taxa are, then the more similar they should be.
the more similar two taxa are, then the more recent is their split.

The ingroup (blue clade) and outgroup (red clade) are most distant from each other. Placing the fossils is trivial: Z is closer to O than to C, the member of the ingroup with the fewest advanced character states. Ingroup sister taxa A + C and B + D are most similar to each other. The monophyly of the ingroup and its two subclades is perfectly reflected by the inter-taxon distances.

Assuming that the character distance reflects the phylogenetic distance (ie. the distance along the branches of the true tree), any numerical phylogenetic approach will succeed in finding the true tree. The Neighbor-Joining method (using either the Least Squares or Minimum Evolution criteria) will be the quickest calculation. The signal from such matrices is trivial, we are in the so-called "Farris Zone" (defined below).

We wouldn't even have to infer a tree to get it right (ie. nested monophyly of A + C, B + D, A–D), we could just look at the heat map sorted by inter-taxon distance.

Just from the distance distributions, visualized in the form of a "heat-map", it is obvious that A–D are monophyletic, and fossil Z is part of the outgroup lineage. As expected for the same phylogenetic lineage (because changes accumulate over time), its fossils C and D are still relatively close, having few advanced character states, while the modern-day members A and B are have diverged from each other (based on derived character states). Taxon B is most similar to D, while C is most similar A. So, we can hypothesize that C is either a sister or precursor of A, and D is the same of B. If C and D are stem group taxa (ie. they are paraphyletic), then we would expect that both would show similar distances to A vs. B, and be closer to the outgroup. If representing an extinct sister lineage (ie. CD is monophyletic), they should be more similar to each other than to A or B. In both cases (CD either paraphyletic or monophyletic), A and B would be monophyletic, and so they should be relatively more similar to each other than to the fossils as well.

Having a black hole named after you

The Farris Zone is that part of the tree-space where the signals from the data are trivial, we have no branching artifacts, and any inference (tree or network), gives us the true tree.

It's opposite has been, unsurprisingly perhaps, labeled the "Felsenstein Zone". This is the part of the tree-space where branching artifacts are important — the inferred tree deviates from the true tree. Clades and grades (structural aspects of the tree) are no longer synonymous with monophyly and paraphyly (their evolutionary interpretation).

We can easily shift our example from the Farris into the Felsenstein Zone, by halving the distances between the fossils and the first (FCA) and last (LCA) common ancestors of ingroup and outgroup and adding some (random = convergence; lineage-restricted = homoiology) homoplasy to the long branches leading to the modern-day genera.

The difference between distance, parsimony and probabilistic methods is how we evaluate alternative tree topologies when similarity patterns become ambiguous — ie. when we approach or enter the Felsenstein Zone. Have all inferred mutations the same probability; how clock-like is evolution; are their convergence/ saturation effects; how do we deal with missing data?

For our example, any tree inference method will infer a wrong AB clade, because the fossils lack enough traits shared only with their sisters but not with the other part of the ingroup. Only the roots are supported by exclusively shared (unique) derived traits (Hennigian "synapomorphies"). The long-branch attraction (LBA) between A and B is effectively caused by:

'short-branch culling' between C and D: the fossils are too similar to each other; and their modern relatives too modified;
the character similarity between A and B underestimates the phylogenetic distance between A and B, due to derived traits that evolved in parallel (homoiologies).

While ML and NJ make the same decision, the three maximum-parsimony trees permute options for placing C and D, except for the correct options, and including an impossible one (a hard trichotomy). Standard phylogenetic trees are (by definition) dichotomous being based on the concept of cladogenesis — one lineage splits into two lineages.

MPTs inferred using PAUP*'s branch-and-bound algorithm (no heuristics, this algorithm will find the actual most parsimonious solution); NJ/LS using PAUP* BioNJ implementation and simple (mean) pairwise distances; ML using RAxML, corrected (asc.) and not (unc.*) for ascertainment bias (the character matrix has no invariable sites). All trees are not rooted using a defined outgroup but mid-point rooted. If rooted with the known outgroup O, the fossil Z would be misinterpreted as early member of the ingroup.

We have no ingroup-outgroup LBA, because the three convergent traits shared by O and A or B, respectively, compete with a total of eight lineage-unique and conserved traits (synapomorphies) — six characters are compatible with a O-A or O-B sister-relationship (clade in a rooted tree) but eight are incompatible. We correctly infer an A–D | O + Z split (ie. A–D clade when rooted with O) simply because A and B are still more similar to C and D than to Z and O; not some method- or model-inflicted magic.

The magic of non-parametric bootstrapping

When phylogeneticists perform bootstrapping, they usually do it to try to evaluate branch support values — a clade alone is hardly sufficient to infer an inclusive common origin (Hennig's monophyly), so we add branch support to quantify its quality (Some things you probably don't know about the bootstrap). In palaeontology, however, this is not a general standard (Ockhams Razor applied but not used), for one simple reason: bootstrapping values for critical branches in the trees are usually much lower than the molecular-based (generally accepted) threshold of 70+ for "good support" (A ll solved a decade ago).

When we bootstrap the Felsenstein Zone matrix that gives us the "wrong" (paraphyletic) AB clade, no matter which tree-inference method we use, we can see why this standard approach undervalues the potential of bootstrapping to explore the signal in our matrices.

Consensus networks based on each 10,000 BS pseudoreplicates, only splits are shown that are at least found in 15% of the replicate trees (trivial splits collapsed). Reddish – false splits (paraphyletic clades), green – true splits (monophyletic clades).

While parsimony and NJ bootstrap pseudoreplicates either fall prey to LBA or don't provide any viable alternative (the bootstrap replicate matrix lacks critical characters), in the example a significant amount of ML pseudoreplicates did escape the A/B long-branch attraction.

Uncorrected, the correct splits A + C vs. rest and B + D vs. rest can be found in 19% of the 10,000 computed pseudoreplicate trees. When correcting for ascertainment bias, their number increases to 41%, while the support for the wrong A + B "clade" collapses to BS_ML = 49. Our BS supports are quite close to what Felsenstein writes in his 2004 book: For the four taxon case, ML has a 50:50 chance to escape LBA (BS_ML = true: 41 vs. false: 49), while MP and distance-methods will get it always wrong (BS_MP = 88, BS_NJ = 86).

The inferred tree may get it wrong but the (ML) bootstrap samples tell us the matrix' signal is far from clear.

Side-note: Bayesian Inference cannot escape such signal-inherent artifacts because its purpose is to find the tree that best matches all signals in the data, which, in our case, is the wrong alternative with the AB clade — supported by five characters, rejected by four each including three that are largely incompatible with the true tree. Posterior Probabilities will quickly converge to 1.0 for all branches, good and bad ones (see also Zander 2004); unless there is very little discriminating signal in the matrix — a CD clade, C sister to ABD, D sister to ABC, ie. the topological alternatives not conflicting with the wrong AB clade, will have PP << 1.0 because these topological alternatives give similar likelihoods.

Long-edge attraction

When it comes to LBA artifacts and the Felsenstein Zone, our preferred basic network-inference method, the Neighbor-Net (NNet), has its limitations, too. The NNet algorithm is in principle a 2-dimensional extension of the NJ algorithm. The latter is prone to LBA, hence, and the NNet can be affected by LEA: long edge attraction. The relative dissimilarity of A and B to C and D, and (relative) similarity of A/B and C/D, respectively, will be expressed in the form of a network neighborhood.

Note, however, the absence of a C/D neighbourhood. If C is a sister of D (as seen in the NJ tree), then there should be at least a small neighbourhood. It's missing because C has a different second-best neighbour than B within the A–D neighbourhood. While the tree forces us into a sequence of dichotomies, the network visualizes the two competing differentiation patterns: general advancement on the one hand (ABO | CDZ split), and on the other potential LBA/LEA, vs. similarity due to a shared common ancestry (ABCD | OZ split; BD neighborhood).

Just from the network, we would conclude that C and D are primitive relatives of A and B, potentially precursors. The same could be inferred from the trees; but if we map the character changes onto the net (Why we may want to map trait evolution on networks), we can notice there may be more to it.

Character splits ('cliques') mapped on the NNet. Green, derived traits shared by all descendants of a common ancestor ('synapomorphies'); blue, lineage-restricted derived traits, ie. shared by some but not all descendants (homoiologies); pink, convergences between in- and outgroup; black, unique traits ('autapomorphies')

Future posts

In each of the upcoming posts in this (irregular) series, we will look at a specific problem with non-molecular data, and test to what end exploratory data analysis can save us from misleading clades; eg. clades in morphology-informed (parts of, in case of total evidence) trees that are not monophyletic.

* The uncorrected ML tree shows branch lengths that are unrealistic (note the scale), and highly distorted. The reason for this is that the taxon set includes (very) primitive fossils and (highly) derived modern-day genera, but the matrix has no invariable sites telling the ML optimization that changing from 0 ↔ 1 is not that big of a deal. This is where the ascertainment bias correction(s) step(s) in (RAxML-NG, the replacement for classic RAxML 8 used here, has more than one implementation to correct for ascertainment bias. A tip for programmers and coders: effects of corrections have so far not been evaluated for non-molecular data).

Cited literature
Cavalli-Sforza LL, Edwards AWF. 1965. Analysis of human evolution. In: Geerts SJ, ed. Proceedings of the XI International Congress of Genetics, The Hague, The Netherlands, 1963. Genetics Today, vol. 3. Oxford: Pergamon Press, p. 923–933.
Felsenstein J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783–791.
Felsenstein J. 2001. The troubled growth of statistical phylogenetics. Systematic Biology 50:465–467.
Felsenstein J. 2004. Inferring Phylogenies. Sunderland, MA, U.S.A.: Sinauer Associates Inc., 664 pp. (in chapter 10, Felsenstein provides an entertaining "degression on history and philosophy" of phylogenetics and systematics).
Hennig W. 1950. Grundzüge einer Theorie der phylogenetischen Systematik. Berlin: Dt. Zentralverlag, 370 pp.
Hennig W. 1965. Phylogenetic systematics. Annual Review of Entomology 10:97–116.
Michener CD, Sokal RR. 1957. A quantitative approach to a problem in classification. Evolution 11:130–162.
Mullis KB, Faloona F. 1987. Specific synthesis of DNA in vitro via a polymerase catalyzed chain reaction. Methods in Enzymology 155:335–350.

Permalink

13Mar / 2013

Topological restrictions: some comments

In a recent post (Different topological restrictions of rooted phylogenetic networks. Which make biological sense?), Leo and Steven synthesized a lot of the issues that have been raised in recent years, both in this blog and in the literature, about the underlying models for rooted phylogenetic networks. This is an excellent summary (and explanation) of the current situation; please read it if you have not already done so.

The key issue for biologists is: what are we trying to model biologically? In one sense the answer is trivial: "evolutionary history". However, this answer is uninformative, in the sense that it is too broad to be practical. We do not know enough about evolutionary history for it to be obvious which model(s) we should use, and we presumably never will know (we were not there to study it, and it happens too slowly to study it now).

Nevertheless, we need to make decisions about models, and in many cases quite detailed decisions. Therefore, we need public discussion about the various possible models that have been suggested, as well as suggestions for new models. Unfortunately, the impetus for this has almost always come form the computational side rather than the biological side, if only from practical necessity — a mathematician cannot proceed without a model. (A biologist can't either, but they seem to be much more comfortable with vague word models rather than quantitative mathematical ones.)

So, I will provide an initial response to Leo and Steven's points, as a staring point for discussion, in the order in which they raised them. Hopefully, someone else will have something to say, as well.

Rooted

If there is assumed to be a single origin of life, and evolution is assumed to be acyclic, then there will always be at least one root in every evolutionary network. Indeed, if we are studying a monophyletic group of organisms then there will always be precisely one root. But if we are studying a polyphyletic group then technically there are multiple roots, in the sense that we cannot reconstruct the most recent common ancestor of the group. In practice, we have ignored this issue, and simply reconstructed the ancestor anyway, as best we can, since that is what the current tree-building algorithms do. In this sense, the only difference between an evolutionary tree and an evolutionary network is the complication caused by horizontal gene transfer, in which we might consider a species' genome to be polyphyletic, as well.

So, a single artificial root will probably suffice — see Steven's earlier comment: "the multiple-root situation can probably be reduced to the single-root situation by having some kind of high-degree (i.e. unrefined) artificial root which is the parent of the real roots."

Acyclic

Acyclicity is an obvious criterion for evolution, because a descendant cannot be its own ancestor. So, in a network we should treat this as "sacred" — an evolutionary network in its final form cannot show any directed cycles.

This does not, I guess, preclude algorithms that do not themselves require acyclicity, as biology always violates mathematical assumptions (eg. normality, homoscedasticity, etc). However, I think that we should require the output to be acyclic.

This is not necessarily an issue for reconciliation between gene trees and species trees, as raised in the original blog post. In this scenario the resulting phylogeny is a tree, and therefore cannot show cycles, by definition.

Time-consistent

Time consistency is a stronger form of acyclicity, in the sense that the acyclic parts of a network are also restricted by time's arrow. It is another obvious criterion for evolution, because a descendant cannot reticulate with its predecessors.

However, this requirement is confounded by incomplete taxon sampling, as noted by Leo and Steven. To deal with this, we can allow "ghost" lineages representing hypothesized missing taxa. This does, nevertheless, raise the issue of where we draw the line. Any two trees can be reconciled by allowing enough ghost lineages, just as is done by algorithms for the gene duplication-loss problem, where minimizing the numbers of unobserved duplications and losses is the optimality criterion. Similarly, how many ghost lineages should we allow in a reticulating network? Perhaps we should be minimizing them, too.

There is also the more fundamental issue of whether time-consistency should actually be a requirement at all. It has been pointed out a number of times that this is not a requirement for anthropology, for example, where transfer of non-genetic information through time is not only possible but quite common. This has been discussed in earlier blog posts.

So, in a general sense we cannot make time consistency a mandatory requirement of evolutionary networks. However, we could do so for strictly biological networks, where information transfer occurs with gene flow, which must follow time's arrow.

Degree-restricted

Nodes with indegree 1 and outdegree 1 are usually considered to be unnecessary, because they do not represent anything more than would an edge (or arc). The only times they have been seriously suggested are when placing known fossils onto a tree, and the fossil is claimed to represent an ancestor of one of the nodes. However, this is considered to be inappropriate, because we can never be sure that any specified fossil is actually an ancestor, as opposed to being a sister to one of the ancestors (see the blog post at Evolving Thoughts).

Indegree >2 has also been considered to be problematic in phylogenetics. If a sexually reproducing organism has two parents, then theoretically we cannot observe indegree>2. In practice, however, evolutionary events may occur over a short enough time-scale that we cannot distinguish a series of individual indegree-2 events, so that for practical purposes a network might end up showing indegree >2. This is analogous to "soft polytomies" in trees, where outdegree >2 represents uncertainty.

Tree-child, Tree-sibling, Reticulation-visible

These network restrictions all refer to so-called "invisible" nodes (nodes from which all paths end up in reticulations). All nodes in a tree are visible, and so there is no analogous situation in tree-building from which we might derive a response (as I have done several times above).

The basic issue with invisible nodes is restricting their occurrence. Theoretically, we could add an infinite number of invisible nodes to a network, but few, if any, of them would be based on any inference from the data. So, tree-child networks ban them entirely, whereas tree-sibling and reticulation-visible networks allow them under particular circumstances.

It is difficult for me to see exactly how I should interpret invisible nodes biologically. How can I reconstruct their features, for example? Invisible nodes tend to appear when combining multiple trees into a network, and so they are not usually constructed directly from character data. Biologically, they might exist, but there seems to be little reason to consider them as well-supported inferences, as we do for visible nodes.

In this sense, they have much more of the feel of mathematical artifacts than do visible nodes.

Galled trees, Galled networks, Level-k

These are all artificial restrictions on networks that do nothing more than limit the complexity (and thus make the algorithms more tractable). There seems to be no good biologicals basis for using any of these criteria as restrictions (as opposed to using them as network descriptors).

There is, however, a good logical basis for their use — simpler networks are to be preferred over more complex ones, at least as working hypotheses for evolutionary history. The problem with this attitude seems to be that sometimes simpler networks look less biologically realistic than do more complex ones (eg. level-1 networks might spread the reticulations across the network whereas level-3 networks can concentrate them in one place).

Regular and normal

These always seem to me to be of mathematical interest for characterizing networks, rather than being of any biological interest. But maybe that is just my ignorance.

Directed Acyclic Graph or tree-with-edges-added?

To me , this is one of the BIG questions. Until biologists come to grips with this, I do not see how the computational people are going to make the helpful contributions that they obviously desire (or, at least, the ones I know). I have thought (and read) about this a lot, and I have come to the (perhaps unfortunate) conclusion that the answer is: "both".

From the mathematical point of view, the problem is simply that certain network topologies will not be considered when we add reticulation nodes to a tree. So, we automatically exclude those topologies if our algorithm starts with a tree. From the biological point of view, the issue is whether we consider evolutionary history to be approximately tree-like or whether it is fundamentally reticulated.

My reading of the literature is that those people dealing with hybridization in eukaryotes are leaning towards the "tree obscured by vines" model, whereas those dealing with horizontal gene transfer in prokaryotes prefer the "anastomosing plexus" model. Perhaps it is too soon to tell whether this dichotomy will continue; but, certainly, discussion about the hybridization issue goes back to the early 1980s, and since then the opinion about the fundamentally tree-like nature of history has been repeatedly expressed. Moreover, those people using reduced median networks and median joining networks for mitochondrial data seem to be interested in simplifying their initially complex network as much as possible, and then interpreting the resulting network in terms of a few conflicting parsimony trees; I interpret this as a preference for trees over networks.

Of course, preferring a reticulated tree model may just be historical inertia, whose importance in the daily lives of scientists should not be underestimated.

Permalink

03Mar / 2013

Different topological restrictions of rooted phylogenetic networks. Which make biological sense?

Those readers active in the field of evolutionary phylogenetic networks will know that there are many different definitions of “rooted phylogenetic network” in circulation. While certain features are almost universal (e.g. rooted, acyclicity), many are not. Why do these differences arise? There are multiple answers to this. Some arise because of differing opinions on what biologically realistic is, and (relatedly) what the correct balance is between biological detail and mathematical abstraction. Others arise because they make optimization problems on networks easier to solve. This should not automatically be viewed as mathematics prescribing reality to biology, but rather as the observation that if evolution looks like this then certain optimization problems can be solved well. Finally, some differences are superficial; they do not lead to any intrinsic differences in the model or its underlying mathematical structure. Of course, “superficial” is highly context dependent, as some of the examples below will show.

Here we list some well-known and less-well known properties that have surfaced in definitions of rooted phylogenetic network. We will take as given that evolution is directed i.e. that the arcs in the network have an explicit orientation (representing the flow of time). Below most properties we show a figure of a network violating it.

The question to our readers, particularly those from the biological side of the spectrum: which of the following properties make sense? And which other restrictions would make more sense biologically?

Rooted

A root is a node with indegree 0, meaning that it does not have any ancestors. If evolution is assumed to be acyclic (see below) then there will always be at least one root. Most articles writing about rooted phylogenetic networks assume a single root (which is necessarily an ancestor of every node in the network). Some time ago on this blog David raised the question of whether it would not sometimes be better to allow multiple roots. This is an interesting point both from the perspective of interpretation (what does it mean?) and its impact on algorithmic efficiency.

Acyclic

Most models assume acyclicity: it is not possible to walk along the arcs of the network (respecting their orientation) such that you end up back where you started. The most intuitive argument for this is the passing of time: if arcs represent forward motion through evolutionary time, how can you end up back at an earlier point in time? Recently someone pointed out to me that in the reconciliation literature – where one shows how to reconcile a given gene tree with a given species tree – acyclicity is actually not sacred at all. The reason for this is that, without the acyclicity constraint, the problem becomes computationally tractable. See this recent RECOMB 2013 article [3] for an example of this.

Time-consistent

This is an interesting property. It was introduced to prevent reticulation events between non-contemporaneous taxa. That is, to avoid absurd situations such as an organism hybridizing with its ancestor. The most common mathematical articulation of time-consistency is this: it should be possible to put a “time-stamp” on each node of the network such that (a) time always moves strictly forward along tree-edges , and (b) all the nodes involved in a reticulation event have the same time-stamp. The figure here shows a network that is not time-consistent. For many contextualisations of “reticulation event” time-consistency seems to make a lot of sense. But there is a catch. A network might fail to be time-consistent only for the rather artificial reason that we failed to sample a taxon that was, in fact, part of the network (incomplete taxon sampling). Given the reality of incomplete data, demanding time-consistency might be too prohibitive. However, as with all restrictions in this blog, it is perhaps useful as a selection criterion for preferring one network over another.

Figure 1: A network that is not time-consistent. The red and blue node are the two parents of a reticulation node but cannot have coexisted in time.

Degree-restricted

The indegree of a node is the number of parents of it, and the outdegree of a node is the number of children of it. In a bifurcating, rooted phylogenetic tree all nodes have indegree-1 (except the root: indegree-0) and outdegree-2. Polytomies have outdegree-3 or higher. In a rooted phylogenetic network we also have reticulation nodes i.e. nodes with indegree-2 or higher. Articles differ in the degree-restrictions they place on nodes; there is an entire zoo of different permutations possible. Many articles agree that nodes with indegree-1 and outdegree-1 should not be allowed, because they are phylogenetically uninformative (and indeed such nodes are also rarely encountered in the phylogenetic tree literature). But what about polytomies? And what about reticulation nodes: should they be permitted to have indegree-3 or higher, and if so how should such “reticulate polytomies” be interpreted? From a parsimony perspective it is usual to argue that reticulate polytomies should be more “expensive” than indegree-2 reticulations, because a single reticulation polytomy can explain far more discordance than a single indegree-2 reticulation. Interestingly, some optimization problems are not really affected at all by degree-restrictions on phylogenetic networks (e.g. the hybridization number problem) while others are highly sensitive to degree-restrictions (e.g. the small parsimony problem).

Tree-child
Every node has at least one non-reticulate child. This restriction makes sure that there are no "invisible nodes", i.e. nodes from which all paths end up in reticulations. As a result, this restriction makes many computational problems more tractable and mathematical reasoning easier. For example, consider the basic Tree Containment problem, i.e. given a phylogenetic network and a phylogenetic tree, decide if the network displays the tree, see [8]. This problem was shown to be tractable for tree-child networks, while it is NP-hard for most other classes of networks (time-consistent, tree-sibling, regular). This seems important because if it is already hard to tell if a given tree is in a given network, then it seems a daunting task to try to build networks of that class from any kind of data.

It should be noted that there is of course no guarantee that "real" evolutionary histories are tree-child. In fact, simulation studies show that only under very low recombination rates one can expect tree-child networks [1][2].

Figure 2: A network that is not tree-child. The red node does not have a non-reticulate child.

Tree-sibling
Every reticulate node has at least one non-reticulate sibling. Originally introduced by Cardona, Llabrés, Rosselló and Valiente who write "Biologically, this condition means that for each of the reticulation events, at least one of the species involved in it has also some descendant through mutation" and showed that networks can efficiently be compared (with a polynomial-time computable metric) if they are both tree-sibling and time-consistent.

An advantage of the class of tree-sibling networks is that it is much larger than the class of tree-child networks. If a reticulation has no non-reticulate siblings, then its parents have no non-reticulate children. Hence, every tree-child network is tree-sibling. However, it can easily be seen that there are many tree-sibling networks that are not tree-child. Computationally, the tree-sibling restriction does not seem to help as much as the tree-child restriction. Again, there is no guarantee that the "real" network is tree-sibling, but it might be more likely, see [1][2].

Figure 3: A network that is not tree-sibling (the red node does not have a non-reticulate sibling).

Reticulation-visible

From every reticulation node there is a path to some leaf or cut-arc (a branch whose removal disconnects the network) such that this path does not pass through any reticulations. This is another attempt to weaken the tree-child restriction, thus obtaining a larger class of networks for which many computational problems are still tractable. It is incomparable with tree-sibling (a reticulation-visible network does not have to be tree-sibling and vice versa) but this class clearly contains the class of tree-child networks, and in fact is much larger. It does not forbid all invisible nodes, but just invisible reticulation nodes. It has been shown in the book by Huson, Rupp and Scornavacca that the so-called Cluster Containment problem becomes tractable for reticulation-visible networks. If the same is the case for the Tree Containment problem is still an open question.

Figure 4: A network that is not reticulation-visible. The red reticulation node is not "visible".

Galled trees

All reticulation cycles are disjoint. Introduced by Gusfield, Eddhu and Langley [6] although studied before under different names. Makes computational problems much easier but seems biologically unrealistic. On the other hand, galled trees could have a future in the context of data-display networks, by using the galls to show where the reticulate activity is in the network, rather than claiming that each gall represents exactly one reticulation event.

Galled networks

Each arc leaving a reticulation node is a cut-arc. Introduced by Huson and Klöpper [7] (who gave a different but equivalent definition) as a generalization of galled trees, giving a fast algorithm for constructing galled networks from clusters. Hasn't been studied much since. (Be aware that there is also an article using "galled network" as an alternative term for galled tree.)

Figure 5: A network that is not a galled network. The red arc leaves a reticulation node but is not a cut-arc.

Level-k (not a restriction, but a measurement)

Also a generalization of galled trees (which are basically level-1 networks). Every network is a level-k network for some k. Hence, “level-k” should not really be viewed as a topological restriction, but rather as a measure of how “tangled” (intensely concentrated) the islands of reticulation are in the network. The higher k is, the more tangled the network is; level-0 networks are simply trees. For more information, see a previous blog. Other proposed measurements of tangledness include k-nested, r-reticulation and depth-k.

Regular and normal

If you see a phylogenetic network as a representation of a set of clusters, then it makes sense to consider regular networks (introduced by Baroni, Semple and Steel). A network is regular if it is the so-called "cover digraph" (Hesse diagram) of its set of clusters. Hence, for each set of clusters, there exists a unique regular network with precisely that set of (hardwired) clusters. Normal networks have the additional requirement of being tree-child, thus forming a very restricted class of networks. For example, the network in Figure 1 is not regular and hence also not normal.

DAG or tree-with-edges-added?

This is a rather subtle one. If one views a phylogenetic network as a tree with edges added, then this leads to a different space of networks than the “a phylogenetic network is essentially a directed acyclic graph” definition encountered in other articles. The point being that if you insist that a network has to be “grown” from some tree starting point (by adding edges in a certain way), certain topologies cannot be reached which can be reached if the we do not anchor it in this way. The following figure shows an example.

Figure 6: A network that cannot be obtained by adding edges to a tree (for common edge-adding rules).

It can be shown that any tree-sibling network can be constructed by adding edges to a tree, but the network in Figure 3 shows that the converse is not true (the shown network can be constructed by adding edges to a tree, but is not tree-sibling).

This restriction touches on the fundamental question whether there exists something like a species tree, and if it might be possible to reconstruct this species tree before starting network analysis.

A related but stronger (?) restriction was recently used by Wu [9]. In his RECOMB 2013 article, he writes “(R1) For a network N, when only one of the incoming edges of each reticulation node is kept and the other is deleted, we always derive a tree T'.”

Conclusion

We see from this list that there are already some quite different topological properties and restrictions in circulation for rooted phylogenetic networks. To biologists these discussions might appear to be a strange side-show to keep computer scientists in work. But it runs deeper than that, because it touches on three fundamental issues. Firstly, what are we trying to model exactly? Secondly, the importance of understanding the networks that your favourite software for constructing rooted phylogenetic networks will not build, however biologically relevant, due to the fact that they are a priori excluded from its search space. Finally, since the total number of networks is huge, it could be inevitable to focus on certain restricted classes of networks when one wants to search through network-space efficiently.

Note: There is a follow-up post Topological restrictions: some comments.

REFERENCES

[1] Miguel Arenas, Mateus Patricio, David Posada and Gabriel Valiente. Characterization of Phylogenetic Networks with NetTest. In BMCB, Vol. 11:268, 2010.

[2] Miguel Arenas, Gabriel Valiente and David Posada, Characterization of Reticulate Networks Based on the Coalescent with Recombination, Mol Biol Evol (2008) 25 (12):2517-2520.

[3] Mukul S. Bansal, Eric J. Alm, Manolis Kellis, Reconciliation Revisited: Handling Multiple Optima when Reconciling with Duplication, Transfer, and Loss, RECOMB 2013.

[4] Gabriel Cardona, Merce Llabres, Francesc Rossello, Gabriel Valiente, The comparison of tree-sibling time consistent phylogenetic networks is graph isomorphism-complete, arXiv:0902.4640 [q-bio.PE], 2009.

[5] Gabriel Cardona, Mercè Llabrés, Francesc Rosselló and Gabriel Valiente. A Distance Metric for a Class of Tree-Sibling Phylogenetic Networks. In BIO, Vol. 24(13):1481-1488, 2008.

[6] Dan Gusfield, Satish Eddhu and Charles Langley. Efficient reconstruction of phylogenetic networks with constrained recombination. In CSB03, Pages 363-374, 2003.

[7] Daniel H. Huson and Tobias Klöpper. Beyond Galled Trees - Decomposition and Computation of Galled Networks. In RECOMB 2007, Vol. 4453:211-225 of LNCS.

[8] Leo van Iersel, Charles Semple and Mike Steel, Locating a Tree in a Phylogenetic Network, Information Processing Letters, 110 (23), pp. 1037-1043 (2010).

[9] Yufeng Wu. An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees. RECOMB 2013.

Permalink

25Jul / 2012

Are mathematical constraints biologically realistic?

Mathematicians and other computational scientists have produced their own definitions of phylogenetic networks, independently of biologists. For the evolutionary type of phylogenetic network, the definition usually looks something like this:

A phylogenetic network is a rooted, directed graph (consisting of nodes, plus edges that connect each parent node to its child nodes) such that:
(1) There is exactly one node having indegree 0, the root
- all other nodes have indegree 1 or 2
(2) All nodes with indegree 1 have outdegree 2 or 0
- nodes with outdegree 2 are tree nodes
- nodes with outdegree 0 are leaves, distinctly labelled
(3) The root has outdegree 2, and
(4) Nodes with indegree 2 have outdegree 1, called reticulation nodes.

An obvious question of interest is how (or whether?) this definition connects to what biologists have in mind when they use the term "phylogenetic network". Clearly, this definition places considerable restrictions on the networks that will be inferred by any mathematical algorithm, which in turn affects their use as models for biological inference.

The first thing to note is that unrooted networks are excluded, because the graph is directed. Thus, many (if not most) of the phylogenetic networks that have appeared in the literature are excluded from the discussion. Furthermore, a tree is considered to have all internal nodes with indegree 1 and outdegree 2 (i.e. no reticulation nodes), and we know this to be biologically unrealistic, in general. (Otherwise, this blog would be redundant!)

Biologically, the other parts of the definition imply:
One node of indegree 0
- the network has no previous ancestry that is to be inferred
Nodes with outdegree 0 are labelled
- observed (contemporary) taxa occur only at the leaves
All nodes with indegree 2 have outdegree 1
- reticulation and speciation cannot occur simultaneously
No nodes with indegree >2
- reticulation events cannot involve input from more than 2 parents simultaneously
No nodes with outdegree >2
- speciation involves only two children at a time.

These do not appear to be onerous biological restrictions. Indeed, he first two have been standard characteristics of tree-building for several decades. The other three are also logical extensions of the restrictions that have previously been placed on trees. However, phylogenetic history is unlikely to have been as simple as implied by these features. Thus, biologists will need to keep a careful eye on whether the simplifications are affecting the networks inferred for their particular group of organisms.

Other restrictions

In addition to the restrictions created by the definition, other topological restrictions have been used to make the mathematical inference algorithms computationally tractable. Thus, only certain sub-families of possible networks are considered by most of the computer programs. These include:

tree-child network, tree-sibling network
level-k network, galled tree
binary input trees for hybridization and HGT networks
binary characters for recombination networks.

These restrictions may be unrelated to each other; so, we can consider them separately.

Tree-child, Tree-sibling

In a tree-child network, every internal node has at least one child node that is a tree node
- ie. a reticulation event cannot be followed immediately by another reticulation event
In a tree-sibling network, every reticulation node has at least one sibling node that is a tree node
- ie. a parent cannot be directly involved in two separate reticulation events
Note that every tree-child network will also be a tree-sibling network, but not vice versa.

Algorithmically, these two restrictions may involve the addition of extra tree nodes to an inferred network, in order to satisfy the restrictions. Biologically, the question is whether real networks are this simple. Arenas et al. (2008) simulated data under the coalescent with recombination, and found that even at small recombination rates most of the networks produced were already more complex than a tree-sibling network. On the other hand, Arenas et al. (2010) analyzed real population-level data from the PopSet and Polymorphix databases using the TCS program, and found that >98% the resulting networks could be characterized as tree-sibling. So, there is cause for optimism, in the sense that the "optimum" networks algorithmically are not necessarily complex, at least for closely related organisms (ie. within species).

Level-k network, Galled tree

A network has level k if each tangled part of the network (ie. each biconnected component) contains at most k reticulation nodes (see this previous post). This is a generalization of the older notion of galled trees, in which reticulation cycles do not overlap (ie. do not share edges or nodes), as galled trees are level-1 networks. Level-k networks can also be seen as a generalization of networks with k reticulation nodes, although there may be a difference between a network with minimum level and one with a minimum number of reticulations.

Algorithmically, these restrictions have been used to guide the search for (or choice of) the "optimal" inferred network. Biologically, these notions do not seem to have been investigated, but basically they restrict how complex inferred reticulation histories can be. In particular, they restrict the complexity of any given subset of each network. It has been noted that optimizing k can easily lead to networks that look biologically unrealistic (Huson et al. 2011).

Binary input

The requirements for binary input trees and binary characters are restrictions that have been applied in the past, because they greatly reduce the complexity of the input to the network algorithms, but they are now being relaxed. Effectively, the restrictions are to fully dichotomous trees and SNP characters. These are not unusual restrictions in evolutionary analysis, but they are obviously unrealistic.

As I noted in an earlier blog post, non-binary data often reflect uncertainty in the input, rather than a strictly bifurcating history, and this is not taken into account in the network inference if the input is restricted to a binary state. In particular, it may be unnecessarily hard to construct a network (because not all of the data signals relate to reticulation), and the resulting networks may have far too many reticulation nodes.

Conclusion

It is still an open question about the extent to which we can use these topologically restricted families of mathematical networks as a basis for reconstructing biological histories. Clearly, much more work is needed to understand the connections between the mathematical restrictions and the requirements of biological modelling.

References

M. Arenas, M. Patricio, D. Posada, G. Valiente (2010) Characterization of phylogenetic networks with NetTest. BMC Bioinformatics 11: 268.

M. Arenas, G. Valiente, D. Posada (2008) Characterization of reticulate networks based on the coalescent with recombination. Molecular Biology and Evolution 25: 2517-2520.

D. H. Huson, R. Rupp, C. Scornavacca (2011) Phylogenetic Networks: Concepts, Algorithms and Applications. Cambridge University Press.

Permalink

Reader

Category Archives: optimization criteria

Fossils and networks 1 – Farris and Felsenstein

Topological restrictions: some comments

Different topological restrictions of rooted phylogenetic networks. Which make biological sense?

Are mathematical constraints biologically realistic?

Meta