Hack and fish … for recombination in the SARS group


Following the current flow, we have had a few recent coronavirus posts here on the Genealogical World of Phylogenetic Networks. In this post, I'll show the results of a little experiment coming back to David's original post on the topic. Can we use trees to "fish" for evidence of recombination?

As David pointed out, even when we use a phylogenetic-tree inference method to analyze virus genomes, we don't really end up with a phylogenetic tree. Instead, we have a tree reflecting genetic similarity, which will reflect the phylogeny to some unknown extent. The main problem with virus genomes, however, is that they easily recombine — and thus different parts of a virus genome may have different evolutionary histories. A single tree cannot reflect this.

This does not mean that trees cannot tell is something about virus evolution. However, these trees become part of a fishing exercise, looking for different possible historical pathways, which may reflect recombination events.

The tree

Our SARS harvest matrix includes about a dozen sequence groups, which we have labeled Type 1 (the original SARS-CoV) to 9b. Type 7 is the new SARS-CoV-2. For my experiment here, I picked one place-holder sequence per main type (to speed up calculation time). I added two more types: the newly found direct sister of SARS-CoV-2; and some "unclassified" SARS-like viruses from pangolins, which earlier were proposed as sisters, as shown in this tree from the GISAID web page.

The phylogenetic neighborhood of SARS-CoV-2 (GISAID, screenshot captured 3/6/2020). Note the flatness of the CoV(-1; yellow) and CoV-2 (red) subtrees.

GISAID doesn't give the GenBank accession numbers, so we cannot easily say whether our sample matches theirs. However, the tree we can infer from the complete genomes (high-divergent, non-alignable regions excluded) looks very similar, as shown next, and some of the labels match up.

Fig. 1 Maximum likelihood (ML) tree inferred for our sample using (old, v.8.0.20) RAxML. Roman numbers refer to earlier defined Types 1–9 (Tree and viruses – the SARS group), Arabic numbers give nonparametric bootstrap (BS) support based on 100 BS pseudoreplicates (number of neccessary BS replicates determined by the extended majority rule criterion). Branches without Arabic number are unambiguous (BS = 100).

Most importantly, all but three branches have unambiguous support: the phylogeny of this sample is resolved. Unfortunately, as our recurring readers already know, this nearly resolved tree simplifies a much more complex situation.

The Neighbor-net with recombinations and mutational trends (arrows, connectives; cf. Tree and viruses – the SARS group).

Hack and slash

A simple method to fish for different evolutionary histories in a genome is to cut the virus genomes into sub-sequences, infer a tree for each sub-sequence, and then compare the trees. Most researchers compare trees by showing them and discussing which one makes most sense. Here is an example from Corman et al. (2014), who searched for the root of MERS (Middle East Respiratory Syndrome) virus, an illness closely related to SARS.

Reprint of Corman et al. 2014, fig. 3 with colors added to EriCoV (green) and HKU/BtCoV (olive) groups

Each tree in their Fig. 4A and B (Bayesian majority rule consensus trees) was inferred from a different part of the genome. Corman et al.'s focus was to root the MERS viruses by identifying a better outgroup. However, note that the new sister-group (red, green stars – sister to MERS; orange stars – sister to someone else) moves, and so does the green EriCoV clade and the olive HKU/BtCoV group (clade in some trees and grade in others). Do some of these trees get it wrong? Or is, eg. NeoCoV the product of reticulate evolution (here: ancient recombination)? Some parts of its genome might be derived from a common ancestor with MERS (blues), and others from a common ancestor with KW2E (black) and EriCoV (green).

Our complete matrix has 27,333 characters, providing nearly 6,000 distinct alignment patterns (abbreviated DAP, below), which is a lot — the GISAID link above also provides a graphical representation of site divergence. However, probabilistic tree inference methods (ML, Bayes) can handle moderate to high levels of divergence in the data. On the other hand, they also need a certain amount of data to perform well (see also: Inferring a tree with 12000 [or more] virus genomes). So, for my experiment, I hacked the matrix into nine bits of equal size, ie. each submatrix has a bit more than 3,000 nucleotides, providing between 615 (bit #5) and 1029 (bit #1) DAPs.

Fig. 2 Nine ML trees with BS support annotated along branches, each based on a ~3000 nucleotide long bit of the genomes (ordered left-right, top-bottom). Purple highlights branches conflicting with the complete genome tree.

Our nine trees (shown above) are not badly resolved, as most branches get substantial support. But they are not congruent. If we are dealing with recombination, then we might assume that all of these trees do show an actual aspect of the evolutionary history of the genomes. That is, they are all right and wrong, at the same time.

Moreover, we have high supported clades conflicting with the complete genome tree's (Fig. 1) topology. The signal issues, due to recombination (see Trees and viruses...), did not decrease branch support. That is, 6,000+ DAP is a lot, and recombination only affects a part of the complete genome, possibly quite a small part.

Non-trivial evolution needs more than trivial graphs

To depict the reticulate phylogeny of the virus sample, we need to consider the differences seen in the hacked-and-slashed matrix trees. This can easily be illustrated using a network, instead of a set of trees, as shown here.

Fig. 3 A (strict) consensus network of all nine trees, in which the edge lengths give the sum of the branch lenghts in the tree sample. The gray brackets give the topology of the near-fully resolved complete genome tree.

The graph above is a phylogenetic network: the competing edge bundles represent the different inferred histories of bits of the genomes. The SARS-CoV-2 lineage seems to be the product of (ancient) recombination, and recombination also played a role in forming the members of the original SARS-CoV group.

Fig. 4 Pruned consensus network showing only the CoV(-1) lineage exhibiting various levels of recombination within and between clades as defined by the complete genome tree (tree sample sames as in Fig. 3).

Consensus networks can also be used to summarize the support for alternative splits, as shown next.

Fig. 5 Sum-support consensus network based on the bit-wise BS analyses (111/112 pseudoreplicates generated per bit). Only splits are shown occurring in at least 20% of all BS replicates, i.e. splits supported by at least two bits, trivial splits are collapsed. Colored splits represent according groups/clades in the full-genome tree (Fig. 1). Inlet: 'splits rose' showing competing splits patterns within Types II and III (cf. according subtrees/-trunks in Fig. 2 and Fig. 4).

In contrast to the networks before (Figs. 3, 4), generated using the same algorithm*, the BS consensus network in Fig. 5 is not a phylogenetic network. The boxes don't reflect disparate histories of parts of the genomes but the varying support for competing topological alternatives. By summing up the bit-wise BS analyses instead of bootstrapping the entire data (the BS consensus network for the full data, Fig. 1, shows only two boxes), we get a better idea which aspects of the all-genome tree find robust support across the genome.**

Conclusion

Sub-dividing an alignment is a really quick way to fish for evidence of recombination, especially when one then uses a consensus network to summarise the resulting partial trees.

For interpretation, a tree is a very simple, trivial, and hence appealing graph: A is sister to B and so on. Even a child can interpret a tree. Networks are already visually more challenging, but whenever an organism's evolution doesn't follow a tree (as for viruses), we shouldn't use a tree to depict its phylogeny (or reconstruct its evolution).



Data availability

The dataset used for our experiment is a taxon subset of the original data set, available via figshare (with a permanent, hence, citable DOI):
Grimm GW, Morrison D. 2020. Harvest and phylogenetic network analysis of SARS virus genomes (CoV-1 and CoV-2). figshare Dataset. https://doi.org/10.6084/m9.figshare.12046581 

References

Corman VM, Ithete NL, Richards LR, Schoeman MC, Preiser W, Drosten C, Drexlera JF (2014) Rooting the phylogenetic tree of Middle East respiratory syndrome coronavirus by characterization of a conspecific virus from an African Bat. Journal of Virology 88: 11297–11303.



* SplitsTree includes five options to determine "edge weights" (= edge-lengths) in case of Consensus networks: "median" and "mean" average the branch-lengths in the tree sample; "count", the setting used to generate Support consensus networks, counts how often a certain taxon bipartition (split) is found in the tree sample – an edge length is proportional to the frequency of a split; "sum", used here to generate the first network, summarizes the branch-lengths; and "none" discards both branch-lengths and split frequency.

** A split supported only by one of the nine bits, even if unambiguous, ie. present in all 111 (112) per bit BS replicates, will not be represented in the sum-Support consenus network using a cut-off of 20%.

† The complete set of ML analyses took 20 min on a stand-alone computer; consensus networks are generated in a blink, and take hardly a minute even when using trees with many leaves.

Fossils and networks 1 – Farris and Felsenstein


Over 60 years ago, Robert Sokal and Peter Sneath changed the way we quantitatively study evolution, by providing the first numerical approach to infer a phylogenetic tree. About the same time, but in German, Willi Hennig established the importance of distinguishing primitive and advanced character states, rather than treating all states as equal. This established a distinction between phenetics and phylogenetics; and the latter is the basis of all modern studies, whether it is explicitly acknowledged or not.

More than two decades later, Steve Farris and the Willi Hennig Society (WHS) established parsimony as the standard approach for evaluating character-state changes for tree inference. In this approach, morphological traits are scored and arranged data matrices, and then the most parsimonious solution is found to explain the data. This tree, usually a collection of most-parsimonious trees (MPT), was considered to be the best approximation of the true tree. Clades in the trees were synonymized with monophyly sensu Hennig (1950, short English version published 1965), and grades with paraphyly: Cladistics was born (see also: Let's distinguish between Hennig and Cladistics).

Why parsimony? Joe Felsenstein, who was not a member of the WHS but brought us, among many other things, the nonparametic bootstrap (Felsenstein 1985), put it like this (Felsenstein 2001):
History: William of Ockham told Popper to tell Hennig to use parsimony
Soon, parsimony and cladistics came under threat by advances in computer technology and Kary Mullis' development of the polymerase-chain-reaction (PCR; Mullis & Faloona 1987) in the early 80s (note: Mullis soon went on with more fun stuff, outside science). While the data analysis took ages (literally) in the early days, more and more speedy heuristics were invented for probabilistic inferences. PCR marked the beginning of the Molecular Revolution, and genetic data became easy to access. Soon, many researchers realized that parsimony trees perform badly for this new kind of data, a notion bitterly rejected by the parsimonists, organized mainly in the WHS: the "Phylogenetic Wars" raged.

The parsimony faction lost. Today, when we analyze (up to) terabytes of molecular data, we use probabilistic methods such as maximum likelihood (ML) and Bayesian inference (BI). However, one parsimony bastion has largely remained unfazed: palaeontology.

In a series of new posts, we will try to change that; and outline what easy-to-compute networks have to offer when analyzing non-molecular data.

It's just similarity, stupid!

One collateral damage of the Phylogenetic Wars was distance-based methods, which, still today, are sometimes classified as "phenetic" in opposite to the "phylogenetic" character-based methods (parsimony, ML, BI). The first numerical phylogenetic trees were not based on character matrices but distance matrices (eg. Michener & Sokal 1957 using a cluster algorithm; Cavalli-Sforza & Edwards 1965 using parsimony; see also Felsenstein 2004, pp.123ff).

But no matter which method, optimality criterion or data-type we use, in principal we do the analysis under the same basic assumptions:
  1. the more closely related two taxa are, then the more similar they should be.
  2. the more similar two taxa are, then the more recent is their split.


The ingroup (blue clade) and outgroup (red clade) are most distant from each other. Placing the fossils is trivial: Z is closer to O than to C, the member of the ingroup with the fewest advanced character states. Ingroup sister taxa A + C and B + D are most similar to each other. The monophyly of the ingroup and its two subclades is perfectly reflected by the inter-taxon distances.


Assuming that the character distance reflects the phylogenetic distance (ie. the distance along the branches of the true tree), any numerical phylogenetic approach will succeed in finding the true tree. The Neighbor-Joining method (using either the Least Squares or Minimum Evolution criteria) will be the quickest calculation. The signal from such matrices is trivial, we are in the so-called "Farris Zone" (defined below).

We wouldn't even have to infer a tree to get it right (ie. nested monophyly of A + C, B + D, A–D), we could just look at the heat map sorted by inter-taxon distance.


Just from the distance distributions, visualized in the form of a "heat-map", it is obvious that A–D are monophyletic, and fossil Z is part of the outgroup lineage. As expected for the same phylogenetic lineage (because changes accumulate over time), its fossils C and D are still relatively close, having few advanced character states, while the modern-day members A and B are have diverged from each other (based on derived character states). Taxon B is most similar to D, while C is most similar A. So, we can hypothesize that C is either a sister or precursor of A, and D is the same of B. If C and D are stem group taxa (ie. they are paraphyletic), then we would expect that both would show similar distances to A vs. B, and be closer to the outgroup. If representing an extinct sister lineage (ie. CD is monophyletic), they should be more similar to each other than to A or B. In both cases (CD either paraphyletic or monophyletic), A and B would be monophyletic, and so they should be relatively more similar to each other than to the fossils as well.

Having a black hole named after you

The Farris Zone is that part of the tree-space where the signals from the data are trivial, we have no branching artifacts, and any inference (tree or network), gives us the true tree.

It's opposite has been, unsurprisingly perhaps, labeled the "Felsenstein Zone". This is the part of the tree-space where branching artifacts are important — the inferred tree deviates from the true tree. Clades and grades (structural aspects of the tree) are no longer synonymous with monophyly and paraphyly (their evolutionary interpretation).

We can easily shift our example from the Farris into the Felsenstein Zone, by halving the distances between the fossils and the first (FCA) and last (LCA) common ancestors of ingroup and outgroup and adding some (random = convergence; lineage-restricted = homoiology) homoplasy to the long branches leading to the modern-day genera.


The difference between distance, parsimony and probabilistic methods is how we evaluate alternative tree topologies when similarity patterns become ambiguous — ie. when we approach or enter the Felsenstein Zone. Have all inferred mutations the same probability; how clock-like is evolution; are their convergence/ saturation effects; how do we deal with missing data?

For our example, any tree inference method will infer a wrong AB clade, because the fossils lack enough traits shared only with their sisters but not with the other part of the ingroup. Only the roots are supported by exclusively shared (unique) derived traits (Hennigian "synapomorphies"). The long-branch attraction (LBA) between A and B is effectively caused by:
  • 'short-branch culling' between C and D: the fossils are too similar to each other; and their modern relatives too modified;
  • the character similarity between A and B underestimates the phylogenetic distance between A and B, due to derived traits that evolved in parallel (homoiologies).
While ML and NJ make the same decision, the three maximum-parsimony trees permute options for placing C and D, except for the correct options, and including an impossible one (a hard trichotomy). Standard phylogenetic trees are (by definition) dichotomous being based on the concept of cladogenesis — one lineage splits into two lineages.

MPTs inferred using PAUP*'s branch-and-bound algorithm (no heuristics, this algorithm will find the actual most parsimonious solution); NJ/LS using PAUP* BioNJ implementation and simple (mean) pairwise distances; ML using RAxML, corrected (asc.) and not (unc.*) for ascertainment bias (the character matrix has no invariable sites). All trees are not rooted using a defined outgroup but mid-point rooted. If rooted with the known outgroup O, the fossil Z would be misinterpreted as early member of the ingroup.

We have no ingroup-outgroup LBA, because the three convergent traits shared by O and A or B, respectively, compete with a total of eight lineage-unique and conserved traits (synapomorphies) — six characters are compatible with a O-A or O-B sister-relationship (clade in a rooted tree) but eight are incompatible. We correctly infer an A–D | O + Z split (ie. A–D clade when rooted with O) simply because A and B are still more similar to C and D than to Z and O; not some method- or model-inflicted magic.


The magic of non-parametric bootstrapping

When phylogeneticists perform bootstrapping, they usually do it to try to evaluate branch support values — a clade alone is hardly sufficient to infer an inclusive common origin (Hennig's monophyly), so we add branch support to quantify its quality (Some things you probably don't know about the bootstrap). In palaeontology, however, this is not a general standard (Ockhams Razor applied but not used), for one simple reason: bootstrapping values for critical branches in the trees are usually much lower than the molecular-based (generally accepted) threshold of 70+ for "good support" (All solved a decade ago).

When we bootstrap the Felsenstein Zone matrix that gives us the "wrong" (paraphyletic) AB clade, no matter which tree-inference method we use, we can see why this standard approach undervalues the potential of bootstrapping to explore the signal in our matrices.

Consensus networks based on each 10,000 BS pseudoreplicates, only splits are shown that are at least found in 15% of the replicate trees (trivial splits collapsed). Reddish – false splits (paraphyletic clades), green – true splits (monophyletic clades).

While parsimony and NJ bootstrap pseudoreplicates either fall prey to LBA or don't provide any viable alternative (the bootstrap replicate matrix lacks critical characters), in the example a significant amount of ML pseudoreplicates did escape the A/B long-branch attraction.

Uncorrected, the correct splits A + C vs. rest and B + D vs. rest can be found in 19% of the 10,000 computed pseudoreplicate trees. When correcting for ascertainment bias, their number increases to 41%, while the support for the wrong A + B "clade" collapses to BSML = 49. Our BS supports are quite close to what Felsenstein writes in his 2004 book: For the four taxon case, ML has a 50:50 chance to escape LBA (BSML = true: 41 vs. false: 49), while MP and distance-methods will get it always wrong (BSMP = 88, BSNJ = 86).

The inferred tree may get it wrong but the (ML) bootstrap samples tell us the matrix' signal is far from clear.

Side-note: Bayesian Inference cannot escape such signal-inherent artifacts because its purpose is to find the tree that best matches all signals in the data, which, in our case, is the wrong alternative with the AB clade — supported by five characters, rejected by four each including three that are largely incompatible with the true tree. Posterior Probabilities will quickly converge to 1.0 for all branches, good and bad ones (see also Zander 2004); unless there is very little discriminating signal in the matrix — a CD clade, C sister to ABD, D sister to ABC, ie. the topological alternatives not conflicting with the wrong AB clade, will have PP << 1.0 because these topological alternatives give similar likelihoods.


Long-edge attraction

When it comes to LBA artifacts and the Felsenstein Zone, our preferred basic network-inference method, the Neighbor-Net (NNet), has its limitations, too. The NNet algorithm is in principle a 2-dimensional extension of the NJ algorithm. The latter is prone to LBA, hence, and the NNet can be affected by LEA: long edge attraction. The relative dissimilarity of A and B to C and D, and (relative) similarity of A/B and C/D, respectively, will be expressed in the form of a network neighborhood.


Note, however, the absence of a C/D neighbourhood. If C is a sister of D (as seen in the NJ tree), then there should be at least a small neighbourhood. It's missing because C has a different second-best neighbour than B within the A–D neighbourhood. While the tree forces us into a sequence of dichotomies, the network visualizes the two competing differentiation patterns: general advancement on the one hand (ABO | CDZ split), and on the other potential LBA/LEA, vs. similarity due to a shared common ancestry (ABCD | OZ split; BD neighborhood).

Just from the network, we would conclude that C and D are primitive relatives of A and B, potentially precursors. The same could be inferred from the trees; but if we map the character changes onto the net (Why we may want to map trait evolution on networks), we can notice there may be more to it.

Character splits ('cliques') mapped on the NNet. Green, derived traits shared by all descendants of a common ancestor ('synapomorphies'); blue, lineage-restricted derived traits, ie. shared by some but not all descendants (homoiologies); pink, convergences between in- and outgroup; black, unique traits ('autapomorphies')

Future posts

In each of the upcoming posts in this (irregular) series, we will look at a specific problem with non-molecular data, and test to what end exploratory data analysis can save us from misleading clades; eg. clades in morphology-informed (parts of, in case of total evidence) trees that are not monophyletic.



* The uncorrected ML tree shows branch lengths that are unrealistic (note the scale), and highly distorted. The reason for this is that the taxon set includes (very) primitive fossils and (highly) derived modern-day genera, but the matrix has no invariable sites telling the ML optimization that changing from 0 ↔ 1 is not that big of a deal. This is where the ascertainment bias correction(s) step(s) in (RAxML-NG, the replacement for classic RAxML 8 used here, has more than one implementation to correct for ascertainment bias. A tip for programmers and coders: effects of corrections have so far not been evaluated for non-molecular data).



Cited literature
Cavalli-Sforza LL, Edwards AWF. 1965. Analysis of human evolution. In: Geerts SJ, ed. Proceedings of the XI International Congress of Genetics, The Hague, The Netherlands, 1963. Genetics Today, vol. 3. Oxford: Pergamon Press, p. 923–933.
Felsenstein J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783–791.
Felsenstein J. 2001. The troubled growth of statistical phylogenetics. Systematic Biology 50:465–467.
Felsenstein J. 2004. Inferring Phylogenies. Sunderland, MA, U.S.A.: Sinauer Associates Inc., 664 pp. (in chapter 10, Felsenstein provides an entertaining "degression on history and philosophy" of phylogenetics and systematics).
Hennig W. 1950. Grundzüge einer Theorie der phylogenetischen Systematik. Berlin: Dt. Zentralverlag, 370 pp.
Hennig W. 1965. Phylogenetic systematics. Annual Review of Entomology 10:97–116.
Michener CD, Sokal RR. 1957. A quantitative approach to a problem in classification. Evolution 11:130–162.
Mullis KB, Faloona F. 1987. Specific synthesis of DNA in vitro via a polymerase catalyzed chain reaction. Methods in Enzymology 155:335–350.




Why we may want to map trait evolution on networks, pt. 2 – Topological ambiguity


In last week's Part 1, I gave an introduction to the problem of categorizing the polarity of morphological traits. How can we reconstruct which characters are primitive, or plesiomorphic according to Hennig, and which are derived, or apomorphic? This is something we need to do to reconstruct evolution, because most of the past is only preserved in the form of fossils, usually lacking any DNA. In this second part of the discussion, I'm going to take apart my own tree and show why we inevitably need networks, not trees.

There may be more than one tree

Even with more and more data at hand, some molecular phylogenies refuse to be unambiguous. Even worse, different, well-sampled molecular data sets may tell different stories — ie. there is more than one molecular tree to explain the diversity patterns. The ML tree used for the ML character mapping in Part 1 was pretty well supported, but not telling the entire truth.

For a start, there is no reason to assume that oaks are not monophyletic even though the data fail to resolve them as a clade (evolving something unique like the oaks twice would be a striking trick, even for gambling Mother Nature) — molecular trees may have misleading, sometimes just wrong, branches, even when they are highly supported.

In this case, one complication is that the oligogene dataset combines plastid and nuclear gene regions that not only differ in their information content but also infer different phylogenetic scenarios (and mask a lot of intra-generic and sub-generic incongruencies). This is illustrated in the following tanglegram.

Fig. 3 – A tanglegram, on the left the ML tree inferred from only the plastid gene regions (1406 DAP, alignment 15254 bp long), and on the right the corresponding nuclear data based tree (1691 DAP per only 4983 bp).

Even though the support along the backbone of the plastid tree is lowa (to non-existent), it well reflects the general diversification patterns in Fagaceae plastomes (see also the tree in Manos et al. 2008, Madroño 55:181–190; and Yan et al. 2019, BMC Evol. Biol. 19: 202, for an oak global picture). Plastid signatures show a strong geographic sorting (eg. New World vs. Old World), while the nuclear data provides most of the lineage-differentiating signal expressed in the combined tree (Part 1, Fig. 2).

Mapping along networks

How do we decide what is a real synapomorphy, a homoiology, or a good symplesiomorphy? Mapping the traits along all possible rooted trees is one option. Another option is to just map them along a consensus network of all trees, as shown next.

Fig. 4 – Map of the seven characters on the consensus network of the nuclear and plastid trees shown in Fig. 3. Blue – genus autapomorphies, dark green – synapomorphies/terminal homoiologies, light green – symplesiomorphies, orange – deep homoiologies, red – randomly distributed trait, pink – genus-restricted reversals.

According to the mapping, the newly described South American Castanopsis rothwellii, assigned to the modern (Souteast Asian) genus Castanopsis, is a stem Castanoideae / Fagaceae, while the "extinct" North American genus Castanopsoidea (then the "earliest megafossil evidence of Fagaceae": Crepet & Nixon 1989, Am. J. Bot. 76: 842–855) could be a stem / crown member of the Castanea-Castanopsis lineage. The difference to the ML trait mapping (Fig. 3 in Part 1) on the combined tree is that we get a better picture what is a lineage-specific trait set in Castanea-Castanopsis, because the interference of the monophyletic(!) oak grade is minimized.

Another possibility is to map the characters directly along a distance-based network, and then compare the latter with the molecular-based topological alternatives. This is quite puzzling in this case, because the morphology (Fig. 1 in Part 1) matches neither the nuclear tree nor the plastid tree (Figs. 2–4) — the traits scored for the fossils cover largely morphological Play-Doh of the Fagaceae.

Fig. 5 – Neighbor-nets based on mean morphological distances. Top graph – polymorphisms treated as ambiguities (standard approach), bottom graph – polymorphism treated as additional states (experimental approach). Text coloring as in Fig. 4, light blue – potential autapomorphy of the fossil American castaneoid lineage. Edge colors: green – edge representing a molecular clade/likley monophyletic group; orange – edge representing a paraphyletic group; red – edge rejected by molecular data; blue – edges supporting a distinct fossil American castaneoid lineage.

The likely primitive characters, irrespective of the evolutionary scenario we prefer, are those also found in the Eocene fossilsb. There are no derived traits/character suites pinning the fossils to Castanopsis. The fossils are a bit derived on their own terms (note their position in Fig. 5), and hence we can deduce that the fossils are either: (a) representing a relatively primitive extinct American sister lineage or (b) surviving, somewhat evolved members of the precursors of modern-day core Fagaceae. Note that the derived oaks evolved nearly 60 myrs ago, ie. 8 myrs before the oldest (Patagonian) Castanoideae fossil was deposited. The earliest (known) Fagaceae and castaneoid pollen are from 80+ Ma old Upper Cretaceous sediments in western North America (Grímsson et al. 2016, Acta Palaeobot. 56: 247–305; open access) and Japan (Takahashi et al. 2008, Intl. J. Plant Sci. 169:899–907), giving them plenty of time to migrate into North and then South America during the Paleocene-Eocene green house episode.

Fig. 6 – Earliest fossil record of Fagaceae and Castanoideae mapped on Scotese's Paleoglobes (© Scotese 2013, GoogleEarth layover files are available from here). Note that although there was no continuous land bridge, North and South America were already connected by a chain of large and high islands, providing a corridor for intercontinental dispersal of near- and extra-tropical plant lineages. A potential  crown-group Castanopsis (C. kaulii, cupule with associated seeds and pollen) has been recently recovered from the Baltic Amber (Sadowski et al. 2018 Am. J. Bot 105: 2025–2036).

Both of the mapping procedures described above are crude, in the sense that they ignore the molecular branch lengths, and use Ockham's Razor. But it strikes me as being not a bad start. They are better than just mapping along a single preferred molecular tree (as is done in many neontological papers; see Part 1) or along a morphology-based strict consensus cladogram (as is done in far too many paleontological papers; many palaeobotanical papers do neither the one nor the other: eg. Wilf et al., 2019, Science 364: eaaw5139). It's important to realize that if one taxon or subtree of our modern taxon set is characterized solely by the lack of shared derived traits or unstable expression of derived traits (like Castanopsis here, see position in both graphs in Fig. 5), ie. represents living fossils or little-evolved lineages, any ancient and primitive fossil, stem group, sister group or precursor, will be attracted by them in a total evidence or any other tree-based approach, especially when we rely on change-probability-naive parsimony as inference criterion. As we pointed out repeatedly: forming a clade in tree is neither a necessary nor a sufficient criterion for monophyly.

All gone, what to do when we have no molecular data?

Morphology alone, like genes on their own, will inevitably get some things wrong (compare Fig. 4 with Fig. 5). Without molecular data, one may have little reason to reject the monophyly of the Castaneoideae (when using more than the seven characters scored by Wilf et al. 2019; see eg. the cladogram in Crepet & Nixon 1989, fig. 1 based on an undocumented 25-character matrix). In the process, we would misinterpret overall similarity, due to shared primitive character suites and the lack of shared derived traits as evidence for an inclusive common originc.

What can we do if we have no or very few extant taxa, when we only have one set of data prone to circular reasoning? Then using networks is inevitable as well (see Fig. 5; and some examples provided in the reading list below). We need to explore in-depth the signal in our data matrix. Only extremely biased morphological matrices provide clear tree-like signals, comprehensive ones will have internal conflict and allow for inferring many, partly very different but more or less equally optimal trees.

Exploratory data analysis will not eliminate all possible errors — based only on the graph in Fig. 5, we would get the inter-generic phylogenetic relationships in Fagaceae partly wrong. However, this may lead to an informed decision as to which of the many equally probable evolutionary scenarios make more sense than others. It will help to reduce the alternatives, without eliminating those that are equally valid (which every tree does). If the time-coverage is good, exploring morphological differentiation over time can be an asset, too (see eg. Stacking neighbor-nets – a real-world example).

Data

The matrices used, networks etc. can be accessed via figshare.

Selection of related posts on The Genealogical World of Phylogenetic Networks

Clades, Cladograms, Cladistics, and why networks are inevitableillustrates why paleontologists should also be less tree-naive (see example in footnote c).
Has homoiology be neglected in phylogenetics? — why we should try to assess the phylogenetic quality of our traits.
Let distinguish between Hennig and Cladisticsas said in the title, the post provides reasons why we should distinguish between Hennig's concepts and clades in phylogenetic trees.
Ockham's Razor applied, but not used: can we do DNA-scaffolding with seven characters? — the original post dealing with Wilf et al.'s (2019) "phylogenetic analysis", which obviously was not scrutinized during review.
Please stop use cladograms!No matter whether you think evolution is tree-like or not, cladograms should be a matter of the past.
Should we try to infer trees on tree-unlikely matrices? —  using well-known (among paleobotanists) examples, I show why networks reveal much more than any tree when we deal with fossils.
More non-treelike data forced into trees: a glimpse into the dinosaursthe same but for a thunder lizard matrix.
Trivial data, but not so trivial graphsan inference experiment using very simple artificial binary matrices.



a The main reason for the lack of branch support is that individuals of different genera growing in the same area can share plastid haplotypes, while individuals of the same genus / infra-generic lineage, even species, can be quite different. [Note that the standard 4x4 ML nucleotide model treats polymorphisms as such, not as missing data.] Plus, the different lineages show different levels of plastid diversity (highest in Quercus subgenus Cerris, but low in subgenus Quercus, the North American castanoids and Lithocarpus outside Borneo, Castanea-Castanopsis appear to be in-between the extremes), and there is a tendency to preferably mutate sequence patterns within a lineage that otherwise differentiate between lineages (for instance, inversions that distinguish two genera, can be found as intra-lineage variation in the third genus or one of the oak sections).

b The striking similarity between the newly found South American and long-known slightly older North American fossils is likely the reason for not discussing the latter in the original paper or including them in the "DNA-scaffold" analysis. As is obvious from the graphs, the slightly younger North American fossil could easily be a slightly more derived of the same lineage than the South American fossil (Planchard et al. 2016 Paleont. Electr. 19.3.51A give a revised age of ≥ 49 Ma for the plant-bearing strata), and thus would have been at odds with the narrative of the authors (see also comment by Denk et al. 2019, Science 10.1126/science.aaz2189).

c As done by Wilf et al. (see also the argumentation in Wilf et al.'s response, Science 10.1126/science.aaz2297, to Denk et al.'s 2019 comment). The combination of circular reasoning, systematic bias, and (parsimony) tree-naivity is well expressed in Wilf et al.'s own words:
Fourth, Denk et al. erroneously contend that Castanopsis rothwellii, a fossil with so many diagnostic characters preserved that it could only be assigned to Castanopsis if “found alive” today (1), has plesiomorphic features and cannot be placed confidently in the extant genus [see Figs. 1–5 in this two-part post]. ... Denk et al.’s phylogenetic conclusions from their emended tree and matrix are misleading, in that any morphological matrix includes characters that are relevant only for the taxa included in the analysis. ... Because the fossils are castaneoid in all features, we did not include all Fagaceae in our original analysis (1) and likewise did not include all characters relevant to non-castaneoid fagaceous taxa. ... By adding just three relevant characters to the Denk et al. scaffold to accommodate the genera they added (Table 1), the fossil Castanopsis rothwellii is placed only with Castanopsis in the single [ie. the strict consensus of two equally parsimonious trees] most parsimonious tree (Fig. 1).
One of the three added traits ("expanded stigma") is exclusively shared by all five Castaneoideae genera, the second ("nut generally rounded in cross section") shared by all but one Castaneoideae and Quercus, and thus are symplesiomorphies of core Fagaceae: shared primitive traits that can be expected in a precursor of several or all modern genera or their less evolved extinct sister lineages. Or positively selected homoiologies, ie. evolved multiple times within the core Fagaceae. The third ("asymmetrical cupule") is an unstable convergence / deep parallelism and a trait of little phylogenetic value, since expressed as intra-generic (intraspecific?) variation in two distantly related genera: the monotypic Formanodendron, a trigonobalanoid, and Castanopsis. These are two genera that share only a very distant (and exclusive fide Hennig) common origin (see Part 1) but inhabit overlapping climate envelopes and ecological niches in modern-day East Asia.

Despite adding three hand-picked characters (from a set of at least 25 at hand, Crepet & Nixon 1989) and accepting a phylogeny closer to the reality, the Castanopsis "clade" in the new "scaffold tree" including the Patagonian fossil remains unsupported by any exclusive or even shared and stable derived trait/set of traits (as in the original study, Wilf et al. refrain from establishing any sort of node or branch support, or test of alternative placements).

Moreover, it is safe to assume that when one adds the extinct genus Castanopsoidea to the scaffold (Wilf et al. deliberately chose not to do so), it would compete with Castanopsis rothwellii for the placement next to the modern-day Castanopsis. According to Crepet & Nixon 1989, fig. 1, one possible placement of Castanopsoidea is a sister to "Castanopsis (1)". This is not necessarily because they share a direct common origin but because these fossils also lack uniquely derived characters or a clearly derived character suite defining all Fagaceae genera except for Castanopsis (which in Crepet & Nixon's morpho-tree, is paraphyletic to Lithocarpus, which, back then, included the potential oak sister genus Notholithocarpus — literally: the 'false Lithocarpus'). Personally, for the same reasons as outlined and applied in Bomfleur et al. 2017, PeerJ 5: e3433 (and like Denk et al. 2019), I would have no problem calling all these fossils Castanopsis by defining the genus as explicitly paraphyletic, which could include the modern-day species of Castanopsis (which are probably monophyletic) and Castanopsis-like fossils that may be more or less related to them and/or other core Fagaceae: the precursors and extinct but similar, underived sister lineages.

Why the emporer has no clothes on – conflict or not?


In the final part of this series dissecting angiosperm gene trees (see: Why the emporer has no clothes on — part 1 and part 2), we will enter muddy ground. Using our example data set, we will try to make a call on whether or not there has been any (detectable) major reticulation in the deep branches of the angiosperm tree.

What triggers conflicting gene histories

Before we look at the data, it may be a good idea to set the scene using simple theoretical examples of what we may look at.


Our two genes, represented by circle and pentagon (could be multigene regions or entire genomes), both follow the same evolutionary history (the gray background tree). In the left lineage, we have a bit of incomplete lineage sorting, because the ancestor was polymorphic for the circles. In the right lineage, we have different fixation rates: the circles evolve faster than the pentagons. With molecular data we usually don't have the ancestors, making any inference straightforward; we only have the tips.


Because of incomplete lineage sorting and different fixation rates in the left and right lineages, the circle gene tree gets the phylogeny pretty wrong. The pentagon gene tree comes closer to the reality – we only infer two sister clades where there is a grade. (With real-world data, the branch support values could give one a clue that three of the inferred blue clades have a higher quality than the fourth supporting a pseudo-monophylum.) The circle and pentagon trees are largely incongruent despite sharing the same history; and we may infer a pseudo-hybrid (the first diverging lineage within the right clade).

Combining these data may allow us to infer a tree that fits the real tree much better. In the left clade the trivial pentagon signal can out-compete the misleading circle signal, and avoid the misplacement of the first diverging lineage of the right clade. In the right clade, the circle signal can help to correct for the pseudo-clade.

Now we can add a late reticulation, and re-infer the gene trees.


Because of the reticulation (the circles are biparentally inherited, the pentagons maternally), the gene trees are more congruent then in the example above (circle and pentagon get it a bit wrong in the left clade), except for the hybrid and its pseudo-hybrid parent. The gene conflict in placing the lineage cross (part of the left clade in the circle-based tree, part of the right clade in the pentagon tree) well reflects its hybrid origin.

Different histories of nuclear genes vs. plastid / mitochondrial genes?

The easiest way to catch reticulation is to compare trees based on plastid / mitochondrial data (maternally inherited) vs. nuclear data (biparentally inherited). If reticulation happened in the past, we can expect that the maternal and biparental genealogy diverge from each other (see part 2).

Strict Consensus network of the plastid (data from 3 protein-coding genes +1 partly coding gene region), mitochondrial (3 protein-coding genes) and nuclear trees (2 nrDNAs). The bold lines represent generally accepted phylogenetic splits (APG IV tree, see also Steven's comprehensive Angiosperm Phylogeny Website).

This network is much more box-like compared to what one would have expected based on the combined tree that can be inferred from the data (Part 1). But are we looking on largely decoupled histories?

This mess is hardly surprising. The combined tree is constrained by the plastid tree, specifically by the signal from the matK gene (Part 1), while the remaining plastid genes (from a different part of the plastome) fall into line. The mitochondrial tree combines genes that on their own inform poorly resolved trees riddled with branching artifacts (Part 2). The nuclear tree, on the other hand, combines the most and least divergent nuclear genes widely known. Because of this, they show topological conflict between each other.

18S-25S rDNA tanglegram. The branch numbers show each gene's bootstrap support (BS) deviating from the combined BS support for the respective branch (indicated by line thickness): green, increased BS support when combining both genes, red, decreased BS support.

However, they are part of the same multi-copy coding unit (the 35S nuclear rDNA) that has very particular evolutionary constraints, such as structural constraints, affected by completeness of concerted evolution and intra-genomic recombination. Polyploid grasses, for example, can have up to three different collections of 35S rDNA, reflecting four different evolutionary origins, being part of the A, B, C or D genomes. You end up with what is called a multi-labelled tree: the A, B, C and D-genome variants of the same taxon pop up (consistently) in different parts of the tree, and you can have recombinants. If we look into the 18S vs. 25S data, however, we find no consistent sequence patterns supporting the topological conflicts between the two trees, or examples for recombination.

As in our theoretical example, each of the trees has certain strengths, and its own set of weaknesses, some of which can be overcome when combining the data (eg. branches with increased combined support in the 18S-25S tanglegram)

Bootstrap (BS) Consensus networks for the combined cp (upper left), mt (upper right), nc (lower left) and full data (lower right). Branches without numbers: BS = 100. Splits conflicting with those present based on the full data highlighted by red font (all with BS < 100).

In contrast to the boxy network appearance and the substantial conflict between the single gene trees (Part 2), most of the relationships (eg. the major clade roots but also many intra-clade relationships) receive high or unambiguous support in all three trees*. Aside from the disparate signals, the data seem to converge on a coalescent. If the genomes had different histories, they wouldn't converge so easily. Also, we would expect to see more consistent conflict between the "genome" trees than between the single-gene trees of the same genome, since the nuclear rDNA is biparentally inherited while the plastid and mitochondrial DNAs are passed on via the mothers only. Many of the angiosperms in our data reproduce sexually.

So far, no conclusive evidence for reticulation

Mere gene-tree incongruence is a poor basis to conclude about decoupled gene histories. We need to dig for sequence-based evidence for reticulation and recombination. For instance, we might find a clearly derived sequence pattern exclusive to the right clade in a member of the left clade.

The importance of rare genomic changes when interpreting conflicting gene trees. The left and right clades obtained a unique and conserved gene or sequence feature before they diversified. The hybrid is the only taxon showing both.

This is where the Walker et al. (2019) and Sullivan et al. (2017) studies seem to fall short — they don't give any example, gene, gene region, or recognizable lineage-diagnostic sequence pattern that could be used as direct evidence for decoupled gene histories and/or reticulation.

For my data set, I cannot pinpoint such evidence either. All high(er)-supported conflict seems to be related to lineage sorting and data/signal issues, the inability of certain gene regions to resolve relationships in parts of the angiosperm tree, or falling prey to (more local than global) long-branch attraction. When looking at the sequences, there's no reason to question, for example, the assumed monophyly of the main lineages and orders, in spite of the topological conflict we face when analyzing these data. If there was reticulation between the ancestors of angiosperm lineages, or later on between the already formed lineages, it left no obvious imprint in the data.

Thus, after having investigated aspects of the seeming conflict by going back to the data (checking highly divergent and conserved sequence patterns, tabulating the partly competing BS support of the single genes, and minus-one gene analyses), I did not hesitate to combine these data and use a Bayesian total-evidence dating procedure. (We never published the results because mid-Cretaceaous angiosperm fossils have much too derived morphologies for total evidence dating; when left unconstrained, MrBayes optimized towards an angiosperm root age of 4.5 Ba, which was the in-built maximum).

A total-evidence Bayes tree based on the full data set. Stars indicate the position of fossil taxa (mid-Cretaceaous). Note their relative long terminal branches, a situation total-evidence dating cannot handle. The matrix can be found at figshare: A basic total evidence matrix for basal angiosperms — combining Soltis et al (2011) with Doyle & Endress (2010).

An example for actual reticulation resulting in gene tree conflict

Working at the coal-face of evolution, I have encountered examples of apparently real reticulation (when analysing biparentally inherited nuclear data). The most compelling was probably the ancient relictual genotypes and pseudogenes that point towards ancient reticulation in the widely known plane trees, Platanus. Platanus subgenus Platanus (which includes all but one species, P. kerrii, a relict of a distant lineage growing in tropical-hot subtropical lowland forests of North Vietnam) falls into two main lineages characterized by unique sets of genotypes, the ANA clade (Atlantic-facing North and Mesoamerica) and the PNA-E clade (NW. Mexico, California and Mediterranean).

Haplo/-genotypic composition of Platanus (Grimm & Denk, Taxon, 2010, ES2 [PDF]). Platanus kerrii represent the sole surviving relative within the Platanaceae (genetically very distinct), an old lineage of angiosperm trees (going back deep into the Cretaceous). Their next kin today are, according to angiosperm molecular trees, the enigmatic Proteaceae, a Gondwanan relict (represented in our angiosperm data by Petrophile). For an even more comprehensive genotypic study that also covers plastid markers check out De Castro et al., Ann. Bot., 2013 [open access])

Individuals in the contact zone between species of the two main lineages (including hybrids) can be heterozygotic / polymorphic for at least one of the sequenced nuclear regions, so that identification of recent hybrids is straightforward. Beyond this, genetically inconspicious members of the ANA clade may show ITS pseudogenes from the PNA-E clade (stippled line in the figures above and below). Furthermore, two of the ANA clade species show (predominately), a PNA-E LEAFY genotype — P. palmeri (pa) and P. rzedowskii (rz), which grow closest to the populations of the PNA-E clade. However, this is not the genotype found in the close-by American PNA-E species (ra, ge), which is one that's sequence is phylogenetically closer to the Mediterranean species, P. orientalis (or), on the other side of the globe.

Overlay of the LEAFY, 5S-IGS and ITS histories in Platanus. This doodle is based on tree- and network-inferences coupled with PCR-RFLP-based genotyping and in-depth analysis of mutation patterns in length-polymorphic sequence regions (Grimm & Denk 2010, ES1). P. x hispanica is the well-known ornamental alley/park tree, the 'London plane'. A cultivated historical hybrid (mid 18th century) of the most hardy North American plane, P. occidentalis, and the frost-vulnerable Mediterranean plane, P. orientalis. In the Mediterranean, due to frequent backcrossing, one can find morphologically mixed individuals showing only the P. orientalis genotypes or homogenous (American or European) type individuals showing occidenatlis and orientalis genotypes (see eg. Pilotti et al., Euphytica, 2009

Further reading

An animal example, of seemingly incongruent single-gene trees that may well be the product of a largely shared evolutionary history, is the autosomal intron data compiled for bears by Kutschera et al. (2014. Bears in a forest of gene trees: Phylogenetic inference is complicated by incomplete lineage sorting and gene flow. Mol. Biol. Evol. 31:2004–2017). Rather than a "forest of trees", each gene tree is poorly resolved but, when combined, allows inferring a phylogeny that matches quite well the parental genealogy based on Y-chromosome data, both in strong conflict with the maternal genealogy inferred from mitochondriomes (see Part 2).

In Supplement File S6 [PDF] of Grímsson et al. (2018, Grana 57:16–116), I outline how ambiguous signal from combined gene regions relate to the poor support of critical branches in the Loranthaceae tree; see also the related posts: Using consensus networks to understand poor roots and Trivial but illogical – reconstructing the biogeographic history of the Loranthaceae (again). Some gene-tree conflicts are possibly linked to different histories (nuclear vs. chloroplast data), while others are a mix of insufficient signal and missing data (between chloroplast genes).

In a previous post (All solved a decade ago: the asterisk branch in the Fagales phylogeny), I give another example using an old Fagales matrix, which resulted in a tree that, even today, is the gold standard of Fagales phylogeny. The matrix combines a highly conserved nuclear gene (18S) conflicting with the plastid genes and complemented by an entirely uninformative mitochondrial gene (matR) to provide a "tree based on all three genomes". Also in this case the three-genome tree is essentially the matK tree.



* That doesn't mean that all highly supported, unconflicted relationships must be true. Note that just by combining a few genes, we obtain a near-unambiguous support for the split between Mesangiosperms and the ANA-grade + gymnosperms, one of the splits defining the root and "basal" part of the angiosperm tree. The outgroup-inferred root is well fixed. Even when using nuclear data, despite the fact that the 18S signal (the one showing the least ingroup-outgroup genetic distance) doesn't support such a root but the 25S does (see part 2), being more divergent and prone to ingroup-outgroup long branch attraction (LBA). That we have LBA issues with the data is obvious from a tiny detail: Ginkgo is supported with BS > 70 as sister of Podocarpus, which is wrong, based on all we know about gymnosperms,(see also Earle's gymnosperm database and literature cited therein). The likely correct split, Ginkgo as sister to Cycas, is present in the nc tree, but represents a much less supported alternative (BS <= 25). It is also obvious when one looks at the alignment(s): Cycas and Ginkgo share some potential genetic 'synapomorphies' in the low-divergent, generally conserved regions (eg. 18S, stem-regions of 25S), but there are essentially none for Gingko + Podocarpus.

Why the emperor has no clothes on – a thicket of trees


A critical question in phylogenetics, and this applies to both the detection and inference of reticulation, is: How much trust do we put in the inferred tree? A phylogenetic tree is just the simplest of all possible phylogenetic networks. Let's assume that there was some phylogenetic reticulation in the past (lineage mixing and crossing), then, in the best-case scenario, our inferred tree shows one of the intertwining pathways but misses the tangles, the crossroads.

An example of simple reticulate evolution: pink is the product of very recent lineage crossing between an early diverged (and otherwise lost) member of the blue lineage and the more recently, hence genetically more coherent, red lineage. Bold lines show the tree we would likely infer in such a situation.

In the worst case, summarizing data with substantially different signals will give us branching artifacts such as:
  • terminal branches that are too long,
  • too long internal branches with conspicuously low support (ie. BS << 100, PP < 1.0),
  • artificial branches representing the least-conflicting solution for the conflicting data,
  • low branch support in general.
See eg. the bear data we used as a real-world example for our Intertwining trees and networks paper (Schliep et al. 2017, open access).

Three possible trees for bears, (a), Y-chromosome, paternal, and (c), nuclear-encoded autosomal introns, biparentally inherited, are congruent but disagree with the maternal genealogy (b), based on the mitochondrial genes. When fusing all three data sets, we get a (low) supported sister relationship for Sloth and Sun bears (red clade), not supported by any of the three fused data set – a branching artifact.

Topological incongruence between gene trees and parental genealogies (as above) is commonly taken as evidence for reticulation. If one gene provides high support for taxon A as sister to B, and another gene has high support for B as sister to C, then B is likely the product of reticulation (eg. hybridization)

One simple possibility to put together a phylogenetic network is to summarize all of the trees in the form of a Consensus network, as shown next. (Technically this is a splits graph, it becomes a phylogenetic network as soon as we determine a root, which, here, would be at the edge leading to the Giant Panda.)

A strict Consensus network of the paternal, biparental, and maternal bear genealogies.
The numbers show the non-parametric bootstrap support for each (competing) split.

In this case, low support for a branch in a combined tree (the values on top) can result from strong conflict. For instance, the brownish splits, which are poorly supported using the combined data (BS = 21, 29), receive near unambiguous support from the mitochondrial genes, but are largely or entirely rejected by the Y-chromosome and nc-intron data. In the combined tree, this deep conflict is resolved by introducing the artificial red clade, with similarly low support: the signal in the data is ambiguous and they support splits between equally possible alternatives.

We know lineage crossing took place in bears (the mitochondrial and Y-chromosome tree are very much in conflict). However, does the above mean that earliest bear-ish creatures hybridized, too? Note that the conflict is associated with a short-branched part of the graph, where apparently little evolution happened. Fast ancient radiations usually come with incomplete lineage sorting and diffuse signals. The only data set producing longer roots, but with notably lower support, are the biparentally inherited introns.

We are closing in our own tail and have to ask again: Is this low support in the autosomal intron tree due to internal conflict, (sets of) introns preferring different topologies, supporting an ancient mixing hypothesis, or just reflecting lack of resolution? Check out the original paper by Kutschera et al. (2014, Mol. Biol. Evol. 31: 2004–2017), and make up your own mind.

On to the angiosperms

In my last post, I exemplified what Walker et al. (PeerJ, 2019) found in their angiosperm study: when we look at a plastome tree we are not looking at a summary of all gene trees but instead at a topology forced by very few of the genes in the chloroplast genome, such as the matK. We also have seen that one misplaced sequence (outgroup Podocarpus-matK) doesn't affect at all the combined analysis — it didn't even reduce the ingroup - outgroup split support. Also, I noted that the low-supported part of the combined tree goes hand in hand with lack of decisive signal from the matK.

It's time to take a look at what the other genes in this example data set come up with.

The eight gene trees. Terminal subtrees collapsed. Scales fit to size, scale bar = 0.1 expected substitutions per site. Upper left, matK tree which is very similar to the combined tree using all gene regions (cp = chloroplast, mt = mitochondrial, nc = nuclear genes). Note the low performance of the mt genes.

One thing is obvious: for most genes (except the nuclear-encoded rRNA genes) including the outgroup taxa adds little ingroup information of use — they are just too distant to any of the ingroup taxa. Outgroup rooting is tricky for angiosperms. Outgroup taxa will always be attracted to the ingroup taxon that is the least similar to any other part of the ingroup: Amborella in this case.

Generally, all of the ANA-grade water plants are genetically distinct and topologically isolated; any outgroup-inferred root must be placed in this part of the tree (all other living seed plants are very distant relatives of angiosperms looking back at, at least, ~250 million years of independent evolution, see eg. Age of Angiosperms... and What is an angiosperm pt. 2). The relatively conserved plastid rbcL and mitochondrial matR prefer an Amborella-Nympheales clade as sister to all other angiosperms, while the more divergent atpB, plastid, nad5, mitochondrial, and 25S, nuclear, prefer the Amborella-root — this is a direct indication for ingroup-outgroup long-branch attraction. Any other placement of the outgroup subtree within the ingroup would necessarily decrease the likelihood of the tree (but note the position of the root in the 18S tree, lower-left the tree based on the most-conserved, evolution-constrained gene in our sample; see also All solved a decade ago..., fig. 4A).

We can look at these trees with the strict consensus network, using uninformed edge lengths— that is, the network counterpart to the strict consensus cladograms still common in plant phylogenetic literature.



This is a nice piece of computer-art, but is scientifically quite useless (the boxiness and general graph structure is, however, reminiscent of strict consensus networks of most-parsimonious tree samples inferred for extinct animals, one example, and plants).

We can add some discriminatory information by counting how often each split occurs in the set of gene trees.

Same set of tree, different way of summarizing it. Note how the main clades emerge: one or two genes may have misplaced the one or other OTU but the others get it right.

Alternatively, we can average the actual tree branch lengths to inform the edge length of the consensus network.

The light green, sand-colored, light brown and dark olive (clockwise) splits are likely branching artifacts. The light blue split is the one that supports the ANA-grade when the (combined) tree is rooted with the very distant outgroups.

A pretty little thicket of trees. Some agreement is found towards the leaves, but even here we have conflict among the gene trees. In some trees, there are long branches grouping non-related OTUs, obvious tree inference artifacts. The general rule is that the deeper we go (ie. the farther back in time), the messier it gets. Adding to this is that, irrespective of which gene is used, some OTUs are much closer to the hypothetical common ancestor (of Mesangiosperms, ie. all but ANA grade) than others – in the eudicots, the least-evolved taxa are Platanus (very old tree genus) and Euptelea (the basalmost Ranunculales); in the Magnoliales, the only angiosperm clade that lacks synapomorphies, it's Magnolia and Liriodendron (again, very old and primitive tree genera). Darwin's Abominable Mystery, the sudden appearance and quick dominance of angiosperms, resulted in an abominable chaos of gene trees and signals. How can they possibly converge to a single tree with amply high support along most branches?

The combined tree from the first post.
When compared to the bears, the answer may well be: because there has been very little to no reticulation between these lineages. Our thicket may be not a forest of trees but just a poorly trimmed, wildly overgrown bush. They genes share the same history, but when being analyzed one-by-one, each of their trees get some aspects right, and some others (severely) wrong. Misplacing one OTU (e.g. the light green, dark olive, sand-colored and dark yellow splits in the averaged Consensus network) may have further topological effects; it didn't matter for the matK gene, because we misplaced only one very alien OTU in a data set that otherwise is hardly affected by adding or removing OTUs.

I argue here that, if there had been substantial reticulation messing up the signal of contemporary lineages and reflecting decoupled histories (like in the case of bears), we would expect at least some (artificial) branching patterns with low support in the combined tree, as well. This would also be the case looking at the gene-tree consensus networks, not only in the deepest parts but also closer to the leaves.

We will be explore this alternative hypothesis in the next (and final) post of this mini-series.

Why the emperor has no clothes on – the mighty matK


In a recent paper published in PeerJ, Walker et al. (2019) take a close look at the complete plastome data of angiosperms. Although they don't find anything fundamentally new — well, at least not for those of us who have looked at the oligogene datasets we worked with — it's nice to see that somebody has been willing to do it in a very comprehensive way, and thereby published what some of us have long known:
  • A combined tree is not the sum of the genes that have been combined;
  • Single-gene trees can tell you very different stories.
Even if the overall branch support is pretty high, we always should be aware of internal data conflict.

When looked at closely, the emperor, in this case the Angiosperm Phylogeny Group (APG) complete plastome tree, maybe not be entirely naked, but is clothed in very few of the many garments at his disposal. Effectively the branches in the plastome reference tree draw their support from very few of the 79 genes/gene regions in the plastome.As Walker et al. note:
"Of the most commonly used markers, matK, greatly outperforms rbcL; however, the rarely used gene rpoC2 is the top-performing gene in every analysis. We find that rpoC2 reconstructs angiosperm phylogeny as well as the entire concatenated set of protein-coding chloroplast genes."

Fig. 1 from Walker et al. showing the (lack of) individual gene support for the angiosperm reference phylogeny.

However, there is one aspect of the paper that calls for a network-based blog post:
"Following the typical assumptions of chloroplast inheritance [i.e. that the entire plastome shares a common history being passed on solely by the mother in angiosperms], we would expect all genes in the plastomes to share the same evolutionary history. We would also expect all plastid genes to show similar patterns of conflict when compared to non-plastid inferred phylogenies ... Our results, however, discussed below, frequently conflict with these common assumptions about chloroplast inheritance and evolutionary history."
Getting incongruent branches in the single-gene trees, including a few highly supported ones, is taken as evidence for different histories potentially mixed within the plastome. Walker et al. give references for (potential) recombination and and reticulation in plastomes.

I asked a question about whether this logic isn't a bit naive about tree inference. In their response, they pointed to the paper by Sullivan et al. (Mol. Biol. Evol. 2017) — these authors made test for recombination in Picea (spruce) plastomes, then split the complete plastomes into three structural units, and found two embedded conflicting phylogenies, as shown in the next figure.

Fig. 4 from Sullivan et al. (2017). F1 and F2 are structural regions comprising most of the large single-copy unit, the F3 the two (duplicate) inverted-repeat regions and the small single-copy unit of the Picea plastomes.

This seems to be a compelling case (but note the BS < 100 for conflicting critical branches). It is also quite possible, since gymnosperm plastomes, in contrast to angiosperms, may be paternally or bipartentally inherited. But, is it a valid assumption that each single-gene tree (or, in Sullivan et al.'s case, trees based on multigene regions) reflects the true tree of that gene or gene complex? That is, even if I assume that all of the genes in my matrix share the same history, must they support the same inferred tree?

Since I have worked a lot at low taxonomic levels, and often with other people's (plastid) data (during my entire career, I remained faithful to the nuclear-encoded ribosomal DNA spacers), my spontaneous answer would be: Absolutely not! Topological conflict may hint towards decoupled gene histories — it is a neccessary criterion but not a sufficient criterion.

There are quite a lot evolutionary scenarios that will lead to data inevitably supporting wrong branches, or false positives (see also Walker et al.'s discussion). Even if evolution is a strictly dichotomous process (which it clearly isn't):
  • low divergence may result in primitive (underived) sequences ('genetic symplesiomorphies') being shared by distant taxa
  • high divergence may result in saturation, which ultimately triggers branching artifacts
  • long isolation coupled with small active population sizes, repeated bottleneck / massive extinction events and/or lack of radiations will lead to sequences that are different from anything else in our data (in angiosperms, this phenomenon has a name: Ceratophyllum).
In fact, the very argument for angiosperm molecular phylogeneticists to move away from using single-gene phylogenies was that these first single-gene trees had branches that made little sense, especially when based on plastid data.

Single-gene trees will get things wrong. The more signal we add, usually by adding additional gene regions, the more we will reduce these errors (this is best-case scenario, but see Delsuc et al., Nature Rev. Genet. 2005). Thus, if some gene-trees conflict more with the combined tree than do others, it can be for two possible reasons:
  1. The conflicting genes had indeed different evolutionary histories. However, this would have to involve intra-plastome recombination and heteroplasmy, which so far have been very rarely documented in angiosperms.
  2. All genes had the same evolutionary history, but some of the data get more aspects of this true tree right than do others (and, of course, some are wrong that others get right).

And the matK said: "I'm your lord, follow my lead"

Walker et al. (all their scripts and results files can be found on github) find that it's only a few of the genes that essentially make up the combined tree. One of them is an old reliable pal of angiosperm phylogeneticists, the chloroplast matK gene. The literature is full of "multigene" trees that are effectively matK gene-trees using enlarged matrices. The matK determines a topology, and by adding genes that cannot compete with it (being too conserved, too variable or just inconsistently different), we re-inforce this topology. Only branches unresolved by matK will be further optimized using the added data.

Let's look at an example.

For the purpose of this post (and the follow-up), I'll use an old angiosperm matrix on stock (I know the quirks of this matrix). For analysis, I eliminated all of the OTUs with missing gene partitions, mainly to make sure that all of the trees and bootstrap (BS) pseudoreplicate trees have the same set of leaves, so I can summarize the tree samples using consensus networks.

Here's the my combined tree, unpartitioned.

Gray – current APG IV classification, "gold tree" (primary relationships within Mesangiospermae still a matter of debate)
And here is the fully partitioned one (over-parametrized; with each gene/codon position treated as data partition).

Essentially the same tree (some branches elongated, others shortened), eudicot clade and the Ceratophyllum-monocot clade swapped positions. Both trees have the same scale.

Even though my matrix includes only relatively few genes (just 21,550 sites), the tree gets the main aspects of the APG IV standard tree. The support for most of the branches is nearly unambiguous (irrespective of data partition), with the exception of some deep-down relationships within the Mesangiospermae (a long-standing issue, called the "dirty dozen"). The fact that the unpartioned and partitioned analysis agree for most part, indicates the signal in my matrix has no model-related issues (at least, none we could fix by using "better" models).

And the matK tree mirrors the fully partitioned tree, as shown here.

A tanglegram of the matK and the combined trees. Shown is the matK BS support for shared and conflicting edges. Orange asterisks, the monocot subtrees have the same structure but when using only matK, the conifer outgroup Podocarpus is nested deep within.

The similarity is indeed striking, in particular since the gene sample in the matrix comprises data from:
  • two of the nuclear-encoded ribosomal RNA genes (18S, 25S; biparentally inherited) that did follow partly different evolutionary trajectories, as e.g. well-studied in the case of Fagales (being a derived eudicot, not included in my matrix)
  • six chloroplast genes/gene regions (maternally inherited including the classics rbcL and matK but also the rpoC2, the most informative gene identified by Walker et al.)
  • three mitochondrial genes (also maternally inherited, but most mutations are, amino-acid-wise, synomymous, being concentrated at the third codon position).
The main things that matK get's wrong* in contrast to the combined tree are deep divergences represented by (very) short branches, in the part of the graph following the (very rapid) split of the mesangiosperm common ancestor (known as "Darwin's abominable mystery").

Also, it nests Podocarpus, the conifer in the outgroup, with unambiguous support in the monocots — which clearly is wrong, a false positive. Looking into the alignment, we can see that the reason for this is a mix of moderate-LBA (long-branch attraction) with missing-data-culling. To minimize LBA artifacts in the matrix originally used, I blanked out parts of the matK in the outgroup (which included a more derived conifer, Pinus, but also the extremely divergent gnetophytes); parts that were not straightforwardly alignable with the angiosperm matK.

The best way to illustrate internal signal conflict is, however, to directly show the BS Consensus network, not mapping support on two alternative topologies as seen in the tanglegram.

BS Consensus network based on 150 matK BS pseudoreplicates (numbers of necessary BS replicates determined by Pattengale et al.'s extended majority rule bootstop criterion implemented in RAxML)

When looking at BS << 100 and the boxes of competing splits in BS-support networks, it is important to keep in mind that low support can have two reasons:
  • Lack of decisive signal, because the BS pseudoreplicates will have (semi-)random or biased branching patterns; in the tree this surfaces usually as low (when random) to moderately high (when biased) support associated with (very) short branches.
  • Conflicting signals, ie. signals incompatible with a single tree; depending which site is eliminated or duplicated during resampling, the BS pseudoreplicate will show one or another topology; strong, deep conflict can surface in a tree by low support associated with (normally) long internal branches but also relatively high support for one alternative topology, the other only manifesting in very long terminal branches.
Regarding Walker et al.'s results, we now need to ask:
  1. Are the non-conflicted branches in the combined tree (major clades equal to the gold tree) the result of shared history of all of the included genes, or just that of the matK?
  2. Is the conflict with the combined tree and locally ambiguous signals due to a different history of the matK, located in the large single-copy unit, and the other genes, or just matK's inability to get certain things right?
In this case, all relatively high-supported conflicting matK splits are associated either with: (i) very short internal branches in the tree, the non-discriminative product of a fast ancient radiation, or (ii) are the result of an obvious data/branching artifact, ie. the misplaced Podocarpus.

So far, nothing challenges the assumption that the combined genes didn't follow the same history. Whether the other genes reveal something else, we'll see in my next post.



* or right: APG IV treats Ceratophyllum as the "probable sister of the eudicots" (see also Stevens' Angiosperm Phylogeny Website).

Losing information in phylogenetic consensus


Any summary loses information, by definition. That is, a summary is used to extract the "main" information from a larger set of information. Exactly how "main" is defined and detected varies from case to case, and some summary methods work better for certain purposes than for others.

A thought experiment that I used to play with my experimental-design students was to imagine that they were all given the same scientific publication, and were asked to provide an abstract of it. Our obvious expectation is that there would be a lot of similarity among those abstracts, which would represent the "important points" from the original — that is, those points of most interest to the majority of the students. However, there would also be differences among the abstracts, as each student would find different points that they think should also be included in the summary. In one sense, the worst abstract would be the one that has the least in common with the other abstracts, since it would be summarizing things that are of less general interest.

The same concept applies to mathematical summaries (aka "averages"), such as the mean, median and mode, which reduce the central location of a dataset to a single number. It also applies to summaries of the variation in a dataset, such as the variance and inter-quartile range. (Note that a confidence interval or standard error is an indication of the precision of the estimate of the central location, not a summary of the dataset variation — this is a point that seems to confuse many people.)

So, it is easy to summarize data and thereby lose important information. For example, if my dataset has two exactly opposing time patterns, then the data average will appear to remain constant through time. I might thus conclude from the average that "nothing is happening" through time when, in fact, two things are happening. I will never find out about my mistake by simply looking at the data summary — I also need to look at the original data patterns.


So, what has this got to do with phylogenetics? Well, a phylogenetic tree is a summary of a dataset, and that summary is, by definition, missing some of the patterns in the data. These patterns might be of interest to me, if I knew about them.

Even worse, phylogenetic data analyses often produce multiple phylogenetic trees, all of which are mathematically equal as summaries of the data. What are we then to do?

One thing that people often do is to compute a Consensus Tree (eg. the majority consensus), which is a summary of the summaries — that is, it is a tree that summarizes the other trees. It would hardly be surprising if that consensus tree is an inadequate summary of the original data. In spite of this, how often do you see published papers that contain any evaluation of their consensus tree as a summary of the original data?

This issue has recently been addressed in a paper uploaded to the BioRxiv:
Anti-consensus: detecting trees that have an evolutionary signal that is lost in consensus
Daniel H. Huson, Benjamin Albrecht, Sascha Patz, Mike Steel
Not unexpectedly, given the background of the authors, they explore this issue in the context of phylogenetic networks. As they note:
A consensus tree, such as the majority consensus, is based on the set of all splits that are present in more than 50% of the input trees. A consensus network is obtained by lowering the threshold and considering all splits that are contained in 10% of the trees, say, and then computing the corresponding splits network. By construction and in practice, a consensus network usually shows the majority tree, extended by a number of rectangles that represent local rearrangements around internal nodes of the consensus tree. This may lead to the false conclusion that the input trees do not differ in a significant way because "even a phylogenetic network" does not display any large discrepancies.
That is, sometimes authors do attempt to evaluate their consensus tree, by looking at a network. However, even the network may turn out to be inadequate, because a phylogenetic tree is a much more complex summary than is a simple mathematical average. This is sad, of course.

So, the new suggestion by the authors is:
To harness the full potential of a phylogenetic network, we introduce the new concept of an anti-consensus network that aims at representing the largest interesting discrepancies found in a set of trees.
This should reveal multiple large patterns, if they exist in the original dataset. Phylogenetic analyses keep moving forward, fortunately.

Phylogenetic ambiguity: data gaps, indifference and internal conflict

A tweet by my favourite journal (not only because they insist that authors make their data available) pointed me to their most viewed paper of 2018, with a nice title (for a network-fan):
Genus-level phylogeny of cephalopods using molecular markers: current status and problematic areas, by Sanchez et al. (PeerJ, 6:e4331).
"Problematic areas" are exactly my cup of tea. However, the graphical representation of these falls a bit short. The authors show three maximum-likelihood phylograms, one for the Cephalopoda with support annotated at some branches (their Fig. 1), and one each for two of the constituent lineages, the Decabrachia (their Fig. 2) and the Octobrachia (Fig. 3, reproduced below, because we will take a look at the data behind it).

Original: "Figure 3: Maximum-likelihood tree of the Octobrachia under the
GTR + Gamma model with the morphological character set mapped onto the tree.
Taxa highlighted in red represents discrepancy to previously published studies."

Unfortunately, we don't know the actual support for each of the branches — there is a legend in the lower right, but no signatures etc. associated with it. You will find some information throughout the text, of course. For example:
The use of concatenated sequences of all markers (Fig. 2) resulted in a resolved topology for monophyly of the Octobrachia (BS = 58%), and strong support for monophyly of the Decabrachia (BS = 98%), with both clades strongly supported by the Bayesian approach with PP = 0.78 and 0.75 respectively
The latter is quite strange, as PP are expected (methodologically) to be ≥ BS.
Although monophyly was demonstrated for several families contained within both superorders, the relationships of the families contained within Octobrachia were better supported than those in Decabrachia (Fig. 2). Of the 37 nodes in the Octobrachia portion of the general tree containing all taxa, the majority were resolved above the 50% level (31 nodes with BS > 50%); but only 28 out of 80 nodes in the Decabrachia were resolved at BS >50%, most of which were located at family level.
BS = 51 could be lack of signal (all other alternatives BS ~ 0) or conflict (one alternative has a BS = 49).

What we can infer directly from the alignment

Let's have a look at the first three gene regions in the matrix provided, using Mesquite's bird-view option.


We can see from the alignment that the first gene (left; mitochondrial 12S rDNA) splits the taxon set (the taxon order seems to be arbitrary) into two (three if we include those with no data) main groups with substantially divergent 12S rDNAs. However, in the second, much more homogeneous gene, no such differentiation is obvious, with the exception of two accessions that remain very different from the rest. This is quite puzzling, because the second gene is the (close-by) mitochondrial 16S rDNA.

Without going into details, the 12S rDNA unambiguously supports (and enforces) an Octopodia core clade defined by a 12S rDNA entirely different from that of other taxa, and comprising five of this order's families, in which Amphioctopus and Octopus make up a subclade with strongly derivating 16S rDNA.

With respect to the tree, we also have to assume that the 12S rDNA of the Octopodia core clade is derived, strongly evolved, whereas it remained largely unmodified (ie. is primitive) in the other, earlier diverged (according to the tree) lineages. However, some of these lineages have equally long terminal branches: there has been more evolution going on in other genes.



The third gene, the nuclear-encoded gene for the 18S rRNA (18S nrDNA), shows another pattern (and quite typical). Large stretches with very little variation, hence, devoid of differentiating signal that would allow the tree algorithm to make a decision (and letting Bayes get lost in the treespace resulting in PP < 1.0).


For half of the taxa, no information is available, but this hardly matters because even genera with strongly different mt 12S rDNA have nearly the same 18S nrDNA. There is a little hickup in the second part in one accession (a gap in Cirrothauma with a small, off-alignment strand in between), but this could just be a sequencing artefact. Limited to a single taxon, it has no topological effect (we at least need four to make a call), it will only increase the length of the terminal branch.

The remainder of the matrix mirrors the situation in the first three partitions, eg. in the well-sampled (only six taxa missing) mt coI gene, Callistoctopus is visibly distinct from all other genera, while most general variation is concentrated at the 3rd codon position. All other mt-genes, accounting for 58% of the matrix' characters, are covered for four of the taxa (the sister taxon used to root, Vampyrotheutis, and three of the core Octopodia, hence, can only support a single split within this group and be used to test for its alternatives.



What networks could have shown

The matrix provided for the shown tree (made available via figshare) has 40 taxa and 16104 characters, quick to run these days. Here's the tree with branch support annotated along branches.

ML phylogram inferred from Sanchez et al.'s matrix, taxa ordered as in the original fig. 3. Members of the same taxon (order, superfamily, family, as annotated in Sanchez et al.'s fig. 3) colored accordingly. Values at branches indicate ML-BS  support using a single partition for the entire data ("unpart.") or using the gene-wise partition scheme provided in the figshare submission ("part.")

Even though I run an unpartitioned analysis, my tree is very similar to the original tree, with a near identical topology except for Ameloctopus being moved one node up and placed as sister to Hapalochlaena (ML, unpartioned-BS = 52 vs. 46[!] for the alternative seen in Sanchez et al.'s fig. 3). I never understood the fuzz about model and partition testing, when we usually work with data where any model will inevitably be suboptimal (see alignments). As a geneticist, I also believe data partitions should be informed by function, not computer programmes (eg. one for 1st and 2nd codon position, another for the 3rd codon position, and one for the rDNAs).

We have unambiguously supported branches (BS ~ 100), and others, the "problematic areas" (BS << 100). Ambiguity in support values for branches of a tree can have two reasons:
  1. Lack of signal, the data is indifferent regarding the placements of certain taxa and/or subtrees (PP < 1.0 are indicative for lack of signal).
  2. Conflicting signal, parts of the data (data partitions) prefer one topological alternative, others a (partly) conflicting one (keep in mind that even in the presence of substantial signal conflict, PP ~ 1).
Short branches with low (BS) support point to the former, long branches with low (BS) support are a direct indication of the latter. Two apparent sources of conflict would be that the data include gene regions from the biparentally inherited nucleome and the (usually maternally inherited, not sure how this is in squids) mitochondriome and combine protein-coding genes (amino-acids coded by codons) with rRNA genes (directly encoding a certain secondary, tertiary structure).

In our tree here, we notice a general correlation between the branch lengths and the support; the shorter the branch, the lower the support. With a few exceptions, eg. the Octopodida core clade, triggered by the unique, strongly diverged sequences of the 12S rDNA, has a long root branch with compartively low support (ML-BS = 63; collapses when using the authors' partitioning scheme that treats each gene region as individual partition).

Full BS Consensus network based on 450 ML pseudoreplicates (result of the unpartitioned analysis). Edge lengths are proportional to the BS support (frequency of the splits in the BS tree sample), trivial splits not collapsed. Arrow points to the root (cf. Sanchez et al.'s fig. 1).

The BS Consensus network shows us that some of the "problematic areas", ie. branches with ambiguous support, are not really problematic (alternatives have no to very little support), but others are. Including the 12S rDNA-based Octopodida core clade, and connected to this, the division of the Megaleledonidae, as annotated in Sanchez et al.'s fig. 3, into two clades (not discussed in the paper). A clade including all Megaleledonidae has a BSunpart./part. = 34/55 and competes with the 12S rDNA split (BS = 63/37) and the placement of Cistopus as sister to the Octopodida core clade (BS = 52/34). It doesn't conflict with the alternative topology placing Cistopus as sister to all of them (BS = 38/50). The reason for this is, of course, that by using a different partion for the highly divergent mt-12S rDNA, we allow RAxML to estimate high probabilities for all mutations, effectively down-weighting each mutation in this gene compared to those in other, more conservatively structured gene regions, which seem to prefer alternative splits.

Vice versa, the poorly supported sister relationship (BS = 45/21) of Bathypolypus with the Enteroctopodidae (light green) + part of the Argonautoidea (pink) stands unopposed, alternative splits have BSunpart. < 10. In the partitioned analysis, however, there is an equally poor supported alternative sticking out a bit: Bathypolypus as sister to the (all-including) Megaleledonidae clade (BSpart. = 23).

While we see little effect on the tree topology, partitioning affects some of the support values. An nice example is the structure of the Megaleledonidae s.str. subtree. The root is unambiguously supported, as is the sister relationship of Graneldone and Bentheledone. The remaining branches have ambiguous support.


Here, the partitioning scheme is a game changer. Unpartioned, the favored alternative is a Adelieledone-Pareledone-Megaeledone (APM) grade "basal" to Graneldone and Bentheledone (BS = 68/49); using the authors' partitioning scheme, the data favors an APM clade sister to the latter two (quite a difference, since we often equal clades with monophyly and grades with paraphyly).

It doesn't matter whether a clade has a BS support of 30, 50 or 70. We need to know, if the remaining 70%, 50%, or 30% of bootstrap replicates show random or the same alternative(s). When a tree has ambiguously support branches, BS Consensus networks should be obligatory.

Instead of reading sentences like this:
Benthic families possessing a double row of suckers (i.e., Enteroctopodidae, Octopodidae and Bathypolypodidae) together with the Megaleledonidae (possessing a single row of suckers) formed a well-supported monophyletic group (BS = 72%, PP = 0.61).
we should read this:
A clade including all benthic families possessing a double row of suckers (i.e., Enteroctopodidae, Octopodidae and Bathypolypodidae) and the Megaleledonidae (possessing a single row of suckers) received ambiguous support (BS = 72%, PP = 0.61), but potential alternatives received no support at all. The combination of a relative high BS but low PP points towards a faint, but consistent signal in the available data.
And include the Consensus networks at least in the supplement.

When we aim to map morphological traits (which a nice touch of Sanchez et al.'s paper), why not consider the topological alternatives we see there?

Running single-gene trees is never wrong, too. But, in the case of these data, that would be the topic of another post, using a different type of network: a super-network.

Final note. This post is not intended to criticize Sanchez et al.'s paper (my squid-expertise ends with having seen them in aquaria). My impression is they put a lot of effort into getting the matrix together. Having been forced to harvest molecular data myself in the past, I know how important and tedious this work is. Instead, this post stresses and shows, using an easy-to-access example that raised a lot of interest (attracted many views), that we often have to work with suboptimal data not providing trivial results in the form of fully resolved trees. This is a situation in which easy to generate networks offer a lot. No peer reviewer should, in such a case, be content with seeing just a tree (although they, to my experience, always are).

Please stop using cladograms!


I really like the journal PeerJ, not only because it is open access and publishes the peer review process, but also because it's one of the few that adhere to strict policies when it comes to data documentation. In my last (on my own) 2-piece post (part 1, part 2), I showed what networks could have offered for historical and more recent studies in Cladistics, the journal of the Willi Hennig Society. In this one, I'll illustrate why paleontology in general needs to stop using cladograms.

An example

In a recent article, Atterholt et al. (PeerJ 6: e5910, 2018) describe and discuss "the most complete enantiornithine from North America and a phylogenetic analysis of the Avisauridae". I'm not a paleozoologist and "stuff of legend", but their first 17 figures seem to make a good point about the beauty of the fossil and its relevance; and it is interesting to read about it. This makes me envy paleozoologists a bit — the reason I exchanged chemistry for paleontology was my childhood love for the thunder lizards; I specialized in zoology not botany for graduate biology courses, and I fell in love with social insects, especially bees; but then more general circumstances pushed me into plant phylogenetics.

The result of Atterholt et al.'s phylogenetic analysis is presented in their figure 18, as shown here.

Figure 18 of Atterholt et al. (2018): "A cladogram depicting the hypothetical phylogenetic position of Mirarce eatoni." [the beautiful fossil is highlighted in bold font]
This looks very familiar — graphs like this can be seen in many paleontological studies, not only those in Cladistics. However, this is a phylogeneticist's "nightmare" (but a cladist's "dream").

First, phylogenetic trees, especially those that were weighted post-analysis several times to get a more or less resolved tree, should be depicted as phylograms — trees with branch lengths. Phylogenetic hypotheses are not only about clades, and what is sister to what, but about the amount of (inferred) evolutionary change between the hypothetical ancestors, the internal nodes, and their descendants, the labelled tips. For example, we may want to know how long is the root of the clade (Avisauridae, Avisaurus s.l.) comprising the focus taxon compared to the lengths of the terminal branches within the clade. Prominent roots and short terminals are a good sign for monophyly (inclusive common origin), or at least a fossil well placed, whereas short roots and long terminals are not.

The above tree as phylogram (using PAUP*'s AccTran optimization). The beauty of cladistic classification is that the new specimen could have just been described as another species of Avisaurus (but read the author's discussion).

In this example, we seem to be on the safe side, although one may question the general taxonomic concept for extinct birds. Are the differences enough to erect a new genus for every specimen? This is hard to decide based on this matrix.

Second, a tree without branch support is just a naked line graph, telling us nothing about the quality (strengths and weaknesses) of the backing data. Neontologists are not allowed to publish naked trees. In molecular phylogenetics, we are not uncommonly asked by reviewers to drop all branches (internodes) below an arbitrary threshold: a bootstrap (BS) support value < 70 and posterior probability (PP) < 0.95. In palaentology, it has become widely accepted to not show support values at all. The reason is simple: the branch support is always low, because of data gaps and homoplasy. This is a problem the authors are well aware of:
The modified matrix consists of 43 taxa (26 enantiornithines, 10 ornithuromorphs) scored across 252 morphological characters [the provided matrix lists 253], which we analyzed using TNT (Goloboff, Farris & Nixon, 2008a). Early avian evolution is extremely homoplastic (O’Connor, Chiappe & Bell, 2011; Xu, 2018) thus we utilized implied weighting (without implied weights Pygostylia was resolved as a polytomy due to the placement of Mystiornis) (Goloboff et al., 2008b); we explored k values from one to 25 (see Supplemental Information) and found that the tree stabilized at k values higher than 12. In the presented analysis we conducted a heuristic search using tree-bisection reconnection retaining the single shortest tree from every 1,000 replications with a k-value of 13. This produced six most parsimonious trees with a score of 25.1. These trees differed only in the relative placement of five enantiornithines closely related to the Avisauridae, forming a polytomy with this clade in the strict consensus tree (Consistency Index = 0.453; Retention Index = 0.650; Fig. 18).
I've seen much worse CI and RI values in the paleophylogenetic literature (some of them are plotted in this post). For a phylogenetic inference, homoplasy equals internally incompatible signals — many characters show different, partly or fully conflicting, taxon bipartitions; or, in other words, they prefer different trees. The signal in the matrix is thus not tree-like — it doesn't fit a single tree. That's why we have to choose one using TNT's iterated reweighting procedures. (Note: an alternative "phenetic" Neighbor-joining tree has a computation time < 1s, and produces the same tree for the Ornithumorpha and the root-proximal, 'basal' part of the tree, except that Jeholornis is moved two nodes up; but it shuffles a lot in the Longirostravis–Avisauridae clade.)

Another point is that the more homoplasy we have, then the higher must have been the rate of change (here: visible anatomical mutation). The higher the rate of change, the higher the statistical inconsistency of parsimony.

In short, paleontologists (Atterholt et al. just follow the standard in paleophylogenetic publications) use data with tree-unlike signal to infer trees (see also David's last post on illogicality in phylogenetics) under a possibly invalid optimality criterion, which are then used to downweight characters (eliminate noise due to homoplasy) to infer less noisy, "better" trees.

The basic signal

We can't change the data, but we can explore and show its signal. And the basic signal from the unfiltered matrix is best visualized using a Neighbor-net splits graph.

Neighbor-net based on mean pairwise taxon distances. Thick edges correspond to branches in the published tree.

Some differentiation patterns that explain the clades in the tree can be traced, but it becomes difficult in the group that is of most interest: the (inferred) clade(s) comprising the newly described fossil. In the Neighbor-net this is placed close to another member of the Avisauridae, but not all. The matrix is not optimal for the task at hand.

The data properties

The matrix is a multistate matrix with up to six states in the definition line (although only five are used, as state "5" is not present). The taxa have variable gappyness (i.e. the proportion of completely undetermined cells), between 2% (extant birds: Anas and Gallus) and 94% (Intiornis, an Avisauridae) — the median is 56%, and the average close to it (54%). The "hypothetically" placed fossil Mirarce eatoni (in the matrix it is under its old designation: "Kaiparowits") lacks a bit more of the scored characters (61%). That may strike one as a lot, but note that the matrix has 253 characters! However, we may well ask: if I want to place a fossil for which I can score 99 characters, why bother to include another ~150 that tell me nothing about its affinity? (Note: paleobotanists struggle hard even to get such numbers, we usually have at best 50 characters.)

Its closest putative relatives, the Avisaurus s.l., lack 90% of the characters; leaving us with max. 25 characters supporting the relevant clade (assuming that the 10% are all found in Mirarce as well). Coverage is not much better in the next-closest relatives (phylogenetically speaking).

Data coverage in the phylogenetic neighborhood of Mirarce eatoni

The missing data percentage may have mislead the Neighbor-net a bit, because we will have fed it with unrepresentative or highly ambiguous pairwise distances. In the the network, the focus fossil comes close to Neuquenornis, the only other Avisauridae with some data coverage. Looking at the heat map below, we see that missing data is indeed a problem in this matrix — we have zero distances between several pairs that show different distances to the better-covered taxa.

The distance matrix drawn as a heat map: green = similar, red = dissimilar (values range between 0 and 0.8). Red arrows: taxa with too many (and ambiguous) zero pairwise distances.

The closest relative of Mirarce is, indeed, Avisaurus/Gettya gloriae, but the latter has zero distances to various other poorly covered taxa from the phylogenetic neighborhood, in contrast to the much better-covered Mirarce. Neighbor-nets are very good at getting the obvious out of a morphological matrix, but they don't perform miracles. However, why should we include poorly known taxa at all during phylogenetic inference? Wouldn't it be better to infer a backbone tree (or network showing the alternative hypotheses) based on a less gappy matrix, and then find the optimal position of the poorly known taxa within that tree (network)?

Estimating the actual character support

Some characters cover just 10–20% of the taxa, whereas others are scored for most of them — more than half of the characters are missing for more than half of the taxa. Using TNT's iterative weight-to-fit option means that we infer a tree, ideally one fitting the well-covered data (taxon- and character-wise), and then downweight all conflicting characters elsewhere to fit this tree. We then end up with a tree where we have no idea about actual character support. Since the matrix is a Swiss cheese, we only can re-affirm the first-inferred tree.

Let's check the raw character support, using non-parametric bootstrapping and maximum likelihood as the optimality criterion (corrected for ascertainment bias, as implemented in RAxML).

ML-BS Consensus Network (using Lewis' 2-parameter Mk+G model). Edge lengths are proportional to the BS support values of taxon bipartitions (= phylogenetic splits, internodes, branches in phylogenetic trees). Only splits are shown that occurred in at least 10% of 900 BS pseudoreplicates (number of necessary BS replicates determined by the Extended Majority Rule Bootstrap criterion), trivial splits collapsed. Thick edges correspond with branches in Atterholt et al.'s iterative parsimony tree; coloring as before.

The ML bootstrap Consensus Network bears not a few similarities to the distance-based Neighbor-net. The characters do not support the Avisauridae subtree, as depicted in the published TNT tree, but there are faint signals associating some of them to each other, despite the missing data. Keep in mind: a BS support of 20 for one alternative and < 10 for all others means (ideally) one fifth of the characters support the split, and the rest have no (coherent) information. Some sister pairs have quite high support (for this kind of data set), and Gettya gloriae is resolved as sister of Mirarce (unambiguously, with a BS support = 67). But, the matrix hardly has the capacity to resolve deeper relationships within the group of interest, the Enantiornithes — the polytomy with the next relatives seen in the tree and the corresponding clade dissolve. This confirms what we saw in the Neighbor-Net (despite missing data distortion).

The matrix and the tree show something that could have been deduced directly from the distance matrix: the poorly known Gettya (Avisaurus) gloriae is (literally) the closest relative of the enigmatic new genus / species Mirarce (morphological distance of 0.08 compared to 0.1–0.64 for all other taxa). But is this overall similarity enough to conclude Avisaurus, Gettya and Mirarce are a monophyletic group within the Avisauridae?

What the authors (and all paleontologists doing phylogenetics) should have done

(I would have skipped all trees, naturally, but peer reviewers and most readers probably need to see them.)

  • Trimmed the matrix to include only those characters preserved in the fossil of interest, in order to minimize missing data artefacts during inference.
  • Shown the Neighbor-net to visualize the primary signal situation, including and excluding poorly covered taxa. From the Neighbor-net it is already obvious that the fossil is an Enantiornithes, so any subsequent optimization / inference could have focussed on this group alone.
  • Then inferred a backbone tree excluding poorly covered taxa, and shown the resulting phylogram. In case one needs to test the Enantiornithes root (the Neighbor-net gives us two alternatives for the Enantiornithes root: Pengornis + Eopengornis or Protopteryx + Iberomesornis), there is no point in including the poorly covered Enantiornithes or the worst-covered taxa outside this clade.
  • Then optimized the position of the poorly covered taxa in the backbone tree. I recommend using RAxML's evolutionary placement algorithm (EPA) for this, but you can also do this in a parsimony framework if you wish. (EPA can also be used to test outgroup roots: here, one would search the branch at which all non-Enantiornithes fit best.)
  • Shown the resulting phylogram including all taxa — that is, read in the topology to the analysis, and then re-optimize branch lengths.
  • Shown a Support Consensus Network to illustrate the support for the branches in the preferred tree and their competing alternatives. (There may be one or more, as there are many options to estimate branch support.) How sure can we be about relationships within the Avisauridae and their relationships to other Enantiornithes?



Postscriptum. For those who are curious about how the ML tree would look like, here it is:


I have no idea about birds, but from a methodological point of view this is an equally (if not more, because unforced) valid hypothesis for the data set. And demonstrating its limitations: note the relatively long branches with very low support making up the backbone of the Enantiornithes clade. This is typical for matrices lacking coherent discriminatory signal and/or struggling with internal conflict.

Please stop using cladograms!


I really like the journal PeerJ, not only because it is open access and publishes the peer review process, but also because it's one of the few that adhere to strict policies when it comes to data documentation. In my last (on my own) 2-piece post (part 1, part 2), I showed what networks could have offered for historical and more recent studies in Cladistics, the journal of the Willi Hennig Society. In this one, I'll illustrate why paleontology in general needs to stop using cladograms.

An example

In a recent article, Atterholt et al. (PeerJ 6: e5910, 2018) describe and discuss "the most complete enantiornithine from North America and a phylogenetic analysis of the Avisauridae". I'm not a paleozoologist and "stuff of legend", but their first 17 figures seem to make a good point about the beauty of the fossil and its relevance; and it is interesting to read about it. This makes me envy paleozoologists a bit — the reason I exchanged chemistry for paleontology was my childhood love for the thunder lizards; I specialized in zoology not botany for graduate biology courses, and I fell in love with social insects, especially bees; but then more general circumstances pushed me into plant phylogenetics.

The result of Atterholt et al.'s phylogenetic analysis is presented in their figure 18, as shown here.

Figure 18 of Atterholt et al. (2018): "A cladogram depicting the hypothetical phylogenetic position of Mirarce eatoni." [the beautiful fossil is highlighted in bold font]
This looks very familiar — graphs like this can be seen in many paleontological studies, not only those in Cladistics. However, this is a phylogeneticist's "nightmare" (but a cladist's "dream").

First, phylogenetic trees, especially those that were weighted post-analysis several times to get a more or less resolved tree, should be depicted as phylograms — trees with branch lengths. Phylogenetic hypotheses are not only about clades, and what is sister to what, but about the amount of (inferred) evolutionary change between the hypothetical ancestors, the internal nodes, and their descendants, the labelled tips. For example, we may want to know how long is the root of the clade (Avisauridae, Avisaurus s.l.) comprising the focus taxon compared to the lengths of the terminal branches within the clade. Prominent roots and short terminals are a good sign for monophyly (inclusive common origin), or at least a fossil well placed, whereas short roots and long terminals are not.

The above tree as phylogram (using PAUP*'s AccTran optimization). The beauty of cladistic classification is that the new specimen could have just been described as another species of Avisaurus (but read the author's discussion).

In this example, we seem to be on the safe side, although one may question the general taxonomic concept for extinct birds. Are the differences enough to erect a new genus for every specimen? This is hard to decide based on this matrix.

Second, a tree without branch support is just a naked line graph, telling us nothing about the quality (strengths and weaknesses) of the backing data. Neontologists are not allowed to publish naked trees. In molecular phylogenetics, we are not uncommonly asked by reviewers to drop all branches (internodes) below an arbitrary threshold: a bootstrap (BS) support value < 70 and posterior probability (PP) < 0.95. In palaentology, it has become widely accepted to not show support values at all. The reason is simple: the branch support is always low, because of data gaps and homoplasy. This is a problem the authors are well aware of:
The modified matrix consists of 43 taxa (26 enantiornithines, 10 ornithuromorphs) scored across 252 morphological characters [the provided matrix lists 253], which we analyzed using TNT (Goloboff, Farris & Nixon, 2008a). Early avian evolution is extremely homoplastic (O’Connor, Chiappe & Bell, 2011; Xu, 2018) thus we utilized implied weighting (without implied weights Pygostylia was resolved as a polytomy due to the placement of Mystiornis) (Goloboff et al., 2008b); we explored k values from one to 25 (see Supplemental Information) and found that the tree stabilized at k values higher than 12. In the presented analysis we conducted a heuristic search using tree-bisection reconnection retaining the single shortest tree from every 1,000 replications with a k-value of 13. This produced six most parsimonious trees with a score of 25.1. These trees differed only in the relative placement of five enantiornithines closely related to the Avisauridae, forming a polytomy with this clade in the strict consensus tree (Consistency Index = 0.453; Retention Index = 0.650; Fig. 18).
I've seen much worse CI and RI values in the paleophylogenetic literature (some of them are plotted in this post). For a phylogenetic inference, homoplasy equals internally incompatible signals — many characters show different, partly or fully conflicting, taxon bipartitions; or, in other words, they prefer different trees. The signal in the matrix is thus not tree-like — it doesn't fit a single tree. That's why we have to choose one using TNT's iterated reweighting procedures. (Note: an alternative "phenetic" Neighbor-joining tree has a computation time < 1s, and produces the same tree for the Ornithumorpha and the root-proximal, 'basal' part of the tree, except that Jeholornis is moved two nodes up; but it shuffles a lot in the Longirostravis–Avisauridae clade.)

Another point is that the more homoplasy we have, then the higher must have been the rate of change (here: visible anatomical mutation). The higher the rate of change, the higher the statistical inconsistency of parsimony.

In short, paleontologists (Atterholt et al. just follow the standard in paleophylogenetic publications) use data with tree-unlike signal to infer trees (see also David's last post on illogicality in phylogenetics) under a possibly invalid optimality criterion, which are then used to downweight characters (eliminate noise due to homoplasy) to infer less noisy, "better" trees.

The basic signal

We can't change the data, but we can explore and show its signal. And the basic signal from the unfiltered matrix is best visualized using a Neighbor-net splits graph.

Neighbor-net based on mean pairwise taxon distances. Thick edges correspond to branches in the published tree.

Some differentiation patterns that explain the clades in the tree can be traced, but it becomes difficult in the group that is of most interest: the (inferred) clade(s) comprising the newly described fossil. In the Neighbor-net this is placed close to another member of the Avisauridae, but not all. The matrix is not optimal for the task at hand.

The data properties

The matrix is a multistate matrix with up to six states in the definition line (although only five are used, as state "5" is not present). The taxa have variable gappyness (i.e. the proportion of completely undetermined cells), between 2% (extant birds: Anas and Gallus) and 94% (Intiornis, an Avisauridae) — the median is 56%, and the average close to it (54%). The "hypothetically" placed fossil Mirarce eatoni (in the matrix it is under its old designation: "Kaiparowits") lacks a bit more of the scored characters (61%). That may strike one as a lot, but note that the matrix has 253 characters! However, we may well ask: if I want to place a fossil for which I can score 99 characters, why bother to include another ~150 that tell me nothing about its affinity? (Note: paleobotanists struggle hard even to get such numbers, we usually have at best 50 characters.)

Its closest putative relatives, the Avisaurus s.l., lack 90% of the characters; leaving us with max. 25 characters supporting the relevant clade (assuming that the 10% are all found in Mirarce as well). Coverage is not much better in the next-closest relatives (phylogenetically speaking).

Data coverage in the phylogenetic neighborhood of Mirarce eatoni

The missing data percentage may have mislead the Neighbor-net a bit, because we will have fed it with unrepresentative or highly ambiguous pairwise distances. In the the network, the focus fossil comes close to Neuquenornis, the only other Avisauridae with some data coverage. Looking at the heat map below, we see that missing data is indeed a problem in this matrix — we have zero distances between several pairs that show different distances to the better-covered taxa.

The distance matrix drawn as a heat map: green = similar, red = dissimilar (values range between 0 and 0.8). Red arrows: taxa with too many (and ambiguous) zero pairwise distances.

The closest relative of Mirarce is, indeed, Avisaurus/Gettya gloriae, but the latter has zero distances to various other poorly covered taxa from the phylogenetic neighborhood, in contrast to the much better-covered Mirarce. Neighbor-nets are very good at getting the obvious out of a morphological matrix, but they don't perform miracles. However, why should we include poorly known taxa at all during phylogenetic inference? Wouldn't it be better to infer a backbone tree (or network showing the alternative hypotheses) based on a less gappy matrix, and then find the optimal position of the poorly known taxa within that tree (network)?

Estimating the actual character support

Some characters cover just 10–20% of the taxa, whereas others are scored for most of them — more than half of the characters are missing for more than half of the taxa. Using TNT's iterative weight-to-fit option means that we infer a tree, ideally one fitting the well-covered data (taxon- and character-wise), and then downweight all conflicting characters elsewhere to fit this tree. We then end up with a tree where we have no idea about actual character support. Since the matrix is a Swiss cheese, we only can re-affirm the first-inferred tree.

Let's check the raw character support, using non-parametric bootstrapping and maximum likelihood as the optimality criterion (corrected for ascertainment bias, as implemented in RAxML).

ML-BS Consensus Network (using Lewis' 2-parameter Mk+G model). Edge lengths are proportional to the BS support values of taxon bipartitions (= phylogenetic splits, internodes, branches in phylogenetic trees). Only splits are shown that occurred in at least 10% of 900 BS pseudoreplicates (number of necessary BS replicates determined by the Extended Majority Rule Bootstrap criterion), trivial splits collapsed. Thick edges correspond with branches in Atterholt et al.'s iterative parsimony tree; coloring as before.

The ML bootstrap Consensus Network bears not a few similarities to the distance-based Neighbor-net. The characters do not support the Avisauridae subtree, as depicted in the published TNT tree, but there are faint signals associating some of them to each other, despite the missing data. Keep in mind: a BS support of 20 for one alternative and < 10 for all others means (ideally) one fifth of the characters support the split, and the rest have no (coherent) information. Some sister pairs have quite high support (for this kind of data set), and Gettya gloriae is resolved as sister of Mirarce (unambiguously, with a BS support = 67). But, the matrix hardly has the capacity to resolve deeper relationships within the group of interest, the Enantiornithes — the polytomy with the next relatives seen in the tree and the corresponding clade dissolve. This confirms what we saw in the Neighbor-Net (despite missing data distortion).

The matrix and the tree show something that could have been deduced directly from the distance matrix: the poorly known Gettya (Avisaurus) gloriae is (literally) the closest relative of the enigmatic new genus / species Mirarce (morphological distance of 0.08 compared to 0.1–0.64 for all other taxa). But is this overall similarity enough to conclude Avisaurus, Gettya and Mirarce are a monophyletic group within the Avisauridae?

What the authors (and all paleontologists doing phylogenetics) should have done

(I would have skipped all trees, naturally, but peer reviewers and most readers probably need to see them.)

  • Trimmed the matrix to include only those characters preserved in the fossil of interest, in order to minimize missing data artefacts during inference.
  • Shown the Neighbor-net to visualize the primary signal situation, including and excluding poorly covered taxa. From the Neighbor-net it is already obvious that the fossil is an Enantiornithes, so any subsequent optimization / inference could have focussed on this group alone.
  • Then inferred a backbone tree excluding poorly covered taxa, and shown the resulting phylogram. In case one needs to test the Enantiornithes root (the Neighbor-net gives us two alternatives for the Enantiornithes root: Pengornis + Eopengornis or Protopteryx + Iberomesornis), there is no point in including the poorly covered Enantiornithes or the worst-covered taxa outside this clade.
  • Then optimized the position of the poorly covered taxa in the backbone tree. I recommend using RAxML's evolutionary placement algorithm (EPA) for this, but you can also do this in a parsimony framework if you wish. (EPA can also be used to test outgroup roots: here, one would search the branch at which all non-Enantiornithes fit best.)
  • Shown the resulting phylogram including all taxa — that is, read in the topology to the analysis, and then re-optimize branch lengths.
  • Shown a Support Consensus Network to illustrate the support for the branches in the preferred tree and their competing alternatives. (There may be one or more, as there are many options to estimate branch support.) How sure can we be about relationships within the Avisauridae and their relationships to other Enantiornithes?



Postscriptum. For those who are curious about how the ML tree would look like, here it is:


I have no idea about birds, but from a methodological point of view this is an equally (if not more, because unforced) valid hypothesis for the data set. And demonstrating its limitations: note the relatively long branches with very low support making up the backbone of the Enantiornithes clade. This is typical for matrices lacking coherent discriminatory signal and/or struggling with internal conflict.