Just try it for your data – a last first-of-its-kind Neighbor-net using FTIR data


This is likely to be my last post for this blog.

Some thoughts

When I joined the Genealogical World of Phylogenetic Networks three years ago, I didn't know how much fun it is to blog about science. Blogging, or writing essays, has several advantages against the traditional way to get a researcher's ideas out into the world — writing a scientific paper. The most important one is, one can just try out something without having to worry how this would get past the peer-reviewers and editors (or as I like to call them: the Mighty Beasts lurking in the Forest of Reviews). When I was still a (sort-of) career scientist (ie. paid by tax-payers to do science), I had my share of discouraging experiences, whenever we tried to leave the beaten (and worn out) paths to try something new; to look into the dark places and not right under the street-lights.


Before we submitted papers, we hence put a considerable effort into them, pondering what our peers may criticize, or what might alienate them (being likely unfamiliar with our methodological and philosophical approaches), and thus to minimize the chance all our work would be for nothing. In a couple of cases, where we expected fierce resistance, we opted for low-impact journals with no manuscript length restrictions and more welcoming editors and peers, to be able to put in everything that we had. Some of my best bits are buried in journals where you'd never expect them!

But it was increasingly annoying, nevertheless;. It was no fun anymore to formally publish research, and so I let my career as smoothly run out in the 2010s as it started in the Zeroes.

David's encouraging me to write blog-posts, just after I early retired, thus revitalized my interest in science, to "boldly go where no-one has gone before". The amount of effort is typically much lower, although some of my posts do involve the same work that I put into the papers that I co-authored. More importantly, there are no beasts in the World Wide Web that can bite you from the shadows; they have to do it in the open. It's an ideal way to get an idea out, without having to think about the consequences. None of the work I put into a post has been for vain. What a difference: before, for every graph / analysis result published, two ended in the bin, many devoured by the Mighty Beasts.

And, maybe somebody will find the work interesting enough to try it out; and eventually my idea finds a place in the sanctionized, peer-reviewed scientific world, anyway. Since I'm out-of-business, I can afford to not cash in the credit (no-one formally cites a blog post).

My last Neighbor-net for the Genealogical World

Neighbor-nets (NNets) and myself was love at the first sight (this was, in my case, ~2005, when my boss Vera Hemleben, a geneticist, sent me over to the new professor in our bioinformatics department, named Daniel Huson, who had just released a new software package, SplitsTree). These networks are...
  • ... most versatile: any kind of data can be transformed into a distance matrix;
  • ... quick-and-easy to infer.
And even if they are not phylogenetic networks in the strict sense – NNets are unrooted and their edge-bundles do not necessarily reflect evolutionary pathways – they more often than not point towards common origins and down-scale ± complex phylogenetic relationships more comprehensively than any phylogenetic tree (coalescent or not) that we could infer. The Genealogical World is full of examples, and the writers of this blog such as David [homepage], Mattis [GoogleScholar/ homepage], myself [GoogleScholar/ homepage], and like-minded researches have published quite a few of them (in high- and low-impact journals). For a comprehensive, permanently updated list see Philippe Gambette's Who's Who in Phylogenetic Networks page.

For my final post, I decided on a fascinating new data source in paleobiology: Fourier transformed infrared spectra (FTIR) of fossil cuticles.

The cuticle is a plant's skin, and it's composition and structure show a lot of variation, down to species level. Thus, their morphological-anatomical features have long been used as taxonomic markers to identify fossil material. Using infrared spectroscopy, one can look at the chemical composition of cuticles. Like any other spectrum, an FTIR-spectrum can be broken down in sets of quantitative (discrete, binned) or qualitative (continuous) characters; and one can then create a dissimilarity matrix for the investigated material. This is what Vajda, Pucetaite et al. (Nature Ecol. Evol. 1: 1093–1099, 2017) did for long-death (Mesozoic) but enigmatic seed plants and their equally enigmatic modern counterparts.

A UPGMA dendrogram based on FTIR data of fossil taxa (Vajda et al. 2017, fig. 4). Brackets to the right give the topology of the UPGMA dendrogram including extant material and data (Vajda et al. 2017, fig. 3).
PCA plots of the first and second (a), and first and third (b) coordinates, with the main seed plant lineages indicated (modified after Vajda et al. 2017, suppl.-fig. 4)

PCA and UPGMA are not phylogenetic inference methods, but there is obviously some phylogenetic signal encoded in these FTIR spectra, as shown above.

When I first saw the paper, I contacted the authors (including former colleagues of mine at Naturhistoriska riksmuseet in Stockholm), and the first author gave the second author, Milda Pucetaite (a Ph.D. student), a green light to share and convert her FTIR data into a simple distance matrix for me to run a NNet, as shown below.

Neighbor-net based on the combined distance matrix provided by Milda (pers. comm. July 2017).


Note that this NNet is a partly impossible graph, phylogenetically. The chemical composition naturally changes after the foliage (in this case) gets buried in sediment, and its cuticle is then conserved for millions of years by various taphonomic and diagenetic processes. As pointed out by the experienced biochemist among the authors during our correspondence: it is hence pointless to combine the data from extant and extinct taxa.

Well, since this is a post and not a paper, I combined them anyway. I find the result quite compelling, supporting the paper's conclusions including more speculative follow-up ones. The NNet reflects every aspect that these kind of data can provide for phylogenetic and systematic purposes.

The prominent central edge bundle reflects the taphonomic-diagenetic change separating the living from fossil samples. The basic sequence within the subgraphs is the same: gingkoes are closest to cycads, and cycads bridge to Araucariaceae, which is a relict lineage of the "needle" trees, the conifers (many of which don't have needles but leaves). Bennettitales and Nilssoniales are extinct groups of seed plants, which are here resolved as a distinct lineage. Especially, the Bennettitales have been have long puzzled scientists. They may represent a third major lineage of seed plants that are neither angiosperms (flowering plants) nor gymnosperms (ginkgoes, cycads, conifers, gnetids), or perhaps an early side lineage of either one (or lineages, as their two main groups are quite different).

As for pretty much any kind of data, just try it out for yourself. This is exploratory data analysis (EDA), particularly useful to get a first, fast impression of the primary signal in your data. This is true even if you keep it to yourself, having to watch out for the Mighty Beasts of the Forest of Reviews (especially the ones that call themselves "cladists"). Who are quick in telling you, what you can't do, but not so straightforward, when it comes pointing you to other options for analyzing your data.



My dive-in list for some more (im-)possible NNets
With David retiring, the Genealogical Worlds of Phylogenetic Networks will fall dormant, the next and final post will be a farewell from David. Like Mattis (Von Wörtern und Bäumen), I will keep on science-blogging (in spite of the new buggy Blogger-editing interface forcing me to draft directly in HTML) for a little while (and irregularly) on my Res.I.P. blog, which also includes a tag for "phylo-networks" for any future NNets and the like.

Xenoplasy

A major obstacle in studying morphological evolution is homoplasy. This occurs when the same (or similar) traits are evolved independently in different lineages (convergences), and are positively selected for or incompletely sorted within a lineage (parallelisms, homoiologies). Traits that not sort following the true tree create incompatible signal patterns, and, eventually, topological ambiguity. No matter which inference method we use, we end up with several alternative trees that combine aspects of the true tree with artificial branching patterns.

Homoplasy is the rule, while trait sorting is the exception. Consequently, we have to expect that any morphology-based tree will have more wrong branches than correct ones.

For extant group of organisms, a simple solution to the problem is to analyse morphological traits in the framework of a molecular phylogeny. The genetic data provides us with an independent, best-possible tree. By mapping the morphological traits on this tree, we can evaluate their potency as phylogenetic markers.

But what if our group of organisms is not the product of a simple repeated dichotomous splitting pattern? What if there were anastomoses as well? That is, the morphological traits are not the product of mere (incomplete) lineage or incomplete gene sorting (the latter is called "hemiplasy") but fusion of traits in different lineages. Thus, a tree is not enough to explain the genetic data? What does this imply for the morphological differentiation we observe?

Take the London Plane (Platanus x acerifolia or P. x hispanica), for example, which is a tree that many of us are familiar with. In case you don't know the name: they are the large trees with a patterned bark and deeply lobed leaves and fluffy fruiting bodies found in abundance in parks and alleys throughout the world. It's a cultivation-hybrid (17th—18th century) of the North American plane tree, Platanus occidentalis, and its distant eastern Mediterranean relative, P. orientalis. These are genetically and morphologically distinct species. Their history is summarized in the following doodle (Grimm & Denk 2010).

Each line represents a semi-sorted nuclear gene region. The split between proto-PNA-E (SW. U.S., NW. Mexico, E. Mediterranean) and proto-ANA (Atlantic-facing Central America, E. U.S.) must have been > 12 myrs ago (last Platanus of Iceland). The minimum air distance between the sister species P. orientalis and P. racemosa is ~11,500 km (via the Arctic). Interestingly, fossils from that time and later (including western Eurasia) have more ANA-clade morphologies: P. orientalis- and P. racemosa-types pop up ~5 Ma. Both ANA and PNA-E clade have distinct morphologies. With respect to the individual gene trees, those exclusively shared by P. palmeri and P. rzedowski with P. occidentalis s.str. and P. mexicana of the ANA clade could be adressed as "hemiplasies".

If you look at the leaves and fruits of London Planes, you can find everything in between the two endpoints; and the same holds for their genetics. The London Plane is much hardier than Europe's own P. orientalis and more drought-resistant than its hardier North American parent. With climate change going on, the hybrid will eventually meld with the European species entirely. And, thanks to what we call "hybrid vigor", given a few millions of years, it might consume its other parent, too. London Planes have been re-introduced into the Americas; and P. orientalis has become an invasive species in California, where it has started to hybridize with its local sister species P. racemosa. Now imagine a future researcher of Platanus evolution having to deal with a highly complex accumulation of Platanus fossils in the Northern Hemisphere, while being able to study only the left-over complex genetics of a single species that replaced two.

This is where a recently coined new concept comes in: xenoplasy.
Yaxuan Wang, Zhen Cao, Huw A. Ogilvie, Luay Nakhleh (2020). Phylogenomic assessment of the role ofhybridization and introgression in trait evolution. bioRxiv doi: 10.1101/2020.09.16.300343
Xenoplasies are traits that originate from hybridization and subsequent introgression. In standard phylogenetics, they would act like any homoplasious character, but their distinction is that they are not independently involved. They are captured via lineage crossing, and reflect a common ancestry.

Example for a trait incongruent with the species tree, representing a xenoplasy obtained by introgression of I1-A lineage which evolved the trait into I3-B lineage, part of the I2 clade. Pending how far they are affected by incomplete lineage sorting (ILS) and introgression, individual gene may result in any of the three possible genealogies. Modified after Wang et al. (2020), fig. 1.

As such, their phylogenetic weight (information content) equals that of the anyhow rare classic autapomorphies or synapomorphies (fide Hennig), and this weight is higher than that of the more common homoiologies, shared apomorphies or symplesiomorphies. Note, in the palaeozoological cladistic literature, sorted versions of the latter three are often called synapomorphies – any lineage-specific, derived trait ("synapomorphy") may be lost / modified in some sublineage(s), or rarely pop-up outside the lineage.

Wang et al. provide an analytical framework for identifying a trait as xenoplasy, and assessing the probability for it ("xenoplasy risk factor"). If you're interested in the mechanics, check out the pre-print. The mathematical part of my brain has been dormant for most of the last two decades (when I exchanged chemistry for geology-biology), so I'm more into possible applications to explore this new concept.

Where to look next

The Wang et al. real-world example (Jaltomata) is, however, not very appealing. The problem is that, to look for xenoplasy, we need data that requires us to infer an explicit phylogenetic network (in the strict sense) to start with. In addition, we could use a morphological partition: scored morphological traits; which is usually absent. Last, identifying xenoplasies would make most sense for traits that can be traced in the fossil record, not only to identify potential products of past reticulation but have a better grip on placing critical fossils. Often overlooked by neontologists, fossils are the only physical proof that a lineage was at a certain place at a certain point in time. So, here's two examples: beeches and bears.

Beeches are a small genus of extra-tropical angiosperm trees with a pretty well understood fossil record. Morphologically, their differentiation is very hard to put into a tree, as shown here.

A morpholgy-based Neigbor-net of fossil (open circles) and extant beech (closed circles) taxa. Coloration gives the (paleo-)geographic distribution (abbreviated as three letters). For more background and information see my Res.I.P. post: The challenging and puzzling ordinary beech – a (hi)story

Mapping species-discriminating traits on a tree would be of little help here, because the modern species are the product of recurrent phases of mixing and incomplete sorting. I have summarized  this in the following doodle, depicting the diversification and propagation of 5S-IGS variants (a non-transcribed, poly-copy, multi-array intergenic nuclear spacer) in a still very small sample.

A doodle summarizing differentiation patterns in a sample of 686 "representative" 5S-IGS variants obtained using high-throughput sequencing of six beech populations of western Eurasia and Japan (Simone Cardoni et al., to be submitted in the near future; see Piredda et al. 2020 for a similar analytical set-up).

The people involved in researching this project (drawn by passion rather than resources) don't have the resources to generate the NGS data needed to construct a species network for all of the species of beech, like Wang et al.'s Jaltomata data. But given that there are only 9–10 species, it would be easy prey for a well-funded research group. If you are interested, but don't know how to get the material and are unfamiliar with beeches, feel free to contact the senior author of Piredda et al. 2020, Marco Simeone — new beech-enthusiasts are always welcomed by this group.

Bears are one of the best-studied extant mammal predators, and they also have a decent fossil record. This is probably the reason that Heath et al. (2014) used bears as the case study when introducing their new molecular dating approach: the fossilized-birth-death dating.

A fossilized birth-death dated tree of bears (modified from Heath et al. 2014, fig. 4). The numbers in brackets give the number of fossil taxa (extinct genera, Ursus spp.) listed on Wikipedia.

As nice as it looks (and done), their analysis is pretty flawed from an evolutionary point of view. Their dated tree only reflects a single aspect of bear evolution and may involve branch-length artifacts. Heath et al. relied on complete mitochondrial genomes, which they combined with a single nuclear protein-coding gene. Mitochondrial genes reflect only the maternal lineage; they did not date a species tree but a mitochondrial genealogy. Paternal and biparentally inherited gene markers (which includes nuclear genes) tell very different stories about species relationships (this is why we also used the bears as example data for Schliep et al. 2017).

Strict, branch-length ignorant Consensus network of three trees inferred using species-consensus sequences generated from three sets of data: biparentally inherited nuclear-encoded autosomal introns (ncAI), paternally inherited Y-chromosomes (YCh) and maternally inherited mitochondrial genes (complete set; mtG). This is clearly not the product of a strictly dichotomous evolution. Thick lines: edges found in Heath et al.'s chronogram (= mitochondrial genealogy).

And while it may be that morphology reflects more the maternal than the paternal side, it has never been tested. Neither how morphology fits with the coalescent species tree. Which would be a network, as shown below.

Gene flow in bears within the last 5 myrs (estimate; from Kumar et al. 2017).

How Heath et al. linked the fossils to clades might have been just as wrong as it was right (note that FBD dating is much less biased by mis- or unoptimal placed fossils than traditional node dating). Hemi- and xenoplasy must be considered here. In addition to the highly incongruent paternal and maternal genealogies, we know that even the morphologically most distinct sister species (grizzlies, a special form of Brown Bear, and polar bears) can produce vital offspring ("Grolar") with morphological traits from either side of the family (usually, the Grizzly-side dominates).

Wildlife services usually kill these hybrids as they are considered to speed up the decline of polar bears (they are food competitors). However, with the (possibly inevitable) melting of the polar caps, these hybrids could be instrumental in the survival of a bit of Polar Bear legacy, in the form of genetic diversity not found in brown bears, and xenoplasies. If two highly distinct bear species hybridize today in the wild due to (in this case: human-induced) environmental pressure, their ancestors probably have done so in the past in reaction to shifting habitats and migration patterns.

Given how long bears have intrigued researchers, there are plenty of classic morphological studies involving fossils; and, in the light of the vast amount of molecular data (including ancient DNA!) that have been collected for bears, it should be pretty easy to apply Wang et al.'s new approach to bears. For example, is the Cave Bear a dead-end side lineage, intrograde or hybrid dead-end? Mitochondrial-wise Cave bears are placed as sister to Brown and Polar bears but that's just because of their provenance. Like chloroplast genealogies in plants, mitochondrial genealogies in animals typically show a strong geographic correlation. Especially in bears, the mothers and daughters don't migrate as much as the fathers and sons.

Mitochondrial genealogy of bears including Cave bears (Kumar et al. 2017, fig. 3), the famous European bears of the Ice Ages. ABC bears are insular brown bears living on the subarctic Admirality, Baranof and Chichagof islands of the Alexander archipelago known as natural example for gene flow between Brown and Polar bears (Kumar et al. 2017, fig. 1, provides a map of current distribution of bears).

Postscriptum

Birds are another animal group that likes to diversify into many species, some of which love to transgress recently established species barriers, forming hybrid swarms. These are actually dinosaurs, a group exclusively studied using cladistic analyses of morphological traits providing non-tree-like signals — mostly homoplasies, a lot of not-really-synapomorphies (good deal are probably homoiologies), and, it wouldn't surprise me, one or another xenoplasy. Or can we assume they were much to advanced to hybridize and intrograde?

Cited literature
  • Grimm GW, Denk T. 2010. The reticulate origin of modern plane trees (Platanus, Platanaceae) - a nuclear marker puzzle. Taxon 59:134–147.
  • Heath TA, Huelsenbeck JP, Stadler T. 2014. The fossilized birth–death process for coherent calibration of divergence-time estimates. PNAS 111:E2957–E2966.
  • Kumar V, Lammers F, Bidon T, Pfenninger M, Kolter L, Nilsson MA, Janke A. 2017. The evolutionary history of bears is characterized by gene flow across species. Scientific Reports 7:46487 [e-pub].
  • Piredda R, Grimm GW, Schulze E-D, Denk T, Simeone MC. 2020. High-throughput sequencing of 5S-IGS in oaks: Exploring intragenomic variation and algorithms to recognize target species in pure and mixed samples. Molecular Ecology Resources doi:10.1111/1755-0998.13264.
  • Schliep K, Potts AJ, Morrison DA, Grimm GW. 2017. Intertwining phylogenetic trees and networks. Methods in Ecology and Evolution 8:1212–1220.

Xenoplasy

A major obstacle in studying morphological evolution is homoplasy. This occurs when the same (or similar) traits are evolved independently in different lineages (convergences), and are positively selected for or incompletely sorted within a lineage (parallelisms, homoiologies). Traits that not sort following the true tree create incompatible signal patterns, and, eventually, topological ambiguity. No matter which inference method we use, we end up with several alternative trees that combine aspects of the true tree with artificial branching patterns.

Homoplasy is the rule, while trait sorting is the exception. Consequently, we have to expect that any morphology-based tree will have more wrong branches than correct ones.

For extant group of organisms, a simple solution to the problem is to analyse morphological traits in the framework of a molecular phylogeny. The genetic data provides us with an independent, best-possible tree. By mapping the morphological traits on this tree, we can evaluate their potency as phylogenetic markers.

But what if our group of organisms is not the product of a simple repeated dichotomous splitting pattern? What if there were anastomoses as well? That is, the morphological traits are not the product of mere (incomplete) lineage or incomplete gene sorting (the latter is called "hemiplasy") but fusion of traits in different lineages. Thus, a tree is not enough to explain the genetic data? What does this imply for the morphological differentiation we observe?

Take the London Plane (Platanus x acerifolia or P. x hispanica), for example, which is a tree that many of us are familiar with. In case you don't know the name: they are the large trees with a patterned bark and deeply lobed leaves and fluffy fruiting bodies found in abundance in parks and alleys throughout the world. It's a cultivation-hybrid (17th—18th century) of the North American plane tree, Platanus occidentalis, and its distant eastern Mediterranean relative, P. orientalis. These are genetically and morphologically distinct species. Their history is summarized in the following doodle (Grimm & Denk 2010).

Each line represents a semi-sorted nuclear gene region. The split between proto-PNA-E (SW. U.S., NW. Mexico, E. Mediterranean) and proto-ANA (Atlantic-facing Central America, E. U.S.) must have been > 12 myrs ago (last Platanus of Iceland). The minimum air distance between the sister species P. orientalis and P. racemosa is ~11,500 km (via the Arctic). Interestingly, fossils from that time and later (including western Eurasia) have more ANA-clade morphologies: P. orientalis- and P. racemosa-types pop up ~5 Ma. Both ANA and PNA-E clade have distinct morphologies. With respect to the individual gene trees, those exclusively shared by P. palmeri and P. rzedowski with P. occidentalis s.str. and P. mexicana of the ANA clade could be adressed as "hemiplasies".

If you look at the leaves and fruits of London Planes, you can find everything in between the two endpoints; and the same holds for their genetics. The London Plane is much hardier than Europe's own P. orientalis and more drought-resistant than its hardier North American parent. With climate change going on, the hybrid will eventually meld with the European species entirely. And, thanks to what we call "hybrid vigor", given a few millions of years, it might consume its other parent, too. London Planes have been re-introduced into the Americas; and P. orientalis has become an invasive species in California, where it has started to hybridize with its local sister species P. racemosa. Now imagine a future researcher of Platanus evolution having to deal with a highly complex accumulation of Platanus fossils in the Northern Hemisphere, while being able to study only the left-over complex genetics of a single species that replaced two.

This is where a recently coined new concept comes in: xenoplasy.
Yaxuan Wang, Zhen Cao, Huw A. Ogilvie, Luay Nakhleh (2020). Phylogenomic assessment of the role ofhybridization and introgression in trait evolution. bioRxiv doi: 10.1101/2020.09.16.300343
Xenoplasies are traits that originate from hybridization and subsequent introgression. In standard phylogenetics, they would act like any homoplasious character, but their distinction is that they are not independently involved. They are captured via lineage crossing, and reflect a common ancestry.

Example for a trait incongruent with the species tree, representing a xenoplasy obtained by introgression of I1-A lineage which evolved the trait into I3-B lineage, part of the I2 clade. Pending how far they are affected by incomplete lineage sorting (ILS) and introgression, individual gene may result in any of the three possible genealogies. Modified after Wang et al. (2020), fig. 1.

As such, their phylogenetic weight (information content) equals that of the anyhow rare classic autapomorphies or synapomorphies (fide Hennig), and this weight is higher than that of the more common homoiologies, shared apomorphies or symplesiomorphies. Note, in the palaeozoological cladistic literature, sorted versions of the latter three are often called synapomorphies – any lineage-specific, derived trait ("synapomorphy") may be lost / modified in some sublineage(s), or rarely pop-up outside the lineage.

Wang et al. provide an analytical framework for identifying a trait as xenoplasy, and assessing the probability for it ("xenoplasy risk factor"). If you're interested in the mechanics, check out the pre-print. The mathematical part of my brain has been dormant for most of the last two decades (when I exchanged chemistry for geology-biology), so I'm more into possible applications to explore this new concept.

Where to look next

The Wang et al. real-world example (Jaltomata) is, however, not very appealing. The problem is that, to look for xenoplasy, we need data that requires us to infer an explicit phylogenetic network (in the strict sense) to start with. In addition, we could use a morphological partition: scored morphological traits; which is usually absent. Last, identifying xenoplasies would make most sense for traits that can be traced in the fossil record, not only to identify potential products of past reticulation but have a better grip on placing critical fossils. Often overlooked by neontologists, fossils are the only physical proof that a lineage was at a certain place at a certain point in time. So, here's two examples: beeches and bears.

Beeches are a small genus of extra-tropical angiosperm trees with a pretty well understood fossil record. Morphologically, their differentiation is very hard to put into a tree, as shown here.

A morpholgy-based Neigbor-net of fossil (open circles) and extant beech (closed circles) taxa. Coloration gives the (paleo-)geographic distribution (abbreviated as three letters). For more background and information see my Res.I.P. post: The challenging and puzzling ordinary beech – a (hi)story

Mapping species-discriminating traits on a tree would be of little help here, because the modern species are the product of recurrent phases of mixing and incomplete sorting. I have summarized  this in the following doodle, depicting the diversification and propagation of 5S-IGS variants (a non-transcribed, poly-copy, multi-array intergenic nuclear spacer) in a still very small sample.

A doodle summarizing differentiation patterns in a sample of 686 "representative" 5S-IGS variants obtained using high-throughput sequencing of six beech populations of western Eurasia and Japan (Simone Cardoni et al., to be submitted in the near future; see Piredda et al. 2020 for a similar analytical set-up).

The people involved in researching this project (drawn by passion rather than resources) don't have the resources to generate the NGS data needed to construct a species network for all of the species of beech, like Wang et al.'s Jaltomata data. But given that there are only 9–10 species, it would be easy prey for a well-funded research group. If you are interested, but don't know how to get the material and are unfamiliar with beeches, feel free to contact the senior author of Piredda et al. 2020, Marco Simeone — new beech-enthusiasts are always welcomed by this group.

Bears are one of the best-studied extant mammal predators, and they also have a decent fossil record. This is probably the reason that Heath et al. (2014) used bears as the case study when introducing their new molecular dating approach: the fossilized-birth-death dating.

A fossilized birth-death dated tree of bears (modified from Heath et al. 2014, fig. 4). The numbers in brackets give the number of fossil taxa (extinct genera, Ursus spp.) listed on Wikipedia.

As nice as it looks (and done), their analysis is pretty flawed from an evolutionary point of view. Their dated tree only reflects a single aspect of bear evolution and may involve branch-length artifacts. Heath et al. relied on complete mitochondrial genomes, which they combined with a single nuclear protein-coding gene. Mitochondrial genes reflect only the maternal lineage; they did not date a species tree but a mitochondrial genealogy. Paternal and biparentally inherited gene markers (which includes nuclear genes) tell very different stories about species relationships (this is why we also used the bears as example data for Schliep et al. 2017).

Strict, branch-length ignorant Consensus network of three trees inferred using species-consensus sequences generated from three sets of data: biparentally inherited nuclear-encoded autosomal introns (ncAI), paternally inherited Y-chromosomes (YCh) and maternally inherited mitochondrial genes (complete set; mtG). This is clearly not the product of a strictly dichotomous evolution. Thick lines: edges found in Heath et al.'s chronogram (= mitochondrial genealogy).

And while it may be that morphology reflects more the maternal than the paternal side, it has never been tested. Neither how morphology fits with the coalescent species tree. Which would be a network, as shown below.

Gene flow in bears within the last 5 myrs (estimate; from Kumar et al. 2017).

How Heath et al. linked the fossils to clades might have been just as wrong as it was right (note that FBD dating is much less biased by mis- or unoptimal placed fossils than traditional node dating). Hemi- and xenoplasy must be considered here. In addition to the highly incongruent paternal and maternal genealogies, we know that even the morphologically most distinct sister species (grizzlies, a special form of Brown Bear, and polar bears) can produce vital offspring ("Grolar") with morphological traits from either side of the family (usually, the Grizzly-side dominates).

Wildlife services usually kill these hybrids as they are considered to speed up the decline of polar bears (they are food competitors). However, with the (possibly inevitable) melting of the polar caps, these hybrids could be instrumental in the survival of a bit of Polar Bear legacy, in the form of genetic diversity not found in brown bears, and xenoplasies. If two highly distinct bear species hybridize today in the wild due to (in this case: human-induced) environmental pressure, their ancestors probably have done so in the past in reaction to shifting habitats and migration patterns.

Given how long bears have intrigued researchers, there are plenty of classic morphological studies involving fossils; and, in the light of the vast amount of molecular data (including ancient DNA!) that have been collected for bears, it should be pretty easy to apply Wang et al.'s new approach to bears. For example, is the Cave Bear a dead-end side lineage, intrograde or hybrid dead-end? Mitochondrial-wise Cave bears are placed as sister to Brown and Polar bears but that's just because of their provenance. Like chloroplast genealogies in plants, mitochondrial genealogies in animals typically show a strong geographic correlation. Especially in bears, the mothers and daughters don't migrate as much as the fathers and sons.

Mitochondrial genealogy of bears including Cave bears (Kumar et al. 2017, fig. 3), the famous European bears of the Ice Ages. ABC bears are insular brown bears living on the subarctic Admirality, Baranof and Chichagof islands of the Alexander archipelago known as natural example for gene flow between Brown and Polar bears (Kumar et al. 2017, fig. 1, provides a map of current distribution of bears).

Postscriptum

Birds are another animal group that likes to diversify into many species, some of which love to transgress recently established species barriers, forming hybrid swarms. These are actually dinosaurs, a group exclusively studied using cladistic analyses of morphological traits providing non-tree-like signals — mostly homoplasies, a lot of not-really-synapomorphies (good deal are probably homoiologies), and, it wouldn't surprise me, one or another xenoplasy. Or can we assume they were much to advanced to hybridize and intrograde?

Cited literature
  • Grimm GW, Denk T. 2010. The reticulate origin of modern plane trees (Platanus, Platanaceae) - a nuclear marker puzzle. Taxon 59:134–147.
  • Heath TA, Huelsenbeck JP, Stadler T. 2014. The fossilized birth–death process for coherent calibration of divergence-time estimates. PNAS 111:E2957–E2966.
  • Kumar V, Lammers F, Bidon T, Pfenninger M, Kolter L, Nilsson MA, Janke A. 2017. The evolutionary history of bears is characterized by gene flow across species. Scientific Reports 7:46487 [e-pub].
  • Piredda R, Grimm GW, Schulze E-D, Denk T, Simeone MC. 2020. High-throughput sequencing of 5S-IGS in oaks: Exploring intragenomic variation and algorithms to recognize target species in pure and mixed samples. Molecular Ecology Resources doi:10.1111/1755-0998.13264.
  • Schliep K, Potts AJ, Morrison DA, Grimm GW. 2017. Intertwining phylogenetic trees and networks. Methods in Ecology and Evolution 8:1212–1220.

Rogue dinosaurs, an example from the Aetosauria


In several earlier posts (a non-comprehensive link list can be found at the end of the post), I outlined how networks, tree-sample (Consensus networks, SuperNetworks) or distance-based (Neighbor-nets) may be of practical help, especially when we study phylogenetic relationships of extinct organisms.

In this post, I will further explore this by looking at a matrix for Aetosauria (Parker 2016, PeerJ) that provides an overall (relatively) strong and unambiguous signal. [NB: The reason, I prefer to use PeerJ papers as examples is that it is one of the very few journals that is open access and has a strict open data policy — to publish there, authors have to give access to the used data.]

In the abstract of the original paper, we read the following:
Nonetheless, aetosaur phylogenetic relationships are still poorly understood, owing to an overreliance on osteoderm characters, which are often poorly constructed and suspected to be highly homoplastic. A new phylogenetic analysis of the Aetosauria, comprising 27 taxa and 83 characters, includes more than 40 new characters that focus on better sampling the cranial and endoskeletal regions, and represents the most comprehensive phylogeny of the clade to date. Parsimony analysis recovered three most parsimonious trees; the strict consensus of these trees finds an Aetosauria that is divided into two main clades: Desmatosuchia, which includes the Desmatosuchinae and the Stagonolepidinae, and Aetosaurinae, which includes the Typothoracinae.
Parker's (2016) fig. 6 shows the results of the "initial analysis" (click to enlarge, colored annotations added by me).

Systematic groups based on clades are abbreviated (see next graph for full names).

A is a "Strict component consensus" of the 30 inferred MPTs (most parsimonious trees), B the Adams consensus. C the Majority rule consensus, branch labels give percentages for branches not found in all MPTs. D a "Maximum agreement subtree after a priori pruning of one taxon (black star) within the upper clade.

Parker's (2016) fig. 7 then shows the preferred result: a "reduced strict consensus of 3 MPTs" with the red star taxon removed, and (rarely seen in dinosaur phylogeny papers) branch-support — including Bootstrap support values below 70, which are very rarely reported in the literature (from my own experience it seems that editors of systematic biology journals don't like them).


Removal of one rogue taxon (called a "wildcard" in paleozoology), Aetobarbakinoides brasiliensis, substantially reduced the number of MPTs. Nonetheless, many branches have low support, and hence also the clades (used here as synonym for monophyla) derived from them – Parker uses branch-based ("stem"-based, brackets on his tree), and node-based taxa (dots).

Low branch support may or may not matter

There are two possible reasons for low branch-support:
  • non-discriminatory signal: any alternative branching pattern receives diminishing support
  • internal signal conflict: two (or more) alternatives receive similar support.
Mapping the support on the preferred (inferred) optimal tree cannot tell us whether it's the one or the other — only Support consensus networks can visualize this. Since we are interested in the rogue, I re-ran the parsimony BS analysis (10,000 quick-and-dirty replicates, following Müller 2005, BMC Evol. Biol. 5:58) including Aetobarbakinoides brasiliensis.

Support consensus network based on 10,000 parsimony BS pseudoreplicates. Trivial splits collapsed, only splits are shown the occured in at least 20% of the BS replicates.

The decreased/low BS support within the most terminal (root-distant) subtrees, the Des'ini and Par'ini, relates to conflicting alternatives involving one or two OTUs. In the case of Des'ini, it is the affinity of Lucasuchus and NCSM 21723, while in the case of Par'ini an alternative (recognizing Tecovasuchus as sister to the remainder) is found in 1 out of three BS pseudoreplicate trees. The diminishing support for basal relationships (root-proximal branches/edges) is due to the general lack of discriminatory signal (BS any alternative < 25). However, there are very few situations in which the best-supported alternative differs much from that in the preferred tree. For instance, any alternative to a Stag'inae sister relationship has even less than BS = 24 (BS = 27 in Parker's "reduced" tree).

Our rogue, however, is not really a 'wildcard'. The scored characters simply put it much closer to the outgroup than is any other ingroup taxon. A simple explanation could be that it is a most primitive (least derived) member of the Aetosauria. Another possibility is that it lacks any critical trait needed to place it within the ingroup. Since the deep splits within the Aetosauria rely on very few character changes, we can put it in different position down here and the tree will still have the same number of inferred changes.

Trivial and non-trivial taxa

The cladograms typically shown provide limited information about the signal in the underlying matrix, its strength and weaknesses, even when not "naked" but annotated using branch-support values. Given that there are no severe overlap gaps in the data, a very quick alternative is the Neighbor-net (a necessary addition, in my opinion).

Bold edges correspond to branches (hence: clades) in Parker's preferred tree.

Using this, we can directly depict which groups, potential clades, draw substantial (partly trivial) character support.

For instance, according to Parker's tree and following cladistic classification, Stagonolepis is an invalid taxon: one species (St. robertsoni) is part of the Stag'inae clade, the other (St. olenki) is of the Des'inae clade. Character support is, however, nearly non-existent (Bremer value = 1 and BS = 7 in the original analysis; BS ≤ 20 for any competing alternative in our re-analysis). The distance network shows us why — indeed, both species are closest to each other; but, while St. robertsoni shares a critical Stag'inae character suite and, consequently, shows the highest similarity to Polesinuchus, St. olenki does not share this (note the lack of a corresponding neighborhood). Furthermore, any alternative placement fits even less. Parker's tree only resolved it at sister to all other Des'inae because it didn't fit into any of the well-supported, terminal clades (prominent edge-bundles).

We can also see where we may have to deal with internal signal conflict, and how this may affect the tree inference and lead to ambiguous branch support. Take, for instance, the NCSM 21723 individual (= Gorgetosuchus pekinensis). It's clearly a Des'inae. The reason, we have ambiguous branch support for this staircase-like subtree is that NCSM 21723 is substantially more similar to the distant, equally evolved sister lineage, the Par'ini (purple edge bundle). Hence, it must be placed as sister to all other Des'inae, although it appears to represent a more derived form than Longosuchus, representing the next step towards the most-derived crown-taxon Desmatosuchus. Tecovachus is the source of topological conflict within the Par'ini because it is the least-derived taxon. Its primitiveness will be expressed by placing it as sister to all other Par'ini, while few shared, non-exclusive apomorphies are behind its position in the preferred tree (Bremer value = 1, BS = 48 in Parker's fig. 7).

While it is obvious that the matrix has no clear tree-like signal for resolving any OTU that is not part of the terminal Des'ini and Typ'inae lineages, our 'wildcard' (Aetobarkinoides) is particularly close to the outgroup while showing no affinity to anything else. If it is part of the ingroup, it represents the ancestral form, ie. shows a character suite that is primitive (derived traits may be missing because they are simply not preserved: see description of the taxon in Parker 2016). This is the reason why it acted rogue-ish in tree inferences even though it's favored phylogenetic position is clear.

Data

Parker's original matrix can be found in the supplement to the paper. An annotated ready-to-use NEXUS-formatted version (including my standard codelines for parsimony and distance bootstrapping) and the inference results used here can be found in this figshare submission, which I generated for a technical Q&A.



Here is the promised list of previous posts dealing with fossils and networks.

Exploring the oak phylogeny

Neighbor-nets are a most versatile tools for exploratory data analysis, including phylogenetics. They are not only fast to infer, but possibly most straightforward in depicting the signal in one's data matrix — this is called Exploratory Data Analysis. EDA makes them useful additions to any phylogenetic paper, because it gives the reader (and peers and editors during review) a good idea what the data can possibly show, and where there may be problems.

A nice example of this use is the Neighbor-net in a recent paper on Chinese oaks:
Yang J, Guo Y-F, Chen X-D, Zhang X, Ju M-M, Bai G-Q, Liu Z-L, Zhao G-F. Framework Phylogeny, Evolution and Complex Diversification of Chinese Oaks. Plants 2020: 1024.
[Note: The paper is, from a purely methodological point-of-view, pretty well done, but has probably not experienced any real peer-review.**]
Oaks (Quercus L.) are ideal models to assess patterns of plant diversity. We integrated the sequence data of five chloroplast and two nuclear loci from 50 Chinese oaks to explore the phylogenetic framework, evolution and diversification patterns of the Chinese oak’s lineage. The framework phylogeny strongly supports two subgenera Quercus and Cerris comprising four infrageneric sections Quercus, Cerris, Ilex and Cyclobalanopsis for the Chinese oaks.
None of this is new. My colleagues and I published an updated classification for oaks a few years ago (Denk et al. 2017) that took into account molecular phylogenies, and introduced the systematic concept referred to by Yang et al., and recently followed by a many-species global oak phylogenomic study (Hipp et al. 2020). All of this is based on nuclear data only, because any researcher who ever studies oak genetics soon realizes that the plastomes are largely decoupled from speciation processes, but are geographically highly constrained (eg. Simeone et al. 2016, Yan et al. 2019). This is the reason why oaks are indeed "ideal models to assess patterns of plant diversity" – they provide a worst-case scenario not the (trivial) best-case one.

As can be seen in the Yang et al. tree, members of section Ilex, a monophyletic lineage forming highly supported clades in trees based on nuclear data, are scattered all across the subgenus Cerris subtree. I have annotated a copy of this tree here.

Yang et al.'s fig. 1a, with some clades newly labeled for orientation

Because of the plastid incongruence, the subgenus Cerris subtree has a wrong root (section Cylcobalanopsis diverged before sister sections Cerris and Ilex split). Also, the reciprocally monophyletic, genetically coherent sections Cerris (green) and Cyclobalanopsis (blue) are embedded in the much more diverse Ilex 3 and Ilex 4 clades. The remaining Ilex species are placed in two early diverged clades, which I have labeled Ilex 1 and Ilex 2 in the above tree (note: the taxon set only includes Chinese oak species). The only indication the tree gives that we have a data conflict issue is the low support (gray circles represent branches with Maximum likelihood bootstrap support > 60).

The network

When interpreting the phylogenetic implications of a Neighbor-net, we have to keep in mind that it is not a phylogenetic network in the strict sense (ie. displaying an evolutionary history), but is instead a meta-phylogenetic graph: a summary of incompatible splits patterns. Incompatibility can have different origins: reticulation, recombination, diffuse or poorly sorted signals, etc. Consequently, when looking at a Neighbor-nets and their neighborhoods (Splits and neighborhoods in splits graphs), we need to keep in mind what kind of data we used to calculate the underlying distance matrix in the first place.

If the data follows two incongruent trees ("phylogenies"), as in this case for the oaks, the Neighbor-net has a good chance of capturing the incompatible splits of both genealogies. Here is the graph from the paper.

Wang et al.'s fig. 1b.

The central inflated portion of the graph reflects the incongruence between the combined data sets: we have overlapping nuclear-informed and plastid-informed neighborhoods.

The authors' brackets (shown in black) refer to neighborhoods triggered by the two nuclear markers in the data set: these are neighborhoods reflecting the common origin and speciation within the oak lineages. We can even see that this signal, which is incompatible with all deep splits in the combined tree, is unambiguous in part of the data (the nuclear partitions): section Ilex spans out as a wide fan, but there is a relatively prominent edge bundle defining the according neighborhood (the blue split).

The net shows additional, even more prominent edge bundles defining partly overlapping or distinct neighborhoods (the red splits). These neighborhoods are represented as clades in Yang et al.'s phylogenetic tree (fig.1a). They write (p. 11 of 20):
However, the conflict between the two datasets seems to be recovered by the neighbor-net method in this study, as the neighbor-net network based on combined plastid–nuclear data strongly shows the presence of two subgenera and four infrageneric species groups for the Chinese oak’s lineage (Figure 1b).
Interestingly, the authors nonetheless used the substantially incongruent combined data for downstream dating and trait mapping analysis (p. 7/20):
Bayesian evolutionary analyses provided a concordant infrageneric phylogeny for the Chinese oak’s lineage at the species level (Figure 2).
This uses a taxon-filtered, obviously constrained (fixed) topology, fitted to the current synopsis outlined in Denk et al. (2017). [Note: the supplement includes the extremely incongruent nuclear and plastid trees, each of which has further incongruence issues because they combine fast- and very slow-evolving sequence regions.]

Postscript

More posts on oaks, plastid data and networks can be found here in the Genealogical World and in my Res.I.P. blog.

Cited papers

Denk T, Grimm GW, Manos PS, Deng M, Hipp AL. (2017) An updated infrageneric classification of the oaks: review of previous taxonomic schemes and synthesis of evolutionary patterns. In: Gil-Pelegrín E, Peguero-Pina JJ, and Sancho-Knapik D, eds. Oaks Physiological Ecology. Cham: Springer, pp. 13–38. Open access Pre-Print [major change: Ponticae and Virentes accepted as additional sections in final version].

Hipp AL, Manos PS, Hahn M, Avishai M, + 20 more authors. (2020) Genomic landscape of the global oak phylogeny. New Phytologist 229: 1198–1212. Open access.

Simeone MC, Grimm GW, Papini A, Vessella F, Cardoni S, Tordoni E, Piredda R, Franc A, Denk T. (2016) Plastome data reveal multiple geographic origins of Quercus Group Ilex. PeerJ 4:e1897. Open access.

Yan M, Liu R, Li Y, Hipp AL, Deng M, Xiong Y. (2019) Ancient events and climate adaptive capacity shaped distinct chloroplast genetic structure in the oak lineages. BMC Evolutionary Biology 19:202. Open access.



** The publisher, MDPI, thrives in the gray zone between predatory and accredited publishing. Originally included in the recently reactivated Beall's List (new homepage), it has been tentatively dropped (see the linked Wikipedia article; but see also this post by Mats Widgren). Personally, I have encountered articles published in MDPI journals only where the review process must have been, at least, strongly compromised. But it's always quick: Yang et al.'s paper was submitted July 24th, accepted August 12th, and published a day later. Three weeks is about the length of time that the editors of my first oak paper needed to find a peer reviewer at all.

Fossils and Networks 3 – (deleting and) adding one tip


In the last Fossils and Networks post, we explored the use of SuperNetworks to identify both safe and problematic branching patterns by removing one OTU and re-evaluating the analysis. Here, we'll take the opposite approach, and see what we can learn from adding one OTU to our analysis.

Breaking and supporting wrong branches

We start again with the artificial Felsenstein Zone matrix that results in a wrong AB clade. Here's the original true tree used to generate the matrix.


Because of convergent/parallel evolution in the modern taxa (genera O, A and B) and primitive characters of their fossil sisters, any phylogenetic inference method will find the wrong, tree with a A + B | rest split.

In the Felsenstein Zone, parsimony will always get the wrong tree due to long-branch attraction (LBA), while Maximum likelihood has a 50:50 chance to escape LBA. To break down the LBA between A and B, we need a fossil that is, from an evolutionary point of view, intermediate between D and B.

If we add a fossil E that features 1 out of 3 derived traits found in the BD lineage (including the only synapomorphy of BD), we end up with two alternative parsimony trees: one with a wrong topology and the other the correct topology, as shown here.


By adding a fossil F featuring 2 out of 3 derived traits, we increase the number of most-parsimonious trees (MPTs) to three alternatives, all of which fall prey to A-B+F LBA, as shown next.


Convergent evolution is a problem for tree inference but selection bias and homoiologies are worse, involving accumulation of the same advanced trait within some but not all members of a lineage (Has homoiology been neglected in phylogenetics?). This is worse because the characters will enforce attraction between long-branching, highly evolved (more modern) taxa. A and B are siblings, but by enforcing an ABF clade, we will inevitably misinterpret the most primitive members of the ingroup, C and D. Hence, we may draw wrong conclusions about evolution in the A–F lineage.

Because E is virtually half-way evolved between D and F, and F is the next step towards B, the all-inclusive tree gets it right. We infer a single optimal tree, shown here.


PS: Also, in this case we could use any other optimality criterion (Maximum Likelihood, Least-squares, Minimum Evolution) and we would end up with the same tree.

Missing the important bits

That last observation is encouraging: the more fossils we include in our matrix and the better they reflect the evolutionary trends within a group (here from a D-like ancestor via E to F and B), the greater the chance of ending up with the true tree. There's only one drawback: in real-world data sets, we may miss exactly those traits in the fossil sample that are needed in order to infer (or stabilize) the true tree.


(Paleo-)Parsimonists have frequently argued that missing data are unproblematic, which is true in one sense, as shown in the above example. The commonly used strict consensus tree has no wrong branches, because it only has one, which is the trivial ingroup-outgroup split. The much less commonly used Adams consensus tree has one more branch, which is wrong: the ABF clade.

As always in such cases, the strict Consensus network visualizes the MPT sample best (again exemplifying why we should stop using cladograms).


The price for not having false positives is that we cannot infer a most-parsimonious tree or a few alternative trees any more, but could easily end up with scores of them. Here, we have 41 MPTs for a 8-taxon dataset that include fairly wrong trees*, although some of them are closer to the true tree (green and olive edges in the strict Consensus network above). For large matrices, or matrices lacking tree-like signals, the number of MPTs can easily reach tens or hundreds of thousands. Lacking critical traits in E (14 out of 46 characters missing) and F (7 missing), we may escape LBA at the cost of decisiveness. If we do have those traits only in F but not E, we will enforce LBA between A and B.

Plus-1-trees (and SuperNetworks)

Before adding a taxon as an additional leaf to our tree, we may be interested in what that taxon does to our tree: can it trigger a topological change or does it fall in line? We will again take the dinosaur-to-bird-matrix of Hartman et al. (2019, PeerJ 7: e7247) as a real-world example. This includes everything from well-covered highly derived and most primitive taxa, to those that lack discriminatory signal in general (ie. are unresolved), plus the one or two rogue taxa, with ambiguous phylogenetic affinities creating topological conflict. (Note: the commonly reported strict consensus trees cannot distinguish between those two alternatives.)

The best-covered 15 taxa provide us with a single optimal tree that is in agreement with current opinion (shown below). However, this struggles to resolve the clade of modern birds because the extinct Lithornis is being attracted by Anas, the duck. When we remove Dromiceiomimus (as shown in Fossil and Networks 2), we end up with a putatively wrong Dromaeosauridae grade, because of LBA between the most distinct Dromaesauridae, Velociraptor and Bambiraptor, and the distantly related (to flying dinosaurs) Allosaurus, Tyrannosaurus and the IGM 10042 skeleton.

Two of the Minus-1 trees generated for the last post of this series.

For our experiment, we will take this (partly) wrong tree, and add every other taxon included in the Hartman et al. (2019) matrix as 15th tip. We can then perform a branch-and-bound search to infer these 14-Plus-1 tree(s). When we browse through the inferred MPTs, we can see that many taxa fall in line with the wrong topology, including a few that, in addition, increase uncertainty for branches correctly resolved in the minus-Dromiceiomimus tree.

Out of the 485 candidate trees, only 10** have a set of characters that can compensate for the missing Dromiceiomimus, leading to Plus-1 trees that show a Dromaesauridae clade, as shown here.

Two of the ten Plus-1 trees, where the added tip saves the inference from LBA. Numbers give the amount of defined characters (scored traits). Both Halszkaraptor and Zhenyuanlong are classified as Dromaeosauridae, however only the better covered taxon is placed as sister to the Dromaeosauridae included in the original 14-taxon tree.

The presence of the deep-branching Compsognathus (Tyrannoraptora: ... :Neocoelurosauria: †Compsognathidae) triggers an Archaeopteryx-Dromaesauridae clade.


In the case of relative deep-branching Garudimus (... :Neocoelurosauria: Maniraptoriformes: †Ornithomimosauria: †Deinocheiridae) and Epidexipteryx (... : Maniraptoriformes: ... : : ... : Paraves: †Scansoriopterygidae) one or two of the two or three MPTs show the wrong grade except the last the clade.

Note: the relative low number of scored traits for Epidexipteryx can avoid LBA leading to a Dromaeosauridae grade but misplace the taxon within the Plus-1 MPTs: its family, the Scansoriopterygidae, are considered to represent the sister lineage (Wikipedia, referring to Godefroit et al. 2013 Nature 498: 359–362) of the Eumaniraptora which include the Dromaeosauridae as first-diverging branch.

We can also summarize the outcome, a collection of 640 Plus-1 MPTs, in form of a z-closure SuperNetwork, as we did for the Minus-1 trees in the previous Fossils and Networks post (shown next).


This SuperNetwork is quite boxy, and may be only semi-comprehensive (I used only 20 runs, which took half a day). Matching 485 tips into a 14-taxon backbone tree is not the kind of tree sample that the SuperNetwork has originally been designed for!

Only four edges, fat and blue, are without alternatives. In all other cases, the added tip triggered the creation of several alternatives: the highest dimension for the boxes is five, but most have four or less dimensions. Regarding our problem of saving the Dromaeosauridae clade, we can see that the topological change depends on very few characters, with Microraptor being very close to the divergence but a bit more bird-like (in a very broad sense), while the other two are much more derived.

Close-up on the Dromaeosauridae part of the network, with all tips labeled. Pie charts give the percentage of scored traits/missing data. * – Tips that saved the inference from LBA (see above).

Note the length of some of the colored edges, especially the light green which represent edges reflecting a Dromaeosauridae clade. Other Dromaeosauridae taxa increase not only the diversity but also may create substantial topological ambiguity (bluish and greenish edge bundles; same color = same split) and branching bias.

Take-home message

Creating morphological supermatrixes makes a lot of sense, because it ensures normalization and facilitates universal comparability, which is crucial also for paleobiology. However, even more than molecular phylogenies, paleophylogenies are affected by character and taxon sampling. This is nothing new, and much debate has dealt with which parsimony strict consensus cladogram is the better one.

I suggest taking a new route. Instead of using morphological supermatrixes to infer trees – for this matrix, Hartman et al. found millions of equally optimal parsimony trees further filtered by post-analysis, initial tree topology informed character weighting (as implemented in TNT) – we should use it to generate subsets and engage in exploratory data analysis. This will pinpoint strengths and weaknesses of the data and its individual taxa. Rather than producing evolutionary meaningless soft polytomies, one should study the reasons for any topological ambiguity. After all, one simple reason for unstable branching patterns may be that all so-far inferred trees are biased, only differently.

The SuperNetwork can assist us in putting together taxon sets that could allow not only a simple tree inference but also topology testing.
  • If we want to test the stability of, e.g., the Dromaeosauridae clade against taxon sampling, it will be of little use to include the most primitive (anything outside Maniraptora) and much more advanced taxa (Avialae including modern birds) of the 501-taxon matrix. On one had, the most primitive taxa will only increase the computational load, because our inferred tree not only optimizes branches we are interested in, but also irrelevant ones, using taxa that largely lack discriminative signal for the branches of interest or at all. On the other hand, the most derived taxa may bias the tree inference by providing strong terminal signals outcompeting potentially conflicting weak basal signals.
  • If we want to test the stability of the backbone phylogeny against adding taxa and entire lineages, we may prefer short-branched over long-branched taxa, in order to avoid (local) LBA (especially when we want to stick to parsimony). The terminal edges in the SuperNetwork indicate the minimum number of unique changes for each tip added to the 14-taxon tree. As seen also in our hypothetical example: E and F only break down the wrong AB clade because both are either identical (or very similar) to the last common ancestor of E+F+B and F+B, respectively.
In a future post, I'll come back to the issue of identifying taxa that are game changers, using a simple and quick tree-based approach: the so-called "evolutionary placement algorithm", first implemented in RAxML.

PS.
For any of you who really don't like networks, but still find no comfort in comb-like strict consensus cladograms either: just tick the SuperTree option when inferring the SuperNetwork. But only if your input trees converge to a shared topology. Otherwise the result may look like this:

A SuperTree based on the 640 Plus-1 MPTs.



* Somebody familiar with Consensus networks and morphological data partitions providing complex signal, can extract a phylogenetic hypothesis from this boxy network for the included taxa. In general, the distance along the network edges represents a phylogenetic distance, and thus gives a direct measure of how derived a taxon is.

For example, C, D are closer to the ougroup and placed close to the centre of the graph, which is exactly where a primitive ingroup taxon, with an ancestral morphology, would be placed. F is most likely a sister of B. The olive EF | rest split supports a potential common origin of E, F, and B (long green edge bundle). Hence, A can only represent a distant, strongly evolved sister lineage (both the alternative AB and ABF clade have less character support). Also, since the graph depicts E as least derived of the four (irrespective of the topological alternatives), its affinity to F and B has more value than the affinity between A and B, both being long-branched, and hence susceptible to LBA. D fits into the picture, the olive DE edge either: (1) represents a common origin, which would make D an early member of the red lineage; or (2) has similarity due to shared primitive traits within the ingroup, which would make D an early member of an ABEF lineage. C, in contrast to D, has no clear affinities with any other ingroup member, and so can only be interpreted as an early, very primitive form with uncertain phylogenetic relationships. The (true tree) mutual monophyly of the red and blue ingroup lineages has very little character support in the matrix, and hence cannot possibly be resolved.

** Systematically they cover a range of maniraptoran ('hand hunters') families 'below' the Avialae ('flying' dinosaurs) including, in addition to two Dromaeosauridae (Halszkaraptor, Zhenyuanlong, trees shown above), members of †Alvarezsauroidea (Haplocheirus), †Caudipteridae (Caudipteryx), †Sinovenatorinae (Sinovenator), †Therizinosauroidea or related (Beipiaosaurus, Jianchangosaurus) and †Troodontidae (Gobivenator, Sinornithoides). Caihong is a member of the †Anchiornithidae, which Wikipedia flags as "Avialae ?". These OTUs show data coverage far above the median (74% missing), with 278 (Caihong) to 558 (Caudipteryx) defined characters (out of a total of 700).

Fossils and networks 2 – deleting (and adding) one tip


A general assumption in phylogenetics is: the more the better. The more data my matrix includes, the better will be my tree. The more taxa I include, the better will be my phylogenetic analysis. But is this true when we include (or rely on) fossils? After all, there is an old saying: less is more; and in this post I will show you that it is often true here, too.

Perfect data – how to recognize unproblematic topologies

In the first post of this series (Farris and Felsenstein), I introduced two matrices, a Farris Zone matrix and a Felsenstein Zone matrix, with the same set of tip taxa: three extant genera and three early fossils, one for each generic lineage.

The Farris Zone matrix provides a perfect signal. No matter which inference criterion one uses, one always gets the true tree. In such a case, the taxon sampling should be irrelevant; and it is. Any 5-taxon sub-tree correctly shows only splits found in the 6-taxon true tree — shown below are the actual most parsimonious trees (MPT) of each inference using the branch-and-bound algorithm.

Six most-parsimonious trees showing the topology of the true tree; trees are midpoint-rooted and have the same scale.
Note: NJ/LS and ML would give the same result for this experiment.

Consequently, for the perfect case, the SuperNetwork of the six 5-taxon trees is the 6-taxon true tree.

Z-closure SuperNetwork (Huson et al. 2004) of the 5-taxon MPTs generated with SplitsTree (walkthrough at the end of the post) depicting the true tree.

Therefore, the simplest test to check for potential topological issues in any set of data is to sub-sample the taxa by sequentially pruning a single taxon, infer the resulting group of trees (which I will call minus-one trees), and then summarize this tree sample in the form of a SuperNetwork. If the data have no signal issues – and the inferred all-inclusive tree is unbiased – all minus-one trees will be congruent with the all-inclusive inferred tree. The resulting SuperNetwork will then be a tree matching the inferred all-inclusive tree.

On the other hand, if removing a single taxon has a significant effect on the inferred tree, then this either means you need this taxon to get the right tree or that this taxon is causing bias. We cannot assume that trees with many taxa are better than trees with fewer taxa. Only if a topology is independent of taxon sampling can we be sure that we are looking at a true tree (or one inevitable with the data at hand).

Taxon-sampling matters? Then the all-inclusive tree may be biased

Real data matrices are far from perfect. Paleophylogenetic matrices, for instance, not only include a lot of missing data limiting the decision capacity of any phylogenetic inference, but, being restricted to morphological traits, usually high levels of homoplasy — that is, similarity in conflict or only partial agreement with the phylogeny (here are some related posts: Has homoiology been neglected in phylogeny? Should we bother about character dependency? Please stop using cladograms! The curious case[s] of tree-like matrices with no synapomorphies and More non-treelike data forced into trees: a glimpse into the dinosaurs). While some OTUs are primitive in their character suites, others are highly derived. We often, without realizing it, are infering within or close to the Felsenstein Zone.

If we repeat the same minus-one experiment, but now use the Felsenstein Zone matrix, instead, we end up with something quite different. We get three most-parsimonious tree (MPT) solutions when eliminating the outgroup genus O or its fossil Z; and eliminating the genera A and B and their fossils C and D, respectively, each leads to a single MPT. This yields a total of 10 MPTs.

First row rooted with Z, all other trees mid-point rooted. All trees have the same scale.

By pruning the long-branching genera A or B, even parsimony analysis gets the correct tree because we have eliminated the source of the long-branch attraction. Adding fossils to break down long branches can be effective (classic paper: Wiens 2005), but dropping long-branching tip taxa works just as well. Changing between a close outgroup (fossil Z) and a distant outgroup (fossil O) has little benefit here.

In this case, the resulting SuperNetwork of our 10 MPTs is not a tree but a network including alternative clades, wrong ones (orange), ie. not monophyletic, and correct ones (green) — ie. branches (internodes, bipartitions) reflecting the monophyletic lineages of the true tree.

Comprehensive Z-closure SuperNetwork of the 10 minus-one MPT inferred based on the Felsenstein Zone matrix. The network includes all split patterns found in the MPT sample.

A real world example

To give an example of how sequentially dropping one taxon works with real-world data, we'll use the exhaustive 700 character matrix for bird-related dinosaurs provided by Hartman et al. (2019).

With its total of 501 taxa (OTUs), the apparent rationale behind the matrix is that, by including as many taxa as possible, one gets the best-possible (parsimony) trees, irrespective of the signal quality provided by individual OTUs. However, the full matrix cannot be forced into a single-optimal parsimony tree, due to missing data (72% of the matrix' cells are undefined or ambiguous, ie. 255969 cells) and a scarcity of synapomorphies (in a Hennigian sense) — this is discussed in Hartman et al.; see also the related Q&A.

Here, in light of the computational effort and to avoid heuristics when searching the MPTs, we'll use a pruned sub-matrix. For our first experiment, we take 15 out of the 19 best-covered OTUs. Thus, OTU pairs / triplets that are much more similar to each other than to any other OTU, are reduced to the best-covered representative.

The 19-taxon matrix that I used in a previous post (Large morphomatrices – trivial signal) had only one most-parsimonious tree solution, showing only clades in agreement with current opinion, which assumes a largely staircase-like evolution from dinosaurs to modern birds (Tree of Life). In contrast to the full matrix, the 19-taxon matrix provided high support for most clades (method-independent), reflecting the number of scored traits. The extant taxa, representatives of modern birds (duck, turkey and ostrich, all edible), have many derived cgaracters, with the extinct bird genus Lithornis being placed in-between ostrich and duck + turkey.

The optimal topologies for the 19 best-covered taxon matrix. Green, the single most-parsimonious tree. Clade names copied from Wikipedia/Tree of Life.

The ML and NJ/LS (except for one branch) trees were topologically identical; each branch is supported by about 100 inferred changes. The signal from the matrix should be straightforward.

The tree-size weighted mean (default in SplitsTree) SuperNetwork, summarizing the result of an exhaustive branch-and-bound search using the 15-dropped-1-taxon matrices (each one resulting in a single optimal MPT) has a tree-like structure.

Allosaurus-rooted SuperNetwork of the 15 minus-one MPTs. Green – clades also found in the all-inclusive tree representing monophyla; orange – conflicting clades, blue – the all-inclusive tree doesn't resolve the assumed monophyly of modern birds, but places Lithornis as sister to Neognathae.

Conflicting clades are found in only two of the 15 inferred MPTs, being represented by short branches (their length in the other 14 trees is counted as zero).

Nonetheless, these conflicts received considerable character support. The frequency of a split in the minus-1 tree sample is irrelevant (see the A-B LBA problem discussed above — any tree including A and B showed the wrong clade). When summarizing our tree sample (especially when using MPTs), we should hence opt for a SuperNetwork, in which the edge lengths give the minimum branch lengths found in the MPT collection, ie. the edge length reflects the minimum length of the branch in all trees showing that branch.

Same SuperNetwork as above, but using the "Min" option instead of the default setting for computing edge lengths.

Without Dromiceiomimus – representing an earlier diverged lineage and step in bird evolution – the Dromaeosauridae clade, which is probably monophyletic (Wikipedia), flips and dissolves into a grade. By removing the intermediate step, we seem to create some ingroup-outgroup (long-branch) attraction.

Anas, the duck, forms the morphological link to Lithornis – with a mean morphological pairwise Hamming distance (MD) of 0.23, Anas is the most-similar OTU; and, hence, the MPT places Lithornis as sister to Anas + Meleagris (turkey; MD = 0.17). By eliminating Anas, the remaining contemporary birds form a clade — the modern birds (Neornithes) are assumed to be monophyletic but do not form a clade in the all-inclusive MPT (Struthio, the ostrich, is morphologically more distant from duck, turkey and Lithornis).

Conclusion

Even the most comprehensive, least gappy of paleophylogenetic matrices have substantial signal issues. If a tree inference is dependent on which OTUs are sampled, we cannot assume that we will automatically get better trees simply by including everything we have. Some OTUs (in our experiment: Dromiceiomimus) will stabilize correct aspects of a tree, while others will manifest bias or error (here: Anas). It's unlikely that a wrong, ie. not monophyletic, clade created by the attraction of two well-sampled taxa can be broken down by adding numerous taxa showing only a fraction of defined characters. SuperNetworks of minus-one trees can point you to the critical OTUs and unstable branching patterns of your (backbone) phylogeny.

PS. Personally, I would analyze a matrix with these properties, and a taxon sample spanning more than 150 myrs of evolution (from Allosaurus to modern birds), using ML not MP. I used MP in this post only because paleontologists are still very fond of it (not a few still discard anything else as unfit for their data). ML is less prone to long-branch attraction, results in a single tree (easier to compare when using larger taxon samples), and is speedy these days, allowing for more in-depth experiments towards the end of the exploratory data analysis. Both IQ-Tree (homepage; includes links to online servers) and RAxML-NG (open access paper providing essential links / github; implemented on various online servers) can quickly infer ML trees and establish branch support (including but not restricted to nonparametric bootstrapping) using binary and multistate data.



Walk-through for computing Z-closure SuperNetworks (Huson et al. 2004) in SplitsTree (v. 4, since v. 5 is still not fully functional):
  1. Make sure the tree sample for reading is in Newick format, including branch-length information. The trees can be in a single file or multiple files.
  2. Start SplitsTree.
  3. To read in the tree sample:
    • File > Open, if your trees are in one file;
    • File > Tools > Load multiple trees, if your files (eg. minus-1 MPTs) are in different files.
  4. Go to Networks > SuperNetwork. Choose "Min" for "Edge Weight" in the pop-up analysis window for the first graph. You can also try out "Mean"/"Sum" (short, rare alternatives will be less prominent), "AverageRelative" (trade-off) or "None" (branch-lengths in the minus-one tree sample are ignored). When using simple tree samples (little topological variation, matrix with fairly stringent signals), a single run (default) suffices. Increasing the number (eg. to 100) ensures no branching pattern in the minus-one tree sample gets lost. For instance, for the Felsenstein Zone matrix, a single run will give you a SuperNetwork capturing the major conflicting aspects, while 100 runs will lead to a higher dimensional graph that includes the correct BD and AC clades as alternatives. If you like to view the overall best-fitting tree instead of a network, tick "SuperTree".


Cited papers

Hartman​ S, Mortimer M, Wahl WR, Lomax DR, Lippincott J, Lovelace DM (2019) A new paravian dinosaur from the Late Jurassic of North America supports a late acquisition of avian flight. PeerJ 7: e7247.

Huson DH, Dezulian T, Kloepper T, Steel MA (2004) Phylogenetic super-networks from partial trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics 1: 151–158.


Wiens JJ (2005) Can incomplete taxa rescue phylogenetic analyses from long-branch attraction? Systematic Biology 54: 731–742.

Can we dig too deep? Signal conflict in mitochondrial genes of land plants


In an earlier post, Supernetworks and gene tree incongruence, I illustrated what Supernetworks can tell us about incongruent mitochondrial gene trees, using the dataset of Sousa et al. (PeerJ 8: e8995, 2020). Here, I will take a closer look at these data, in order to illustrate another point.

Fig. 1 A Supernetwork based on 34 individual mitochondrial gene trees (atp1 and atp8 missing due an alignment glitch). Groups (clades) referring to splits not found in any of the gene trees, including the "Septaphyta", are shown in gray font, in blue, groups referring to clades seen in Sousa et al.'s preferred tree.

Sousa et al.'s set of analyses aimed to filter signal in order to get a better all-inclusive tree, and succeeded to produce support for a "Septaphyta" clade, comprising liverworts and mosses, which is a split not found in any of the inferred (Bayesian) gene trees.

Fig. 2 Comprehensive but branch-length and frequency ignorant Supernetwork of Sousa et al.'s Bayesian MRC gene trees (trees are provided as supplementary online data on zenodo), inferred from nucleotide sequences. The trees show several alternatives (colored and labeled) regarding the sister lineages of mosses and liverworts. Any split found in at least one gene tree, is represented in this Supernetwork.

This split did occur, however, when the amino acid sequences were used, instead of the nucleotide sequences.

Fig. 3 Comprehensive branch-length and frequency ignorant Supernetwork of Sousa et al.'s amino acid Bayesian MRC gene trees. The sister taxon of liverworts are either the mosses (= Septaphyta clade) or the outgroup green algae Coleochaete (further inclusive splits include subsequently more outgroups, placing the liverworts as sister to all other land plants). Realized alternatives for mosses include further being a sister of hornworts (in which case liverworts would be sister to higher land plants), or sister to all land plants (brown split: green algae + mosses | rest).

The alternative to the Septaphyta clade, which does appear in the gene trees, recognizes the liverworts as the closest relative of the vascular plants, while the mosses are resolved as the first branch. As Sousa et al. point out:
The tree inferred from the concatenated nucleotide data set of 36 mitochondrial genes shows mosses as the sister-group to the remaining land plants, as previous analyses of mitochondrial nucleotide data have shown (Liu et al., 2014). ...
The concatenated tree hence only reflects a minor aspect of the Supernetwork (Fig. 1) of the individual gene trees:
... However, the mosses are replaced by the liverworts in the same position when analysing codon-degenerate recoded data.
This seems to be the preferred placement when summarizing the gene trees using the Supernetwork.

In this post, we will take a closer look. Is there a deep, easily obscured signal for the Septaphyta clade in the mitochondria of plants? A signal that only surfaces in some amino acid gene trees (Fig. 2) and the filtered concatenated tree (Sousa et al.'s fig. 2), or is it just a branching artifact?

An example for a (low-supported) artificial clade (cf. Schliep et al., Methods Ecol. Evol. 8: 1212–1220, 2017). The trees in (a), (b) and (c) give the paternal (Y-chromosome), maternal (mt genes) and biparental (nuclear autosomal introns) genealogies. While the paternal and biparental genealogies are compatible (congruent), the maternal is in strong conflict. When combining these three data sets, this substantial conflict decreases branch support and results in the artificial red clade. The support for the artificial clade is low but worrisome: the tips differ substantially from each other and the hypothetical, alternative common ancestors and there are literally no patterns in the data supporting a sister relationship of Sloth and Sun bears. The artificial red clade is a secondary product of the inference: the sister relationship of Polar and Brown Bear is trivial (data and inference wise), the American Black Bear the more likely sister (note the length of competing branches in a/c vs. b). The signal for a clade of Asian Black and Sloth bears is less pronounced, here the mt genealogy clade is strongly incongruent and forces the combined tree to resolve the conflict by introducing the artificial red clade.

Starting simply

The Supernetwork in Fig. 1 shows that, no matter which gene we look at, liverworts and mosses were originally most similar to each other, and, absolutely speaking, still close to the (hypothetical) mitochondrion of the ancestor of all land plants. We can illustrate the general situation about the signals using a Neighbor-net inferred from the concatenated data of all 36 genes.

Fig. 4 Neighbor-net based on uncorrected p-distances inferred from the concatenated gene data.

Note that we used a substitution model via a naive distance matrix for a set of coding genes that include saturated third codon positions. Some phylogenetic relationships are obviously based on trivial signals: the Neighbor-net in Fig. 4 includes ± prominent edge bundles defining neighborhoods in line with generally accepted clades (in bold). To capture these evolutionary lineages (some going back nearly half a billion of years), we just need the raw data but no sophisticated phylogenetic analysis.

In the case of the probably monophyletic gymnosperms, the gymnosperm neighborhood competes with a neighborhood excluding the Gnetidae Welwitschia, which is the most distinct of the seed plants in this taxon set (this applies to Gnetidae in general, no matter which data are used). In addition, we see a neighborhood defined by the pink edge bundle: a split of green algae + Welwitschia versus all other land plants. This is a case of obvious long-edge attraction, enforced (here) by missing data (Welwitschia lacks data for 12 out of the 34 genes).

The center of the graph with respect to all tips would be a candidate for the ancestral mitochondrion of the common ancestor of all land plants. Closest to this point are the Septaphyta (mosses + liverworts) and the lycophyte Huperzia (the better represented taxon only missing out on five genes, while Isoetum miss 15).

One can depict a phylogenetic hypothesis by just dropping the less pronounced neighborhoods in Fig. 4:
  • Most prominent edge bundles define three main cluster (= lineages): green algae, seed plants, and other land plants.
  • Within green algae:
    • Closterium is sister to Gonatozygon, next is Roya → Zygnematales
    • there is no prominent edge bundle connecting Chaetospharidium with the Zygnematales; the closest relative is however the last green algae (→ Coleochatales; only group without a neighborhood).
  • Within seed plants:
    • Brassica may be the sister of Liriodendron (more prominent edge bundle), Oryza complements the clade as first diverged member → angiosperms
    • Cycas is the sister of Gingko, the two are sister to the angiosperms
    • this leaves Welwitschia as the first diverged branch.
  • Taking the green algae as outgroups:
    • the ferns are the sister group of the seed plants (edges longer than the alternative of a primitive land plant clade)
    • mosses are sister to liverworts (→ Septaphyta); Huperzia shares the same edge bundle but is apparently sister of Isoetes (→ lycophytes), and the lycophytes appear to be ± primitive sisters of ferns and seed plants
    • this leaves the hornworts, a highly coherent group sharing no prominent edge bundles with any other member of the land plant cluster, and hence are a candidate for the first diverging land plant lineage.
This is a tree hypothesis that is strikingly similar with Sousa et al.'s preferred tree.

Sousa et al.'s fig. 2

The only differences lie in terminal subtrees (Oryza as sister to Liriodendron; Marchantia-Treubia grade, the position of the latter two within liverworts being unclear based on the Neighbor-net).

Something that is easily overlooked in Sousa et al.'s rooted tree, but that is apparent from the Neighbor-net, is that we should be aware of ingroup-outgroup long-branch attraction (LBA). The green algae are not only highly divergent but also very distant from all ingroup taxa, the land plants.; and the first ingroup branch in Sousa et al.'s tree has the longest root.

Additive and subtractive support

In principal, when comparing single gene tree samples to combined trees, we face four sorts of signals in our data:
  • Very strong signals imprinted in one or a few genes; they will outcompete, and possibly even be re-enforced by any conflicting signal. Walker et al. (PeerJ 7: e7747, 2019), studied this phenomenon for the case of angiosperm plastomes (see also our miniseries The emperor has no clothes on).
  • Phylogenetically sorted, weak but consistent signals; they will add up, as branch support will increase with each gene added. In this category fall signals reflecting deep splits obscured by terminal noise, when analyzing a single gene or few genes – like the one found by Sousa et al. supporting a Septaphyta clade.
  • Disparate gene histories; eg. because of intergenomic recombination. The support will be diminished with every added gene not sharing the same history.
  • General conflict; eg. when combining data from different genomes reflecting different genealogies, such as combining chloroplast (product of biogeographic history) and nuclear data (product of speciation processes) of tree genera. This will be expressed by split bootstrap (BS) support, and may result in artificial clades in the combined/concatenated tree (eg. bear example shown above).
Adding to this is the absence of signal: short-branch culling, a special case of long-branch attraction, which could also explain the inference of a (paraphyletic) Septaphyta clade. If there are few tips in the data that are close (absolute, not only regarding their phylogenetic distance) to the all-ancestor without clear affinities, they may be collected in a subtree, being leftovers from optimizing all other tips with certain affinities and higher distance to the all-ancestor.


Fig. 5 Short-branch culling. Let's assume liverworts are the sister clade of higher land plants (an alternative with near-unambiguous support from cox1). The signal for this in mitochondrial data is weak (short root). On the other hand, there is a high risk for ingroup-outgroup long-branch attraction (LBA) leading eventually to an artificial Septaphyta clade. Because of (inevitable) LBA, even though the false branch is very short, its support can be high (unambiguous when using Bayesian inference).

By compiling the support for all alternatives, we can assess where the support is additive or subtractive. We do this using my re-analysis not Sousa et al.'s Bayesian analysis because:
  1. BS support is more sensitive to internal signal conflict than Bayesian PP,
  2. to extract this information, we need the tree samples used to establish the branch support.
When doing this, we find that the split defining the Septaphyta clade is not only missing from the nucleotide genes trees but also rarely found in the BS pseudoreplicate samples. Only for seven gene regions (atp4, atp8, nad2, rpl16, rps2, rps3, rps13) do we find BS ≥ 25; the highest support comes from rps3 (BS = 65; however, the split is not found in the corresponding Bayesian MRC of Sousa et al.).

On the other hand, the main alternatives find much higher and more consistent support, as shown here.

Fig. 6 Competing support for (purple) and against a Septaphyta clade (greens and yellows). Placing the hornworts as sister to all other land plants (pink) is compatible with the hypothesis of a Septaphyta clade as well as the competing alternative of placing the liverworts as sister to higher land plants; note the high support from cox1 gene for an according tree. *, for these genes no hornwort data were included/have been available.

Short-branch culling, a special form of ingroup-outgroup LBA

Now, my BS analyses were deliberately naive, because they did not apply any data partitioning. However, both liverworts and mosses have short-branches while the outgroup, the green algae, are extremely long-branched. If substitution saturation is an issue for misplacing either liverworts or mosses as sisters to all other land plants, then there should also be ingroup-outgroup LBA. A false split of liverwort + outgroup versus the rest, or moss + outgroup versus the rest, has a lower chance to be supported than would a false hornwort + outgroup versus the rest split. The latter directly opens the door for a Septaphyta clade (see Fig. 5).

Let's have a look at the trees of the four genes supporting the Septaphyta split, as the best alternative. ("AA tree/PP" is the amino acid tree provided by Sousa et al.; BS support refers to my unpartitioned ML analyses)
  • atp8 — The AA tree is a star tree (comb), strongly distorted by LBA: a Coleochaetales + seed plants | Zygnematales + all other land plants splits has a PP = 1; the short- and long-branched lycophytes are not resolved as sisters.
  • rpl16 — Also here, the AA tree is star-like regarding deep relationships: (i) green algae (unresolved), PP = 1; (ii) liverworts, very long root, little internal resolution PP = 1; (iii) mosses (unresolved), root half as long as for liverworts, PP = 1; (iv) higher land plants, short root, PP = 0.88.
  • rps3 — No ingroup-outgroup LBA, shortest-branched ingroup, liverworts, resolved as sister to mosses + rest (PP = 0.77); thus, AA tree, not affected by saturation issues, rejects the Septaphyta (PP < 0.23).
  • rps13 — Again, the AA tree is star-like, with five tips: (i) green algae (PP = 1); (ii) long-rooted hornworts (PP =1); (iii) liverworts, relatively short root (PP =1); mosses, longer root (PP = 1); (v) higher land plants, shortest root (PP = 0.89).
The Septaphyta root is either extremely short or non-existent, as we would expect for a false clade, because there are no character splits in the matrix that support the taxon split.

Fig. 7 Sousa et al.'s amino acid Bayesian MRC trees (top row) compared to codon-naive nucleotide ML trees (bottom row) producing highest BS support for a Septaphyta clade. Note that in two cases, rpl16 and rps13, the 'best-known' ML tree shows a competing split with much lower support.

Typically, since we are looking at a deep split, we would expect that support increases when shifting from (codon-naive) nucleotide to amino acid analysis, because we eliminate terminal noise. However, we observe the opposite (Bayesian PP more easily converges to unambiguous support than BS values). The difference between our codon-naive nucleotide ML and Sousa et al.'s amino acid MRC trees tells us that it is mostly information from the 3rd codon position that triggers a Septaphyta versus the rest split for these four genes — ie. potentially synonymous substitutions that Sousa et al. filtered against.

Where does the high support comes from for the Septaphyta clade in their combined tree? That tree is based on a matrix, that should have a signal in-between our codon-naive nucleotide and their amino acid analysis.

A five-taxon problem with a glitch

Sousa et al.'s study is exemplary, in that it provides a careful, and well documented, analysis of the combined data. If you want to infer a potentially good tree, this is one way to do it.

However, their Septaphyta clade is most likely a branching artifact. It still combines data that, genuinely, provides not only diffuse but conflicting information about how the main lineages of land plants diverged from each other (Fig. 6). No analysis, no matter how sophisticated and well-crafted, can compensate for the deficits of the underlying data. By filtering out "noise", one also filters out actual conflicting signal. In this case, this is about how liverworts, mosses, and hornworts stand in relation to the extremely long-branched and divergent outgroup, the green algae, and their increasingly evolved siblings, the higher land plants (lycophytes, ferns, and seed plants). It is another example of what I pointed out in last week's post: Big Data invites big (ie. well supported) errors.

It is important to realize that, although we use many more OTUs, we are still looking at a five-taxon problem. When our data supports one split (or prefers it, being biased or not), there are only three more alternatives to select from.

Ingroup-outgroup LBA draws the hornworts, as the genetically most distinct (longest-rooted) lineage of the "bryophytes", away from liverworts and the lycophyte Huperzia, which connects the much more diverged higher land plants to the bryophytes. This leaves three alternatives:
  1. Liverworts are the sister of higher land plants. Their mitochondria show some affinity, but only to the lycophytes, mostly the low-divergent and better sampled Huperzia; and often together with the hornworts, ie. a split incompatible with the hornwort-green algae versus the rest split.
  2. Mosses are the sister of higher land plants, but their mitochondria show very little affinity to any of them (including Huperzia). In fact, they seem to have the most primitive of all land plant mitochondria.
  3. Septaphyta are monophyletic, as the trade-off with the least conflict. Being (much) less diverged than the higher seed plants, they are genetically closer, and ± equally close, to the hornworts and the least-evolved higher land plant, the lycophyte Huperzia.
Sousa et al.'s codon-degenerate approach enforced ingroup-outgroup LBA between the hornworts (the worst-sampled ingroup) and the green algae, while decreasing the absolute distance between liverworts and mosses, and increasing their distance to the higher land plants. That is, Alternative 3 outcompetes Alternative 1. Alternative 2 has no support in the data.

Are the mosses sister to all land plants?

Probably not. Just because the Septaphyta clade is an artifact, it doesn't mean the Septaphyta cannot be monophyletic — it just means the mitochondrial genes don't provide any clear signal to support or reject such a hypothesis, or any other alternative. The same applies to the mosses as the first diverging lineage; their position in earlier trees is likely also to be an artifact — not a branching, but a data artifact. If their mitochondrial genomes are still very similar to that of the common ancestor of all land plants, then they should be placed like an ancestor in the tree — as a short-branched sister to all of their "offspring", the remaining land plant mitochondria.

Eight of the nine genes that support a moss + outgroup versus the rest split, fail to resolve a moss clade. This is a clear indication that the moss mitochondria are simply primitive (at all gene positions that matter). What divides them from most (or all) other land plants are symplesiomorphies — shared but ancestral sequence patterns. The only gene that prefers both splits at once, mosses as sister to all other lands plants as well as a moss clade, is nad4 (BS = 67 and 62, respectively); but only when using nucleotides.

Fig. 8 A small but important difference: the codon-naive ML nucleotide (nt) tree (left) shows a moss clade as sister to all other land plants. The Bayesian amino acid (aa) MRC tree for the same gene shows a wrong split (purple internode) between green algae + ferns + angiosperms (long-branch, prominent roots) and bryophytes + lycophytes (mostly short-branched, short roots). By translating nucleotides into amino acids one may eliminate genuine discriminative signal encoded in synonymous substitutions, while in other, faster evolving parts of the tree, the same site/gene is oversaturated/biased. The poorly supported sister relationship of Roya and Gonatozygon within the green algae in the nt ML tree is an artifact, correctly resolved by the aa tree based on the same gene.

The shift from nucleotide data (ML / BS) to amino acid data (Bayesian MRC) triggers ingroup-outgroup LBA between green algae and ferns + seed plants (PP = 0.53; 'short-branch culling' of bryophytes and lycophyte Huperzia), and results in a branching artifact — the monophyly of higher land plants is well established, and hence they should form a clade.

By contrast, the genes providing strong support for a moss clade (such as atp1, atp8, ccmB, cob, cox1, cox3, nad2, nad5, rpl6, rpl16, rps3, rps13, and rps14) fail to resolve any deep relationships at all, or prefer different alternatives (including the Septaphyta hypothesis: atp8, rps3, rps13). The combined tree's solution is therefore a least-conflicting one, again — a moss clade (based on a consistent signal in the majority of genes: 13 with BS ≥ 90; in total 24 with BS ≥ 58) as sister to the rest of the land plants (based on a signal found in other genes not reflecting the monophyly of mosses). This solution adds to the phenomenon that moss mitochondria are generally primitive (ie. show a variant basically ancestral to all other land plants), and doesn't conflict with a wide range of otherwise conflicting splits strongly supported by individual genes (in contrast to the Septaphyta clade, see Fig. 6).

Conclusion

Having spent some time with the data and gene trees, I have little hope that mitochondrial data can be used to resolve the deep relationships between land plants. Each tweaking may result in something different, and the support-after-tweaking will be inflated.

Nevertheless, it will be worthwhile to close the data gaps, especially for the hornworts. This may not solve the 5-taxon problem,* but may give unique insights in how the mitochondrial genome evolved and sorted during the initial radiation of land plants.

Notably, the mitochondriomes of land plants can differ in the arrangement of their genes; which means that they recombined with or within the nucleome (or even plastome). While in some plants the mitochondriome is passed on via both parents (like in Ginkgo or Cycas), in others it is only the mother (most, maybe all, angiosperms). Plants may have colonized land more than once, and expanded quickly, so that lineage crossing and also lineage sorting may be an issue — marine species can be cosmopolitan and genetically heterogeneous (cryptic speciation). Thus, some mitochondrial genes may tell different stories from others. Instead of trying to solve which of the alternatives is correct (which is what most phylogenetic literature revolves around), we should find out which gene or part of the genome agrees with which alternative, as they may be all true.

The question to address with mitochondrial data cannot be whether mosses, liverworts or hornworts are the first diverging branch of extant land plants, but should be why moss, liverwort and hornwort mitochondriomes show different stages of evolution, as exemplified by the nad4 trees in Fig. 8.

Data availability

An archive including the support consensus networks (in Splits-NEXUS format) and inferred gene ML trees (plain NEWICK), as well as the comprehensive split support table (XLSX format), has been uploaded to figshare.



* It may help to have an in-depth analysis of a more focused taxon set with no data gaps that minimizes the risk of LBA. This starts with a better selection of taxa representing the higher land plants:
  • Oryza (the rice) is a domesticated, much cultivated, and thus extremely evolved and polyploid monocot. If there is any deep signal embedded in the mitochondria of seed plants, the mitochondrion of rice is probably the last place to look for it.
  • When trying to resolve the deepest land relationships, including a Gnetidae like Welwitschia (a genus that is an evolutionary oddball to start with), makes equally little sense — like any of the three surviving genera of this unique gymnosperm lineage, it is genetically the outer-most tip of an iceberg. Each mutation in its genome is the product of an unknown number of divergences in the past.
  • If any seed plant should be included at all, would be more than sufficient to have: Liriodendron, a magnoliid, and thus a member of the least-diverged angiosperm lineage, plus Cycas, as a representative of an ancient, slow-evolving gymnosperm lineage. These are much more recent additions to the plant Tree of Life.
  • Being a tip of an iceberg applies even more to Isoetum. It is strikingly similar only to the other lycophyte, but it has more data gaps and is much more diverged, and thus can invite branching artifacts. When one wants to dig deep, the much more primitive Huperzia is obviously the better representative.
  • Last, the green algae are the only possible outgroups for inference, but they are poor for this – apparently, their mitochondria have evolved much farther from the common ancestor than those of the land plants. Rather than inferring trees including them, one should infer trees without them, and then optimize their position within trees that will then potentially be unbiased by outgroup-LBA — eg. using the evolutionary placement algorithm, to test the land plant root. An interesting experiment could also be to infer the sequence of the common ancestor(s) of modern-day green algae (lacking a time machine to sample it), and use them instead. The new RAxML-NG, for example, allows for ancestral state reconstruction of nucleotides.
In addition, standard 4-base substitution models are not the best choice when analyzing matrices with a high proportion of ambiguous base calls, like Sousa et al.'s codon-degenerate matrix (note that Sousa et al. already applied models that compensate for substitutional bias). This is especially so, given the importance of synonymous mutations to resolve relationships in the slow-evolving lineages, and slow evolving genes. One could try to use ambiguity-aware substitution models instead. The newest releases of RAxML-NG (Kozlov et al. 2019, Bioinformatics 35: 4453–4455) include models for "phased" and heterozygous data — ie. models that can make use of ambiguity codes as additional information during tree inference (see also Potts et al. 2014, Syst. Biol. 63:1–16).

    How I would (realistically) analyze SARS-CoV-2 (or similar) phylogenetic data


    While writing this post, the Gisaid database reported over 40,000 SARS-CoV-2 genomes (a week before it was only 32,000), which is rather a lot for a practical data analysis. There have been a few posts on the RAxML Google group about how to analyze such large datasets, and speed up the analysis:
    How to run ML search and BS 100 replicates most rapidly for a 30000 taxa * 30000 bp DNA dataset
    In response, Alexandros Stamatakis, the developer of RAxML, expressed the basic problem this way:
    Nonetheless, the dataset has insufficient phylogenetic signal, and thus it can and should not be analyzed using some standard command line that we provide you here; but requires a more involved analysis, carefully exploring if there is sufficient signal to even represent the result as a binary/bifurcating tree, which I personally seriously doubt.
    As demonstrated in our current collection of recent blog posts, we also doubt this. One user, having read some of our posts, wondered whether we can't just use the NETWORK program to infer a haplotype network, instead. Typically, the answer to such a question is "Yes, but..."

    So, here's a post about how I would design an experiment to get the most information out of thousands of virus genomes (see also: Inferring a tree with 12000 [or more] virus genomes).

    Why trees struggle with resolving virus phylogenies and reconstructing their evolution. X, the genotype of Patient Zero (first host, not first-diagnosed host) spread into five main lineages. All splits (internodes, taxon bipartitions) in this graph are trivial, ie. one tip is seperated from all others. Thus, they, and the underlying data, cannot provide any information to infer a tree, which is a sequence of non-trivial taxon bipartitions. For instance, an outgroup (O)-defined root would require to sample the 'Source' (S), the all-ancestor, hence, defining a split O+S | X+A+B+C+D+E. All permutations of X+descendant | rest should have the same probability, leading to a 5-way split support (BS = 20, PP = 0.2). In reality, however, tree-analyses, Bayesian inference more than ML bootstrapping, may prefer one split over any other, eg. because of long-branch attraction between C and D and 'short-branch culling' of X and E. See also: Problems with the phylogeny of coronaviruses and A new SARS-CoV-2 variant.


    Start small

    Having a large set of data doesn't mean that you have to analyze it all at once. Big Data does not mean that we must start with a big analysis! The reason we have over 40,000 CoV-2 genomes is simply the recent advances in DNA sequencing, and that we have effectively spread the virus globally, to provide a lot of potential samples.

    The first step would thus be:
    1. Take one geographical region at a time, and infer its haplotype network.
    This will allow us to define the main virus types infecting each region. It will also eliminate all satellite types (local or global) that are irrelevant for reconstructing the evolution of the virus, as they evolved from a designated ancestor, which is also included in our data.

    We can also search the regional data for recombinants — virus may recombine, but to do so they need to come into contact, ie. be sympatric.

    C/G→U mutations seen in several of the early sampled CoV-2 genomes: note their mixing-up within haplotypes collected from the cruiseship 'Diamond Princess' (from Using Median-networks to study SARS-CoV-2)


    Go big

    Once the main virus variants in each region are identified, we can filter them and then use them to infer both:
    1. a global haplotype network, and
    2. a global, bootstrapped maximum-likelihood (ML) tree.
    The inference of the latter will now be much faster, because we have eliminated a lot of the non-tree-like signal ("noise") in our data set. The ML tree, and its bootstrap Support consensus network, will give us an idea about phylogenetic relationships under the assumption that not all mutations are equally probable (which they clearly aren't) — this provides a phylogenetic hypothesis that is not too biased by convergence or mutational preferences, eg. replacing A, C, and G by U (Finding the CoV-2 root).

    On the other hand, the haplotype network (Median-joining or Statistical parsimony) may be biased, but it can inform us about ancestor-descendant relationships. Using the ML tree as guide, we may even be able to eliminate saturated sites or weigh them for the network inference, provided that the filtered, pruned-down dataset provides enough signal.

    With the ML tree, bootstrapping analysis and haplotype networks at hand, it is easy to do things like compare the frequency of the main lineages, and assess their global distribution. This also facilitates the depiction of potential recombination, we can sub-divide the complete genome and infer trees/networks for the different bits, and then compare them.

    Only based on nearly 80 CoV-2 genomes stored in gene banks by March 2020. The same can be done for any number of accessions, provided tools are used taking into account the reality of the data. The "x" indicate recombination, arrows ancestor-descendant relationships (from: Using Median networks to study SARS-CoV-2)


    Change over time

    The most challenging problem for tree inference and haplotype-network inference, is the fact that virus genomes evolve steadily through time. That is, the CoV-2 data will include both the earliest variants of the virus as well as its many, diverse offspring — both ancestors and descendants are included among the (now) 40,000+ genomes. We have shown a number of examples where trees cannot handle ancestor-descendant relationships very well. Haplotype networks, on the other hand, are vulnerable to homoplasy (random convergences). So:
    1. Take one time-slice and establish the amount of virus divergence at that time.
    Depending on the virus diversity, one can use haplotype networks or distance-based Neighbor-nets (RAxML can export model-based distances). Even traditional trees are an option — by focusing on one time slice, we fulfill the basic requirement of standard tree-inference: that all tips are of the same age.
    1. Then stack the time-slice graphs together, for a general overview.
    It will be straightforward to establish which subsequent virus variant is most similar to which one in the slice before.

    Based on such networks, we can also easily filter the main variants for each time slice, to compile a reduced set for further explicit dating analysis, for example via the commonly used dating software BEAST (it was actually designed originally for use with virus phylogenies).

    A stack of time-filtered Neighbor-nets (from: Stacking neighbour-nets: a real world example; see Stacking neighbour-nets: ancestors and descendants for an introduction)


    Networks and trees go hand-in-hand

    With the analyses above, it should be straightforward to model not only the spread of the virus (as GISAID tries to do using Nextstrain) but also its evolution – global and general, local and in-depth, and linear and reticulate.

    The set of reconstructions will allow for exploratory data analysis. Conflicts between trees and networks are often a first hint towards reticulate history — in the case of viruses this will be recombination. Keep in mind that deep recombinants will not necessarily create conflict in either trees (eg. decreased bootstrap support) or networks (eg. boxes), but may instead result in long terminal branches.

    There may be haplotypes in the regional networks that are oddly different, or create parallel edge-bundles. Using the ML guide-tree, we can assess their relationship within the global data set — whether they show patterns diagnostic for more than one lineage or are the result of homoplasy.

    Likewise, there may be branches in the ML tree with ambiguous support, which can be understood when using haplotype networks (see eg., Tree informing networks explaining trees).

    Era of Big Data, and Big Error

    SARS-CoV-2 data form a very special dataset, but there are parallels to other Big Data phylogenomic studies. Many of these studies produce fully resolved trees: and it is often assumed that the more data are used then the more correct is the result. Further examination is thus unnecessary (and it may be impossible, because of the amount of compiled data).

    As somebody who worked at the coal-face of evolution, I have realized that the more data we have then the more complex will be the patterns we can extract from them. The risk of methodological bias will not vanish, but may even increase; and the more I then need to check which part of my data resolves which aspect of a taxonomic group's evolution.

    This can mean that, rather than a single tree of 10,000 samples, it is better to infer 100 graphs that each reflects variation among 100 samples and one overall graph that includes only the main sample types. Make use of supernetworks (eg. Supernetworks and gene incongruence) and consensus networks to explore all aspects of a group's evolution. In particular, when you leave CoV-2 behind and task larger groups of coronaviruses (Hack and fish...for recombination in coronaviruses).

    Hack and fish … for recombination in the SARS group


    Following the current flow, we have had a few recent coronavirus posts here on the Genealogical World of Phylogenetic Networks. In this post, I'll show the results of a little experiment coming back to David's original post on the topic. Can we use trees to "fish" for evidence of recombination?

    As David pointed out, even when we use a phylogenetic-tree inference method to analyze virus genomes, we don't really end up with a phylogenetic tree. Instead, we have a tree reflecting genetic similarity, which will reflect the phylogeny to some unknown extent. The main problem with virus genomes, however, is that they easily recombine — and thus different parts of a virus genome may have different evolutionary histories. A single tree cannot reflect this.

    This does not mean that trees cannot tell is something about virus evolution. However, these trees become part of a fishing exercise, looking for different possible historical pathways, which may reflect recombination events.

    The tree

    Our SARS harvest matrix includes about a dozen sequence groups, which we have labeled Type 1 (the original SARS-CoV) to 9b. Type 7 is the new SARS-CoV-2. For my experiment here, I picked one place-holder sequence per main type (to speed up calculation time). I added two more types: the newly found direct sister of SARS-CoV-2; and some "unclassified" SARS-like viruses from pangolins, which earlier were proposed as sisters, as shown in this tree from the GISAID web page.

    The phylogenetic neighborhood of SARS-CoV-2 (GISAID, screenshot captured 3/6/2020). Note the flatness of the CoV(-1; yellow) and CoV-2 (red) subtrees.

    GISAID doesn't give the GenBank accession numbers, so we cannot easily say whether our sample matches theirs. However, the tree we can infer from the complete genomes (high-divergent, non-alignable regions excluded) looks very similar, as shown next, and some of the labels match up.

    Fig. 1 Maximum likelihood (ML) tree inferred for our sample using (old, v.8.0.20) RAxML. Roman numbers refer to earlier defined Types 1–9 (Tree and viruses – the SARS group), Arabic numbers give nonparametric bootstrap (BS) support based on 100 BS pseudoreplicates (number of neccessary BS replicates determined by the extended majority rule criterion). Branches without Arabic number are unambiguous (BS = 100).

    Most importantly, all but three branches have unambiguous support: the phylogeny of this sample is resolved. Unfortunately, as our recurring readers already know, this nearly resolved tree simplifies a much more complex situation.

    The Neighbor-net with recombinations and mutational trends (arrows, connectives; cf. Tree and viruses – the SARS group).

    Hack and slash

    A simple method to fish for different evolutionary histories in a genome is to cut the virus genomes into sub-sequences, infer a tree for each sub-sequence, and then compare the trees. Most researchers compare trees by showing them and discussing which one makes most sense. Here is an example from Corman et al. (2014), who searched for the root of MERS (Middle East Respiratory Syndrome) virus, an illness closely related to SARS.

    Reprint of Corman et al. 2014, fig. 3 with colors added to EriCoV (green) and HKU/BtCoV (olive) groups

    Each tree in their Fig. 4A and B (Bayesian majority rule consensus trees) was inferred from a different part of the genome. Corman et al.'s focus was to root the MERS viruses by identifying a better outgroup. However, note that the new sister-group (red, green stars – sister to MERS; orange stars – sister to someone else) moves, and so does the green EriCoV clade and the olive HKU/BtCoV group (clade in some trees and grade in others). Do some of these trees get it wrong? Or is, eg. NeoCoV the product of reticulate evolution (here: ancient recombination)? Some parts of its genome might be derived from a common ancestor with MERS (blues), and others from a common ancestor with KW2E (black) and EriCoV (green).

    Our complete matrix has 27,333 characters, providing nearly 6,000 distinct alignment patterns (abbreviated DAP, below), which is a lot — the GISAID link above also provides a graphical representation of site divergence. However, probabilistic tree inference methods (ML, Bayes) can handle moderate to high levels of divergence in the data. On the other hand, they also need a certain amount of data to perform well (see also: Inferring a tree with 12000 [or more] virus genomes). So, for my experiment, I hacked the matrix into nine bits of equal size, ie. each submatrix has a bit more than 3,000 nucleotides, providing between 615 (bit #5) and 1029 (bit #1) DAPs.

    Fig. 2 Nine ML trees with BS support annotated along branches, each based on a ~3000 nucleotide long bit of the genomes (ordered left-right, top-bottom). Purple highlights branches conflicting with the complete genome tree.

    Our nine trees (shown above) are not badly resolved, as most branches get substantial support. But they are not congruent. If we are dealing with recombination, then we might assume that all of these trees do show an actual aspect of the evolutionary history of the genomes. That is, they are all right and wrong, at the same time.

    Moreover, we have high supported clades conflicting with the complete genome tree's (Fig. 1) topology. The signal issues, due to recombination (see Trees and viruses...), did not decrease branch support. That is, 6,000+ DAP is a lot, and recombination only affects a part of the complete genome, possibly quite a small part.

    Non-trivial evolution needs more than trivial graphs

    To depict the reticulate phylogeny of the virus sample, we need to consider the differences seen in the hacked-and-slashed matrix trees. This can easily be illustrated using a network, instead of a set of trees, as shown here.

    Fig. 3 A (strict) consensus network of all nine trees, in which the edge lengths give the sum of the branch lenghts in the tree sample. The gray brackets give the topology of the near-fully resolved complete genome tree.

    The graph above is a phylogenetic network: the competing edge bundles represent the different inferred histories of bits of the genomes. The SARS-CoV-2 lineage seems to be the product of (ancient) recombination, and recombination also played a role in forming the members of the original SARS-CoV group.

    Fig. 4 Pruned consensus network showing only the CoV(-1) lineage exhibiting various levels of recombination within and between clades as defined by the complete genome tree (tree sample sames as in Fig. 3).

    Consensus networks can also be used to summarize the support for alternative splits, as shown next.

    Fig. 5 Sum-support consensus network based on the bit-wise BS analyses (111/112 pseudoreplicates generated per bit). Only splits are shown occurring in at least 20% of all BS replicates, i.e. splits supported by at least two bits, trivial splits are collapsed. Colored splits represent according groups/clades in the full-genome tree (Fig. 1). Inlet: 'splits rose' showing competing splits patterns within Types II and III (cf. according subtrees/-trunks in Fig. 2 and Fig. 4).

    In contrast to the networks before (Figs. 3, 4), generated using the same algorithm*, the BS consensus network in Fig. 5 is not a phylogenetic network. The boxes don't reflect disparate histories of parts of the genomes but the varying support for competing topological alternatives. By summing up the bit-wise BS analyses instead of bootstrapping the entire data (the BS consensus network for the full data, Fig. 1, shows only two boxes), we get a better idea which aspects of the all-genome tree find robust support across the genome.**

    Conclusion

    Sub-dividing an alignment is a really quick way to fish for evidence of recombination, especially when one then uses a consensus network to summarise the resulting partial trees.

    For interpretation, a tree is a very simple, trivial, and hence appealing graph: A is sister to B and so on. Even a child can interpret a tree. Networks are already visually more challenging, but whenever an organism's evolution doesn't follow a tree (as for viruses), we shouldn't use a tree to depict its phylogeny (or reconstruct its evolution).



    Data availability

    The dataset used for our experiment is a taxon subset of the original data set, available via figshare (with a permanent, hence, citable DOI):
    Grimm GW, Morrison D. 2020. Harvest and phylogenetic network analysis of SARS virus genomes (CoV-1 and CoV-2). figshare Dataset. https://doi.org/10.6084/m9.figshare.12046581 

    References

    Corman VM, Ithete NL, Richards LR, Schoeman MC, Preiser W, Drosten C, Drexlera JF (2014) Rooting the phylogenetic tree of Middle East respiratory syndrome coronavirus by characterization of a conspecific virus from an African Bat. Journal of Virology 88: 11297–11303.



    * SplitsTree includes five options to determine "edge weights" (= edge-lengths) in case of Consensus networks: "median" and "mean" average the branch-lengths in the tree sample; "count", the setting used to generate Support consensus networks, counts how often a certain taxon bipartition (split) is found in the tree sample – an edge length is proportional to the frequency of a split; "sum", used here to generate the first network, summarizes the branch-lengths; and "none" discards both branch-lengths and split frequency.

    ** A split supported only by one of the nine bits, even if unambiguous, ie. present in all 111 (112) per bit BS replicates, will not be represented in the sum-Support consenus network using a cut-off of 20%.

    † The complete set of ML analyses took 20 min on a stand-alone computer; consensus networks are generated in a blink, and take hardly a minute even when using trees with many leaves.