Rooted phylogenetic networks for coronaviruses

In a previous post, Guido constructed trees for coronaviruses in the SARS group to search for evidence of recombination. He also constructed unrooted data-display networks using SplitsTree. Here, we discuss our attempts to construct rooted genealogical phylogenetic networks for the same dataset [6] but with some modifications.

In particular, we deleted some sequences, giving a smaller data set with only 12 taxa. These taxa include, next to SARS-CoV-2 (the virus causing COVID-19) and SARS-CoV (responsible for the SARS epidemic in 2002/2003), the viruses MP789 and PCoV_GX-P1E sampled from Malayan pangolins from two different Chinese provinces and several viruses found in different bat species in the horseshoe bat genus (Rhinolophus), all from China.

This research was done by Rosanne Wallin, an MSc student at VU Amsterdam and UvA. Her full thesis as well as all data and results can be found on github.

The first algorithm we applied to this data set was the TreeChild Algorithm [1], which is one of the methods that take a number of discordant (rooted, binary) trees as input and finds a rooted network containing each input tree, minimizing the number of reticulate events in the network. To filter out some noise, we contracted some poorly-supported branches and then resolved multifurcations consistently across the trees (using a tool within the TreeChild Algorithm). This gave the network below. Note that the method is restricted to so-called tree-child networks, meaning that certain complex scenarios are excluded (where a network node only has reticulate children). Also note that this is not necessarily the only optimal tree-child network and not all topological differences can be distinguished based on the trees [5].

Figure 1: Phylogenetic network constructed by the Tree-Child algorithm (blocks_A_len0.01_supp70).

The network shows no reticulation in the SARS-CoV-2 clade (the bottom four taxa) and puts SARS-CoV-2 right next to RaTG13. Furthermore, it shows a reticulation between an ancestor of HKU3-1 and a common ancestor of SARS-CoV-2 and RaTG13 leading to bat-SL-CoVZC45. However, it cannot exactly identify which common ancestor of SARS-CoV-2 and RaTG13 is the parent, leading to multiple branches (in red) leading into this reticulation. All these observations are consistent with previous research [2].

Importantly, we cannot directly conclude that each reticulation corresponds to a recombination event. See Table 2.1 of David’s book [10] for a nice overview of possible causes of reticulation. Nevertheless, based on [2], it does look like at least the reticulation leading to bat-SL-CoVZC45 corresponds to a recombination event.

The second algorithm we applied was TriLoNet [3], which constructs a rooted network directly from sequence data. It is restricted to so-called level-1 networks, meaning that it cannot construct overlapping cycles. This method produced the network below.

Figure 2: Phylogenetic network constructed by TriLoNet.

At first sight, the network may look a bit different from the previous one (Figure 1). However, note that the three observations above also hold for this second network. Moreover, the SARS-CoV-2 clade is identical in both networks. This network contains only one reticulation, which is most likely due to the level-1 restriction.

Nevertheless, we can still use this method to find more putative recombination events. To do so, we simply exclude the recombinant bat-SL-CoVZC45 from the analysis and rerun the algorithm. This gives the following network.

Figure 3: Phylogenetic network constructed by TriLoNet, after omitting bat-SL-CoVZC45.

We have now found a second putative recombination event with Rf1 as recombinant. Note that this is also consistent with the network in Figure 1. On the other hand, also note that the branching order in the SARS-CoV clade (the bottom 7 taxa in Figure 3) has changed a bit. This could mean that more recombination events are present in the SARS-CoV clade, as we also see in Figure 1.

One interesting follow-up question is whether the two (or more) networks produced by TriLoNet can be combined into a single higher-level network, in order to show multiple reticulations simultaneously (see [4] for an algorithm that could be useful).

Another interesting observation from these networks is that there is no sign of recombination involving the pangolin coronaviruses MP789 and PCoV_GX-P1E. It rather looks like these viruses evolved from common ancestors of SARS-CoV-2 and RaTG13, but it is important to note that we cannot exclude a recombination event on the basis of these networks. The relationship between SARS-CoV-2 and pangolin coronaviruses is still being debated in the literature [2,7,8,9].

Some limitations of the algorithms were noticed during this study. Firstly, the depicted networks are purely topological, i.e., the branch lengths do not represent anything. Adapting these algorithms to take branch length information into account could possibly improve their accuracy for this data set since the extant taxa have precise time stamps and for recent divergence events these times can be estimated quite accurately, see [2].

Another limitation is that we had to remove several taxa from the original data set [6] before the TreeChild algorithm could find a solution. By removing taxa, we reduced the number of reticulations needed to display the trees, making the TreeChild algorithm run in reasonable time. We made sure to include a diverse set of taxa (based on their pairwise distances [6]) to represent as much of the subgenus as possible. 

Rosanne used several other algorithms, taxon selections and also used trees based on genes rather than fixed-length blocks (which we did above, following Guido’s post), see her thesis on github.

Conclusion
Although rooted phylogenetic network methods are often limited in the number of taxa that can be analysed and/or the complexity of the networks that can be constructed, we have seen that these methods can be useful for constructing hypothetical evolutionary histories. Moreover, although the constructed networks are not identical, we have seen that they share certain key properties, which are also consistent with previous research.  

Rosanne Wallin, Leo van Iersel, Mark Jones, Steven Kelk and Leen Stougie


[1] Leo van Iersel, Remie Janssen, Mark Jones, Yukihiro Murakami and Norbert Zeh. A Practical Fixed-Parameter Algorithm for Constructing Tree-Child Networks from Multiple Binary Trees. arXiv:1907.08474 [cs.DM] (2019).

[2] Maciej F. Boni, Philippe Lemey, Xiaowei Jiang, Tommy Tsan-Yuk Lam, Blair W. Perry, Todd A. Castoe, Andrew Rambaut and David L. Robertson. Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic. Nat Microbiol 5, 1408–1417 (2020). https://doi.org/10.1038/s41564-020-0771-4

[3] James Oldman, Taoyang Wu, Leo van Iersel and Vincent Moulton. TriLoNet: Piecing together small networks to reconstruct reticulate evolutionary histories. Molecular Biology and Evolution, 33 (8): 2151-2162 (2016). http://dx.doi.org/10.1093/molbev/msw068 (postprint)

[4] Yukihiro Murakami, Leo van Iersel, Remie Janssen, Mark Jones and Vincent Moulton. Reconstructing Tree-Child Networks from Reticulate-Edge-Deleted Subnetworks. Bulletin of Mathematical Biology, 81(10):3823–3863 (2019).

[5] Fabio Pardi and Celine Scornavacca. Reconstructible phylogenetic networks: do not distinguish the indistinguishable. PLoS Comput Biol, 11(4), e1004135 (2015).

[6] Grimm, Guido; Morrison, David (2020): Harvest and phylogenetic network analysis of SARS virus genomes (CoV-1 and CoV-2). figshare. Dataset. https://doi.org/10.6084/m9.figshare.12046581.v3

[7]  Lam, Tommy Tsan-Yuk, Marcus Ho-Hin Shum, Hua-Chen Zhu, Yi-Gang Tong, Xue-Bing Ni, Yun-Shi Liao, Wei Wei, et al. Identifying SARS-CoV-2 Related Coronaviruses in Malayan Pangolins. Nature, 583, 282–285 (2020). https://doi.org/10.1038/s41586-020-2169-0

[8] Wang, Hongru, Lenore Pipes, and Rasmus Nielsen. Synonymous Mutations and the Molecular Evolution of SARS-Cov-2 Origins. [Preprint] Evolutionary Biology, April 21, 2020. https://doi.org/10.1101/2020.04.20.052019

[9] Li, Xiaojun, Elena E. Giorgi, Manukumar Honnayakanahalli Marichannegowda, Brian Foley, Chuan Xiao, Xiang-Peng Kong, Yue Chen, S. Gnanakaran, Bette Korber, and Feng Gao. Emergence of SARS-CoV-2 through Recombination and Strong Purifying Selection. Science Advances, Vol. 6, no. 27 (2020). https://doi.org/10.1126/sciadv.abb9153 

[10] David Morrison, Introduction to Phylogenetic Networks. RJR Productions, Uppsala, Sweden (2011). http://www.rjr-productions.org/Networks/index.html


Why don’t people draw evolutionary networks sensibly?


In phylogenetics there are two types of network:
  • those where the network edges have a time direction, whether explicit or implied; and
  • those where the edges are undirected.
The latter networks are among the most valuable tools ever devised for the exploration of multivariate data patterns; and this blog is replete with examples drawn from all fields that produce quantitative data (see the Analyses blog page). The first type of network, however, is the only one that can display hypothesized evolutionary histories — that is, they can truly be called evolutionary networks.

Evolutionary networks have a set of characteristics that are essential in order to successfully display biological histories, such as:
  • no directed cycles, because otherwise one of the descendants would be its own ancestor;
  • time consistency, meaning that reticulations in the network only occur between contemporaries.
The latter requirement is not needed for the history of human artifacts, because the ideas on which those artifacts are based can be recorded, and then not used until much later — ideas can "leap forward" in time. There are a number of examples of this in this blog, as discussed in last week's post (A phylogenetic network outside science).

However, time consistency is pretty much universal in biology (see the post on Time inconsistency in evolutionary networks). Natural hybridization and introgression require two living organisms in order to occur, as does horizontal gene transfer. This is basic biology, at least outside the laboratory.

So, the question posed in this post's title refers to the fact that so many people draw their evolutionary networks in a manner that appears to violate time consistency.

Consider this example (from: Interspecies hybrids play a vital role in evolution. Quanta Magazine):


Note that the reticulation edges (the dashed lines) represent gene transfers by introgression or hybidization, and yet none of them are drawn vertically, as they would need to be in order to be time consistent (since time travels from left to right).

It might be argued that most of these are not all that important in practice, but the one to the left quite definitely matters very much. It shows gene transfer between: (i) an organism that speciated 3.65 million years ago and (ii) an organism that is the descendant of one that speciated 3.47 million years ago. The 180,000 years between those two events are not irrelevant; and they make the claimed gene transfer impossible.

One might think that this is simply the general media misunderstanding the network requirements, but this is not so. The diagram is actually a quite accurate representation of the one from the original scientific publication (from: Genome-wide signatures of complex introgression and adaptive evolution in the big cats. Science Advances 3: e1700299; 2017.):


The network shows the same series of hybridizations / introgressions. However, this time three sets of gene transfers are shown to be time consistent, represented by the horizontal arrows (since time flows from top to bottom). Two of the three diagonal arrows (light blue and orange) could be made time consistent (ie. drawn horizontally), although the authors have chosen not to do so, apparently for artistic reasons. However, the first reticulation cannot be made time consistent, for the reason outlined above.

So, people, please think about what you are drawing, and don't show things that are biologically impossible,

Why don’t people draw evolutionary networks sensibly?


In phylogenetics there are two types of network:
  • those where the network edges have a time direction, whether explicit or implied; and
  • those where the edges are undirected.
The latter networks are among the most valuable tools ever devised for the exploration of multivariate data patterns; and this blog is replete with examples drawn from all fields that produce quantitative data (see the Analyses blog page). The first type of network, however, is the only one that can display hypothesized evolutionary histories — that is, they can truly be called evolutionary networks.

Evolutionary networks have a set of characteristics that are essential in order to successfully display biological histories, such as:
  • no directed cycles, because otherwise one of the descendants would be its own ancestor;
  • time consistency, meaning that reticulations in the network only occur between contemporaries.
The latter requirement is not needed for the history of human artifacts, because the ideas on which those artifacts are based can be recorded, and then not used until much later — ideas can "leap forward" in time. There are a number of examples of this in this blog, as discussed in last week's post (A phylogenetic network outside science).

However, time consistency is pretty much universal in biology (see the post on Time inconsistency in evolutionary networks). Natural hybridization and introgression require two living organisms in order to occur, as does horizontal gene transfer. This is basic biology, at least outside the laboratory.

So, the question posed in this post's title refers to the fact that so many people draw their evolutionary networks in a manner that appears to violate time consistency.

Consider this example (from: Interspecies hybrids play a vital role in evolution. Quanta Magazine):


Note that the reticulation edges (the dashed lines) represent gene transfers by introgression or hybidization, and yet none of them are drawn vertically, as they would need to be in order to be time consistent (since time travels from left to right).

It might be argued that most of these are not all that important in practice, but the one to the left quite definitely matters very much. It shows gene transfer between: (i) an organism that speciated 3.65 million years ago and (ii) an organism that is the descendant of one that speciated 3.47 million years ago. The 180,000 years between those two events are not irrelevant; and they make the claimed gene transfer impossible.

One might think that this is simply the general media misunderstanding the network requirements, but this is not so. The diagram is actually a quite accurate representation of the one from the original scientific publication (from: Genome-wide signatures of complex introgression and adaptive evolution in the big cats. Science Advances 3: e1700299; 2017.):


The network shows the same series of hybridizations / introgressions. However, this time three sets of gene transfers are shown to be time consistent, represented by the horizontal arrows (since time flows from top to bottom). Two of the three diagonal arrows (light blue and orange) could be made time consistent (ie. drawn horizontally), although the authors have chosen not to do so, apparently for artistic reasons. However, the first reticulation cannot be made time consistent, for the reason outlined above.

So, people, please think about what you are drawing, and don't show things that are biologically impossible,

Can we depict the evolution of highly conserved genes, such as the ribosomal RNA genes?


Median networks have been designed to put within-species haplotypes into an explicit evolutionary framework. They are exclusively parsimony-based, but differ from traditional trees by treating operational taxonomic units (OTUs) as both potential tips and ancestors. Ancestors are placed at internal nodes ('medians'). The latter makes them interesting for hypotheses about sequence evolution; but, like all parsimony-based methods, they suffer from high levels of homoplasy, which is a common feature of genetic data sets.

Can we use median networks to better understand evolution far above the species level?

In order to test this, I generated a median network using data on the nuclear-encoded 5.8S rDNA of Fagales. This is a flowering plant (angiosperm) order, which includes well-known trees such as oaks, beeches, chestnuts, walnuts, alder, birch and hazel, but also the enigmatic 'false beech' (Nothofagus s.l., the traditional four subgenera have been elevated to genera by Heenan & Smissen 2013), a Gondwanan element that (for some time) has intrigued biogeographers.

Why I have always loved nrDNA

A a young (phylo-)geneticist, my boss, a geneticist who sequenced genes such as the rRNA genes before PCR made it easy, pointed me to the works of Mark Hershkovitz, Louise Lewis, and Edith Zimmer about evolution of the nuclear-encoded ribosomal RNA genes (nrDNA) in angiosperms. Long pre-dating the era of big data and self-evident, trivial phylogenies (ie. data sets allowing for the inference of a fully resolved, unambiguously supported tree), Hershkovitz and co-workers sought to extract as much information as possible from the best-known gene region available back then (mid-late 90s): the internal transcribed spacers (ITS1, ITS2) of the 35S rDNA, the cistron encoding the genes for the 18S, 5.8S and 25S (or 28S, but not "26S") nuclear ribosomal RNA.
  • Hershkovitz MA, Lewis LA. 1996. Deep-level diagnostic value of the rDNA-ITS region. Molecular Biology and Evolution 13:1276–1295.
  • Hershkovitz MA, Zimmer EA. 1996. Conservation patterns in angiosperm rDNA ITS2 sequences. Nucleic Acids Research 24:2857–2867.
  • Hershkovitz MA, Zimmer EA, Hahn WJ. 1999. Ribosomal DNA sequences and angiosperm systematics. In: Hollingsworth PM, Bateman RM, and Gornall RJ, eds. Molecular Systematics and Plant Evolution. London: Taylor & Francis, pp. 268–326.
The ITS1 and ITS2 are highly divergent, non-coding but transcribed intergenic spacers within the structurally and sequentially much more conserved nrDNA, which distinguishes them from nearly all other non-coding regions. More often than not, their sequences are impossible to align across high-ranking taxa such as families or orders. The brilliance of Hershkovitz et al.'s work was to just go a level-up by identifying shared general sequence patterns, and to put them in an evolutionary context.

Birds-eye view of the ITS region (consensed for sequence groups) in Fagales including sequences of the two outgroups used in Li et al. 2004 (zoom-in and try to figure out where they are). The position of the ITS(1) cleavage site is indicated, a highly conserved, AT-dominated sequence motif within the ITS1. The "Nothofagus deletion" (Manos 1997), gray area seen in some of the topmost variants in the 5.8S rDNA, is a sequencing/ editing artifact (newer sequences all have a complete 5.8S rDNA). Most of these data are more than 15-years old (see references provided at the end of the post) and may include more data artifacts, especially in the length-polymorphic portions. Nonetheless, part of the data were included in the dating studies of Sauquet et al. (2012) and Xing et al. (2014) to compensate for the lack of resolution of the also included plastid regions towards the tips of the Fagales tree (intrafamily and -generic relationships).

Accordingly, in my (open access) Ph.D. thesis you'll find not a few figures depicting the potential evolution of sequence patterns in the ITS1 and ITS2 of maples and the beech trees.

I could probably write a book taking up where Hershkovitz et al. stopped, but this would be: a) very subjective, and b) too complex and marginal for the 21st century. Very few people would read it. We have grown accustomed to simple graphs as metaphors of evolution and, thanks to big data, we have become reluctant to discuss the results ex machina. Also, I would have needed a score of students to pursue all the avenues that I glimpsed into; e.g. the following pic:

Evolution of the 5'-end of the ITS1 in basal eudicots (looking at divergences that happened, at least, 100 myrs ago).

The other way around

If the more conserved sequence patterns within the ITS1 and ITS2 can be informative about evolution at a much higher level (which they are), the next question is: what can we learn from the sequence patterns in the highly-conserved portions of the rDNA linked with the ITS1 and ITS2? Historic-genetically, the ITS1 is fundamentally different from the ITS2. The former, ITS1, is an intergenic spacer, which has no secondary structure (although you can find reconstructions in literature) as it is split into two parts right after translation (the ITS1 cleavage site is quite conserved, and a main topic in the papers by Hershkovitz and Zimmer). The latter, ITS2, has been evolutionarily derived from the first variable portion of the large ribosomal subunit (LSU), the 25S (28S) rDNA. In primitive organisms, there is hence no 5.8S rDNA and ITS2.

This geno-evolutionary history is also the reason for the structural linkage between the 5.8S rRNA and the 5' end of the 25S (28S) rRNA. Here's a zoom-in on the part that we are interested in.


For better orientation, I have named some of the extremely conserved secondary structure elements of the (mature) 5.8S rRNA. Note that the "Gingerbread Man" structure is very conserved in angiosperm sequences although it only contains three very short stems. The "Pimple" and the "Needle" are so-called hairpins — a strictly complementary stem part is capped by a short, non-complementary tip ('semi-loop'): a 3- and 4-nt long motif, respectively, in Arabidopsis and all Fagales (in some species of Lithocarpus, the tropical 'stone nut' and relative of oaks, the "Needle" has two extra nucleotides).

5.8S rDNA in Fagales

I chose the Fagales because I have worked on them a lot, they are a pretty small group, and except for one "asterisk branch" their inter-family relationships are solved.

Basic signal in Li et al. (2004)'s matrix. Inter-family relationships are, data-wise, fairly trivial, hence, the tree-like Neighbor-net. Only the placement of the Myricaceae with respect to Juglandaceae (now incl. Rhoipteleaceae) and Betulaceae + allies is not unambiguously resolved (see this post)

Oaks have received a lot of attention from population geneticists, like other widespread species or species complexes. Those studies, using Median networks and related methods such as Statistical Parsimony, revealed very complex genetic diversity patterns. On the other hand, the Fagales lineage has been fairly neglected by plant phylogeneticists, although it comprises many of the dominant, ecologically and economically most important trees of the Northern Hemisphere (and the enigmatic Gondwanan Nothofagaceae). The early studies found evidence for deep nuclear-plastid incongruences, but only in recent years has the first (non-comprehensive) complete plastome phylogenies and dated all-Fagales trees surfaced (which do contain one or other common error and misinterpretation of results).

For one family, the southern hemispheric, tropical-subtropical Casuarinaceae, we have no (reliable) ITS data at all; also missing is one of the genera of the Juglandaceae: Engelhardia (s.str.; most data in gene banks labelled as Engelhardia is from Alfaropsis; cf. Manchester 1987 and Manos et al. 2007, but see Zhang et al. 2013).

In total, we find 17 variable sites at and above the genus level in the 5.8S rDNA of Fagales. There are three in the core parts, structurally linked to the 5' 25S rRNA, two in the 'Gingerbread Man', three in the 5' and 3' trails, and the rest are in the 'Needle'.

Unique mutations and mutational trends (arrows) in the 5.8S rDNA in Fagales. Circles highlight the basepairs differing from the reference (Arabidopsis 5.8S rRNA). Blue, mutations found within more than one major lineage, pink, lineage-conserved (diagnostic) mutations; red, mutations restricted to a single genus; green, genetic (syn)apomorphies of the 5.8S rDNA of Fagales. Be = Betulaceae; Ju = Juglandaceae; My = Myricaceae; No = Nothofagaceae; Fagaceae include Fagus (Fa, the beech) and the remainder ("Quercaceae": Qu), which are genetically substantially distinct from Fagus.

Many mutations are genus-coherent; increased intrageneric variation is found in the 5'-tail and the part encoding the 4(6)-nt long 'semi-loop' sequence of the "Needle" (pos. 120–142 in the rRNA of Arabidopsis thaliana):

A (near-)full Median network for the tip of the 'Needle'. In a few Lithocarpus (a "Quercaceae" genus) the sequence is 6-nt-long, which would result in an elongated hairpin (paired basepairs are underlined). The ATTC is a genetic symplesiomorphy.

Exceptions are Fagus and Quercus, which can show substantial intragenomic ITS divergence, Lithocarpus (the most divergent genus, ITS-wise), and Nothofagus s.l. (between the former subgenera, now genera). In these cases, the intra-(sub)generic variation includes the putatively ancestral nucleotide and/or nucleotide shared with other genera of the family; eg. at pos. 123, all Fagales have a C, Fagus can have either C or T (= Y), and Quercus can show any of the four nucleotides (= N).

A Median-network for the 5.8S rDNA

Ambiguities can be detrimental for resolution in standard parsimony implementations. The NETWORK program, for instance, warns that a code of "N" may render the result less reliable, and this applies also to the other ambiguity codes. If we include the intra-generic polymorphisms as ambiguity codes, NETWORK runs for quite a long time: too many solutions are equally parsimonious (for this experiment I used genus-consensus data, being interested in the deep splits)

But when we resolve the intra-generic polymorphisms prior to analysis by treating them as satellite types, ie. assuming the family-shared nucleotide represents the ancestral state within the according lineage, we quickly get the following result:

Edges colored to trace the same mutational step. Bubbles indicate the position of the (basic) 5.8S rDNA genotypes for the genera in each family-level lineage.

This is still not a too trivial graph, but it:
  • provides a framework on which we can develop our evoluionary scenario;
  • visualizes how mutational patterns may be linked;
  • tells us directly how derived (genetically) and unique (isolated) the genera are.
Since the 5.8S rDNA is part of a multi-copy (potentially multi-loci, Ribeiro et al. 2011) gene region, uniqueness gives us an idea about how reduced a lineage is. Bottlenecks will eliminate intra-lineage diversity and unique mutational patterns are more likely to accumulate in a species-poor lineage with small population sizes.

But since it is a vital gene region underlying strong sequential and structural constraints, evolution is not neutral: the graph has little tree-likeness. However, the graph looks like graphs that one expects for fast ancient radiations.

There are more interesting details. For instance, we have no mutation separating consistently the earliest diverging lineages (given the currently accepted root), the Nothofagaceae and the Fagaceae (s.l.) and the remainder of the order (called "higher hamamelids" in classic systematic literature). We also see that the 5.8S rDNA shows the Fagaceae should be monotypic: Fagus is more different from its siblings, the 'Quercaceae', than it is from the first-diverging Nothofagaceae or the common ancestor of the "higher hamamelids". Fagaceae s.str. and 'Quercaceae' are without a doubt sister lineages but this also applies to Betulaceae and Ticodendraceae (differing only by three point mutations), with the Betulaceae being just one point mutation away from its more distant sibling (phylogenetically speaking), the Juglandaceae. Furthermore, for Ticodendron-Betulaceae we can postulate a sequentially unique common ancestor, but we can't do the same for Fagus-'Quercaceae'.

Either the 5.8S rDNA evolved much faster in Fagus than in most other lineages, or Fagus split away from its sisters prior to the radiation of the "higher hamamelids" and shortly after their respective ancestors isolated. This second scenario coincides nicely to recent fossil findings tracing the Fagus lineage back to the late Cretaceous (at least 80 Ma; Grímsson et al. 2016, supplement includes a digression of all-Fagales dating attempts).

Reconstruction of ancestral genepools

Using the split patterns in the network to extract an evolutionary tree could be hazardous, since we are looking at strongly interconnected mutational patterns filtered by selective pressure (maintaining a functional structure) in a gene region that evolves very slowly: some sites can or did accumulate mutations (the 'Needle' and the trails), others can't and did not (the remainder of the 5.8S rDNA) in the Fagales lineage. At least mutations were not fixed over a long evolutionary time: the data includes at least as many variable sites where within a single genus, species or genome, the shared, family-typical nucleotide (or even shared with Arabidopsis, a quite distant relative of Fagales) is occasionally replaced.

But since we know the phylogeny of the Fagales, we can, based on the Median(-joining) network(s), infer the evolution of the 5.8S rDNA (i.e. the rDNA gene pool) over time:

Results of the Median-joining analysis mapped on the currently accepted Fagales tree. Clade-characteristic mutations are highlighted by according colors; black, homoplastic mutations that occurred independently in two lineages, gray, in more than two.

Regarding the 'asterisk branch', the 5.8S rDNA provides few extra clues, unless we want to re-include a third hypothesis: that the Myricaceae are sister to Juglandaceae + Betulaceae and allies. This would be the most fitting explanation for the 5.8S rDNA diversity. It also would explain why they can be either sister to Betulaceae and allies or Juglandaceae. Ancestors, or slower evolving sisters diverging shortly before a radiation, will do such a thing.

In this context, one should point out that unequivocal fossils representing various modern genera of all families are known from the early Paleogene, many pop up in early Eocene (~ 50 Ma) intramontane basins of northwestern North America. The oldest modern genus and a possible living fossil is the first diverging Juglandaceae: Rhoiptelea. Its pollen can be found from the Maastrichian onwards in North America and elsewhere, and a fossil showing the unique Rhoiptelea-flower and fitting pollen can be found in the late Turonian-Santonian (~90 Ma) of Bohemia (Heřmanová et al. 2011; the authors, however, decided to name it Budvaricarpus and tone down the striking resemblance to modern-day Rhoiptelea).

Of course, since we use network-based approaches, we can conceptualize the 5.8S rDNA sequence patterns and inferred evolution as a subsequent breaking up and sorting of once-shared gene pools:

A 'coral' tree metaphor for the evolution of the 5.8S rDNA in Fagales (using an alternative, one-node-shifted root).

I chose an alternative root because it is the one that makes most sense regarding the fossil-morphological, palaeoclimatological/-vegetation and high-conserved genetic patterns (thinking of the 18S rDNA). The labels are, of course, a gross simplification — it is likely that the all-ancestor was a tropical-subtropical plant as well (the genetically most unique and potentially earliest isolated genera of the 'Quercaceae' are exclusively tropical-subtropical) and Myricaceae, Betulaceae and Juglandoideae can today be found deep into the temperate zone, some even thriving in boreal and polar climates. But posts can afford to trigger discussion.

The vertical axis reflects not only the derivedness of the 5.8S rDNA, but also the potential sequence of divergences back in time. The horizontal axis represents the taxonomic-geographic breadth over time (very roughly, tapering means higher diversity/greater range in the past than today) and towards the tips the genetic within-lineage diversity seen in the ITS1 and ITS2 (in Myricaceae, it would be close to a point, if it would not be for one species: Myrica gale, the bog myrtle or sweetgale, beloved in Scotland and Scandinavia – see this Dane's video for how to use it).

Just a curious experiment?

Now, to most readers this post may just be a strange example with little general relevance for phylogenetics. But consider the following.
  1. When we infer deeper phylogenetic relationships, we usually rely on sequence differentiation in coding-gene regions. Like the rRNA genes, the tRNA genes need to fulfill secondary (and tertiary) structural constraints to maintain their vital functions. All other genes code for proteins, which also need to fulfill structural constraints (secondary, tertiary and quaternary structures). Their essential functions rely on keeping a specific amino-acid sequence, which is translated from DNA sequences.
  2. We do this inference under the assumption that molecular evolution is neutral, which, as can be seen in the case of the 5.8S rDNA, is apparently not the case. Mutations that would negatively affect the function of the DNA-transcripts are strongly selected against.
Many of our trees makes sense nonetheless, but we should keep a wary eye on all of those branches that draw their support from only one or two gene regions (a common issue of oligo-gene trees like the one by Li et al. 2004), or very few mutations. Especially, when we are producing an ultrametric the tree. How sensible can a divergence age estimate be when the data behind it are four mutations in the monotypic lineage and zero in its more diverse sister clade?



    Cited literature and further reading (with comments).

    ITS studies (some mixed with further data and results that were ignored by all-Fagales dating studies that included the data)
    • Acosta MC, Premoli AC. 2010. Evidence of chloroplast capture in South American Nothofagus (subgenus Nothofagus, Nothofagaceae). Molecular Phylogenetics and Evolution 54:235–242. See also Premoli AC, Mathiasen P, Acosta MC, Ramos VA. 2012. Phylogeographically concordant chloroplast DNA divergence in sympatric Nothofagus s.s. How deep can it be? New Phytologist 193:261–275. — Just two brilliant papers that only leave one question open: is this different in the Australasian genera of the Nothofagaceae?
    • Cannon CH, Manos PS. 2003. Phylogeography of the Southeast Asian stone oaks (Lithocarpus). Journal of Biogeography 30:211–226. — A very well-done paper that still doesn't need to fear to comparison with more recent biogeographic papers on Fagales genera with access to more elaborate inference methods, while using much poorer data samples.
    • Denk T, Grimm GW. 2010. The oaks of western Eurasia: traditional classifications and evidence from two nuclear markers. Taxon 59:351–366. — Since this is mine, I should not give myself an assessment. Just some info: it was the most sloppy draft, we ever submitted, and passed rather smoothly the review process. But it used 600+ new ITS and 900+ new 5S-IGS sequences, and although it provided a comprehensive ITS tree (new and all data stored in gene banks), the conclusions relied mostly on networks based on inter-clonal and inter-individual distances and ML bootstrap pseudoreplicate samples. I'm pretty sure, it's still hard to find a similar paper.
    • Denk T, Grimm G, Stögerer K, Langer M, Hemleben V. 2002. The evolutionary history of Fagus in western Eurasia: Evidence from genes, morphology and the fossil record. Plant Systematics and Evolution 232:213–236. — My first phylogenetic paper (using only about 100 ITS sequences) and one of my most-cited papers; published only because the editor ignored the opinions of two reviewers.
    • Denk T, Grimm GW, Hemleben V. 2005. Patterns of molecular and morphological differentiation in Fagus: implications for phylogeny. American Journal of Botany 92:1006–1016. — the follow-up paper, including all beech species.
    • Forest F, Bruneau A. 2000. Phylogenetic analysis, organization, and molecular evolution of the non-transcribed spacer of 5S ribosomal RNA genes in Corylus (Betulaceae). International Journal of Plant Sciences 161:793–806. — Likely the reason for the 2005 study by Forest et al., a great paper (especially when compared to other phylogenetic papers published in the same journal back then and much later). The reason why the 5S-IGS has rarely been studied, is because it is difficult to handle (usually one needs to clone because of intraindividual length-polymorphism). But it provides an unsurpassed resolution at the intrageneric level that only finds a match in the last years by the accumulation of NGS SNP data.
    • Forest F, Savolainen V, Chase MW, Lupia R, Bruneau A, Crane PR. 2005. Teasing apart molecular- versus fossil-based error estimates when dating phylogenetic trees: a case study in the birch family (Betulaceae). Systematic Botany 30:118–133. — A pivotal, still valid study using ITS and 5S-IGS data, even though the divergence age estimates are probably much too old (an aspect demonstrating the quality of the study, back then, molecular age estimates were usually much too young). Forest and Bruneau published several other papers of equal quality on other plant groups, and I suspect there is an interesting publication story given the author list and the dissemination platform.
    • Grimm GW, Denk T, Hemleben V. 2007. Coding of intraspecific nucleotide polymorphisms: a tool to resolve reticulate evolutionary relationships in the ITS of beech trees (Fagus L., Fagaceae). Systematics and Biodiversity 5:291–309. — A crazy experiment, but one that, years later, would bring me my first paper in Systematic Biology [PDF] (10-times higher impact factor) because it was the only piece of science providing a way-out for a young researcher in South Africa.
    • Manos PS. 1997. Systematics of Nothofagus (Nothofagaceae) based on rDNA spacer sequences (ITS): taxonomic congruence with morphology and plastid sequences. American Journal of Botany 84:1137–1155. — A typical study for the time, may be not ground-breaking but opening an interesting path and still the basis for molecular systematics of Nothofagaceae (getting such data in the late 90s was not easy). Interestingly, no-one in Australia or New Zealand ever took the thread up (but see Knapp et al. 2005), the only only properly studied genus (then a subgenus) of Nothofagaceae is Nothofagus s.str. (Acosta & Premoli 2010; Premoli et al. 2012).
    • Manos PS, Doyle JJ, Nixon KC. 1999. Phylogeny, biogeography, and processes of molecular differentiation in Quercus subgenus Quercus (Fagaceae). Molecular Phylogenetics and Evolution 12:333–349. [PDF] — The counterpart to the above for oaks, it took nearly two decades to assemble more data on American oaks than used for this study.
    • Manos PS, Stone DE. 2001. Evolution, phylogeny, and systematics of the Juglandaceae. Annals of the Missouri Botanical Garden 88:231–269. — An exemplary paper for two reasons (and despite the fact that it just shows cladograms): 1) it combined morphological and chemotaxonomic data with ITS and plastid data (rbcL-atpB and trnL-trnF intergenic spacer); 2) pretty much got the still accepted tree. Also proof-of-point that, even 20 years ago, studies in low-impact journals were not rarely better than those in high-fly ones. (Note the number of pages; decent research needs space!)
    • Manos PS, Zhou ZK, Cannon CH. 2001. Systematics of Fagaceae: Phylogenetic tests of reproductive trait evolution. International Journal of Plant Sciences 162:1361–1379. — For years to come the basis for Fagaceae systematics.
    • Muir G, Fleming CC, Schlötterer C. 2001. Three divergent rDNA clusters predate the species divergence in Quercus petraea (Matt.) Liebl. and Quercus robur L. Molecular Biology and Evolution 18:112–119. — Only about two species, but setting the scene: ITS evolution in Fagales (and probably any other wind-pollinated tree) can be very complex at the very basic level.
    • Ribeiro T, Loureiro J, Santos C, Morais-Cecílio L. 2011. Evolution of rDNA FISH patterns in the Fagaceae. Tree Genetics and Genomes 7:1113–1122. — A must-read for everyone using ITS data in Fagales.
    Phylogenetic studies at and above family level
    Betulaceae: see Forest et al. (2005) and Grimm & Renner (2013, following section).
    Casuarinaceae: see 'Phylogeny' section on Stevens' Angiosperm Phylogeny Website (never bothered myself with them, since they lack ITS data).
    Fagaceae: see Manos et al. (2001), tree in Denk & Grimm (2010)
    • Oh S-H, Manos PS. 2008. Molecular phylogenetics and cupule evolution in Fagaceae as inferred from nuclear CRABS CLAW sequences. Taxon 57:434–451. — The molecular basis for Fagaceae systematics.
    • Manos PS, Cannon CH, Oh S-H. 2008. Phylogenetic relationships and taxonomic status of the paleoendemic Fagaceae of Western North America: recognition of a new genus, Notholithocarpus. Madroño 55:181–190.The only paper providing a tangible plastid-informed phylogeny.
    Juglandaceae:
    • Manos PS, Soltis PS, Soltis DE, Manchester SR, Oh S-H, Bell CD, Dilcher DL, Stone DS. 2007. Phylogeny of extant and fossil Juglandaceae inferred from the integration of molecular and morphological data sets. Systematic Biology 56:412–430. — I would have used a different set of analyses but the paper (and used data) provides the basis for Juglandaceae phylogenetics and systematics (see Manos & Stone 2001)
    Nothofagaceae: Manos (1997), Knapp et al. (2005, following section).
      Fagales dating studies (naturally including phylogenies)
      • Grimm GW, Renner SS. 2013. Harvesting GenBank for a Betulaceae supermatrix, and a new chronogram for the family. Botanical Journal of the Linnéan Society 172:465–477. [PDF] — a little experiment we made and submitted to a respectable but low-impact journal because the results were not really ground-shaking. Exemplifies how I think one should harvest gene banks for dating studies (check out the supplement files), hence, providing a striking contrast to the much more ambitious papers by Xiang et al. (2014) and Xing et al. (2014). In that aspect, possibly a must-read for reviewers and editors of large-scale, harvest papers.
      • Knapp M, Stöckler K, Havell D, Delsuc F, Sebastiani F, Lockhart PJ. 2005. Relaxed molecular clock provides evidence for long-distance dispersal of Nothofagus (Southern Beech). PLoS Biology 3:e14. — A very interesting paper, because it rejects two of the scenarios later tested by Sauquet et al. (2012) and found to produce strange estimates; also, it provides some new sequences of higher quality, none of which was included for the 2012 paper. The author list is quite interesting, too: the last author (GoogleScholar) was the only botanist who challenged tree-thinking from the very start and embraced splits graphs as alternative to trees. The forth author wrote a classic paper everyone should have read working with big data: Delsuc F, Brinkmann H, Philippe H. 2005. Phylogenomics and the reconstruction of the tree of live. Nature Reviews Genetics 6:361–375.
      • Sauquet H, Ho SY, Gandolfo MA, Jordan GJ, Wilf P, Cantrill DJ, Bayly MJ, Bromham L, Brown GK, Carpenter RJ, Lee DM, Murphy DJ, Sniderman JM, Udovicic F. 2012. Testing the impact of calibration on molecular divergence times using a fossil-rich group: the case of Nothofagus (Fagales). Systematic Biology 61:289–313 — in principle, an interesting idea, unfortunately the instability of dating estimates observed may be mostly due to data artifacts. The authors use unrepresentative, old data (which is puzzling, since the understudied Nothofagaceae grow in Australia, New Zealand and the French New Caledonia, and the authors are from France, Australia and New Zealand) including not a few editing/ sequencing artifacts, insufficient sampling and internal signal conflict by combination of low-divergent plastid genes and introns with high-divergent ITS data. The main test compares apples (Nothofagaceae) with pears (the rest of Fagales as sister clade); for details see this draft [PDF], which I put together for applications (the data documentation of Sauquet et al. is examplary, hence, it was very easy to look into the data basis).
      • Xiang X-G, Wang W, Li R-Q, Lin L, Liu Y, Zhou Z-K, Li Z-Y, Chen Z-D. 2014. Large-scale phylogenetic analyses reveal fagalean diversification promoted by the interplay of diaspores and environments in the Paleogene. Perspectives in Plant Ecology, Evolution and Systematics 16:101–110 — an ambitious experiment, with even more data-related problems than the study of Sauquet et al. While Sauquet et al. used placeholder sequences for each included genus (and dropped some because their data inflicted too much topological ambiguity), Xiang et al. blindly harvested all data of commonly sequenced plastid "barcodes" (rbcL, matK, trnL/LF region, rbcL-atpB spacer) to infer a species-level tree. Outdated, invalid taxa were not corrected for; the used gene sample can show little to no variation below the genus level (which makes dating, and barcoding, impossible). Furthermore, plastid diversification is partly or fully decoupled from speciation processes in the four genera that have been studied using more than a single individual per species (Nothofagus s.str., Fagus, Quercus, Ostryopsis).
      • Xing Y, Onstein RE, Carter RJ, Stadler T, Linder HP. 2014. Fossils and large molecular phylogeny show that the evolution of species richness, generic diversity, and turnover rates are disconnected. Evolution 68:2821–2832 — very similar to the Xiang et al. approach but even more flawed (poor control over used data, poor selection of markers, several problems with the dating approach, which is the bases to estimate the crucial turnover rates). Xiang et al. and Xing et al. show what happens when large-scale meta-analyses are conducted by researchers with no idea about the studied organisms.
      • Zhang J-B, Li R-Q, Xiang X-G, Manchester SR, Lin L, Wang W, Wen J, Chen Z-D. 2013. Integrated fossil and molecular data reveal the biogeographic diversification of the eastern Asian-eastern North American disjunct hickory genus (Carya Nutt.). PLoS ONE 8:e70449. — Focuses on one genus but includes data from all Juglandaceae and gives a typical example for plant biogeographic studies using dated trees (the forth author is the expert on the fossil record of Juglandaceae, so there are little data issues). It's open access, quite short, give it a read and then try to figure out what is the point of the paper (I looked at the provided data matrix, too, and found quite interesting genetic patterns that completely escaped the authors; it is never wrong to look over your alignment when this is still possible).
      Other cited literature
      • Grímsson F, Grimm GW, Zetter R, Denk T. 2016. Cretaceous and Paleogene Fagaceae from North America and Greenland: evidence for a Late Cretaceous split between Fagus and the remaining Fagaceae. Acta Palaeobotanica 56:247–305.
      • Heenan PB, Smissen RD. 2013. Revised circumscription of Nothofagus and recognition of the segregate genera Fuscospora, Lophozonia, and Trisyngyne (Nothofagaceae). Phytotaxa 146:1–31.
      • Heřmanová Z, Kvaček J, Friis EM. 2011. Budvaricarpus serialis Knobloch & Mai, an unusual new member of the Normapolles complex from the Late Cretaceous of the Czech Republic. International Journal of Plant Sciences 172:285–293.
      • Manchester SR. 1987. The fossil history of the Juglandaceae. St. Louis: Missouri Botanical Garden. [book-like paper]

      Bayesian inference of phylogenetic networks


      Over the years, a number of methods have been explored for constructing evolutionary networks, starting with parsimony criteria for optimization, and moving on to likelihood-based inference. However, the development of Bayesian methods has been somewhat delayed by the computational complexities involved.

      Network from Radice (2012)

      The earliest work on this topic seems to be the thesis of:
      Rosalba Radice (2011) A Bayesian Approach to Phylogenetic Networks. PhD thesis, University of Bath, UK.
      Apparently, the only part of this work to be published has been:
      Rosalba Radice (2012) A Bayesian approach to modelling reticulation events with application to the ribosomal protein gene rps11 of flowering plants. Australian & New Zealand Journal of Statistics 54: 401-426.
      The method described requires the prior specification of the species tree (phylogeny), and the position and number of the reticulation events. The algorithm was implemented in the R language.

      More recently, methods have been developed that infer phylogenies by using (i) incomplete lineage sorting (ILS) to model gene-tree incongruence arising from vertical inheritance, and (ii) introgression / hybridization to model gene-tree incongruence attributable to horizontal gene flow. ILS has been addressed using the multispecies coalescent.

      The first of these publications was:
      Dingqiao Wen, Yun Yu, Luay Nakhleh (2016) Bayesian inference of reticulate phylogenies under the multispecies network coalescent. PLoS Genetics 12(5): e1006006. [Correction: 2017 PLoS Genetics 13(2): e1006598]
      The method requires the set of gene trees as input, along with the number of reticulations. The algorithm was implemented in the PhyloNet package.

      In the past few months, two manuscripts have appeared that try to co-estimate the gene trees and the species network, using the original sequence data (assumed to be without recombination) as input:
      Dingqiao Wen, Luay Nakhleh (2017) Co-estimating reticulate phylogenies and gene trees from multi-locus sequence data. bioRxiv 095539. [v.2; v.1: 2016]
      Chi Zhang, Huw A Ogilvie, Alexei J Drummond, Tanja Stadler (2017) Bayesian inference of species networks from multilocus sequence data. bioRxiv 124982.
      The algorithm for the first method has been implemented in the PhyloNet package, while the second has been implemented in the Beast2 package.

      Finally, another manuscript describes a method utilizing data based on single nucleotide polymorphisms (SNPs) and/or amplified fragment length polymorphisms (AFLPs), which thus sidesteps the assumption of no recombination:
      Jiafan Zhu, Dingqiao Wen, Yun Yu, Heidi Meudt, Luay Nakhleh (2017) Bayesian inference of phylogenetic networks from bi-allelic genetic markers. bioRxiv 143545.
      This method has also been implemented in PhyloNet.

      Due to the computational complexity of likelihood inference, all of these methods are currently severely restricted in the number of OTUs that can be analyzed, irrespective of whether these involve multiple samples from the same species or not. In this sense, parsimony-based inference or approximate likelihood methods are still useful for constructing evolutionary networks of any size. However, progress is clearly being made to alleviate the computational restrictions.

      Connecting tree and network edges


      I have struggled over the years to try to understand the relationship between trees and networks. In one sense, networks are generalizations of trees, and in another sense a tree is just a simplified network. But it is not always that simple.

      For example, not all networks can be created by adding edges to a tree (see Networks vs augmented trees); so the connection between trees and networks is not always obvious. Moreover, it is not always easy to determine which tree edges are present in any given network, or which network edges are present in a given tree.

      Nevertheless, this should be basic information in phylogenetics — otherwise, how can we know when a tree is adequate for our purposes, or when a network is needed?

      It turns out that I have not been alone in struggling to connect trees and networks. Fortunately, some of these other people decided to actually do something about it, rather than simply struggling on. As a result, a computerized way to relate much of the important information connecting trees with networks now exists.
      Klaus Schliep, Alastair J. Potts, David A. Morrison and Guido W. Grimm
      Intertwining phylogenetic trees and networks.
      Methods in Ecology and Evolution (Early View)
      To quote the authors:
      Here we provide a framework, implemented in the PHANGORN library in R, to transfer information between trees and networks. This includes: (i) identifying and labelling equivalent tree branches and network edges, (ii) transferring tree branch-support to network edges, and (iii) mapping bipartition support from a sample of trees (e.g. from bootstrapping or Bayesian inference) onto network edges.
      These three functions are illustrated in this figure, taken from the paper. It should be self-explanatory to anyone who has tried to relate the edges of trees and networks; but if it is not, then you can read an explanation in the paper.


      The R library referred to, including the source code, along with some examples and vignettes, can be accessed on the PHANGORN CRAN page.

      Note that PHANGORN (originally created by Klaus Schliep) also contains other functions related to estimating phylogenetic trees and networks, using maximum likelihood, maximum parsimony, distance methods and hadamard conjugation. Specifically, it allows you to: estimate phylogenies, compare trees and models, and explore tree space and visualize phylogenetic trees and split graphs.

      Why are splits graphs still called phylogenetic networks?


      This is an issue that has long concerned me, and which I think causes a lot of confusion among biologists. A phylogenetic tree is usually a clear concept — to a biologist, it is a diagram that displays a hypothesis of evolutionary history. The expectation, then, is that a phylogenetic network does the same thing for reticulate evolutionary histories. However, this is not true of splits graphs; and so there is potential confusion.

      Mathematically, of course, a phylogenetic tree is a directed acyclic line graph. It is usually constructed, in practice, by first producing an undirected graph based on some pattern-analysis procedure, and then nominating one of the nodes or edges as the root (say, by specifying an outgroup). So, the mathematics is not really connected to the biological interpretation. To a mathematician, the tree is a set of nodes connected by directed edges, and the nodes could represent anything at all, as could the edges. It is the biologist who artificially imposes the idea that the nodes represent real historical organisms connected by the flow of evolution — ancestors connected to descendants by evolutionary events.

      A phylogenetic network should logically be a generalization of this idea of a phylogenetic tree, adding the possibility of evolutionary relationships due to gene flow, in addition to the ancestor-descendant relationships. This can be done, but it is only partly done by splits graphs.

      That is, a splits graph generalizes the idea of an undirected line graph (an unrooted tree), but not a directed acyclic graph (a rooted tree). It follows the same logic of using a pattern-analysis procedure to produce an undirected graph, although the graph can have reticulations, and thus is a network rather than necessarily being a bifurcating tree. However, it is not straightforward to specify a root in a way that will turn this into an acyclic graph. So, in general it does not represent a phylogeny.

      Indeed, splits graphs are simply one form of multivariate pattern analysis, along with clustering and ordination techniques, which are familiar as data-display methods in phenetics (see Morrison D.A. 2014. Phylogenetic networks — a new form of multivariate data summary for data mining and exploratory data analysis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4: 296-312). In this sense, it makes no difference whatsoever what the data represent — they can be data used for phylogenetics, or they could be any other form of multivariate data. Indeed, this point is illustrated in many of the posts in this blog, which can be accessed in the Analyses page.

      So, unlike unrooted trees, unrooted splits graphs are not a route to producing a phylogenetic diagram. Mind you, they are a very useful form of multivariate data analysis in their own right, and I value them highly as a form of exploratory data analysis. But that doesn't make them phylogenetic networks in the biological sense.

      So, isn't it about time we stopped calling splits graphs "phylogenetic networks"? They aren't, to a biologist, so why call them that?

      Capturing phylogenetic algorithms for linguistics


      A little over a week ago I was at a workshop "Capturing phylogenetic algorithms for linguistics" at the Lorentz Centre in Leiden (NL). This is, as some of you will recall, the venue that hosted two earlier workshops on phylogenetic networks in 2012 and 2014.

      I had been invited to participate and to give a talk and I chose to discuss the possible relevance of phylogenetic networks (as opposed to phylogenetic trees) for linguistics. (My talk is here). This turned out to be a good choice because, although phylogenetic trees are now a firmly established part of contemporary linguistics, networks are much less prominent. Data-display networks (which visualize incongruence in a data-set, but do not model the genealogical processs that gave rise to it) have found their way into some linguistic publications, and a number of the presentations earlier in the week showed various flavours of split networks. However, the idea of constructing "evolutionary" phylogenetic networks - e.g. modeling linguistic analogues of horizontal gene transfer - has not yet gained much traction in the field. In many senses this is not surprising, since tools for constructing evolutionary phylogenetic networks in biology are not yet widely used, either. As in biology, much of the reticence concerning these tools stems from uncertainty about whether models for reticulate evolution are sufficiently mature to be used 'out of the box'.

      As far as this blog is concerned the relevant word in linguistics is 'borrowing'. My lay-man interpretation of this is that it denotes the process whereby words or terms are transferred horizontally from one language to another. (Mattis, feel free to correct me...) There were many discussions of how this proces can confound the inference of concept and language trees, but other than it being a problem there was not a lot a said about how to deal with it methodologically (or model it). One of the issues, I think, is that linguists are nervous about the interface between micro and macro levels of evolution and at what scale of (language) evolution horizontal events could and should be modelled. To cite a biological analogue: if you study populations at the most microscopic level evolution is usually reticulate (because of e.g. meiotic recombination) but at the macro level large parts of mammalian evolution are uncontroversially tree-like. In this sense whether reticulate events are modeled depends on the event itself and the scale of the phylogenetic model concerned.

      Are there analogues of population-genetic phenomena in linguistics, and are they foundations for phenomena observed at the macro level? Is there a risk of over-stating the parallels with biology? One participant told me that, while she felt that there was definitely scope for incrorporating analogies of species and gene trees within linguistics - and many of the participants immediately recognized these concepts - comparisons quickly break down at more micro levels of evolution.

      I'm not the right person to comment on this of course, or to answer these questions, but in any case it's clear that linguistics has plenty of scope for continuing the horizontal/vertical discussions that have already been with us for many years in biology...

      Last, but not least: it was a very enjoyable workshop and I'm grateful to the organizers for inviting me!

      Capturing phylogenetic algorithms for linguistics


      A little over a week ago I was at a workshop "Capturing phylogenetic algorithms for linguistics" at the Lorentz Centre in Leiden (NL). This is, as some of you will recall, the venue that hosted two earlier workshops on phylogenetic networks in 2012 and 2014.

      I had been invited to participate and to give a talk and I chose to discuss the possible relevance of phylogenetic networks (as opposed to phylogenetic trees) for linguistics. (My talk is here). This turned out to be a good choice because, although phylogenetic trees are now a firmly established part of contemporary linguistics, networks are much less prominent. Data-display networks (which visualize incongruence in a data-set, but do not model the genealogical processs that gave rise to it) have found their way into some linguistic publications, and a number of the presentations earlier in the week showed various flavours of split networks. However, the idea of constructing "evolutionary" phylogenetic networks - e.g. modeling linguistic analogues of horizontal gene transfer - has not yet gained much traction in the field. In many senses this is not surprising, since tools for constructing evolutionary phylogenetic networks in biology are not yet widely used, either. As in biology, much of the reticence concerning these tools stems from uncertainty about whether models for reticulate evolution are sufficiently mature to be used 'out of the box'.

      As far as this blog is concerned the relevant word in linguistics is 'borrowing'. My lay-man interpretation of this is that it denotes the process whereby words or terms are transferred horizontally from one language to another. (Mattis, feel free to correct me...) There were many discussions of how this proces can confound the inference of concept and language trees, but other than it being a problem there was not a lot a said about how to deal with it methodologically (or model it). One of the issues, I think, is that linguists are nervous about the interface between micro and macro levels of evolution and at what scale of (language) evolution horizontal events could and should be modelled. To cite a biological analogue: if you study populations at the most microscopic level evolution is usually reticulate (because of e.g. meiotic recombination) but at the macro level large parts of mammalian evolution are uncontroversially tree-like. In this sense whether reticulate events are modeled depends on the event itself and the scale of the phylogenetic model concerned.

      Are there analogues of population-genetic phenomena in linguistics, and are they foundations for phenomena observed at the macro level? Is there a risk of over-stating the parallels with biology? One participant told me that, while she felt that there was definitely scope for incrorporating analogies of species and gene trees within linguistics - and many of the participants immediately recognized these concepts - comparisons quickly break down at more micro levels of evolution.

      I'm not the right person to comment on this of course, or to answer these questions, but in any case it's clear that linguistics has plenty of scope for continuing the horizontal/vertical discussions that have already been with us for many years in biology...

      Last, but not least: it was a very enjoyable workshop and I'm grateful to the organizers for inviting me!

      Studying gene flow using genomes


      Continuing the recent blog theme of researchers analyzing potentially reticulate relationships without explicitly using networks (Are networks actually used to explore reticulate histories? ; Problems with manually constructing networks), there is this just-published paper:
      Nater A, Burri R, Kawakami T, Smeds L, Ellegren H (2015) Resolving evolutionary relationships in closely related species with whole-genome sequencing data. Systematic Biology 64: 1000-1017.
      The authors note:
      Using genetic data to resolve the evolutionary relationships of species is of major interest in evolutionary and systematic biology. However, reconstructing the sequence of speciation events, the so-called species tree, in closely related and potentially hybridizing species is very challenging. Processes such as incomplete lineage sorting and interspecific gene flow result in local gene genealogies that differ in their topology from the species tree, and analyses of few loci with a single sequence per species are likely to produce conflicting or even misleading results ... Although gene tree incongruences caused by ILS are still fully compatible with a strictly bifurcating species tree, gene flow among species requires a more complex representation of evolutionary histories, resembling reticulate networks rather than trees.
      Unfortunately, this is the sole mention of the word "network" in the text.


      The authors addressed the issues of incomplete lineage sorting and interspecific gene flow using whole-genome sequence data from 198 individuals of four flycatcher species, plus two outgroup genomes. They found that, for most genomic regions, none of the 15 possible rooted gene tree topologies appeared consistently at high frequencies — the most frequent gene tree occurred 17.7% of the time, with the second at 14.3% and the third at 10.5%.

      They investigated this gene-tree diversity using four programs that attempt to resolve a species tree in the context of incomplete lineage sorting and the coalescent: MP-EST, SNAPP, Fastsimcoal2, and ABC. The latter two approaches also allow for post-divergence gene flow. All four methods have limited applicability when applied to 200 genomes, and so in each case only a subset of the data was analyzed or a subset of the possible species trees was tested. All four methods produced the same species tree, which was also the same as the most commonly encountered gene tree.

      Unfortunately, the authors found almost no evidence of gene flow using these methods, although their detailed gene-tree analyses do suggest its existence. This indicates that there are problems with these methods. Perhaps the main problem is that the authors approached their analyses almost exclusively in the context of a species tree rather than a network. There are other methods that one could try, including the one used by researchers studying introgression in archaic hominoids (as discussed in Are networks actually used to explore reticulate histories?).

      In addition, the authors seem to be unclear about their concept of what is a species. For example, they note that "gene flow among lineages in the species tree can confound the true order of speciation events", which seems to preclude use of the biological species concept. Furthermore, they note that "lack of species monophyly is common in this study system", which seems to preclude the phylogenetic species concept. What then constitutes speciation?

      Finally, the authors seem to have a common misconception of ancestral character states. Their approach includes this statement: "If both outgroup individuals were monomorphic for the same allele, this allele was considered ancestral." This argument has been repeatedly rejected in the literature. See, for example, Crisp MD, Cook LG. (2005) Do early branching lineages signify ancestral traits? Trends in Ecology and Evolution 20: 122-128.