Limitations of the new book about HGT networks


This is a joint post by David Morrison and Ajith Harish.

There has been a flurry of reviewing activity recently about the new book:

The Tangled Tree: a Radical New History of Life
David Quammen. 2018. Simon & Schuster.


This book has received glowing reviews, including:

The book is intended for the general public, rather than for specialists, explaining the "new view" of evolutionary history that includes extensive horizontal gene transfer (HGT), especially in the microbial world. Quammen describes himself as a science, nature and travel writer, so his book is more than just a record of science, and is as much about the people involved as about the scientific theory. In particular, it contains a biography of Carl Woese.

Quammen’s recent New York Times feature article The scientist who scrambled Darwin’s Tree of Life is a very good primer to his book. For us, it indicates that the book has many overlaps with Jan Sapp's earlier book The New Foundations of Evolution: on the Tree of Life (2009. Oxford University Press). The publisher’s advertised selling point of that book is: "This is the first book on (and first history of) microbial evolutionary biology, and that it puts forth a new theory of evolution", with HGT being the new theory. In this sense, the "radical new view" is simply that genetic material can be transferred without sexual reproduction, an idea that goes back rather a long way in history (see The history of HGT), and which is often seen as anti-Darwinian.

Bill Hanage in his review of Sapp’s book (2010. The trouble with trees. Science 327: 645-646) argues that the book neither puts forward a new theory nor is the debate actually about horizontal gene transfer, and the Tree of Life is thus far from settled. There are many other interesting points discussed in that review. Furthermore, even after almost 10 years, Hanage’s review of Sapp’s 2009 book can be substituted verbatim as a review of Quammen’s 2018 book! This PDF shows how the book review would read if the author and book names in Hanage’s review were to be substituted [reproduced with the permission of the original author].

The debate allegedly involving HGT is, at heart, about explaining the pattern of extensively mixed genetic material found in the akaryotes. However, simply looking at a pattern does not tell you about the process that created the pattern. In order to study processes, we need a model, in this case a model about how evolution occurs. The "HGT model" is that the Last Universal Common Ancestor (LUCA) of life was a relatively simple organism genetically, and that subsequent evolutionary history has involved complexification of that ancestor, both by diversification and by HGT.

What the two books do not explore is the other major model for the current distribution of genetic material among akaryotes. This alternative scenario is that the LUCA was genetically complex, and that the subsequent evolutionary history involved independent losses of parts of the genetic material — the sporadically shared material is basically coincidental. All that this model requires is that there be evolutionary history prior to the LUCA, during which it became a complex organism from its simple beginnings — the LUCA is merely as far back as we can see into the past, with the prior history being unrecoverable by us (ie. we cannot see past the LUCA bottleneck).

Over the past couple of decades, a number of papers have explored the evidence for the latter idea, from both the RNA and protein perspectives, including:
  • Anthony Poole, Daniel Jeffares, David Penny (1999) Early evolution: prokaryotes, the new kids on the block. BioEssays 21: 880-889.
  • Christos A. Ouzounis, Victor Kunin, Nikos Darzentas, Leon Goldovsky (2006) A minimal estimate for the gene content of the last universal common ancestor — exobiology from a terrestrial perspective. Research in Microbiology 157: 57-68.
  • Miklós Csűrös István Miklós (2009) Streamlining and large ancestral genomes in Archaea inferred with a phylogenetic birth-and-death model. Molecular Biology and Evolution 26: 2087-2095.
  • Kyung Mo Kim, Gustavo Caetano-Anollés (2011) The proteomic complexity and rise of the primordial ancestor of diversified life. BMC Evolutionary Biology 11: 140.
  • Ajith Harish, Charles G. Kurland (2017) Akaryotes and Eukaryotes are independent descendants of a universal common ancestor. Biochimie 138: 168-183.
Finally, even from the perspective of phylogenetic networks, Quammen's book is very one-sided. In particular, the other processes that lead to reticulate evolution (eg. introgression and hybridization) are pretty much ignored. That is, the focus is on akaryotes not eukaryotes. The latter are also of phylogenetic interest.

Bayesian inference of phylogenetic networks


Over the years, a number of methods have been explored for constructing evolutionary networks, starting with parsimony criteria for optimization, and moving on to likelihood-based inference. However, the development of Bayesian methods has been somewhat delayed by the computational complexities involved.

Network from Radice (2012)

The earliest work on this topic seems to be the thesis of:
Rosalba Radice (2011) A Bayesian Approach to Phylogenetic Networks. PhD thesis, University of Bath, UK.
Apparently, the only part of this work to be published has been:
Rosalba Radice (2012) A Bayesian approach to modelling reticulation events with application to the ribosomal protein gene rps11 of flowering plants. Australian & New Zealand Journal of Statistics 54: 401-426.
The method described requires the prior specification of the species tree (phylogeny), and the position and number of the reticulation events. The algorithm was implemented in the R language.

More recently, methods have been developed that infer phylogenies by using (i) incomplete lineage sorting (ILS) to model gene-tree incongruence arising from vertical inheritance, and (ii) introgression / hybridization to model gene-tree incongruence attributable to horizontal gene flow. ILS has been addressed using the multispecies coalescent.

The first of these publications was:
Dingqiao Wen, Yun Yu, Luay Nakhleh (2016) Bayesian inference of reticulate phylogenies under the multispecies network coalescent. PLoS Genetics 12(5): e1006006. [Correction: 2017 PLoS Genetics 13(2): e1006598]
The method requires the set of gene trees as input, along with the number of reticulations. The algorithm was implemented in the PhyloNet package.

In the past few months, two manuscripts have appeared that try to co-estimate the gene trees and the species network, using the original sequence data (assumed to be without recombination) as input:
Dingqiao Wen, Luay Nakhleh (2017) Co-estimating reticulate phylogenies and gene trees from multi-locus sequence data. bioRxiv 095539. [v.2; v.1: 2016]
Chi Zhang, Huw A Ogilvie, Alexei J Drummond, Tanja Stadler (2017) Bayesian inference of species networks from multilocus sequence data. bioRxiv 124982.
The algorithm for the first method has been implemented in the PhyloNet package, while the second has been implemented in the Beast2 package.

Finally, another manuscript describes a method utilizing data based on single nucleotide polymorphisms (SNPs) and/or amplified fragment length polymorphisms (AFLPs), which thus sidesteps the assumption of no recombination:
Jiafan Zhu, Dingqiao Wen, Yun Yu, Heidi Meudt, Luay Nakhleh (2017) Bayesian inference of phylogenetic networks from bi-allelic genetic markers. bioRxiv 143545.
This method has also been implemented in PhyloNet.

Due to the computational complexity of likelihood inference, all of these methods are currently severely restricted in the number of OTUs that can be analyzed, irrespective of whether these involve multiple samples from the same species or not. In this sense, parsimony-based inference or approximate likelihood methods are still useful for constructing evolutionary networks of any size. However, progress is clearly being made to alleviate the computational restrictions.

Why do we need Bayesian phylogenetic information content?


There are many ways to construct a phylogenetic tree, and after we have done so we are usually expected to indicate something about "branch support", such as bootstrap values or bayesian posterior probabilities. Rarely, however, do people indicate whether there is much tree-like phylogenetic information in their dataset in the first place — it is simply assumed that there must be (fingers crossed, touch wood).

Recently, this latter issue has been addressed for bayesian analysis by:
Paul O. Lewis, Ming-Hui Chen, Lynn Kuo, Louise A. Lewis, Karolina Fučíková, Suman Neupane, Yu-Bo Wang, Daoyuan Shi. (2016) Estimating Bayesian phylogenetic information content. Systematic Biology 65: 1009-1023.
They develop a methodology for "measuring information about tree topology using marginal posterior distributions of tree topologies", and apply it to two small empirical datasets. That is, we can now work out something about "[substitution] saturation and detecting conflict among data partitions that can negatively affect analyses of concatenated data."

However, we have long been able to do this with data-display phylogenetic networks. More to the point, we can do it in a second or two, without ever constructing a tree. More pedantically, if the network construction produces a tree, then we know there is tree-like phylogenetic information in the dataset; if we get a network then there is little such information. Equally importantly, the network might tell us something about the patterns of non-tree-likeness, which a single-number measurement cannot.

Let's take the first empirical dataset, as described by the authors:
The five sequences of rpsll composing the data set BLOODROOT [three taxa from the angiosperm family Papaveraceae and two monocots] ... were chosen because they represent a case in which horizontal transfer of half of the gene results in different true tree topologies for the 5′ (219 nucleotide sites) and 3′ (237 nucleotide sites) subsets, which allows investigation of information content estimation in the presence of true conflicting phylogenetic signal. We analyzed each half of the data separately and measured phylogenetic dissonance, which is expected to be high in this case.
Here is the NeighborNet based on uncorrected distances. The idea that there is something non-tree-like about Sanguinaria seems hard to avoid. Indeed, the network pattern makes recombination an obvious first choice, with part of the sequence matching the Papaveraceae (on the left) and part matching the monocots (on the right). This recombination may be due to HGT.


Now for the second dataset:
The data set ALGAE comprises chloroplast psaB sequences from 33 taxa of green algae (phylum Chlorophyta, class Chlorophyceae, order Sphaeropleales) ... The alignments of just the psaB gene ... were chosen because of their deep divergence, which invites hasty judgements of saturation, especially of third codon position sites. We analyzed second and third codon position sites separately ... to assess which subset has more phylogenetic information.
Here are the two NeighborNets based on uncorrected distances. Once again, it is immediately obvious that the third-codon positions have almost no information at all, even for a network, let alone a tree — the terminal branches do not connect in any coherent way. The second-codon positions do have some information, but it is so contradictory that one could not construct a reliable tree. Saturation of nucleotide substitutions is a likely candidate for this situation; and some correction for this saturation would be needed even to construct a reasonable network from these data.

2nd positions:

3rd positions:

Hybridization in the world of duplication-transfer-loss


It seems to me that the study of reticulate evolutionary histories currently boils down to two options:
(1) reconstructing a species "tree" from multiple gene trees using a coalescent model that includes hybridization (either homoploid or polyploid);
(2) reconciling multiple gene trees with a known [sic] species tree using a model that includes gene duplication, loss and transfer (as well as speciation) - a DTL model.

This often leads me to wonder where hybridization fits into option (2) and where gene transfer fits into option (1). They must fit somewhere. For example, Jacox et al. (2016. ecceTERA: comprehensive gene tree-species tree reconciliation using parsimony. Bioinformatics 32: 2056-2058) describe their DTL as:
comprehensive as it includes the following evolutionary events: speciation, speciation-loss (speciation followed by a loss of one gene copy), gene duplication, gene loss, gene transfer and transfer-loss (gene transfer with loss of the original gene) between two sampled species, and gene transfer and transfer-loss from/to an unsampled species (i.e. a species that is not represented in the dataset) to/from a sampled one.

Since the model is "comprehensive", then hybridization must be included. The only parts of the model that include reticulate histories are gene transfer and transfer-loss, so this is where hybridization must be. Possibly, polyploid hybridization is included in "gene transfer" (an increase in the number of gene copies), and homoploid hybridization is included in "transfer-loss" (maintaining the same number of genes).

This seems to be a simple example of the idea that different types of reticulation events cannot be distinguished from each other. Genomic material moves from one place to another in contemporaneous organisms, either sexually (introgression, hybridization) or asexually (lateral gene transfer). There is nothing intrinsic about gene trees to tell us which mechanism is involved in any given reticulation, other than the relative positions of the donor and recipient in the "species tree" and the possibility of time inconsistency.

This leads to the question of why horizontal gene movement is called "transfer" in one model (2) and "hybridization" in the other (1).

The history of HGT


Because it seems to be an interesting topic, I have written a number of posts about the history of horizontal gene transfer (HGT) in phylogenetics, including:
The first gene transfer (HGT) network (1910)
The first paper on HGT in plants (1971)
HGT networks
The first HGT network
Recently, Nathalie Gontier has produced a comprehensive history of HGT, which makes a major contribution to the field:
N. Gontier (2015) Historical and epistemological perspectives on what horizontal gene transfer mechanisms contribute to our understanding of evolution. In: N. Gontier (ed.) Reticulate Evolution, pp. 121-178. Springer, Switzerland.
In this book chapter, she contemplates why the evidence for HGT was ignored for most of the 20th century:
Many of the mechanisms whereby genes can become transferred laterally have been known from the early twentieth century onward. The temporal discrepancy between the first historical observations of the processes, and the rather recent general acceptance of the documented data, poses an interesting epistemological conundrum: Why have incoming results on HGT been widely neglected by the general evolutionary community and what causes a more favorable reception today? Five reasons are given:
(1) HGT was first observed in the biomedical sciences and these sciences did not endorse an evolutionary epistemic stance because of the ontogeny / phylogeny divide adhered to by the founders of the Modern Synthesis.
(2) Those who did entertain an evolutionary outlook associated research on HGT with a symbiotic epistemic framework.
(3) That HGT occurs across all three domains of life was demonstrated by modern techniques developed in molecular biology, a field that itself awaits full integration into the general evolutionary synthesis.
(4) Molecular phylogenetic studies of prokaryote evolution were originally associated with exobiology and abiogenesis, and both fields developed outside the framework provided by the Modern Synthesis.
(5) Because HGT brings forth a pattern of reticulation, it contrasts the standard idea that evolution occurs solely by natural selection that brings forth a vertical, bifurcating pattern in the “tree” of life.
These are important points, and it is interesting to have so much of the history and epistemology gathered into one place.

Gontier notes:
In prokaryotes, HGT occurs via bacterial transformation, phage-mediated transduction, plasmid transfer via bacterial conjugation, via Gene Transfer Agents (GTAs), or via the movement of transposable elements such as insertion sequences ... In eukaryotes, HGT is mediated by processes such as endosymbiosis, phagocytosis and eating, infectious disease, and hybridization or divergence with gene flow, which facilitates the movement of mobile genetic elements such as transposons and retrotransposons between different organisms.
In this context, knowledge of HGT extends back a long way. Transformation was first observed by Griffith (1928), conjugation was discovered by Lederberg and Tatum (1946), and Freeman (1951) reported on HGT from a bacteriophage. Information about endosymbiosis and phagocytosis extends back even further.

Unfortunately, the history presented is incomplete, because it focuses on microbiology (possibly because the timeline around which the chapter is written "is based upon the timeline provided by the American Society for Microbiology"). The possibility that the asexual transfer of genetic units may be of more general occurrence than just prokaryotes dates back to at least Ravin (1955), who is not mentioned. Thus, for example, the early phylogenetic work of Jones & Sneath (1970) on bacteria is included, but the works of Went (1971) on plants and Benveniste & Todaro (1974) on animals are not referenced. Similarly, the discussion of gene trees versus species trees in bacteria by Hilario and Gogarten (1993) is quoted but not that of Doyle (1992) regarding plants. Thus, there is more history to be written.

The book itself (Reticulate Evolution) is mostly about the broader fields of symbiosis and symbiogenesis, rather than about more specific topics like lateral gene transfer and hybridization.

References

Benveniste RE, Todaro GJ (1974) Evolution of C-type viral genes: inheritance of exogenously acquired viral genes. Nature 252: 456-459.

Doyle JJ (1992) Gene trees and species trees: molecular systematics as one-character taxonomy. Systematic Botany 17: 144-163.

Freeman VJ (1951) Studies on the virulence of bacteriophage-infected strains of Corynebacterium diphtheriae. Journal of Bacteriology 61: 675-688.

Griffith F (1928) The significance of pneumococcal types. Journal of Hygiene 27: 113-159.

Hilario E, Gogarten JP (1993) Horizontal transfer of ATPase genes — the tree of life becomes a net of life. Biosystems 31: 111-119.

Jones D, Sneath PH (1970) Genetic transfer and bacterial taxonomy. Bacteriology Reviews 34: 40-81.

Lederberg J, Tatum EL (1946) Gene recombination in E coli. Nature 158: 558.

Ravin AW (1955) Infection by viruses and genes. American Scientist 43: 468-478.

Went FW (1971) Parallel evolution. Taxon 20: 197-226.

An unusual genealogy


"Genealogies" produced on the web are frequently no such thing, they are merely timelines. However, the following alleged Genealogy of Automobile Companies seems to really be one, and it has a number of odd characteristics. These characteristics are quite common among manufactured products.


It is described as "A flowing history of more than 100 automobile companies across the complete time span of the automobile industry." Actually, it focuses on companies in the USA, up to 2012. You can zoom in on the details by visiting the original image at HistoryShots InfoArt.

First, note that the genealogy has multiple roots. Second, lineages coalesce forwards through time rather than diverging, so that the lineages become clustered. Moreover, some lineages do not connect to any others. Finally, there is horizontal transfer, because parts of companies get sold to other companies.

There is also a similar Genealogy of US Airlines, and a Genealogy of International Airlines.

Representing macro- and micro-evolution in a network


In biology we often distinguish microevolutionary events, which occur at the population level, from macroevolutionary events, which involve species. We have traditionally treated phylogenetics as a study of macroevolution. However, more recently there has been a trend to include population-level events, such as incomplete lineage sorting and introgression.


This is of particular importance for the resulting display diagrams. A phylogenetic tree was originally conceived to represent macroevolution. For example, speciation and extinction occur as single events at particular times, and these events apply to discrete groups of organisms. The taxa can be represented as distinct lineages in a tree graph, and the events by having these lineages stop or branch in the graph.

This idea is easily extended to phylogenetic networks, where the gene-flow events are also treated as singular, so that hybridization or horizontal gene transfer can be represented as single reticulations among the lineages.

These are sometimes called "pulse" events. However, there are also "press" events that are ongoing. That is, a lot of genetic variation is generated where populations repeatedly mix, so that every gene-flow instance is part of a continuous process of mixing. This often occurs, for example, in the context of isolation by distance, such as ring species or clinal variation. Under these circumstances, processes like introgression and HGT can involve ongoing events.

For instance, in an earlier life I once studied three species of plant in the Sydney region (Morrison DA, McDonald M, Bankoff P, Quirico P, Mackay D. 1994. Reproductive isolation mechanisms among four closely-related species of Conospermum (Proteaceae). Botanical Journal of the Linnean Society 116: 13-31). One of the species was ecologically isolated from the other two (it occurred in dry rather than damp habitats), and the other two were geographically isolated from each other (they occurred on separate sandstone uplands with a large valley in between). These species look very different from each other, as shown in the picture above, but looks are deceiving. Where the ecological isolation was incomplete, introgression occurred and admixed populations could be found.

These dynamics are more difficult to represent in a phylogenetic tree or network. We do not have discrete groups that can be represented by lines on a graph, but instead have fuzzy groups with indistinct boundaries. Furthermore, we do not have discrete events, but instead have ongoing (repeated) processes.

Nevertheless, it seems clear that there is a desire in modern biology to integrate macroevolutionary and microevolutionary dynamics in a single network diagram. That is, some parts of the diagram will represent pulse events involving discrete groups and other parts will represent press events among fuzzy groups. This situation seems to be currently addressed by practitioners by first creating a tree to represent the pulse events (and possibly their times), and then adding imprecisely located dashed lines as a representation of ongoing gene flow — see the example in Producing trees from datasets with gene flow. This particular mixture of precision and imprecision seems rather unsatisfactory.

Perhaps someone might like to have a think about this aspect of phylogenetic networks, to see if there is some way we can do better.

Current methods for evolutionary networks


It has been noted before that we have a wide range of mathematical techniques available for producing data-display networks, most notably the many variants of splits graphs (see Huson & Scornavacca 2011). For example, NeighborNets and Consensus networks are commonly encountered in the phylogenetics literature, and Reduced median networks and Median-joining networks are commonly used for haplotype networks in population biology.

However, there are few techniques used to produce evolutionary networks. Studies of reticulate evolutionary histories, which include recombination networks, hybridization networks, introgression networks and HGT networks, have no unifying theme as yet. So, the biological literature has many papers in which biologists struggle with reticulate evolutionary histories using ad hoc collections of techniques, which often boil down to simply presenting incongruent phylogenetic trees from different datasets (see Morrison 2014a).

So, maybe a brief look at the current state of play with evolutionary networks would be useful. There are enough worthwhile techniques out there for people to be using them more often than they are.

Assumptions

Almost all current phylogenetic methods assume that the basic building unit is a non-recombining sequence block, for which the evolutionary history is strictly tree-like. We tend to call these blocks "genes" and their history "gene trees", but this is just for semantic convenience. In practice, we first collect data for various loci, and we then simply make the assumption that there is recombination between the loci but not within them. This is basically the assumption of independence between loci. At the limit, each nucleotide along a chromosome has a tree-like history, but for aggregations of nucleotides it is all assumptions.

Furthermore, we assume that there are no data errors that will confound any reconstruction of the phylogenetic trees. Possible sources of error include: incorrect data (e.g. contamination), inappropriate sampling (taxa or characters), and model mis-specification. Any of these errors will lead to stochastic variation at best and to bias at worst.

Gene-tree incongruence

Reticulate evolutionary processes lead to gene trees that are not all congruent. However, there are two other processes that have been widely recognized as also producing gene-tree incongruence, but which do not involve reticulation in the strict sense: incomplete lineage sorting (deep coalescence; ancestral polymorphism), and gene duplication-loss.

Many studies have now shown that stochastic variation due to ILS can be very large (see Degnan & Rosenberg 2009), and that this varies in relation to both the population sizes of the taxa and the times between divergence events. The expectation of completely congruent gene trees is thus very naive, even when the evolutionary history of the taxa has been strictly tree-like. A number of methods have been developed to reconstruct species trees in the face of ILS (Nakhleh 2013).

DL involves gene duplication (which can be repeated to create gene families) followed by selective gene loss. The phylogenetic history of the genes is usually presented as an unfolded species tree, where each gene copy has its own part of the tree. A number of methods have been developed to reconstruct gene DL histories given a "known" species tree, which is called gene-tree reconciliation (Szöllősi et al 2015). However, our interest here is in the reverse process, in which reconstructed but incongruent gene trees are combined into a single species tree, given a model of duplication and selective loss, which is called species-tree inference (which is the same as cophylogeny reconstruction; Drinkwater & Charleston 2014).

Reticulations

Known biological processes such as recombination, reassortment, hybridization, introgression and horizontal gene transfer all create reticulate phylogenetic histories. However, it is a moot point as to whether these processes can be distinguished from each other solely in the context of an evolutionary network (Holder et al. 2001; Morrison 2015). These evolutionary processes operate by distinct biological mechanisms, but the evolutionary patterns that they create can all be rather similar. The processes all result in gene flow among contemporaneous organisms (usually called horizontal flow or transfer), whereas other evolutionary processes involve gene flow from parent to offspring (usually called vertical inheritance), including ILS and DL. These gene flows create incongruent gene histories, which we may detect directly in the data or via reconstructed gene trees. The patterns of incongruence do not necessarily allow us to infer the causal process.

There are a number of differences in pattern, but the consistency of these is doubtful. Polyploid hybridization produces the most distinctive pattern, because there is duplication of the genome in the hybrid. However, subsequent aneuploidy will serve to obscure this pattern. Homoploid hybridization nominally involves 50% of the genome coming from difference sources, while introgression ultimately involves a smaller percentage. However, in practice, genome mixtures vary continuously from 0 to 50%. HGT also involves a small percentage of the genome, but in theory it also can vary from 0 to 50%. Reassortment produces mixtures of viral genes, which can occur in such a great number that reconstructing the history is severely problematic.

So, in the absence of independent experimental evidence, distinguishing one form of evolutionary network from another is almost a matter of definition. This has become increasingly obvious in the methodological literature, where semantic confusion abounds.

For example, a network produced directly from a set of characters has usually been called a "recombination network", while one produced from a set of trees has usually been called a "hybridization network", irrespective of what processes the gene trees represent. Furthermore, models that add reticulation events to DL trees have usually referred to the horizontal gene flow as "HGT", whereas models that add reticulation events to ILS trees have usually referred to the horizontal gene flow as "hybridization" (Morrison 2014a). Studies of horizontal gene flow during human evolution have usually referred to "admixture", which is a more process-neutral term.

In many, if not most, cases we might all be better off if network methods simply distinguish gene flow among contemporaries (horizontal) from gene inheritance between generations (vertical), rather than trying to infer a process — process inference can often best take place after network construction. This does not help anthropologists, of course, who are dealing with evolutionary networks where oblique gene flow is possible (so that they do not have Time inconsistency in evolutionary networks).

Methods

There seems to be a dichotomy of purposes to current method development, which are neatly summarized by the contrasting theoretical views of Mindell (2013) and Morrison (2014b). These views each recognize that evolutionary history involves both vertical and horizontal processes, but they reconstruct the resulting evolutionary patterns as a species tree and a species network, respectively. Obviously, this blog is dedicated to the latter point of view, but it is the former one (the so-called Tree of Life) that seems to currently dominate the literature.

Focussing on gene-tree inference, Szöllősi et al (2015) provide a comprehensive review of the various models that have been used to describe the dependence between gene trees and species trees. Essentially, gene trees are contained within the species tree, and they may differ from it in relative branch lengths and/or topology. The differences between genes and species are the result of population-level processes, often modeled using the coalescent. These authors recognize four current classes of probabilistic model that combine different evolutionary processes:
  • the DLCoal model, which combines coalescence and DL
  • the DTLSR model and the ODT model, both of which combine gene transfer and DL
  • models that combine hybridization and ILS
  • models of allopolyploidization.
When inferring species trees from gene trees (species-tree inference), we basically combine the scores for all of the gene trees, and then search for the species tree with the best overall score. This involves adding the scores in parsimony analyses, or multiplying the conditional probabilities in likelihood analyses (ie. maximum-likelihood or bayesian context). Many methods have been developed for inferring a species tree based on multi-locus data. These differ in whether the gene and species trees are estimated simultaneously or sequentially, and in how the gene trees are used to infer the species tree. Nakhleh (2013) and Szöllősi et al (2015) discuss both parsimony and likelihood methods for species-tree inference based on either ILS or DL models.

Extending these ideas to infer networks (rather than species trees) is a bit more tricky, and most of the work to date has involved combining hybridization and ILS. There has been no recent summary of the ideas. However, calculating the parsimony score of a network, given a set of gene-tree topologies, has been beed addressed by Yu et al (2011); and Yu et al (2013a) have extended these ideas to heuristically search the network space for the optimal network (the one that minimizes the number of extra reticulation lineages in a species tree). Furthermore, methods for computing the likelihood of a phylogenetic network, given a set of gene-tree topologies, have been devised by Yu et al (2012, 2013b); and Yu et al (2014) have extended these ideas to heuristically search for the maximum-likelihood network for limited cases of introgression or hybridization (since they differ only in degree).

There are also several methods that simply use gene-tree incongruence to infer reticulation events in a species network (Huson et al. 2010). Basically, these methods combine gene trees into "hybridization networks" by minimizing the number of reticulations required for reconciliation, measured either by counting the reticulations or calculating the network level. The combinatorial optimization can be based on trees, triplets or clusters, using parsimony as the optimality criterion. These methods model homoploid hybridization by assuming that reticulation is the sole cause of all gene-tree incongruence. This means that they are likely to overestimate the amount of reticulation in a dataset when other processes are co-occurring.

The most completely developed network methods involve data for allopolyploid hybrids. Here, there are multiple copies of each gene, one in each copy of the genome, so that allopolyploid hybrids have more copies than do their diploid parent taxa. To construct a hybridization network topology, Huber et al (2006) developed a parsimony method based on first estimating a multi-labeled gene tree, and then searching for the single-labeled network that best accommodates the multiple gene patterns. The model has been extended to heuristically include ILS (Marcussen et al 2012), as well as dates for the internal nodes (Marcussen et al 2015). Jones et al. (2013) have also developed models that incorporate ILS in a bayesian context, but only for the case of a single hybridization event between two diploid species (an allotetraploid).

Species-tree inference for a pair of gene phylogenies that may be networks not trees, has been considered in terms of parsimony by Drinkwater & Charleston (2014).

This brings us to the matter of introgression. The massive recent influx of genome-scale data for hominids has lead to the development of methods explicitly for the analysis of what is termed admixture among the lineages. These methods basically work by constructing a phylogenetic tree that includes admixture events, the topology inference being based on allele frequencies. There has been no formal comparison of the methods, and not much application to non-humans. Three such methods have been produced so far (Patterson et al 2012; Pickrell & Pritchard 2012; Lipson et al 2013).

Recombination has somewhat been the poor cousin to other causes of reticulation, as most network methods assume it to be absent. Nevertheless, Gusfield (2014) has recently provided an ample survey of the study methods available to date.

References

Degnan JH, Rosenberg NA (2009) Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends in Ecology & Evolution 24: 332-340.

Drinkwater B, Charleston MA (2014) An improved node mapping algorithm for the cophylogeny reconstruction problem. Coevolution 2: 1-17.

Gusfield D (2014) ReCombinatorics: the Algorithmics of Ancestral Recombination Graphs and Explicit Phylogenetic Networks. MIT Press, Cambridge.

Holder MT, Anderson JA, Holloway AK (2001) Difficulties in detecting hybridization. Systematic Biology 50: 978-982.

Huber KT, Oxelman B, Lott M, Moulton V (2006) Reconstructing the evolutionary history of polyploids from multilabeled trees. Molecular Biology & Evolution 23: 1784-1791.

Huson D, Rupp R, Scornavacca C (2010) Phylogenetic Networks: Concepts, Algorithms, and Applications. Cambridge University Press, Cambridge.

Huson DH, Scornavacca C (2011) A survey of combinatorial methods for phylogenetic networks. Genome Biology & Evolution 3: 23-35.

Jones G, Sagitov S, Oxelman B (2013) Statistical inference of allopolyploid species networks in the presence of incomplete lineage sorting. Systematic Biology 62: 467-478.

Lipson M, Loh P-R, Levin A, Reich D, Patterson N, and Berger B (2013) Efficient moment-based inference of population admixture parameters and sources of gene flow. Molecular Biology & Evolution 30: 1788-1802.

Marcussen T, Heier L, Brysting AK, Oxelman B, Jakobsen KS (2015) From gene trees to a dated allopolyploid network: insights from the angiosperm genus Viola (Violaceae). Systematic Biology 64: 84-101.

Marcussen T, Jakobsen KS, Danihelka J, Ballard HE, Blaxland K, Brysting AK, Oxelman B (2012) Inferring species networks from gene trees in high-polyploid north American and Hawaiian violets (Viola, Violaceae). Systematic Biology 61: 107-126.

Mindell DP (2013) The Tree of Life: metaphor, model, and heuristic device. Systematic Biology 62: 479-489.

Morrison DA (2014a) Phylogenetic networks: a review of methods to display evolutionary history. Annual Research and Review in Biology 4: 1518-1543.

Morrison DA (2014b) Is the Tree of Life the best metaphor, model or heuristic for phylogenetics? Systematic Biology 63: 628-638.

Morrison DA (2015, in press) Pattern recognition in phylogenetics: trees and networks. In: Elloumi M, Iliopoulos CS, Wang JTL, Zomaya AY (eds) Pattern Recognition in Computational Molecular Biology: Techniques and Approaches. Wiley, New York.

Nakhleh L (2013) Computational approaches to species phylogeny inference and gene tree reconciliation. Trends in Ecology & Evolution 28: 719-728.

Patterson NJ, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, Genschoreck T, Webster T, Reich D (2012) Ancient admixture in human history. Genetics 192: 1065-1093.

Pickrell JK, Pritchard JK (2012) Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genetics 8: e1002967.

Szöllősi GJ, Tannier E, Daubin V, Boussau B (2015) The inference of gene trees with species trees. Systematic Biology 64: e42-e62.

Yu Y, Barnett RM, Nakhleh L (2013a) Parsimonious inference of hybridization in the presence of incomplete lineage sorting. Systematic Biology 62: 738-751.

Yu Y, Degnan JH, Nakhleh L (2012) The probability of a gene tree topology within
a phylogenetic network with applications to hybridization detection. PLoS Genetics 8:
e1002660.

Yu Y, Dong J, Liu KJ, Nakhleh L (2014) Maximum likelihood inference of reticulate evolutionary histories. Proceedings of the National Academy of Sciences of the USA 111: 16448-16453.

Yu Y, Ristic N, Nakhleh L (2013b) Fast algorithms and heuristics for phylogenomics
under ILS and hybridization. BMC Bioinformatics 14: S6.

Yu Y, Than C, Degnan JH, Nakhleh L (2011) Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. Systematic Biology 60: 138-149.

The first HGT network


I have published a number of blog posts about early phylogenetics involving horizontal gene transfer (HGT). The historical issue is that all of the early publications about HGT of individual genes were about mechanisms and evidence, rather than about the phylogeny, and so explicit network illustrations were rare. It is therefore difficult to pinpoint the first illustrated network.

For example, if we consider HGT to be a subset of genome transfer (or genome fusion) then the first explicit phylogenetic network illustrating this was by Constantin Mereschkowsky (1910) Theorie der zwei Plasmaarten als Grundlage der Symbiogenese, einer neuen Lehre von der Entstehung der Organismen. Biologisches Centralblatt 30: 278–303, 321–347, 353–367 (see The first gene transfer network). However, HGT is conventionally treated as involving a small collection of genes, not whole genomes.

Alternatively, if we consider unrooted phenograms to represent HGT networks, then the first explicit illustration of relationships based on individual genes was by Dorothy Jones & Peter H. Sneath (1970) Genetic transfer and bacterial taxonomy. Bacteriology Reviews 34: 40-81 (HGT networks). However, phenetics is not really phylogenetics.

It seems that if we insist upon an illustration showing a rooted phylogenetic network, then we must turn to the paper by Raoul E. Benveniste & George J. Todaro (1974) Evolution of C-type viral genes: inheritance of exogenously acquired viral genes. Nature 252: 456-459. The summary of this paper is:
Genes related to the nucleic acid of an endogenous domestic cat C-type virus (RD114) are found in the cellular DNA of anthropoid primates while many members of the cat family Felidae lack these sequences. Endogenous viruses from primates are thus concluded to have infected and become part of the germ line of an evolutionarily distance group, the ancestors of the domestic cat.
The authors discuss HGT explicitly in the context of a phylogeny:
When the virogenes of two species are more closely related to each other than are the cellular genes, one must suspect horizontal transmission and subsequent perpetuation of the viral genes through the germ line. Figure 3 shows models which could account for the data.

There are three distinct phylogenetic models in this figure, and the third one has three alternative possibilities. The authors conclude that "model cII is most likely." This then appears to be the first HGT network that fits the conventional specifications.

HGT networks


Introgression is the transfer of genetic material from one species to another via sexual reproduction, and this process has been recognized for a long time. If sex is not involved (such as between distantly related organisms) then we usually refer to it as horizontal gene transfer (HGT), and this has only relatively recently come to the general attention of biologists.

During the 1990s, HGT among prokaryotes began to be taken seriously in phylogenetics (Smith et al. 1992; Syvanen 1994), and more than a decade later also in eukaryotes (see Bock 2010; Boto 2010; Renner & Bellot 2012). However, the question still remains as to when it was first considered within phylogenetics, as opposed to other areas of biology.

It seems that the first report of what was probably HGT in prokaryotes is due to Flu (1927), who of course did not recognize it as such. Indeed, Lederberg & Tatum (1946) also apparently observed HGT, but mistakenly attributed it to sexual recombination (in prokaryotes). This emphasizes just how difficult it can be to identify processes from looking at data patterns.

Further observations were reported by Freeman (1951) and Lederberg et al. (1951). Shortly afterwards, experimental work was published concerning mechanisms for the transfer of genetic material between micro-organisms via what we now call transduction (Zinder & Lederberg 1952; Stocker et al. 1953). The effect of this on phylogenetics was soon considered (Stocker 1955), although no diagrams representing reticulation were presented at this time. The focus was still on elucidating the processes rather than illustrating the phylogenies.

It seems that the first people to actually illustrate HGT among species were Jones & Sneath (1970). In their review of HGT, they not only considered the accumulating evidence for the processes, they explicitly illustrated all of the known cases. These were presented as a series of 18 unrooted phenetic diagrams with known HGT connections linking the bacterial taxa. A single example is shown here.


For eukaryotes, the possibility was early on considered that the asexual transfer of genetic units may be of more general occurrence (Ravin 1955). Indeed, Went (1971) presented a strong case for HGT among plants, based on morphological and anatomical data (ie. phenotypic rather than genotypic evidence). Benveniste & Todaro (1974) then suggested the possibility of exogenously acquired viral genes in mammals. However, it was not really until molecular sequencing became available in the 1980s that biologists really started presenting evidence for gene transfer among eukaryotes (Shilo & Weinberg 1981; Singh et al. 1981; Buslinger et al. 1982; Hyldig-Nielson et al. 1982; Engels 1983).

Most of these suggestions turned out to be spurious, once more evidence accumulated (Smith et al. 1992; Syvanen 1994). However, this did not stop Syvanen (1987) from explicitly considering the effect of HGT on the assessment of evolutionary relationships, apparently being the first to do so. Interestingly, he concluded that "horizontal gene flow would not necessarily preclude a linear molecular clock or change the rate of molecular evolution (assuming the neutral allele theory)."

References

Benveniste RE, Todaro GJ (1974) Evolution of C-type viral genes: inheritance of exogenously acquired viral genes. Nature 252: 456-459.

Bock R (2010) The give-and-take of DNA: horizontal gene transfer in plants. Trends in Plant Science 15: 11-22.

Boto L (2010) Horizontal gene transfer in evolution: facts and challenges. Proceedings of the Royal Society of London B: Biological Sciences 277: 819-827.

Busslinger M, Rusconi S, Birnstiel ML (1982) An unusual evolutionary behaviour of a sea urchin histone gene cluster. EMBO Journal 1: 27-33.

Engels WR (1983) The P family of transposable elements in Drosophila. Annual Review of Genetics 17: 315-344.

Flu P-C (1927) Sur la nature du bactériophage. Comptes Rendus Hebdomadaires des Séances et Mémoires de la Société de Biologie 96(1): 1148-1149.

Freeman VJ (1951) Studies on the virulence of bacteriophage-infected strains of Corynebacterium diphtheriae. Journal of Bacteriology 61: 675-688.

Hyldig-Nielson, JJ, Jensen EØ, Paludan K, Wiburg O, Garrett R, Jørgensen P, Marcker KA (1982) The prnmary structures of two lehemoglobin genes from soybean. Nucleic Acids Research 10: 689-701.

Jones D, Sneath PH (1970) Genetic transfer and bacterial taxonomy. Bacteriology Reviews 34: 40-81.

Lederberg J, Lederberg EM, Zinder ND, Lively ER (1951) Recombination analysis of bacterial heredity. Cold Spring Harbor Symposium on Quantitative Biology 16: 413-443.

Lederberg J, Tatum EL (1946) Gene recombination in Escherichia coli. Nature 158: 558.

Ravin AW (1955) Infection by viruses and genes. American Scientist 43: 468-478.

Renner SS, Bellot S (2012) Horizontal gene transfer in eukaryotes: fungi-to-plant and plant-to-plant transfers of organellar DNA. Advances in Photosynthesis and Respiration 35: 223-235.

Shilo BZ, Weinberg RA (1981) DNA sequences homologous to vertebrate oncogenes are conserved in Drosophila melanogaster. Proceedings of the National Academy of Sciences of the USA 78: 6789-6792.

Singh L, Purdom IF, Jones KW (1981) Conserved sex chromosome-associated nucleotide sequences in eukaryotes. Cold Spring Harbor Symposium on Quantitative Biology 45: 805-813.

Smith MW, Feng D-F, Doolittle RF (1992) Evolution by acquisition: the case for horizontal gene transfers. Trends in Biochemical Science 17: 489-493.

Stocker BAD (1955) Bacteriophage and bacterial classification. Journal of General Microbiology 12: 375-379.

Stocker BAD, Zinder ND, Lederberg J (1953) Transduction of flagellar characters in Salmonella. Journal of General Microbiology 9: 410-433.

Syvanen M (1987) Molecular clocks and evolutionary relationships: possible distortions due to horizontal gene flow. Journal of Molecular Evolution 26: 16-23.

Syvanen M (1994) Horizontal gene transfer: evidence and possible consequences. Annual Review of Genetics 28: 237-261.

Went FW (1971) Parallel evolution. Taxon 20: 197-226.

Zinder ND, Lederberg J (1952) Genetic exchange in Salmonella. Journal of Bacteriology 64: 679-699.