The pedigree of grape varieties


We are all familiar with the concept of a family tree (formally called a pedigree). People have been compiling them for at least a thousand years, as the first known illustration is from c.1000 CE (see the post on The first royal pedigree). However, these are not really tree-like, in spite of their name, unless we exclude most of the ancestors from the diagram. After all, family histories consist of males and females inter-breeding in a network of relationships, and this cannot be represented as a simple tree-like diagram without leaving out most of the people. I have written blog posts about quite a few famous people who have really quite complex and non-tree-like family histories (including Cleopatra, Tutankhamun, Charles II of Spain, Charles Darwin, Henri Toulouse-Lautrec, and Albert Einstein).

A history of disease within an Amish community

Clearly, the history of domesticated organisms is even more complex than that of humans. After all, in most cases we have gone to a great deal of trouble to make these histories complex, by deliberately cross-breeding current varieties (of plants) and breeds (of animals) to make new ones. So, I have previously raised the question: Are phylogenetic trees useful for domesticated organisms? The answer is the same: no, unless you leave out most of the ancestry.

In most cases, we have no recorded history for domesticated organisms, because most of the breeding and propagating was undocumented. Until recently, it was effectively impossible to reconstruct the pedigrees. This has changed with modern access to genetic information; and there is now quite a cottage industry within biology, trying to work out how we got our current varieties of cats, dogs, cows and horses, as well as wheat, rye and grapes, etc. I have previously looked at some of these histories, including Complex hybridizations in wheat, and Complex hybridizations in barley and its relatives.

Grapes

One example of particular interest has been grape varieties. I have discussed some of the issues in a previous post: Grape genealogies are networks, not trees, including the effects of unsampled ancestors when trying to perform the reconstruction.

There are a number of places around the web where you can see heavily edited summaries of what is currently known about the grape pedigree. However, these simplifications defeat the purpose of this blog post, which is to emphasize the historical complexity. The only diagram that I know of that shows you the full network (as currently known) is one provided by Pop Chart (The Genealogy of Wine), a commercial group who provide infographic posters for just about anything. They will sell you a full-sized poster of the pedigree (3' by 2'), but here I have provided a simple overview (which you can click on to see somewhat larger).

Grape variety genealogy from Pop Chart

You can actually zoom in on the diagram on the Pop Chart web page to see all of the details. This allows you to spend a few happy hours finding your favorite varieties, and to see how they are related. You will presumably get lost among the maze of lines, as I did.

The curiously converted logic of phylogenetics


Phylogenetic analysis involves describing patterns, not studying processes. That is, we cannot conduct a manipulative experiment to study evolutionary history. All we can do is collect naturally occurring data, and then try to detect relevant patterns in it. Thus, in a descriptive study we investigate processes by examining the patterns they produce, not by manipulating the processes themselves, which is what we would do in an experimental study.

Obviously, one of the limitations of this procedure is that the patterns we need may not be in the data we have at hand. It is this limitation that leads some scientists to claim that descriptive studies are not part of science. However, this is not the majority view. [See Mattis' later post, on Patterns, processes, abduction, and consilience]


Equally importantly, there is a logical limitation to descriptive studies, as well, which I have rarely seen mentioned. In the world of logic, propositions cannot be converted; and yet converting propositions is exactly what is done by all descriptive analyses. [The four terms used in logic are defined at the bottom of this post.]

Our initial logic works from process to pattern (if p, then q), but we interpret it the other way around, that a specified pattern must be created by a particular process (if q, then p). Thus:
  • we expect this specific process to produce that particular pattern
  • therefore, when we see that particular pattern we can infer this specific process.
The problem here is the second statement, which is the logical converse of the first statement (the proposition). The inference is illogical, because other processes might also create the same pattern, in which case our inference can be wrong.

The Monty Python comedy team had a go at this in their Logician skit on "The Holy Grail" album (but not in the movie of the same name). Their example concerned a 1950s-60s singer called Alma Cogan, who died in 1966. Their inference was:
  • all of Alma Cogan is dead
  • therefore, all dead people are Alma Cogan.
This is illogical, because there is more to being dead than simply being Alma Cogan — logical propositions can be only partially converted.

The same logical fallacy has also been pointed out in the application of statistics to ecology. Stuart Hurlbert (1990. Spatial distribution of the Montane Unicorn. Oikos 58: 257-271) assessed the use of the poisson probability distribution as evidence for random spatial distributions of organisms. The inference is:
  • for a poisson distribution, the variance equals the mean
  • therefore, if the variance equals the mean we can infer a poisson distribution.
His paper points out many real datasets where the variance equals the mean but the data do not fit a poisson distribution. He concluded: "Each population showed a different pattern of aggregation and none corresponded to a Poisson distribution. The variance:mean ratio is useless as a measure of departure from randomness, though it is widely recommended as such."

These are simply examples of a general problem: we cannot convert a proposition and expect to be right all of the time, or even most of the time. The issue applies to all phylogenetic analyses, whether they involve the assessment of homology, or the construction of trees and networks — we are inferring particular evolutionary processes form the observation of particular patterns in our data. For example, our model of the process of speciation implies a tree model of evolution, and therefore every time we get a "well-supported tree" we treat it as the true phylogeny. This will not work if other processes are occurring, such as hybridization.

I will finish with one specific example from network analysis. The D-statistic is used in the so-called ABBA-BABA test for detecting introgression among taxa (see Networks of admixture or introgression). The logic works from process to pattern (introgression would create a particular gene-tree pattern), but we interpret it the other way around — we see the specified gene pattern and we thereby infer the presence of introgression.

This issue of illogic is definitely a limitation of phylogenetic analysis.



The terms of logical analysis:
Proposition
Inverse
Converse
Contrapositive
if p, then q
if not p, then not q
if q, then p
if not q, then not p

Tattoo Monday XVI — ambitious Darwin trees


Perhaps the most popular tattoo for phylogeneticists has been a small one based on Charles Darwin's best-known sketch from his Notebooks (the "I think" tree) — see Tattoo Monday III, Tattoo Monday VI, Tattoo Monday IX, and Tattoo Monday XII.

However, some people are more ambitious artists than this. Below is a collection of tattoos that incorporate the tree as one element in a much larger Darwin-related picture. You can click on any image to see it at the original size.






A proper network of Europeans


Back in May this year, Iosif Lazaridis submitted a paper to the arXiv, called: "The evolutionary history of human populations in Europe". It is now online as part of the December 2018 issue of Current Opinion in Genetics & Development (53: 21-27).

Its interest for readers of this blog is the one and only figure that the paper contains. It is a genealogical network, showing the obvious — that the human "family tree" has quite a few reticulations, mostly due to introgression (or admixture, as human geneticists like to call it). Here is the figure, along with the legend. Note that not all of the edges in the network have a direction, so that it is not really a directed acyclic graph (see also First-degree relationships and partly directed networks).


A sketch of European evolutionary history based on ancient DNA
Bronze Age Europeans (~4.5-3kya) were a mixture of mainly two proximate sources of ancestry: (i) the Neolithic farmers of ~8-5kya who were themselves variable mixtures of farmers from Anatolia and hunter-gatherers of mainland Europe (WHG), and (ii) Bronze Age steppe migrants of ~5kya who were themselves a mixture of hunter-gatherers of eastern Europe (EHG) and southern populations from the Near East. Thus, we only have to go ~8 thousand years backwards in time to find at least four sources of ancestry for Europeans. But, each of these sources was also admixed: European hunter-gatherers received genetic input from Siberia and ultimately also from archaic Eurasians, and Near Eastern populations interacted in unknown ways with Europe and Siberia and also had ancestry from ‘Basal Eurasians’, a sister group of the main lineage of all other non-African populations. Dates correspond to sampled populations; in the case of a cluster of populations (such as the WHG), they correspond to the earliest attestation of the group.

Which airlines are the best?


Scientists are known to get about a bit. They attend conferences and give workshops, they go on sabbatical, and sometimes they even have holidays. Many of these activities require them to be in other places than their home city; and to get there they often resort to air travel. This makes it of interest to them to know which airlines are considered to be "good". Scientists may not have much choice about which airlines they can choose to fly, depending on where they live, but they can at least try to fly on one of the good ones.


They are not alone in this desire, and so inevitably there are web sites that provide the necessary information. These include AirHelp Airline Worldwide Rankings; but the best-known listing is the annual one from Skytrax, a UK-based consumer aviation agency.

Each year, Skytrax conducts a survey in which "airline customers around the world" vote for the best airline. The survey results are released at the beginning of each year, and they thus refer to the previous year's survey. Skytrax note that "over 275 airlines were featured in the [current] customer survey but we only feature the top 100 listing."

The Skytrax top-100 data currently exist online for the years 2012-2018 inclusive, which cover the years 2011-2017. It can be useful to consider data for multiple years, because some airlines have greatly improved their ranking through time, while others have slipped back. There are 80 airlines with top-100 data for each of the years 2011-2017, and another 45 airlines that have appeared in the top 100 at least once. A few airlines have also merged during these years.

We can explore the multi-year data for the 80 airlines using a network analysis, to visualize the overall pattern. I first calculated the Manhattan distances pairwise between the airlines, and then plotted these using a NeighborNet graph, as shown in the figure below. Airlines that have similar rankings across the years are near each other in the network; and the further apart they are in the network then the more different are their overall rankings.


As you can see, this is pretty much a linear network, with the best-ranked airlines at the top-right, and then continuing down to the bottom-left. A simple list of the average rankings across the years would be almost as informative. In particular, the top-ranked airlines have remained at the top across the years; and it is only in the middle and especially at the bottom that there has been movement among the rankings (that is, the network broadens out at one spot in the middle and then again at the end).

Note that the top end of the list consists mainly of airlines from the Middle East and Asia. Australia has only two airlines, both of which do well in the network, along with the only one from New Zealand. The presence near the top of both Turkish Airlines and Garuda Indonesia may surprise some people.

You will also note that the US airlines are generally closer to the bottom of the network — they are marked in red in the network. The airlines from China are mostly there, also (except Hainan Airlines). It is not a coincidence that neither of the world's two biggest economies runs a high-quality airline. It seems that the only way to do this is actually to rely on government subsidies, which is how most of the top-ranked airlines are doing it.

Finally, there are few discount airlines that make it into the top 50. Put simply, the economics of running an all-economy-class plane do not allow much in the way of customer service (see How Budget Airlines Work). It is actually the first-class and business-class passengers on any given plane that allow it to take off at all, in terms of making money for the airline — a classic example of the 80/20 rule: 80% of the money comes from 20% of the passengers (see The Economics of Airline Class).

Finally, in a similar vein, you could also contemplate the sites pertaining to airport quality (eg. AirHelp Airport Worldwide Rankings, World Airport Awards), as well as the Guide to Sleeping in Airports. There are also sites that tell you which seats to choose in any given plane (eg. SeatGuru).

Getting the wrong tree when reticulations are ignored


One issue that has long intrigued me is what happens when someone constructs a phylogenetic tree under circumstances where there are reticulate evolutionary events in the actual (ie. true) phylogeny itself. That is, a network is required to accurately represent the phylogeny, but a tree is used as the model, instead. How accurate is the tree?

By this, I mean that, if the phylogeny can be thought of as a "tree with reticulations", do we simply get that tree but miss the reticulations, or do we get a different (ie. wrong) tree?


Sometimes, people refer to this situation as having a "backbone tree" — the phylogeny is basically tree-like, but there are a few extra branches, perhaps representing occasional hybridizations or horizontal gene transfers. The phylogenetic tree can then be treated as a close approximation to the true phylogeny, representing the diversification events but not the (rarer) reticulation events.

I have argued against this approach (2014. Systematic Biology 63: 628-638.). Instead of seeing a network as a generalization of a tree, we should see a tree as a simplification of a network. If we do this, then we would construct a network every time; and sometimes that network would be a tree, because there are no reticulation events in the phylogeny. It cannot work the other way around, because we can never get a network if all we ask for is a tree!

Presumably, if there are no reticulations then we should get the same answer (phylogenetic tree) irrespective of whether we simply construct a tree or instead construct a network that turns out to be a tree. But what about the "backbone tree" situation? Here, it has always seemed to me to be possible that we do not get the same tree. If this is so, then constructing a tree and then adding a few reticulations to it (as is often done in the literature) would not work — we would be adding reticulations to the wrong backbone tree.

There are two possible ways in which we can get the wrong backbone tree: the topology might be incorrect, or the branch-lengths might be incorrect (or both). For example, if there are true reticulations and yet we do not include them in our model, I have argued that the branches will be too short (2014. Systematic Biology 63: 847-849.) — two taxa will be genetically similar because of the reticulation events, but the tree-building algorithm can only make them similar on the tree by shortening the branches (not by adding a reticulation).

Fortunately, for at least one tree-building model Luay Nakhleh and his group have now done some simulations to answer my questions. You may not yet have noticed their results, because they are not necessarily in the most obvious place; so I will highlight them here. The analyses involve the Multispecies Coalescent (MSC) model, which accounts for incomplete lineage sorting during the tree-like part of evolution, as compared to the Multispecies Network Coalescent (MSNC) which adds reticulations (eg hybridization) to the model.

1.
Dingqiao Wen, Yun Yu, Matthew W. Hahn, Luay Nakhleh (2016) Reticulate evolutionary history and extensive introgression in mosquito species revealed by phylogenetic network analysis. Molecular Ecology 25: 2361-2372.

This paper compares a tree-based analysis (construct a tree first then add reticulations) with a network-based analysis (construct a network) for an empirical genomic dataset. The two results differ.

2.
Dingqiao Wen, Luay Nakhleh (2018) Coestimating reticulate phylogenies and gene trees from multilocus sequence data. Systematic Biology 67: 439-457.

Tucked away in the Supplementary Information are the results of a set of simulations comparing the MSC (using *Beast) and the MSNC (using PhyloNet), with (section 3) and without (section 2) reticulations. The basic conclusion is that, in the presence of reticulation, tree-based methods either get the tree completely wrong, or they get the tree topology right but the branch lengths are "forced" to be very short. A summary of the latter result is shown in the figure above. In the absence of reticulation, both methods produce the same tree.

3.
R.A. Leo Elworth, Huw A. Ogilvie, Jiafan Zhu, and Luay Nakhleh (ms.) Advances in computational methods for phylogenetic networks in the presence of hybridization. (chapter for a forthcoming book]

A summary of the group's work to date. Section 6.3 summarizes the results from the paper 2.

Limitations of the new book about HGT networks


This is a joint post by David Morrison and Ajith Harish.

There has been a flurry of reviewing activity recently about the new book:

The Tangled Tree: a Radical New History of Life
David Quammen. 2018. Simon & Schuster.


This book has received glowing reviews, including:

The book is intended for the general public, rather than for specialists, explaining the "new view" of evolutionary history that includes extensive horizontal gene transfer (HGT), especially in the microbial world. Quammen describes himself as a science, nature and travel writer, so his book is more than just a record of science, and is as much about the people involved as about the scientific theory. In particular, it contains a biography of Carl Woese.

Quammen’s recent New York Times feature article The scientist who scrambled Darwin’s Tree of Life is a very good primer to his book. For us, it indicates that the book has many overlaps with Jan Sapp's earlier book The New Foundations of Evolution: on the Tree of Life (2009. Oxford University Press). The publisher’s advertised selling point of that book is: "This is the first book on (and first history of) microbial evolutionary biology, and that it puts forth a new theory of evolution", with HGT being the new theory. In this sense, the "radical new view" is simply that genetic material can be transferred without sexual reproduction, an idea that goes back rather a long way in history (see The history of HGT), and which is often seen as anti-Darwinian.

Bill Hanage in his review of Sapp’s book (2010. The trouble with trees. Science 327: 645-646) argues that the book neither puts forward a new theory nor is the debate actually about horizontal gene transfer, and the Tree of Life is thus far from settled. There are many other interesting points discussed in that review. Furthermore, even after almost 10 years, Hanage’s review of Sapp’s 2009 book can be substituted verbatim as a review of Quammen’s 2018 book! This PDF shows how the book review would read if the author and book names in Hanage’s review were to be substituted [reproduced with the permission of the original author].

The debate allegedly involving HGT is, at heart, about explaining the pattern of extensively mixed genetic material found in the akaryotes. However, simply looking at a pattern does not tell you about the process that created the pattern. In order to study processes, we need a model, in this case a model about how evolution occurs. The "HGT model" is that the Last Universal Common Ancestor (LUCA) of life was a relatively simple organism genetically, and that subsequent evolutionary history has involved complexification of that ancestor, both by diversification and by HGT.

What the two books do not explore is the other major model for the current distribution of genetic material among akaryotes. This alternative scenario is that the LUCA was genetically complex, and that the subsequent evolutionary history involved independent losses of parts of the genetic material — the sporadically shared material is basically coincidental. All that this model requires is that there be evolutionary history prior to the LUCA, during which it became a complex organism from its simple beginnings — the LUCA is merely as far back as we can see into the past, with the prior history being unrecoverable by us (ie. we cannot see past the LUCA bottleneck).

Over the past couple of decades, a number of papers have explored the evidence for the latter idea, from both the RNA and protein perspectives, including:
  • Anthony Poole, Daniel Jeffares, David Penny (1999) Early evolution: prokaryotes, the new kids on the block. BioEssays 21: 880-889.
  • Christos A. Ouzounis, Victor Kunin, Nikos Darzentas, Leon Goldovsky (2006) A minimal estimate for the gene content of the last universal common ancestor — exobiology from a terrestrial perspective. Research in Microbiology 157: 57-68.
  • Miklós Csűrös István Miklós (2009) Streamlining and large ancestral genomes in Archaea inferred with a phylogenetic birth-and-death model. Molecular Biology and Evolution 26: 2087-2095.
  • Kyung Mo Kim, Gustavo Caetano-Anollés (2011) The proteomic complexity and rise of the primordial ancestor of diversified life. BMC Evolutionary Biology 11: 140.
  • Ajith Harish, Charles G. Kurland (2017) Akaryotes and Eukaryotes are independent descendants of a universal common ancestor. Biochimie 138: 168-183.
Finally, even from the perspective of phylogenetic networks, Quammen's book is very one-sided. In particular, the other processes that lead to reticulate evolution (eg. introgression and hybridization) are pretty much ignored. That is, the focus is on akaryotes not eukaryotes. The latter are also of phylogenetic interest.

Distinguishability in Phylogenetic Networks, report


We have now completed the workshop, as you can tell from the previous post with some photos. Here is a brief report on what seem to me to be some of the more useful points covered.


We had 10 formal presentations, but we also focused on group discussions for several hours each day. It may be the latter that were the most productive. However, I will briefly summarize the talks first.

I spent my time time in the opening talk emphasizing the different viewpoints of network computations, which focus on the patterns that can be detected in the data, and the network users, who are generally more interested in the processes that create those patterns (or are, indeed, absence from the patterns but present in the phylogenetic history, anyway). This highlights the two essential point of the workshop title, that both the patterns and the processes are much harder to untangle for networks than for trees.

Céline Scornavacca then bravely tried to tackle the combined problem, anyway, by trying to produce networks from analyzing the patterns in terms of their processes. The issues immediately become obvious, but she seems to be determined to proceed, regardless. Later in the week, Luay Nakhleh reduced the issue simply to vertical processes (including incomplete lineage sorting but not gene duplication-loss) versus horizontal processes. This creates a tractable problem for parsimony and likelihood, but the current challenge remains the limited number of taxa.

Vincent Moulton, Cécile Ané and Charles Semple dodged the issue by focusing on computations. Charles took on the challenge of trying to create a network version of Neighbor-Joining, which would address the issues of computational speed and taxon sampling, and Vince tackled super-networks, and the conditions required for building networks from a collection of smaller (ie. incomplete) trees. Both topics remain open questions. Cécile, on the other hand, discussed network models for trait evolution, which is important for the use of phylogenetic comparative methods when using networks.

On the user side, the presentations focused on examples, and the issues encountered when dealing with them. James Whitfield and Axel Janke talking about biology (mostly phylogenomics), while Johann-Mattis List talked about linguistics, and Tiago Tresoldi talked about stemmatology. In some ways, historical linguistics seems to be the odd one out, since many of the issues dealt with are somewhat removed from those in the other fields. However, in biology there are actually two options for producing networks — directly from the data or via "gene trees" (trees derived from non-recombining blocks of sequences). For the humanities, much of the current discussion is about the nature of the data, and how to code it for quantitative analysis.

This brings us to the discussions. While some time was spent on trying to establish whether biologists think that there is a difference between lateral gene transfer and horizontal gene transfer, or between incomplete lineage sorting, ancestral polymorphism and deep coalescence, some productive interchanges also occurred. Here is a coverage of four of the most important ones.

There was general agreement that there are several barriers to widespread adoption of network analyses in phylogenetics. This includes the development of suitable methods (in the face on indistinguishability), but also includes an understanding of what methods are currently available, what data are required to apply those methods, what taxon sampling is required to benefit from the methods, and how to use the programs that implement those methods.

One popular suggestion was therefore to produce some sort of "cookbook", to address the complexity of producing networks, given that there are many methods and programs. From the users' point of view this would illustrate what network analyses can do, in terms of finding reticulation patterns in the data; and from the computational point of view it would outline what needs to be done to get the programs to work. The consensus idea was to choose two suitable datasets (yet to be determined), and then have each program author provide analyses of them (including any scripts that are needed).

Following on from this latter point, it was agreed that the programs need easy user interfaces, if they are to become more widely used. Here, the word "widely" includes casual users from outside of phylogenetics, who use phylogenies as only one of many tools in their work. So, users will include those who need nothing more than a "point and click" control panel (which may be >90% of potential users) to those who would benefit from scripting control of the analyses. The interface needs both a front end, to specify the particular analysis, and a back end, to allow exploration of the output.

Another long-discussed issue was how to popularize networks, which is clearly a major topic. A phylogenetic tree is nothing more than one of the possible networks for any given dataset, and yet the focus is often on trees rather than networks.

To this end, it was noted that the current Wikipedia entry is inadequate, especially compared to the corresponding entry for phylogenetic trees. Not only is this entry out of date, it is in a number of ways misleading. In particular, there needs to be a discussion of the fact that, if a network is a "tree with reticulations", then ignoring the reticulations can result in the wrong tree, and the branch lengths may be severely under-estimated. There are challenges to getting Wikipedia entries changed, especially the wholesale re-writing of an entry, but this will be necessary.

Finally, it was noted that Philippe Gambette's Who is Who in Phylogenetic Networks website is extremely useful but is still poorly known, even within the phylogenetic networks community. We had a long discussion about how to enhance this site, to make it a more general-purpose repository of information about phylogenetic networks. This included a more inclusive database, more comprehensive tagging of keywords, enhanced descriptions of those keywords, and ways to keep the database up to date.


Steven Kelk has the notes from the final session, which was a review of what we achieved during the workshop, and which contains the To Do list. Both he and Philippe have the notes about modifications for the Who is Who in Phylogenetic Networks website, which is likely to be the first outcome-project tackled.

Thankyou to everybody who participated in the workshop. It seemed to be very productive, with a number of concrete outcomes that will be interesting to review at the next workshop.

Distinguishability in Phylogenetic Networks, photos


Evidence that we were in the Netherlands.



Evidence that we did some work.



Left to right: Steven Kelk, David Morrison, Mike Steel, Philippe Gambette (obscured), Tiago Tresoldi, Claudia Solis-Lemus, Fabio Pardi, Simone Linz, Mark Jones.


Left to right: David Morrison, Cecile Ané, Philippe Gambette (obscured), Katharina Huber, Leen Stougie, Remie Janssen, Yukihiro Murakami, Mattis List, Gereon Kaiping and Charles Semple.


Left to right: David Morrison (obscured), Axel Janke, Steven Kelk, Charles Semple, Claudia Solis-Lemus, Mark Jones (obscured), Fabio Pardi, Leo van Iersel, Simone Linz and Vincent Moulton.


Céline Scornavacca lectures Cecile Ané.


Axel Janke and Leo van Iersel contemplate methods for infering hybridization.


Philippe Gambette and Guido Grimm.


Mozes Blom and Jim Whitfield.


Mike Steel and Luay Nakhleh.


Luay delivers his Final Message, to Mozes Blom, Cecile Ané, Katharina Huber and Charles Semple.