Next-generation neighbor-nets


Neighbor-nets are a most versatile tool for exploratory data analysis (EDA). Next-generation sequencing (NGS) allows us to tap into an unprecedented wealth of information that can be used for phylogenetics. Hence, it is natural step to combine the two.

I have been waiting for it (actively-passively) and the time has now come. Getting NGS data has become cheaper and easier, but one still needs considerable resources and fresh material. Hence, NGS papers usually not only use a lot of data, but also are many-authored. You can now find neighbor-nets based on phylogenomic pairwise distances computed from NGS data — for example, in these two recently published open access pre-prints:
  • Pérez Escobar​ OA, Bogarín D, Schley R, Bateman R, Gerlach G, Harpke D, Brassac J, Fernández-Mazuecos M, Dodsworth S, Hagsater E, Gottschling M, Blattner F. 2018. Resolving relationships in an exceedingly young orchid lineage using Genotyping-by-sequencing data. PeerJ Preprint 6:e27296v1
  • Hipp AL, Manos PS, Hahn M, Avishai M, Bodénès C, Cavender-Bares J, Crowl A, Deng M, Denk T, Fitz-Gibbon S, Gailing O, González Elizondo MS, González Rodríguez A, Grimm GW, Jiang X-L, Kremer A, Lesur I, McVay JD, Plomion C, Rodríguez-Correa H, Schulze E-D, Simeone MC, Sork VL, Valencia Avalos S. 2019. Genomic landscape of the global oak phylogeny. bioRxiv DOI:10.1101/587253.

Example 1: A young species aggregate of orchids

Pérez Escobar et al.'s neighbor-nets are based on uncorrected p-distances inferred from a matrix including 13,000 GBS ("genotyping-by-sequencing") loci (see the short introduction for the method on Wikipedia, or the comprehensive PDF from a talk at/by researchers of Cornell) covering 29 accessions of six orchid species and subspecies.

They also inferred maximum likelihood trees, and did a coalescent analysis to consider eventual tree-incompatible signal, gene-tree incongruence due to potential reticulation and incomplete lineage sorting. They applied the neighbor-net to their data because "split graphs are considered more suitable than phylograms or ultrametric trees to represent evolutionary histories that are still subject to reticulation (Rutherford et al., 2018)" – which is true, although neighbor-nets do not explicitly show a reticulate history.

Here's a fused image of the ML trees (their fig. 1) and the corresponding neighbor-nets (their fig. 2):

Not so "phenetic": NGS data neighbor-nets (NNet) show essentially the same than ML trees — the distance matrices reflect putative common origin(s) as much as the ML phylograms. The numbers at branches and edges show bootstrap support under ML and the NNet optimization.

Groups resolved as clades, Group I and III, or grades or clades, Group II (compare A vs. B and C), in the ML trees form simple (relating to one edge-bundle) or more complex (defined by two partly compatible edge-bundles, Group I in A) neighborhoods in the neighbor-net splits graphs. The evolutionary unfolding, we are looking at closely related biological units, was likely not following a simple dichotomizing tree, hence, the ambiguous branch-support (left) and competing edge-support (right) for some of the groups. Furthermore, each part of a genome will be more descriminative for some aspect of the coalescent and less for another, another source of topological ambiguity (ambiguous BS support) and incompatible signal (as seen in and handled by the neighbor-nets). The reconstructions under A, B and C differ in the breadth and gappyness of the included data (all NGS analyses involve data filtering steps): A includes only loci covered for all taxa, B includes all with less than 50% missing data, and C all loci with at least 15% coverage.

PS I contacted the first author, the paper is still under review (four peers), a revision is (about to be) submitted, and, with a bit of luck, we'll see it in print soon.


Example 2: The oaks of the world

The Hipp et al. (note that I am an author) neighbor-net is based on model-based distances. The reason I opted (here) for model-based distance instead of uncorrected p-distances is the depth of our phylogeny: our data cover splits that go back till the Eocene, but many of the species found today are relatively young. The dated tree analyses show substantial shifts in diversification rates. In the diverse lineages today and possibly in the past (see the lines in the following graph), in those with few species (*,#) we may be looking at the left-overs of ancient radiations.

A lineage(s)-through-time plot for the oaks (Hipp et al. 2019, fig. 2). Generic diversification probably started in the Eocene around 50 Ma, and between 10–5 Ma parts (usually a single sublineage) of these long-isolated intrageneric lineages (sections) underwent increased speciation.

The data basis is otherwise similar, SNPs (single-nucleotide polymorphisms) generated using a different NGS method, in our case RAD-tagging (RAD-seq) of c. 450 oak individuals covering the entire range of this common tree genus — the most diverse extra-tropical genus of the Northern Hemisphere. There are differences between GBS and RAD-seq SNP data sets — a rule of thumb is that the latter can provide more signal and SNPs, but the single-loci trees are usually less decisive, which can be a problem for coalescent methods and tests for reticulation and incomplete lineage sorting that require a lot of single-loci (or single-gene) trees (see the paper for a short introduction and discussion, and further references).

We also inferred a ML tree, and my leading co-authors did the other necessary and fancy analyses. Here, I will focus on the essential information needed to interpret the neighbor-net that we show (and why we included it at all).

Our fig. 6. Coloring of main lineages (oak sections) same as in the LTT plot. Bluish, the three sections traditionally included in the white oaks (s.l.); red, red oaks; purple, the golden-cup or 'intermediate' (between white and red) oaks — these three groups (five sections) form subgenus Quercus, which except for the "Roburoids" and one species of sect. Ponticae is restricted to the Americas. Yellow to green, the sections and main clades (in our and earlier ML trees) of the exclusively Eurasian subgenus Cerris.

Like Pérez Escobar et al., we noted a very good fit between the distance-matrix based neighbor-net and the optimised ML tree. Clades with high branch support and intra-clade coherence form distinct clusters, here distinct neighborhoods associated with certain edge bundles (thick colored lines). This tells us that the distance-matrix is representative, it captures the prime-phylogenetic signal that also informs the tree.

The first thing that we can infer from the network is that we have little missing data issues in our data. Distance-based methods are prone to missing data artifacts and RAD-seq data are (inevitably) rather gappy. It is important to keep in mind that neighbor-nets cannot replace tree analysis in the case of NGS data, they are "just" a tool to explore the overall signal in the matrix. If the network has neighborhoods contrasting what can be seen in the tree, this can be an indication that one's data is not sufficiently tree-like at all. But it also can just mean that the data is not sufficient to get a representative distance matrix.

Did you notice the little isolated blue dot (Q. lobata)? This is such a case — it has nothing to do with reticulation between the blue and the yellow edges, it's just that the available data don't produce an equally discriminative distance pattern: according to its pairwise distances, this sample is generally much closer to all other oak individuals included in the matrix in contrast to the other members of its Dumosae clade, which are generally more similar to each other, and to the remainder of the white oaks (s.str., dark blue, and s.l., all bluish).

Close-up on the white oak s.str. neighbor-hood (sect. Quercus) and plot of the preferred dated tree.

In the tree it is hence placed as sister to all other members, and, being closer to the all-ancestor, it triggers a deep Dumusae crown age, c. 10 myr older than the subsequent radiation(s) and as old as the divergence of the rest of the white oaks s.str.

The second observation, which can assist in the interpretation of the ML tree (especially the dated one), is the principal structure (ordering) within each subgenus and section. The neighbor-net is a planar (i.e. 2-dimensional graph), so the taxa will be put in a circular order. The algorithm essentially identifies the closest relative (which is a candidate for a direct sister, like a tree does) and the second-closest relative. Towards the leaves of the Tree of Life, this is usually a cousin, or, in the case of reticulation, the intermixing lineage. Towards the roots, it can reflect the general level of derivation, the distance the (hypothetical all-)ancestor.

Knowing the primary split (between the two subgenera), we can interprete the graph towards the general level of (phylogenetic) derivedness.

The overall least derived groups are placed to the left in each subgenus, and the most derived to the right. The reason is long-branch attraction (LBA) stepping in: the red and green group are the most isolated/unique within their subgenera, and hence they attract each other. This is important to keep in mind when looking at the tree and judge whether (local) LBA may be an issue (parsimony and distance-methods will always get the wrong tree in the Felsenstein Zone, but probabilistics have a 50% chance to escape). In our oak data, we are on the safe side. The red group (sect. Lobatae, the red oaks) are indeed resolved as the first-branching lineage within subgenus Quercus, but within subgenus Cerris it is the yellow group, sect. Cyclobalanopsis. If this would be LBA, Cyclobalanopsis would need to be on the right side, next to the red oaks.

The third obvious pattern is the distinct form of each subgraph: we have neighborhoods with long, slim root trunks and others that look like broad fans.

Long-narrow trunks, i.e. distances show high intra-group coherence and high inter-group distinctness can be expected for long isolated lineages with small (founder) population sizes, eg. lineages that underwent in the past severe or repeated bottleneck situations. Unique genetic signatures will be quickly accumulated (increasing the overall distance to sister lineages), and the extinction ensures only one (or very similar) signature survives (low intragroup diversity until the final radiation).

Fans represent gradual, undisturbed accumulation of diversity over a long period of time, eg. frequent radiation and formation of new species during range and niche expansion – in the absence of stable barriers we get a very broad, rather unstructured fan like the one of the white oaks (s.str.; blue); along a relative narrow (today and likely in the past) geographic east-west corridor (here: the  'Himalayan corridor') a more structured, elongated one as in the case of section Ilex (olive).

Close-up on the sect. Ilex neighborhood, again with the tree plotted. In the tree, we see just sister clades, in the network we see the strong correlation between geography and genetic diversity patterns, indicating a gradual expansion of the lineage towards the west till finally reaching the Mediterranean. Only sophisticated explicit ancestral area analysis can possibly come to a similar result (often without certainty) which is obvious from comparing the tree with the network.

This can go along with higher population sizes and/or more permeable species barriers, both of which will lead to lower intragroup diversity and less tree-compatible signals. Knowing that both section Quercus (white oaks s.str., blue) and Ilex (olive) evolved and started to radiate about the same time, it's obvious from the structure of both fans that the (mostly and originally temperate) white oaks produced always more, but likely less stable species than the mid-latitude (subtropical to temperate) Ilex oaks today spanning an arc from the Mediterranean via the southern flanks of the Himalayas into the mountains of China and the subtropics of Japan.

Networks can be used to understand, interpret and confirm aspects of the (dated) NGS tree.

The much older stem and young crown ages seen in dated trees may be indicative for bottlenecks, too. But since we typically use relaxed clock models, which allow for rate changes and rely on very few fix points (eg. fossil age constraints), we may get (too?) old stem and (much too) young crown ages, especially for poorly sampled groups or unrepresentative data. By looking at the neighbor-net, we can directly see that the relative old crown ages for the lineages with (today) few species fit with their within-lineage and general distinctness.

The deepest splits: the tree mapped on the neighbor-net.

By mapping the tree onto the network, and thus directly comparing the tree to the network, we can see that different evolutionary processes may be considered to explain what we see in the data. It also shows us how much of our tree is (data-wise) trivial and where it could be worth to take a deeper look, eg. apply coalescent networks, generate more data, or recruit additional data. Last, but not least, it's quick to infer and makes pretty figures.

So, try it out with your NGS data, too.

PS. Model-based distances can be inferred with the same program many of us use to infer the ML tree: RAxML. We can hence use the same model assumptions for the neighbor-net that we optimized for the inferring tree and establishing branch support.

How CDC Is Using Advanced Molecular Detection Technology To Better Fight Flu!

Lab worker

Flu (influenza) is a serious disease caused by influenza viruses. Flu viruses change constantly. They are among the fastest mutating viruses known. These changes can impact how well the flu vaccine works, or can also result in the emergence of new influenza viruses against which people have no preexisting immunity, triggering a pandemic. Year round, scientists from CDC, World Health Organization (WHO), and other partners monitor the influenza viruses that are infecting people. These scientists study the viruses in the laboratory to see how they are changing.

CDC is using next-generation gene sequencing tools to analyze flu viruses as part of CDC’s Advanced Molecular Detection (AMD) initiative. The technology allows CDC to study more influenza viruses faster and in more detail than ever before. AMD technology uses genomic sequencing, high-performance computing, and epidemiology to study pathogens and improve disease detection. CDC is using this Next Generation-Sequencing (NGS) technology to monitor genetic changes in influenza viruses in order to better understand and improve the effectiveness of influenza vaccines.

To share more information about this revolutionary NGS technology and its impactful work, CDC expert John Barnes, PhD, Team Lead of the Influenza Genomics Team within the Virology, Surveillance, and Diagnosis Branch within CDC’s Influenza Division took part in a Reddit Ask Me Anything digital Q & A, to answer the public’s question on AMD technology and how these tools are helping to improve influenza virus monitoring and the development of better-performing influenza vaccines. This post includes some highlights from that discussion.

Question 1: What exactly does the AMD technology platform do that is different from the current approaches used to guide vaccine development? And what are the most common reasons that we “guess wrong” in terms of which viral strains will be responsible for the next season’s flu?

Dr Barnes after Reddit Ask Me Anything Q&ADr. Barnes: One example of how AMD technology is used in vaccine development is to address mutations that may occur in vaccine viruses during growth in eggs used in the production of vaccine viruses. These mutations can change the vaccine virus so much that the immune response to vaccination may not protect as well against circulating viruses. This means that vaccinated people may still get sick. CDC is using AMD technology to try to solve this problem. Scientists are looking at the genetic sequences of 10 generations of H3N2 flu viruses as they grow and evolve in eggs. CDC will test all of the viruses to find out what genetic changes cause a good immune response and good growth in eggs. Once the “good” genetic changes are identified, CDC will then synthesize H3N2 viruses with those properties that can be used to make vaccine that offers better protection against H3N2 flu infection. One of the main reasons that the virus is challenging, is due to its’ RNA polymerase. The polymerase of influenza is very mistake prone and causes the virus to mutate rapidly. For example, in some years certain influenza viruses may not appear and spread until later in the influenza season, making it difficult to prepare a candidate vaccine virus in time for vaccine production. This can make vaccine virus selection very challenging. We are currently using AMD techniques to sequence all clinical specimens that come into the CDC to improve our ability to find and track mutations that may be of concern.

Question 2: Why are chicken embryos typically the go-to for flu vaccine cultivation?

Dr. Barnes: Thanks for this question – it’s one we get a lot!  Flu vaccines have been made using an egg-based manufacturing process for more than 70 years. In the past, when making a vaccine for production manufacturers utilized eggs as a safe host to make the vaccine and to provide high yield.  As birds are the natural reservoir host for flu, influenza typically grows well in eggs and maintains a safe distance between species you’re using to make the vaccine and the target.  Mammalian cell lines were subjected to extensive safety testing to establish a cell line that is human pathogens free, while maintaining sufficient vaccine yield. You can learn more about how AMD technology is improving the development of flu vaccines made using egg-based technology, here.

Question 3: What about the flu virus causes it to mutate so quickly from year to year requiring a new vaccine every season? For example with chickenpox there is one virus and one vaccine, why then with the flu are there countless strains and a new vaccine every year?

Dr. Barnes: As you know, influenza is a virus and can only replicate in living cells. Influenza viruses survive by infecting host cells, multiplying, and then exiting host cells. The enzyme influenza uses to copy itself is very error prone which causes the virus to rapidly mutate. Each host has its own defense mechanisms and these defenses are collectively referred to as environmental pressures. It’s difficult to predict how a virus will mutate when attempting to get around a host’s immune defenses, but the changes can happen rapidly, as you said.

Because flu viruses are constantly changing, the formulation of the flu vaccine is reviewed each year and sometimes updated to keep up with changing flu viruses. More information about how influenza viruses can change is available here.

Question 4: Do you have any insight on the universal vaccine that was developed?

Dr. Barnes: Great question! Yes, I can provide some insight. A longer-term goal for seasonal flu vaccines is the development of a single vaccine, or universal vaccine, that provides safe, effective, and long-lasting immunity against a broad spectrum of different flu viruses (both seasonal and novel). Right now, CDC is a part of an inter-agency partnership coordinated by the Biomedical Advanced Research and Development Authority (or BARDA) that supports the advanced development of new and better flu vaccines. These efforts have already yielded important successes (i.e. a high dose flu vaccine specifically for people 65 years and older that creates a stronger antibody response), but a part of this effort is the eventual development of a universal vaccine. A number of government agencies and private companies have already begun work to advance this type of vaccine development, but, as you can imagine, this task poses an enormous scientific and programmatic challenge.    

Question 5: How would you convince someone who is staunchly against flu vaccines that they’re a good thing?

 Dr. Barnes: Help address misconceptions about the flu. Remind people that a flu shot cannot cause flu illness. They should understand that anyone can get the flu, and each year, thousands of people in the United States die from flu, and many more are hospitalized.  It’s important to stress that the flu vaccine can keep people from getting flu, make flu illness less severe if they do get it, AND keep them from spreading flu to their family and other people that could be at high risk of severe flu complications.

Interested in learning more? Check out Dr. Barnes’ full Reddit AMA here.

John Barnes, Ph.D., is Team Lead of the Influenza Genomics Team (IGT) at the Virology, Surveillance, and Diagnosis Branch of the CDC’s Influenza Division. He earned his Ph.D. degree in Biochemistry and Molecular Biology from University of Georgia in Athens, Georgia. Dr. Barnes began his career at CDC in the Influenza Division in 2007 after working at a postdoctoral fellow at the Emory University Department of Human Genomics. His current work includes managing a staff of nine to serve the sequencing and genetic analysis needs of the Influenza Division. Current numbers of viruses sequenced by the IGT make CDC’s Influenza Division the largest contributor of influenza sequence data among the WHO Influenza Collaborating Centers.

Phylogenomics: the effect of data quality on sampling bias


Sampling bias refers to a statistical sample that has been collected in such a way that some members of the intended statistical population are less likely to be included than are others. The resulting biased sample does not necessarily represent the population (which it should), because the population members were not all equally likely to have been selected for the sample.

This affects scientific work because all scientific questions are about the population not the sample (ie. we infer from the sample to the population), and we can only answer these questions if the samples we have collected truly represent the populations we are interested in. That is, our results could be due to to the method of sampling but erroneously be attributed to the phenomenon under study instead. Bias needs to be accounted for, but it cannot be assessed by looking at the sampled data alone. Bias can only be addressed via the sampling protocol itself.

In genome sequencing, sampling bias is often referred to as ascertainment bias, but clearly it is simply an example of the more general phenomenon. This is potentially a big problem for next generation sequencing (NGS) because there are multiple steps at which sampling is performed during genome sequencing and assembly. These include the initial collection of sequence reads, assembling sequence reads into contigs, and the assembly of these into orthologous loci across genomes. (NB: For NGS technologies, sequence reads are of short lengths, generally <500 bp and often <100 bp.)

The potential for sampling (ascertainment) bias has long been recognized for the detection of SNPs. This bias occurs because SNPs are often developed using only a small group of samples from which to choose the polymorphic markers. The resulting collection of markers samples only a small fraction of the diversity that exists in the population, and this mis-estimates phylogenetic relationships.


However, it is entirely possible that the any attempt to collect high-quality NGS data actually results in poor quality sampling — that is, we end up with high-quality but biased genome sequences. Genome sequencing is all about the volume of data collected, and yet data volume cannot be used to address bias (it can only be used to address stochastic variation). It would be ironic if phylogenomics turns out to have poorer data than traditional sequence-based phylogenetics, but biased genomic data are unlikely to be any more use than non-genome sequences.

The basic issue is that attempts to get high-quality genome data usually involve leaving out data, either because the initial sequencing protocol never collects the data in the first place, or because the subsequent assembly protocols delete the data as being below a specified quality threshold. If these data are left out in a non-random manner, which is very likely, then sampling bias inevitably results. Unfortunately, both the sequencing and bioinformatic protocols are usually beyond the control of the phylogeneticist, and sampling bias can thus go undetected.

Two recent papers highlight common NGS steps that potentially result in biased samples.

First, Robert Ekblom, Linnéa Smeds and Hans Ellegren (2014. Patterns of sequencing coverage bias revealed by ultra-deep sequencing of vertebrate mitochondria. BMC Genomics 15: 467) discuss genome coverage bias, using mtDNA as an example. They note:
It is known that the PCR step involved in sequencing-by-synthesis methods introduces coverage bias related to GC content, possibly due to the formation of secondary structures of single stranded DNA. Such GC dependent bias is seen on a wide variety of scales ranging from individual nucleotides to complete sequencing reads and even large (up to 100 kb) genomic regions. Systematic bias could also be introduced during the DNA fragmentation step or caused by DNA isolation efficacy, local DNA structure, variation in sequence quality and mapability of sequence reads.
In addition to variation in coverage, there may be sequence dependent variation in nucleotide specific error rates. Such systematic patterns of sequencing errors can also have consequences for downstream applications as errors may be taken for low frequency SNPs, even when sequencing coverage is high. GC rich regions and sites close to the ends of sequence reads typically show elevated errors rates and it has also been shown that certain sequence patterns, especially inverted repeats and "GGC" motifs are associated with an elevated rate of Illumina sequencing errors. Such sequence specific miscalls probably arise due to specific inhibition of polymerase binding. Homopolymer runs cause problems for technologies utilising a terminator free chemistry (such as Roche 454 and Ion Torrent), and specific error profiles exist for other sequencing technologies as well.
Sequencing coverage showed up to six-fold variation across the complete mtDNA and this variation was highly repeatable in sequencing of multiple individuals of the same species. Moreover, coverage in orthologous regions was correlated between the two species and was negatively correlated with GC content. We also found a negative correlation between the site-specific sequencing error rate and coverage, with certain sequence motifs "CCNGCC" being particularly prone to high rates of error and low coverage.
The second paper is by Frederic Bertels, Olin K. Silander, Mikhail Pachkov, Paul B. Rainey and Erik van Nimwegen (2014. Automated reconstruction of whole-genome phylogenies from short-sequence reads. Molecular Biology and Evolution 31: 1077-1088). They discuss the situation where raw short-sequence reads from each DNA sample are directly mapped to the genome sequence of a single reference genome. They note:
There are reasons to suspect that such reference-mapping-based phylogeny reconstruction methods might introduce systematic errors. First, multiple alignments are traditionally constructed progressively, that is, starting by aligning the most closely related pairs and iteratively aligning these subalignments. Aligning all sequences instead to a single reference is likely to introduce biases. For example, reads with more SNPs are less likely to successfully and unambiguously align to the reference sequence, as is common in alignments of more distantly related taxa. This mapping asymmetry between strains that are closely and distantly related to the reference sequence may affect the inferred phylogeny, and this has indeed been observed. Second, as maximum likelihood methods explicitly estimate branch lengths, including only alignment columns that contain SNPs and excluding (typically many) columns that are nonpolymorphic, may also affect the topology of the inferred phylogeny. This effect has been described before for morphological traits and is one reason long-branch attraction can be alleviated with maximum likelihood methods when nonpolymorphic sites are included in the alignment.
We identify parameter regimes where the combination of single-taxon reference mapping and SNP extraction generally leads to severe errors in phylogeny reconstruction. These simulations also show that even when including nonpolymorphic sites in an alignment, the effect of mapping to a single reference can lead to systematic errors. In particular, we find that when some taxa are diverged by more than 5-10% from the reference, the distance to the reference is systematically underestimated. This can generate incorrect tree topologies, especially when other branches in the tree are short.


These issues are part of the current "gee-whizz" phase of phylogenomics, in which over-optimism prevails over realism, and volume of data is seen as the important thing. David Roy Smith (2014. Last-gen nostalgia: a lighthearted rant and reflection on genome sequencing culture. Frontiers in Genetics 5: 146) has recently commented on this:
The promises of NGS have, at least for me, not lived up to their hype and often resulted in disappointment, frustration, and a loss of perspective.
I was taught to approach research with specific hypotheses and questions in mind. In the good ol' Sanger days it was questions that drove me toward the sequencing data. But now it’s the NGS data that drive my questions ... I'm trapped in a cycle where hypothesis testing is a postscript to senseless sequencing.
As we move toward a world with infinite amounts nucleotide sequence information, beyond bench-top sequencers and hundred-dollar genomes, let’s take a moment to remember a simpler time, when staring at a string of nucleotides on a screen was special, worthy of celebration, and something to give us pause. When too much data were the least of our worries, and too little was what kept us creative. When the goal was not to amass but to understand genetic data.