Visualization research for non-researchers

Reading visualization research papers can often feel like a slog. As a necessity, there’s usually a lot of jargon, references to William Cleveland and Robert McGill, and sometimes perception studies that lack a bit of rigor. So for practitioners or people generally interested in data communication, worthwhile research falls into a “read later” folder never to be seen again.

Multiple Views, started by visualization researchers Jessica Hullman, Danielle Szafir, Robert Kosara, and Enrico Bertini, aims to explain the findings and the studies to a more general audience. (The UW Interactive Data Lab’s feed comes to mind.) Maybe the “read later” becomes read.

I’m looking forward to learning more. These projects have a tendency to start with a lot of energy and then fizzle out, so I’m hoping we can nudge this a bit to urge them on. Follow along here.

Tags: ,

Posted by in Uncategorized


More heretic bits: networks for (more) recent matrices published in Cladistics

This is Part 2 of a 2-part blog series. Part 1 covered some history, while this post has three (more) recently published matrices, and the take-home message.

Jumping forward in time, welcome to the 21st century

In Part 1, I showed several networks generated based on some early phylogenetic matrices published in the first volumes of the journal Cladistics. In this post, we will look at the most recent data matrices and trees uploaded to TreeBASE, covering the past seven years.

Nearly a generation later, and facing the "molecular revolution", some researchers (fortunately) still compile morphological matrices. This is an often overlooked but important work: genes and genomes can be sequenced by machines, and the only thing we need to do is to feed these machine-generated data into other powerful machines (and programs) to get a phylogenetic tree, or network. But no software and computer cluster can (so far) study anatomy, and generate a morphological matrix. The latter is paramount when we want to put fossils, usually devoid of DNA, in a (molecular) phylogenetic context. We need to do this when we aim to reconstruct histories in space and time.

Nevertheless, we can't ignore the fact that these important data are (still) far from tree-like. What holds for the matrices of the 80's (see the end of Part 1), still applies now.

So, let's have a look at the three most recent data sets (one morphological, two molecular) published in Cladistics that have their data matrix in TreeBASE.

The morphological dataset

Beutel et al. (2011; submission S11976) provided a "robust phylogeny of ... Holometabola", and note in their abstract: "Our results show little congruence with studies based on rRNA, but confirm most clades retrieved in a recent study based on nuclear genes."

Without having read the study, I can guess which clades (likely used here as a synonym for monophyletic group; but see David's post on Hennig and Cladistics) were confirmed. The data matrix contains: 356 multistate, with up to six states, characters scored and annotated for 34 taxa, including polymorphisms and some gaps ("–") viz missing data ("?"). Just by looking at the Neighbor-net inferred from this matrix. (Standard tree- or network-inference doesn't differ between gaps and missing data, but some people find it important to distinguish between "not applicable" and "not known" in a matrix.)

Neighbor-net inferred from simple pairwise distances computed based on Beutel et al.'s matrix. Brackets show my ad hoc assessment of candidates for monophyla (here: likely represented by clades in no matter how optimized trees).

How did I postulate the monophyla? By deduction: if two or more OTUs are much more similar to each other than to anything else in the matrix, they likely are part of the same evolutionary lineage, ie. have a common origin (= monophyletic in a pre-Hennigian sense). This, when the matrix well covers the group and morphospace, has a good chance to be inclusive (= monophyletic fide Hennig; for the covered OTUs). This is especially so when there is a good deal of homoplasy — the provided tree has a CI of 0.44 and RC of 0.33: convergences should be more randomly distributed than lineage-specific/-conserved traits. The latter don't need to be (or were, at some point in time) synapomorphies, shared derived unique traits, but could be diagnostic suites of characters that evolved in parallel within a lineage and passed on to all (or most) of the descendants.

The first molecular dataset

Let's look at the signal in the two molecular matrices.

In 2016, Gaspar and Almeida (submission S19167) tested generic circumscriptions in a group of ferns by "assembl[ing] the broadest dataset thus far, from three plastid regions (rbcL, rps4-trnS, trnL-trnF) ... includ[ing] 158 taxa and 178 newly generated sequences". They found: "three subfamilies each corresponding to a highly supported clade across all analyses (maximum parsimony, Bayesian inference, and maximum likelihood)."

The total matrix has 3250 characters, of which 1641 are constant and 1189 are parsimony-informative. This is a quite a lot for such a matrix, and, by itself, rules out parsimony for tree-inference. If half of the nucleotide sites are variable, then the rate of character change was high, and parsimony is statistically only robust, when the rate of change was low. High mutation rates or high level of divergence may also pose problems for distance methods and other optimality criteria, all closely related to parsimony.

The file includes three trees, labelled "vero" (which, in Italian, means "true"), "Fig._1" and "MPT". "Vero" and "Fig._1" come with branch lengths; judging from the values (<< 1), they are probabilistic trees (of some sort); the "MPT" is (as usual) provided as a cladogram without branch-lengths. It may be that the authors had to add the parsimony tree just to fulfill editorial policies, while being convinced "vero" is the much better tree. "Vero" is a fully resolved tree (the ML tree?), while "Fig._1" (Bayesian?) and "MPT" include polytomies.

Using PAUP*'s "describe" function, we learn that the "MPT" is 5101 steps long and has a CI of 0.41 and RC of 0.33. Nucleotide sequence data can be notoriously homoplasious, as we repeat the same four states into infinity and have to deal with an unknown but usually significant amount of back mutations. This adds to the other problems for parsimony:
  • transitions are more likely to happen than transversions; and
  • in coding gene regions, such as the rbcL, some sites (3rd codon positions) mutate much faster than others.
Still, parsimony trees are not necessarily wrong. Neither are NJ trees; and there are also datasets where probabilistic methods struggle, eg. when the likelihood surface of the treespace is flat.

So, the first question is: how different are the three trees provided? Rather than having to show three graphs, we can show the (strict) Consensus network of those trees.

A strict consensus network summarizing the topologies of the three trees provided in the TreeBASE submission of

The main difference is between "vero" and the other two — "Fig. 1" and the "MPT" are very similar (and both include polytomies). There are three main scenarios for a Consensus network like this with respect to the high portion of variable sites:
  1. "Fig. 1" is a Jukes-Cantor model-based tree,
  2. "Fig. 1" is an uncorrected p-distance based tree, or
  3. most of the variation is between ingroup (the subtree including all Blechnum) and outgroup (the other subtree).
"Vero" is still quite congruent, so the model used here can't be too much different, either.

What should ring one's alarm bells are, however, the many grade-like / staircase subtrees, which are unusual for a molecular data set. Staircases imply that each subsequent dichotomous speciation event resulted in a single species and a further diversifying lineage: multiple, consistently occurring budding events.

The same graph, with arrows showing grade evolution. Often found in morpho-data-based trees with ancestral, more ancient, and derived (from them), modern forms, but should ring an alarm bell when common in a molecular tree. Major clades (found in all three trees) are labelled for comparison with the next graph.

Let's compare this to the Neighbor-net (usually, I would use model-based distances in such a case, but here we can do with uncorrected p-distances).

A Neighbor-net inferred from uncorrected p-distances based on Gaspar & Almeida's matrix; the major clades are labelled as in the preceding graph. Note the isolated, long-branch blue dots with asterisks, indicating the position of the first diverged species in the large clades G and I. Genuine signal or missing data artefact?

The Neighbor-net shows only a limited number of tree-like portions, but does correspond with the main clades above. Only A and B are dissolved, which are the two first diverging clades in the original trees (preceding graph). Some OTUs are placed close to the centre of the graph, or even along a tree-like portion (purple dots), a behaviour known from actual ancestors: some OTUs apparently have sequences that may be literally ancestral to others. This explains the grade structure seen in the original trees. Others (violet dots) create boxes, which may reflect a genuine ambiguous signal, or just be missing data leading to ambiguous pairwise distances. The latter (missing data artefact) is behind the misplacement of the four OTUs (red dots): missing data can inflate pairwise distances severely. And, like parsimony, distance-based methods are more vulnerable to long-branch(edge)-attraction than probabilistic methods.

Model-based distances may help clean up this a bit, but the networks needed for these kind of data are Support consensus networks (see e.g. Schliep et al., MEE, 2017). The split appearance of the Neighbor-net hints at internal signal conflict and, with respect to the high number of variable sites (note the sometimes extremely long terminal edges), saturation issues. Two major questions would be:
  1. How do the different markers (coding gene vs. inter-genic spacers with different levels of diversity; rps4-trnS is typically more divergent than the trnL-trnF spacer) resolve relationships, which clades / topological alternatives receive unanimous support?
  2. Does it make a difference to run a fully partitioned (ML) analysis vs. an unpartitioned one vs. one excluding the 3rd codon position in the gene?
For intra-clade evolutionary pathways, it would be worthwhile to give median networks and suchlike a try, as parsimony methods that can discern ancestor-descendant relationships.

The second molecular dataset

The most recent data are from Kuo et al. (2017; submission S20277), who inferred a "robust ... phylogeny" (see Part 1, Jamieson et al. 1987, and Beutel et al., above) for a group of ferns, focusing on the taxonomy of a single genus, Deparia, that now includes five traditionally recognized genera. In the abstract it says: "... seven major clades were identified, and most of them were characterized by inferring synapomorphies using 14 morphological characters".

The matrix includes the molecular characters used to infer the major clades plus two trees, labelled "bestREP1" and "rep9BEST", both with branch lengths. Branch length values indicate that "bestREP1" could be parsimony-optimized (with averaged or weighted branch lengths), while "rep9BEST" is either a ML or Bayesian tree (technically, it could be a distance-based tree, too, but I don't think such "phenetics" are condoned by Cladistics).

Re-calculated, the first tree ("bestREP1") is shorter (3024 steps) than the one of Gaspar & Almeida, reflecting the much lower number of parsimony-informative sites (979). Many of the sites differ only between the focal genus and the outgroups, which is well visible in the Neighbor-net. [For those of you unfamiliar with Neighbor-nets, a parsimony analysis of these data takes hours, or days depending on the software and computer, while the distance matrix and the resultant Neighbor-net is inferred in a blink.]

The Neighbor-net based on Kuo et al.'s data. Why do we need to include long-branching, distant outgroups when we just want to bring order in a genus? Because to test monophyly, we need a rooted tree (ambiguous or not, or even biased by branching artefacts).

Let's remove the distant, long-branching outgroups, which (as we can see in the Neighbor-net) at best provide ambiguous signal for rooting the ingroup — at worst, they trigger ingroup-outgroup branching artefacts. What could a Neighbour-net have contributed regarding taxonomy and the seven major monophyletic intrageneric groups ("clades")? Pretty much everything needed for the paper, I guess (judging from the abstract).

Same data as above, but outgroups removed. The structure of this Neighbour-net allows to identify seven likely candidates for monophyla ("1"–"7"), with "1" and "2" being obvious sister lineages. Colours refer to the clusters ("A"–"E") annotated above.

On a side note: by removing the long-branching, distant outgroups, taxon "T" is resolved as a probable member of the putative monophyletic group "5" (= "E" in the full graph with outgroups, and surely a high-supported subtree in any ingroup-only reconstruction, method-independent). Placing the root between "T" and the rest of the genus implies that "5" is a paraphyletic group comprising species that haven't evolved and diversified at all (ie. are genetically primitive), in stark contrast to the other main intra-generic lineages. This is not impossible, but quite unlikely. More likely is the second scenario (primary split between "1"–"3" and "4"–"7"). Having "4" as sister to the rest could be an alternative, too.

This is where Hennig's logic could be of help: find and tabulate putative synapomorphies to argue for a set and root that makes the most sense regarding morphological evolution and molecular differentiation.

The take-home message(s)

We have argued before that it is in the ultimate interest of science and scientists to give access to phylogenetic data. No matter where one stands regarding phylogenetic philosophy, we should publish our data, so that people can do analyses of their own. Discussion should be based on results, not philosophies.

When you deal with morphological data, you should never be content with inferring a single tree (parsimony or other). You have to use networks.

The Neighbor-net was born as late as 2002 (Bryant & Moulton, 2002, in: Guigó R, and Gusfield D, eds, Algorithms in Bioinformatics, Second International Workshop, WABI, p. 375–391; paywalled) and made known to biologists in 2004 (same authors, same title, in Mol. Biol. Evol. 21:255–265), so that authors before this time did not have access to its benefits. Similarly, Consensus networks arrived around about the same time (Holland & Moulton 2003, in: Benson G, and Page R, eds, Algorithms in Bioinformatics: Third International Workshop, WABI, p. 165–176). However, the Genealogical World of Phylogenetic Networks has been here for six years now (first post February 2012). So there is now no excuse for publishing a cladogram without having explored the tree-likeness of your matrix' signal!

Neighbor-nets like the ones I showed in this 2-piece post (or can be found in many of our other posts) are a quick and essential tool to explore the basic signal in your matrix:
  • How tree-like is it?
  • Where are the potential conflicts, obscurities?
  • What are the principal evolutionary alternatives (competing topologies)?
  • What is well supported (especially regarding taxonomy and the question of monophyly)?
Even if you don't use it in your paper, the network will tell you what you are dealing with when you start inferring trees.

The second essential tool is the much under-used Support consensus network, not shown in this post but in plenty of our other posts (and many papers I co-authored; for a comprehensive collection of network-related literature see Who's who in phylogenetic networks by Philippe Gambette). Support consensus networks estimate and visualize the robustness of the signal for competing topological (tree) alternatives.

Consensus networks should also be obligatory for those molecular data,where even probabilistic methods fail to find a single fully resolved, highly supported tree.

If the editors of Cladistics are really dedicated to parsimony, they should not still insist only on a parsimony tree (often provided as cladogram), but also parsimony-based networks as well:
  • strict Consensus networks to summarize the MPT samples instead of the standard strict Consensus cladograms;
  • bootstrap Support consensus networks showing the signal strength and support for alternative trees/competing clades (TNT has many bootstrapping options to play around with); and
  • Median networks and such-like for datasets with few mutations, and low levels of expected homoplasy.
This is what the 2016 #parsimonygate uproar (see Part 1) should have been about (12 years after Neighbor-nets, and 11 years after Consensus networks). Not the prioritizing of parsimony, but the naivety or ignorance towards pitfalls of (parsimony or other) trees inferred from data not providing tree-like signal or riddled by internal conflict.
This is a problem not limited to Cladistics, but found, to my modest experience in professional science (c. 20 years), in many other journals as well (e.g. Bot. J. Linn. Soc., Taxon, Mol. Phyl. Evol., J. Biogeogr., Syst. Biol., Nature, Science).

Hence, here are my suggestions for future conference buttons, instead of those shown in Part 1.

No Cladograms!Use Neighbour-nets!Support Consensus Networks as obligatory!

Further reading for those who mistrust trees or become network-curious in general

Weekend reads: A debate over journal editors; academic corruption in China; a poisoning in a lab

Before we present this week’s Weekend Reads, a question: Do you enjoy our weekly roundup? If so, we could really use your help. Would you consider a tax-deductible donation to support Weekend Reads, and our daily work? Thanks in advance. The week at Retraction Watch featured the retraction and replacement of a paper on whether gun … Continue reading Weekend reads: A debate over journal editors; academic corruption in China; a poisoning in a lab

NCBI to assist in Virus Hunting Data Science Hackathon January 9-11, 2019

We are pleased to announce the second installment of the SoCal Bioinformatics Hackathon. From January 9-11, 2019, the NCBI will help run a bioinformatics hackathon in Southern California hosted by the Computational Sciences Research Center at San Diego State University! … Continue reading

Celebrating 50 years of Neutral Theory

The importance of Neutral Theory and Nearly-Neutral Theory cannot be exaggerated. It has radically transformed the way experts think about evolution, especially at the molecular level. Unfortunately, the average scientist is not aware of the revolution that took place 50 years ago and they still think of evolution as a synonym for natural selection. I suspect that 80% of biology undergraduates in North American universities are graduating without a deep understanding of the importance of Neutral Theory.1

The journal of Molecular Biology and Evolution has published a special issue: Celebrating 50 years of the Neutral Theory. The key paper published 50 years ago was Motoo Kimura's paper on “Evolutionary rate at the molecular level” (Kimura, 1968) followed shortly after by a paper from Jack Lester King and Thomas Jukes on "Non-Darwinian Evolution" (King and Jukes, 1969).

The special issue contains reprints of two classic papers published in Molecular Biology and Evolution in 1983 and 2005. In addition, there are 14 reviews and opinions written by editors of the journal and published earlier this year (see below). It's interesting that several of the editors of a leading molecular evolution journal are challenging the importance of Neutral Theory and one of them (senior editor Matthew Hahn) is downright hostile.
Kimura, M. (1983) Rare variant alleles in the light of the neutral theory. Molecular Biology and Evolution, 1:84-93. [doi: 10.1093/oxfordjournals.molbev.a040305]
Based on the neutral theory of molecular evolution and polymorphism, and particularly assuming "the model of infinite alleles," a method is proposed which enables us to estimate the fraction of selectively neutral alleles (denoted by Pneut) among newly arisen mutations. It makes use of data on the distribution of rare variant alleles in large samples together with information on the average heterozygosity. The formula proposed is Pneut = [He/(1-He)] [loge(2nq)/n alpha (x less than q)], where n alpha(x less than q) is the average number of rare alleles per locus whose frequency, x, is less than q; n is the average sample size used to count rare alleles; He is the average heterozygosity per locus; and q is a small preassigned number such as q = 0.01. The method was applied to observations on enzyme and other protein loci in plaice, humans (European and Amerindian), Japanese monkeys, and fruit flies. Estimates obtained for them range from 0.064 to 0.21 with the mean and standard error Pneut = 0.14 +/- 0.06. It was pointed out that these estimates are consistent with the corresponding estimate Pneut(Hb) = 0.14 obtained independently based on the neutral theory and using data on the evolutionary rate of nucleotide substitutions in globin pseudogenes together with those in the normal globins.

Nei, M. (2005) Selectionism and neutralism in molecular evolution. Molecular Biology and Evolution, 22:2318-2342. [doi: 10.1093/molbev/msi242]
Charles Darwin proposed that evolution occurs primarily by natural selection, but this view has been controversial from the beginning. Two of the major opposing views have been mutationism and neutralism. Early molecular studies suggested that most amino acid substitutions in proteins are neutral or nearly neutral and the functional change of proteins occurs by a few key amino acid substitutions. This suggestion generated an intense controversy over selectionism and neutralism. This controversy is partially caused by Kimura's definition of neutrality, which was too strict (|2Ns|≤1).
If we define neutral mutations as the mutations that do not change the function of gene products appreciably, many controversies disappear because slightly deleterious and slightly advantageous mutations are engulfed by neutral mutations. The ratio of the rate of nonsynonymous nucleotide substitution to that of synonymous substitution is a useful quantity to study positive Darwinian selection operating at highly variable genetic loci, but it does not necessarily detect adaptively important codons. Previously, multigene families were thought to evolve following the model of concerted evolution, but new evidence indicates that most of them evolve by a birth-and-death process of duplicate genes. It is now clear that most phenotypic characters or genetic systems such as the adaptive immune system in vertebrates are controlled by the interaction of a number of multigene families, which are often evolutionarily related and are subject to birth-and-death evolution. Therefore, it is important to study the mechanisms of gene family interaction for understanding phenotypic evolution. Because gene duplication occurs more or less at random, phenotypic evolution contains some fortuitous elements, though the environmental factors also play an important role. The randomness of phenotypic evolution is qualitatively different from allele frequency changes by random genetic drift. However, there is some similarity between phenotypic and molecular evolution with respect to functional or environmental constraints and evolutionary rate. It appears that mutation (including gene duplication and other DNA changes) is the driving force of evolution at both the genic and the phenotypic levels.

Kumar, S., and Patel, R. (2018) Neutral Theory, Disease Mutations, and Personal Exomes. Molecular Biology and Evolution, 35:1297-1303. [doi: 10.1093/molbev/msy085]
Genetic differences between species and within populations are two sides of the same coin under the neutral theory of molecular evolution. This theory posits that a vast majority of evolutionary substitutions, which appear as differences between species, are (nearly) neutral, that is, these substitutions are permitted without a significantly adverse impact on a species’ survival. We refer to them as evolutionarily permissible (ePerm) variation. Evolutionary permissibility of any possible variant can be inferred from multispecies sequence alignments by applying sophisticated statistical methods to the evolutionary tree of species. Here, we explore the evolutionary permissibility of amino acid variants associated with genetic diseases and those observed in personal exomes. Consistent with the predictions of the neutral theory, disease associated amino acid variants are rarely ePerm, much more biochemically radical, and found predominantly at more conserved positions than their non-disease counterparts. Only 10% of amino acid mutations are ePerm, but these variants rise to become two-thirds of all substitutions in the human lineage (a 6-fold enrichment). In contrast, only a minority of the variants in a personal exome are ePerm, a seemingly counterintuitive pattern that results from a combination of mutational and evolutionary processes that are, in fact, broadly consistent with the neutral theory. Evolutionarily forbidden variants outnumber detrimental variants in individual exomes and may play an underappreciated role in protecting against disease. We discuss these observations and conclude that the long-term evolutionary history of species can illuminate functional biomedical properties of variation present in personal exomes.

Austerlitz, F., and Heyer, E. (2018) Neutral Theory: From Complex Population History to Natural Selection and Sociocultural Phenomena in Human Populations. Molecular Biology and Evolution, 35:1304-1307. [doi: 10.1093/molbev/msy067]
Here, we present a synthetic view on how Kimura’s Neutral theory has helped us gaining insight on the different evolutionary forces that shape human evolution. We put this perspective in the frame of recent emerging challenges: the use of whole genome data for reconstructing population histories, natural selection on complex polygenic traits, and integrating cultural processes in human

Niida, A., Iwasaki, W.M., and Innan, H. (2018) Neutral Theory in Cancer Cell Population Genetics. Molecular Biology and Evolution, 35:1316-1321. [doi: 10.1093/molbev/msy091]
Kimura’s neutral theory provides the whole theoretical basis of the behavior of mutations in a Wright–Fisher population. We here discuss how it can be applied to a cancer cell population, in which there is an increasing interest in genetic variation within a tumor. We explain a couple of fundamental differences between cancer cell populations and asexual organismal populations. Once these differences are taken into account, a number of powerful theoretical tools developed for a Wright–Fisher population could be readily contribute to our deeper understanding of the evolutionary dynamics of cancer cell population.

Cannataro, V.L., and Townsend, J.P. (2018) Neutral Theory and the Somatic Evolution of Cancer. Molecular Biology and Evolution, 35:1308-1315. [doi: 10.1093/molbev/msy079]
Kimura’s neutral theory argued that positive selection was not responsible for an appreciable fraction of molecular substitutions. Correspondingly, quantitative analysis reveals that the vast majority of substitutions in cancer genomes are not detectably under selection. Insights from the somatic evolution of cancer reveal that beneficial substitutions in cancer constitute a small but important fraction of the molecular variants. The molecular evolution of cancer community will benefit by incorporating the neutral theory of molecular evolution into their understanding and analysis of cancer evolution—and accepting the use of tractable, predictive models, even when there is some evidence that they are not perfect.

Yoder, A.D., Poelstra, J.W., Tiley, G.P., and Williams, R.C. (2018) Neutral Theory Is the Foundation of Conservation Genetics. Molecular Biology and Evolution, 35:1322-1326. [doi: 10.1093/molbev/msy076]
Kimura’s neutral theory of molecular evolution has been essential to virtually every advance in evolutionary genetics, and by extension, is foundational to the field of conservation genetics. Conservation genetics utilizes the key concepts of neutral theory to identify species and populations at risk of losing evolutionary potential by detecting patterns of inbreeding depression and low effective population size. In turn, this information can inform the management of organisms and their habitat providing hope for the long-term preservation of both. We expand upon Avise’s “inventorial” and “functional” categories of conservation genetics by proposing a third category that is linked to the coalescent and that we refer to as “process-driven.” It is here that connections between Kimura’s theory and conservation genetics are strongest. Process-driven conservation genetics can be especially applied to large genomic data sets to identify patterns of historical risk, such as population bottlenecks, and accordingly, yield informed intuitions for future outcomes. By examining inventorial, functional, and process-driven conservation genetics in sequence, we assess the progression from theory, to data collection and analysis, and ultimately, to the production of hypotheses that can inform conservation policies.

Zhang, J. (2018) Neutral Theory and Phenotypic Evolution. Molecular Biology and Evolution, 35:1327-1331. [doi: 10.1093/molbev/msy065]
Although the neutral theory of molecular evolution was proposed to explain DNA and protein sequence evolution, in principle it could also explain phenotypic evolution. Nevertheless, overall, phenotypes should be less likely than genotypes to evolve neutrally. I propose that, when phenotypic traits are stratified according to a hierarchy of biological organization, the fraction of evolutionary changes in phenotype that are adaptive rises with the phenotypic level considered. Consistently, molecular traits are frequently found to evolve neutrally whereas a large, random set of organismal traits were recently reported to vary largely adaptively. Many more studies of unbiased samples of phenotypic traits are needed to test the general validity of this hypothesis.

Rocha, E.P.C. (2018) Neutral Theory, Microbial Practice: Challenges in Bacterial Population Genetics. Molecular Biology and Evolution, 35:1338-1347. [doi: 10.1093/molbev/msy078]
I detail four major open problems in microbial population genetics with direct implications to the study of molecular evolution: the lack of neutral polymorphism, the modeling of promiscuous genetic exchanges, the genetics of ill-defined populations, and the difficulty of untangling selection and demography in the light of these issues. Together with the historical focus on the study of single nucleotide polymorphism and widespread non-random sampling, these problems limit our understanding of the genetic variation in bacterial populations and their adaptive effects. I argue that we need novel theoretical approaches accounting for pervasive selection and strong genetic linkage to better understand microbial evolution.

Arkhipova, I.R. (2018) Neutral Theory, Transposable Elements, and Eukaryotic Genome Evolution. Molecular Biology and Evolution, 35:1332-1337. [doi: 10.1093/molbev/msy083]
Among the multitude of papers published yearly in scientific journals, precious few publications may be worth looking back in half a century to appreciate the significance of the discoveries that would later become common knowledge and get a chance to shape a field or several adjacent fields. Here, Kimura’s fundamental concept of neutral mutation-random drift, which was published 50 years ago, is re-examined in light of its pervasive influence on comparative genomics and, more specifically, on the contribution of transposable elements to eukaryotic genome evolution.

Leitner, T. (2018) The Puzzle of HIV Neutral and Selective Evolution. Molecular Biology and Evolution, 35:1355-1358. [doi: 10.1093/molbev/msy089]
HIV is one of the fastest evolving organisms known. It evolves about 1 million times faster than its host, humans. Because HIV establishes chronic infections, with continuous evolution, its divergence within a single infected human surpasses the divergence of the entire humanoid history. Yet, it is still the same virus, infecting the same cell types and using the same replication machinery year after year. Hence, one would think that most mutations that HIV accumulates are neutral. But the picture is more complicated than that. HIV evolution is also a clear example of strong positive selection, that is, mutants have a survival advantage. How do these facts come together?

Frost, S.D.W., Magalis, B.R., and Kosakovsky Pond, S.L. (2018) Neutral Theory and Rapidly Evolving Viral Pathogens. Molecular Biology and Evolution, 35:1348-1354. [doi: 10.1093/molbev/msy088]
The evolution of viral pathogens is shaped by strong selective forces that are exerted during jumps to new hosts, confrontations with host immune responses and antiviral drugs, and numerous other processes. However, while undeniably strong and frequent, adaptive evolution is largely confined to small parts of information-packed viral genomes, and the majority of observed variation is effectively neutral. The predictions and implications of the neutral theory have proven immensely useful in this context, with applications spanning understanding within-host population structure, tracing the origins and spread of viral pathogens, predicting evolutionary dynamics, and modeling the emergence of drug resistance. We highlight the multiple ways in which the neutral theory has had an impact, which has been accelerated in the age of high-throughput, high-resolution genomics.

Satta, Y., Fujito, N.T., and Takahata, N. (2018) Nonequilibrium Neutral Theory for Hitchhikers. Molecular Biology and Evolution, 35:1362-1365. [doi: 10.1093/molbev/msy093]
Selective sweep is a phenomenon of reduced variation at presumably neutrally evolving sites (hitchhikers) in the genome that is caused by the spread of a selected allele at a linked focal site, and is widely used to test for action of positive selection. Nonetheless, selective sweep may also provide an unprecedented opportunity for studying nonequilibrium properties of the neutral variation itself. We have demonstrated this possibility in relation to ancient selective sweep for modern human-specific changes and ongoing selective sweep for local population-specific changes.

Charlesworth, B., and Charlesworth, D. (2018) Neutral Variation in the Context of Selection. Molecular Biology and Evolution, 35:1359-1361. [doi: 10.1093/molbev/msy062]
In its initial formulation by Motoo Kimura, the neutral theory was concerned solely with the level of variability maintained by random genetic drift of selectively neutral mutations, and the rate of molecular evolution caused by the fixation of such mutations. The original theory considered events at a single genetic locus in isolation from the rest of the genome. It did not take long, however, for theoreticians to wonder whether selection at one or more loci might influence neutral variability at linked sites. Once DNA sequence variability could be studied, and especially when resequencing of whole genomes became possible, it became clear that patterns of neutral variability in genomes are affected by selection at linked sites, and that these patterns could advance our understanding of natural selection, and can be used to detect the action of selection in genomic regions, including selection much weaker than could be detected by direct measurements of the relative fitnesses of different genotypes. We outline the different types of processes that have been studied, in approximate order of their historical development.

Kern, A. D., and Hahn, M. W. (2018) The Neutral Theory in Light of Natural Selection. Molecular Biology and Evolution, 35:1366-1371. [doi: 10.1093/molbev/msy092]
In this perspective, we evaluate the explanatory power of the neutral theory of molecular evolution, 50 years after its introduction by Kimura. We argue that the neutral theory was supported by unreliable theoretical and empirical evidence from the beginning, and that in light of modern, genome-scale data, we can firmly reject its universality. The ubiquity of adaptive variation both within and between species means that a more comprehensive theory of molecular evolution must be sought.

Nekrutenko, A., Team, G., Goecks, J., Taylor, J., and Blankenberg, D. (2018) Biology Needs Evolutionary Software Tools: Let’s Build Them Right. Molecular Biology and Evolution, 35:1372-1375. [doi: 10.1093/molbev/msy084]
Research in population genetics and evolutionary biology has always provided a computational backbone for life sciences as a whole. Today evolutionary and population biology reasoning are essential for interpretation of large complex datasets that are characteristic of all domains of today’s life sciences ranging from cancer biology to microbial ecology. This situation makes algorithms and software tools developed by our community more important than ever before. This means that we, developers of software tool for molecular evolutionary analyses, now have a shared responsibility to make these tools accessible using modern technological developments as well as provide adequate documentation and training.

1. I don't know if this is also true of undergraduates in Asia, Africa, South America, Europe, Australia, and Antarctica.

Kimura, M. (1968) Evolutionary rate at the molecular level. Nature, 217:624-626. [PDF]

King, J.L., and Jukes, T.H. (1969) Non-darwinian evolution. Science, 164:788-798. [PDF]

October 2018 RefSeq annotations include honey bee, butterfly & more

In October, the NCBI Eukaryotic Genome Annotation Pipeline released new annotations in RefSeq for the following organisms: Apis mellifera (honey bee) Athene cunicularia (burrowing owl) Ceratina calcarata (bee) Ciona intestinalis (vase tunicate) Ctenocephalides felis (cat flea) Diaphorina citri (Asian citrus psyllid) Galleria mellonella (greater … Continue reading

✚ How I Made That: Animated Difference Charts in R

A combination of a bivariate area chart, animation, and a population pyramid, with a sprinkling of detail and annotation. Read More

A collection of Charles-Joseph Minard’s statistical graphics

Charles-Joseph Minard, best known for a graphic he made (during retirement, one year before his death) showing Napoleon’s March, made many statistical graphics over his career. The Minard System from Sandra Rendgen is a collection of these works. The first section is background on Minard, his famed graphic, and his process, but really, you get it for the collection of vintage graphic goodness. [Amazon link]

Tags: ,

Posted by in Uncategorized