How I would (realistically) analyze SARS-CoV-2 (or similar) phylogenetic data


While writing this post, the Gisaid database reported over 40,000 SARS-CoV-2 genomes (a week before it was only 32,000), which is rather a lot for a practical data analysis. There have been a few posts on the RAxML Google group about how to analyze such large datasets, and speed up the analysis:
How to run ML search and BS 100 replicates most rapidly for a 30000 taxa * 30000 bp DNA dataset
In response, Alexandros Stamatakis, the developer of RAxML, expressed the basic problem this way:
Nonetheless, the dataset has insufficient phylogenetic signal, and thus it can and should not be analyzed using some standard command line that we provide you here; but requires a more involved analysis, carefully exploring if there is sufficient signal to even represent the result as a binary/bifurcating tree, which I personally seriously doubt.
As demonstrated in our current collection of recent blog posts, we also doubt this. One user, having read some of our posts, wondered whether we can't just use the NETWORK program to infer a haplotype network, instead. Typically, the answer to such a question is "Yes, but..."

So, here's a post about how I would design an experiment to get the most information out of thousands of virus genomes (see also: Inferring a tree with 12000 [or more] virus genomes).

Why trees struggle with resolving virus phylogenies and reconstructing their evolution. X, the genotype of Patient Zero (first host, not first-diagnosed host) spread into five main lineages. All splits (internodes, taxon bipartitions) in this graph are trivial, ie. one tip is seperated from all others. Thus, they, and the underlying data, cannot provide any information to infer a tree, which is a sequence of non-trivial taxon bipartitions. For instance, an outgroup (O)-defined root would require to sample the 'Source' (S), the all-ancestor, hence, defining a split O+S | X+A+B+C+D+E. All permutations of X+descendant | rest should have the same probability, leading to a 5-way split support (BS = 20, PP = 0.2). In reality, however, tree-analyses, Bayesian inference more than ML bootstrapping, may prefer one split over any other, eg. because of long-branch attraction between C and D and 'short-branch culling' of X and E. See also: Problems with the phylogeny of coronaviruses and A new SARS-CoV-2 variant.


Start small

Having a large set of data doesn't mean that you have to analyze it all at once. Big Data does not mean that we must start with a big analysis! The reason we have over 40,000 CoV-2 genomes is simply the recent advances in DNA sequencing, and that we have effectively spread the virus globally, to provide a lot of potential samples.

The first step would thus be:
  1. Take one geographical region at a time, and infer its haplotype network.
This will allow us to define the main virus types infecting each region. It will also eliminate all satellite types (local or global) that are irrelevant for reconstructing the evolution of the virus, as they evolved from a designated ancestor, which is also included in our data.

We can also search the regional data for recombinants — virus may recombine, but to do so they need to come into contact, ie. be sympatric.

C/G→U mutations seen in several of the early sampled CoV-2 genomes: note their mixing-up within haplotypes collected from the cruiseship 'Diamond Princess' (from Using Median-networks to study SARS-CoV-2)


Go big

Once the main virus variants in each region are identified, we can filter them and then use them to infer both:
  1. a global haplotype network, and
  2. a global, bootstrapped maximum-likelihood (ML) tree.
The inference of the latter will now be much faster, because we have eliminated a lot of the non-tree-like signal ("noise") in our data set. The ML tree, and its bootstrap Support consensus network, will give us an idea about phylogenetic relationships under the assumption that not all mutations are equally probable (which they clearly aren't) — this provides a phylogenetic hypothesis that is not too biased by convergence or mutational preferences, eg. replacing A, C, and G by U (Finding the CoV-2 root).

On the other hand, the haplotype network (Median-joining or Statistical parsimony) may be biased, but it can inform us about ancestor-descendant relationships. Using the ML tree as guide, we may even be able to eliminate saturated sites or weigh them for the network inference, provided that the filtered, pruned-down dataset provides enough signal.

With the ML tree, bootstrapping analysis and haplotype networks at hand, it is easy to do things like compare the frequency of the main lineages, and assess their global distribution. This also facilitates the depiction of potential recombination, we can sub-divide the complete genome and infer trees/networks for the different bits, and then compare them.

Only based on nearly 80 CoV-2 genomes stored in gene banks by March 2020. The same can be done for any number of accessions, provided tools are used taking into account the reality of the data. The "x" indicate recombination, arrows ancestor-descendant relationships (from: Using Median networks to study SARS-CoV-2)


Change over time

The most challenging problem for tree inference and haplotype-network inference, is the fact that virus genomes evolve steadily through time. That is, the CoV-2 data will include both the earliest variants of the virus as well as its many, diverse offspring — both ancestors and descendants are included among the (now) 40,000+ genomes. We have shown a number of examples where trees cannot handle ancestor-descendant relationships very well. Haplotype networks, on the other hand, are vulnerable to homoplasy (random convergences). So:
  1. Take one time-slice and establish the amount of virus divergence at that time.
Depending on the virus diversity, one can use haplotype networks or distance-based Neighbor-nets (RAxML can export model-based distances). Even traditional trees are an option — by focusing on one time slice, we fulfill the basic requirement of standard tree-inference: that all tips are of the same age.
  1. Then stack the time-slice graphs together, for a general overview.
It will be straightforward to establish which subsequent virus variant is most similar to which one in the slice before.

Based on such networks, we can also easily filter the main variants for each time slice, to compile a reduced set for further explicit dating analysis, for example via the commonly used dating software BEAST (it was actually designed originally for use with virus phylogenies).

A stack of time-filtered Neighbor-nets (from: Stacking neighbour-nets: a real world example; see Stacking neighbour-nets: ancestors and descendants for an introduction)


Networks and trees go hand-in-hand

With the analyses above, it should be straightforward to model not only the spread of the virus (as GISAID tries to do using Nextstrain) but also its evolution – global and general, local and in-depth, and linear and reticulate.

The set of reconstructions will allow for exploratory data analysis. Conflicts between trees and networks are often a first hint towards reticulate history — in the case of viruses this will be recombination. Keep in mind that deep recombinants will not necessarily create conflict in either trees (eg. decreased bootstrap support) or networks (eg. boxes), but may instead result in long terminal branches.

There may be haplotypes in the regional networks that are oddly different, or create parallel edge-bundles. Using the ML guide-tree, we can assess their relationship within the global data set — whether they show patterns diagnostic for more than one lineage or are the result of homoplasy.

Likewise, there may be branches in the ML tree with ambiguous support, which can be understood when using haplotype networks (see eg., Tree informing networks explaining trees).

Era of Big Data, and Big Error

SARS-CoV-2 data form a very special dataset, but there are parallels to other Big Data phylogenomic studies. Many of these studies produce fully resolved trees: and it is often assumed that the more data are used then the more correct is the result. Further examination is thus unnecessary (and it may be impossible, because of the amount of compiled data).

As somebody who worked at the coal-face of evolution, I have realized that the more data we have then the more complex will be the patterns we can extract from them. The risk of methodological bias will not vanish, but may even increase; and the more I then need to check which part of my data resolves which aspect of a taxonomic group's evolution.

This can mean that, rather than a single tree of 10,000 samples, it is better to infer 100 graphs that each reflects variation among 100 samples and one overall graph that includes only the main sample types. Make use of supernetworks (eg. Supernetworks and gene incongruence) and consensus networks to explore all aspects of a group's evolution. In particular, when you leave CoV-2 behind and task larger groups of coronaviruses (Hack and fish...for recombination in coronaviruses).

BigDat 2016 where men (and only men) will teach you about big data #YAOMM

Just got an email invitation to the following

2nd INTERNATIONAL WINTER SCHOOL ON BIG DATA

BigDat 2016

Bilbao, Spain

February 8-12, 2016

Organized by:
DeustoTech, University of Deusto
Rovira i VirgiliUniversity



I confess, I was intrigued enough to look because it was in Bilbao, and, well, my kids are completely obsessed with soccer and we are thinking of a trip to Barcelona, so why not a trip to Bilbao too.  And then, well, I got sick to my stomach.  I looked at the list of speakers and instructors and did a bunch of Googling to make inferences about their gender.  And, well, everyone associated with the School appears to be male.  That is 24 or 24 slots (4 keynote speaker slots and 20 professor slots.  See below for the rundown.  People I identified as male are highlighted in yellow. Sad and disappointing.  Needless to say I will not be going.

-----------------------------

Keynote Speakers
  1. Nektarios Benekos (European Organization for Nuclear Research)
  2. Chih-Jen Lin (National Taiwan University)
  3. Jeffrey Ullman (Stanford University)
  4. Alexandre Vaniachine (Argonne National Laboratory)
Professors and Courses
  1. Nektarios Benekos (European Organization for Nuclear Research)
  2. Hendrik Blockeel (KU Leuven)
  3. Edward Y. Chang (HTC Health, Taipei)
  4. Nello Cristianini (University of Bristol)
  5. Ernesto Damiani (University of Milan)
  6. Francisco Herrera (University of Granada),
  7. Chih-Jen Lin (National Taiwan University),
  8. George Karypis (University of Minnesota)
  9. Geoff McLachlan (University of Queensland)
  10. Wladek Minor (University of Virginia),
  11. Raymond Ng (University of British Columbia)
  12. Sankar K. Pal (Indian Statistical Institute)
  13. Erhard Rahm (University of Leipzig)
  14. Hanan Samet (University of Maryland)
  15. Jaideep Srivastava (Qatar ComputingResearch Institute)
  16. Jeffrey Ullman (Stanford University)
  17. Alexandre Vaniachine (Argonne National Laboratory)
  18. Xiaowei Xu (University of Arkansas, Little Rock)
  19. Fuli Yu (Baylor College of Medicine)
  20. Mohammed J. Zaki (Rensselaer Polytechnic Institute)

Network Effect overwhelms with data

Network Effect

Network Effect by Jonathan Harris and Greg Hochmuth is a gathering of the emotions, non-emotion, and everyday-ness of life online. It hits you all at once and overwhelms your senses.

We gathered a vast amount of data, which is presented in a classically designed data visualization environment — all real, all impeccably annotated, all scientifically accurate, all “interesting,” and yet all basically absurd. In this way, the project calls into question the current cult of Big Data, which has become a kind of religion for atheists.

Harris and Hocmuth gathered tweets that mentioned 100 behaviors, such as hug, cry, blow, and meditate, and paired them with YouTube videos that correspond. They then employed workers on Amazon's Mechanical Turk to read the tweets aloud and gather data on when behaviors occurred. Tweets are continually collected to collect data on why people perform such behaviors, and Google Ngram provides historical usage context.

It is a lot of things going on at once.

I could go on, but it's better if you experience it for yourself. You're given about seven minutes per day to view, depending on the life expectancy of where you live. The weird thing is that even though it's an overwhelming view into online life, you're left wanting more, which is exactly what the creators were going for.

Tags: , , ,

Science for the People: Dataclysm

sftpThis week Science for the People is looking at how powerful computers and massive data sets are changing the we study each other, scientifically and socially. We’re joined by machine learning researcher Hannah Wallach, to talk about the definition of “big data,” and social science research techniques that use data about individual people to model patterns in human behavior. Then, we speak to Christian Rudder, co-founder of OkCupid and author of the OkTrends blog, about his book Dataclysm: Who We Are (When We Think No One’s Looking).

*Josh provides research & social media help to Science for the People and is, therefore, completely biased.


Filed under: Follies of the Human Condition Tagged: Big Data, Christian Rudder, Crown, data, Dataclysm, Hannah Wallach, OkCupid, OkTrends, Podcast, science for the people

Why exploring big data is hard

The talks from OpenVisConf 2015 went up, so I'm slowly making my way through. In this one Danyel Fisher from Microsoft Research talks about the challenges of working with data that doesn't quite fit into your standard CSV data model. The visualization has to account for the mess.

Tags: ,

This is big data.

Big data ocean

A one-off tumblr that catalogs stock images that depict the tumultuous, rising sea of big data. Nice.

Still though, nothing beats Big Star Trek Data.

Big Data

Tags: ,

Data science, big data, and statistics – all together now

Terry Speed, a emeritus professor in statistics at University of California at Berkeley, gave an excellent talk on how statisticians can play nice with big data and data science. Usually these talks go in the direction of saying data science is statistics. This one is more on the useful, non-snarky side.

Share the Data

DARPA_Big_DataData-sharing is often much easier said than done. In the past, researchers created large and valuable databases which would often languish on the university’s server fading into oblivion after the particular post-doc or graduate student who created it had moved on. It has actually been shown that for the field of ecology, the likelihood of accessing data ever again decreases by 17% every year.

While that study is specific to a particular field, I can imagine some level of data loss in every field. Even if data was described in a publication, there is no easy way for an outside researcher to access it, or even know if that particular data would be useful in their new study. The times they are a-changing. There are now multiple venues to openly share data – one example is Figshare, where you can publicly share as much data as you like and also have a private data storage option. This service truly represents open access and the DIY mentality with “collaboration spaces” where multiple groups can work together on projects.

Now the journal Nature is launching a new initiative to catalog and describe data resources and make them widely available.

Scientific Data is the name of their new open-access online only publication of scientifically valuable datasets. This new system will allow researchers to publish their datasets outside of a traditional publishing system. They will be able to get a citation for their data even if it didn’t lead to a paper otherwise. These publications will include something called a Data Descriptor which is a detailed account of the collection and analysis of the data so that it can be combined with other similar data, or replicated by following the detailed description. These publications are also associated with the Nature brand name… which I think will definitely influence the number and variety of data contributors.

The data will be peer-reviewed with at least one expert in the experimental science and one expert in data standards. This may help to limit poor quality data from the collection.  It will also raise the cachet of the Nature data sharing enterprise. Whether that is for the better of the scientific enterprise remains to be seen. All of the collected data-sets will be set up to be fully searchable to help researchers identify other relevant or complementary data that they could be using in their experiments.

By making this resource open-access, Nature is encouraging replication and access by all including those outside of bench research. I think this is one area that Nature will excel in, attracting non-scientists to the data. They are also committing themselves to a high standard of speedy publication and high quality data management. This system seems like another way to increase the transparency in data analysis and provide more opportunities for replication with fewer grant dollars. In our current system, there is no way to know if someone has a data set hanging around that would be perfectly suited to help you in your research.  A system like Scientific Data can also help researchers meet the data sharing expectations of their grant funders for example, most, if not all NIH grants require sharing of data.

More data for everyone!!


Filed under: Follies of the Human Condition Tagged: Big Data, Open Access, replication

Project Tycho: Vaccines prevent diseases!

Tycho Brahe, Image from Wikipedia

Tycho Brahe, Image from Wikipedia

I just heard about a new “big data” project called Project Tycho. They chose the name Tycho in honor of Tycho Brahe who made tons of detailed observations of the stars and planets. After his death, his data was used by Kepler to formulate the laws of planetary motion. This project wants to connect the vast amounts of public health data to scientists and policy researchers to improve their understanding of contagious diseases and their spread. Their undertaking is incredible; they digitized weekly Nationally Notifiable Disease Surveillance System reports from 1888-2013. Now that all of the data is digitized they are working their way through standardizing it and making it amenable to analysis. This entire dataset is available for search online.

Recently, the New England Journal of Medicine published their description of the project along with data from the first analysis done on this new resource. They looked at 8 different vaccine preventable diseases (smallpox, polio, measles, rubella, mumps, hepatitis A, diptheria, pertussis) and looked at the rate of incidence before the introduction of a the vaccine. They assumed that there were no other major reasons that the rate of infection of the diseases would change other than the vaccination increase. They estimated that 103.1 million cases of these 8 diseases had been prevented since 1924. Now when you think back that sometimes these diseases can be fatal, these vaccination programs have made a huge difference in child health, and population health in general.

This data also exposes increased rates of 4 of these diseases (measles, mumps, rubella, and pertussis) in recent years. This could be attributed to pockets where vaccination rates have dropped due to personal or religious reasons. It’s not really possible to know definitively with this particular data. While the rates of these diseases seem incredibly low and the perceived risk of infection seems low, the current rates of infection are low due to years and years of vaccination. The risk of the unvaccinated is actually much higher than it would appear.

This large data project will be invaluable resource in evaluating vaccine program effectiveness and it will help to guide and record future vaccination programs.


Filed under: Follies of the Human Condition, This Mortal Coil Tagged: Big Data, Centers for Disease Control and Prevention, disease surveillane, Tycho, Tycho Brahe, vaccine

“Big Data” – tool not philosophy

To me, the take home message from David Brooks’ article “What You’ll Do Next” and Tyler Cowen’s follow-on comment is that “Big Data” is a potentially useful tool, but alone it is not a coherent or inspiring approach to life.