New paper: SCOTTI Efficient reconstruction of transmission within outbreaks with the structured coalescent

New paper published today in PLoS Computational Biology: Understanding how infectious disease spreads and where it originates is essential for devising policies to prevent and limit outbreaks. Whole genome sequencing of pathogens has proved an extremely promising tool for identifying transmission, particularly when combined with classical epidemiological data. Several statistical and computational approaches are available for exploiting genomics for epidemiological investigation. These methods have seen applications to dozens of outbreak studies. However, they have a number of serious drawbacks.

In this new paper Nicola De Maio, Jessie Wu and I introduce SCOTTI, a method for quickly and accurately inferring who-infected- whom from genomic and epidemiological data. SCOTTI addresses very widespread, but generally neglected problems in joint epidemiological and genomic inference, notably the presence of non-sampled and undetected intermediate cases and within-host pathogen variation caused by microevolution. Using real examples and simulations, we show that these problems cause strong misleading effects on existing popular inference methods. SCOTTI is based on BASTA, our recent breakthrough method for phylogeographic inference, and offers new standards of accuracy, calibration, and computational efficiency. SCOTTI is distributed as an open source package within BEAST2.

BASTA: Improved method for phylogeography

This week sees publication of our paper New Routes to Phylogeography: a Bayesian Structured Coalescent Approximation in PLoS Genetics.

Phylogeography is the recovery of migration history from genome sequences, and has exploded as a field in recent years. Over a thousand papers have used contemporary sequences and ancient DNA to reconstruct migratory trends, locate the origin of outbreaks and track the spread of infectious diseases. In many high profile examples phylogeography has informed our understanding of how major human pathogens spread.

In our new paper we solve a severe and apparently widely unappreciated problem: that the most popular approaches to phylogeography are heavily biased, extremely sensitive to sampling structure and substantially underestimate statistical uncertainty. The problems stem from the treatment of migration as equivalent to mutation (discrete trait analysis; DTA), and the assumption that sampling locations are phylogeographically informative.

To solve these problems we introduce and demonstrate a new method BASTA, implemented in the phylogenetic software package BEAST2, that employs a novel approximation to enable inference under the structured coalescent – the bottom-up population genetics model of migration. Previously, methods for exact inference under the structured coalescent have proven too slow for many practical purposes, hence the need for a fast and accurate approximation.

The biases we highlight with popular phylogeography methods are much more important than might appear from what is at one level a question of model choice. To underline this, we present an analysis of around 100 Ebola virus genome sequences to investigate the emergence of human outbreaks. Epidemiological studies have found that animals act as a reservoir, maintaining the virus between the sporadic human outbreaks that have unfolded over the past four decades, a scenario that our structured coalescent-based model correctly identifies.

Remarkably, DTA, the de facto standard method for phylogeography, wrongly concluded with high confidence that Ebola has been maintained since 1976 by undetected human-to-human transmission between outbreaks. Although such a conclusion would never be believed in the case of Ebola, it makes clear the potential for highly misleading inference about transmission that could, for much less well understood diseases, have serious implications for public health policy.

BASTA is the result of a lot of hard work by Nicola De Maio, who is a James Martin Fellow at the Oxford Martin School Institute for Emerging Infections, with help from Jessie Wu and Kathleen O'Reilly. You can read the paper here and download BASTA here.

Coalescent inference for infectious disease

Today my student Bethany Dearlove has her first paper published, called Coalescent inference for infectious disease: meta-analysis of hepatitis C. In this paper, published in Philosophical Transactions of the Royal Society B, we have developed coalescent-based population genetics methods for popular, deterministic, epidemiological models known as SI (susceptible-infectious), SIS (susceptible-infectious-susceptible) and SIR (susceptible-infectious-recovered). By implementing these methods in BEAST, we were able to re-analyse previously published hepatitis C virus datasets and directly estimate epidemiological parameters. Our results show that, in the absence of co-infection, the widely-used exponential growth and logistic growth models of changing population size correspond directly to SI and SIS dynamics. We were also able to examine the limitations to genetic approaches to reconstructing epidemiological dynamics.

This paper appears as part of an issue on Next-generation molecular and evolutionary epidemiology of infectious disease, which accompanies a Royal Society discussion meeting organized by Oli Pybus, Christophe Fraser and Andrew Rambaut. The Royal Society has made audio recordings of the talks at this meeting, and the accompanying satellite meeting, available online, including my talk on Bethany's paper.

Please don’t use Clustal for tree construction!

{{en|A phylogenetic tree of life, showing the ...Image via Wikipedia

There are reams of books, articles, and websites about the correct way to build a phylogenetic tree. My post is not to argue about what is the best method, but rather point out that most people do not consider Clustal (e.g. ClustalX or ClustalW) to be an optimal solution in almost any circumstance. Countless times I have asked people how they built their particular tree and they give me the vague "Clustal" answer. Of course this answer is fine if this is the first tree you ever constructed, but beware you will be labelled as a phylogenetic newbie.

Clustal is technically a multiple alignment algorithm, but it also includes methods for tree construction in the same interface. Most of these methods are not really considered "good" tree building methods. If you do use Clustal, at least specify what tree building method you used (ie. "Clustal with neighbor joining"). Most people don't use Clustal even for multiple alignment anymore, because Muscle has been shown to be at least as accurate as Clustal and is much faster.

For tree construction, most people would agree that a Maximum Likelihood or Bayesian method would almost always be a better solution; PhyML and Mr. Bayes seem to be the most popular implementations for these methods. Advanced users might also want to look into using Beast.

I usually interact with most of these programs through a command line interface, so I don't have an expansive knowledge of the best graphical tool. However, I did come across, "Robust Phylogenetic Analysis For The Non-Specialist" which does a good job allowing easy interaction between various methods for multiple sequence alignment, tree construction, and tree viewing.

Whatever you use to build trees, just make sure it isn't Clustal!
Reblog this post [with Zemanta]