New paper: Machine learning to predict the source of campylobacteriosis using whole genome data

This study, published in October in PLOS Genetics, brings together machine learning, large bacterial isolate collections and whole genome sequencing to address the general problem of how to trace the source of human infections.

Specifically, we investigated campylobacteriosis, a common infection of animal origin causing ~1.5 million cases of gastroenteritis and 10,000 hospitalizations every year in the United States alone. We show that our combined machine learning/genomics analyses:

  • Improve the accuracy with which infections can be traced back to farm reservoirs.
  • Identify evolutionary shifts in bacterial affinity for livestock host species.
  • Detect changes in human infection capability within related strains.

These results will improve understanding not only of Campylobacter, but more generally as these technologies can readily be applied to other important bacterial pathogen species.

This paper builds on previous work published by the group, including our well cited Tracing the source of campylobacteriosis (Wilson et al 2008, PLOS Genetics 4:e1000203). The use of these methods for tracing infection has influenced public health policy and contributed to reducing disease burden.

This work demonstrates the potential for modern genomics and artificial intelligence approaches to address common and serious problems that affect our everyday lives. The awareness of the importance of infection to society has rarely been higher than in 2021, and while the current pandemic imposes an acute global problem, other infections continue to present long-term threats to health and productivity.

This work was led by Nicolas Arning, in collaboration with David Clifton and Sam Sheppard.

New paper: SCOTTI Efficient reconstruction of transmission within outbreaks with the structured coalescent

New paper published today in PLoS Computational Biology: Understanding how infectious disease spreads and where it originates is essential for devising policies to prevent and limit outbreaks. Whole genome sequencing of pathogens has proved an extremely promising tool for identifying transmission, particularly when combined with classical epidemiological data. Several statistical and computational approaches are available for exploiting genomics for epidemiological investigation. These methods have seen applications to dozens of outbreak studies. However, they have a number of serious drawbacks.

In this new paper Nicola De Maio, Jessie Wu and I introduce SCOTTI, a method for quickly and accurately inferring who-infected- whom from genomic and epidemiological data. SCOTTI addresses very widespread, but generally neglected problems in joint epidemiological and genomic inference, notably the presence of non-sampled and undetected intermediate cases and within-host pathogen variation caused by microevolution. Using real examples and simulations, we show that these problems cause strong misleading effects on existing popular inference methods. SCOTTI is based on BASTA, our recent breakthrough method for phylogeographic inference, and offers new standards of accuracy, calibration, and computational efficiency. SCOTTI is distributed as an open source package within BEAST2.

Prize PhD Studentships available

I am offering two PhD projects as part of the annual Nuffield Department of Medicine Prize Studentship competition:
These are fully-funded, four-year awards open to outstanding students of any nationality. Applicants nominate three projects, in order of preference, from the available pool. For how to apply, click here. Only applications submitted through the online system will be considered, but interested applicants are welcome to contact me informally. The deadline for applications is noon, 6th January 2017.

In addition to my projects, the Modernising Medical Microbiology project has announced the following PhD projects as part of the competition:

    New paper: Rapid host switching in Campylobacter

    Our new open access paper Rapid host switching in generalist Campylobacter strains erodes the signal for tracing human infections was published last week in the ISME Journal.

    Figure from paper 
    With Bethany Dearlove, Sam Sheppard and colleagues, we investigated common strains of campylobacter, the most frequent cause of bacterial gastroenteritis worldwide. Campylobacter infection is associated with food poisoning, particularly contaminated chicken. But in previous work, we found that certain strains (the ST-21, ST-45 and ST-828 complexes) are often found contaminating a range of meat and poultry, making it difficult to trace the source of human infection.

    That previous work was based on partial genome sequencing known as MLST. In MLST, less than 1% of the information in the genome is captured. Now that whole genome sequencing is available, the expectation was that we should be able to distinguish easily between between ST-21, 45 and 828 strains contaminating poultry versus beef versus lamb, and so on.

    What we found was surprising. Instead of these strains harbouring previously unobserved sub-structure that allowed them to be associated with different animal sources, we found rapidly mixing populations undergoing extremely fast transmission between animal species, with campylobacter strains ricocheting among animal species on a timescale of just a few years. This is faster than they can accumulate enough mutations to differentiate populations colonizing different animal species.

    Our results present an unforeseen roadblock to tracing transmission with whole genome sequencing, and suggests these strains are adapted to a generalist lifestyle, shedding new light on the ecology of this pathogen. These findings push back against the tide of opinion that whole genome sequencing is necessarily a panacea for detecting transmission, and demonstrate that going forwards, a detailed understanding of the biology of zoonotic bacteria (those transmitting between multiple species) and intensive sampling of potential sources are essential for effectively tracing the source of human infection.

    Detecting mixed strain infections with whole genome sequencing

    Whole genome sequencing in near-to-real time is set to become a routine tool for outbreak detection by hospital and public health microbiology labs, following successful pilot studies in the UK last year. Typically, the bacteria are cultured from a clinical sample, and a single colony is picked for sequencing. Since a bacterial colony grows from a single cell, this procedure ensures that all the cells picked for sequencing are genetically identical, and this in turn helps piece the genome back together again following sequencing.

    But it exposes the system to a flaw. What would happen if a patient sick with two strains transmitted one, but not the other to a second patient? Characterizing the genome of just one of the strains in the first patient risks missing the transmission event entirely, because the "wrong" strain might have been sequenced.

    One safeguard would be to sequence multiple bacterial colonies per sample, three for example. But this would increase the cost of routine surveillance three-fold.

    In a new paper published this month in PLoS Computational Biology, with David Eyre, Madeleine Cule, Sarah Walker and others, we have investigated an alternative solution, where by a large number of colonies gets sequenced all together. The cost is the same as that of sequencing a single colony. But the downstream bioinformatics analysis is complicated considerably by the presence of multiple strains. To cope with this, we developed a new computational method that reconstructs the identities of the multiple strains, using a panel of reference genomes to help where possible.

    By applying the approach to 26 clinical samples of Clostridium difficile hospital infections with known epidemiological relationships, we detected four mixed strain infections, one of which revealed a previously undetected transmission event within the hospital. For full details, read the open access paper.