Announcing ProbGen22 in Oxford 28-30 March

The organizing committee is pleased to announce the 7th Probabilistic Modeling in Genomics Conference (ProbGen22) to be held at the Blavatnik School of Government and Somerville College Oxford from 28th-30th March 2022.

The meeting will be a hybrid in-person and online event. Talk sessions will feature live speakers, both in-person and online, and will take place during the afternoons (making live attendance feasible for US timezones). Talks will be recorded and made available to registrants for a period of one month. Poster sessions will be held online during the evenings.

The conference will cover probabilistic models, algorithms, and statistical methods across a broad range of applications in genetics and genomics. We invite abstract submissions on a range of topics including population genetics, natural selection, Quantitative genetics, Methods for GWAS, Applications to cancer and other diseases, Causal inference in genetic studies, Functional genomics, Assembly and variant identification, Phylogenetics, Single cell 'omics, Deep learning in genomics and Pathogen genomics.

The registration deadline is 28th February 2022.

For more details visit the conference website. 

New paper: GenomegaMap for dN/dS in over 10,000 genomes

Published this week in Molecular Biology and Evolution, is a new paper joint with the CRyPTIC Consortium "GenomegaMap: within-species genome-wide dN/dS estimation from over 10,000 genomes".

The dN/dS ratio is a popular statistic in evolutionary genetics that quantifies the relative rates of protein-altering and non-protein-altering mutations. The rate is adjusted so that under neutral evolution - i.e. when the survival and reproductive advantage of all variants is the same - it equals 1. Typically, dN/dS is observed to be less than 1 meaning that new mutations tend to be disfavoured, implying they are harmful to survival or reproduction. Occasionally, dN/dS is observed to be greater than 1 meaning that new mutations are favoured, implying they provide some survival or reproductive advantage. The aim of estimating dN/dS is usually to identify mutations that provide an advantage.

Theoreticians are often critical of dN/dS because it is more of a descriptive statistic than a process-driven model of evolution. This overlooks the problem that currently available models make simplifying assumptions such as minimal interference between adjacent mutations within genes. These assumptions are not obviously appropriate in many species, including infectious micro-organisms, that exchange genetic material infrequently.

There are many methods for measuring dN/dS. This new paper overcomes two common problems:
  • It is fast no matter how many genomes are analysed together.
  • It is robust whether there is frequent genetic exchange (which causes phylogenetic methods to report spurious signals of advantageous mutation) or infrequent genetic exchange.
The paper includes detailed simulations that establish the validity of the approach, and it goes on to demonstrate how genomegaMap can detect advantageous mutations in 10,209 genomes of Mycobacterium tuberculosis, the bacterium that causes tuberculosis. The method reproduces known signals of advantageous mutations that make the bacteria resistant to antibiotics, and it discovers a new signal of advantageous mutations in a cold-shock protein called deaD or csdA.

Software that implements genomegaMap is available on Docker Hub and the source code and documentation are available on Git Hub.

With the steady rise of more and more genome sequences, the analysis of data becomes an increasing challenge even with modern computers, so it is hoped that this new method provides a useful way to exploit the opportunities in such large datasets to gain new insights into evolution.

Right Answer, Wrong Question

Author’s Note: Post was written without access only to the abstract, not the full text, of the journal article in question. Note that the argument is not with the methods or results of the research, but with how the research question has been presented.

University of Chicago Medicine & Biological Sciences tweeted the following tweet on Twitter today highlighting the work of post-doc Laure Ségurel on genetic risks for Type 2 Diabetes:
Screen Shot 2013-03-08 at 8.33.27 AM
The work itself is interesting in its own right. Investigating the population genetic history of genetic markers associated with Type 2 Diabetes risk could have multiple applications, beyond the high level of intellectual interest.

The question used to frame the research, however, troubles me, because it plays to general misconceptions about the evolutionary dominance and efficiency of natural selection in humans:

Why is this deleterious disease so common, while the associated genetic variants should be removed by natural selection? -Ségurel et al (Eur J Hum Genet. 2013 Jan 23. doi: 10.1038/ejhg.2012.295)

Selection is not the only force that can drive evolution. Other options include drift, mutation, and migration. These forces are distinguished from selection by their blindness to their effects on the fitness of organisms. I use “the adaptionist paradigm” to refer to the assumption held by many, including scientists, that selection is almost always the primary driving force of evolution and that it is efficient.

Human population structure is not well suited for efficient selection due to a variety of features like not having a large effective population size (a genetics concept that can differ greatly from actual population size) and relatively long generation times. Under these conditions, it is hardly surprising that natural selection has failed to scrub all manner of deleterious genetic differences from humans, without having to postulate that what are now deleterious genetic differences were at one point “good for us”.

At the same time, I hardly blame the authors for taking this approach to present their work [UPDATE 9 Mar 2013: Underlined text added to clarify that "approach" does not refer to how the research was conducted-see comment thread]. Their research directly addresses the “thrifty” genotype hypothesis (insulin resistance was beneficial for hunter-gatherers) for Type 2 Diabetes genetic risk factors, which still has a lot of traction. It is an adaptionist hypothesis addressing a problem emerging from the adaptionist paradigm. That this problem is not well supported by evolutionary theory does not change the fact that it is relevant to the field of research.

There is also an issue of the genetic differences and this one requires paying close attention to the jargon. In particular, we need to distinguish between the words associated and causal:

The ‘thrifty genotype’ hypothesis proposed that the causal genetic variants were advantageous and selected for during the majority of human evolution. It remains, however, unclear whether genetic data support this scenario. In this study, we characterized patterns of selection at 10 variants associated with type 2 diabetes… -Ségurel et al (Eur J Hum Genet. 2013 Jan 23. doi: 10.1038/ejhg.2012.295) [Emphasis mine]

Causal genetic differences are the ones that actually cause the increased risk. These are the locations where having one base pair instead of another means you have a different likelihood to get the disease. Associated genetic differences are relatively common genetic variations in the human population that can be linked statistically to risk. Associated differences are markers for causal differences, because they are usually, but not always inherited together. It’s like the golden arches sign for McDonald’s. The sign is not the McDonald’s, but it is almost always right next to the McDonald’s; and it strongly suggests the nearby building will not be a Burger King.

We tend to use associated differences (thus, genome-wide association studies) because the large populations necessary for these studies need common variants and working with a known set of differences is much easier technologically (ie, much easier to genotype people).

This means that the researchers are, at some level, using a proxy for the evolutionary history of the causal differences. This can introduce multiple confounding issues, for example the causal mutation entering the population after the associated difference making it a poorer quality marker.  In such a case, natural selection could act to remove the causal difference, but leave the associated marker – though, in that case, the marker would no longer be associated with risk [UPDATE 19:53 - a version issue caused this sentence to be truncated when posted].

Most genetic markers only contribute marginally, but statistically significantly to a person’s individual risk. Their individual effect on individual fitness may not particularly large. In addition, Type 2 Diabetes frequently manifests at an age when people have already reproduced. Pile on top of that the fact that the human population structure (this pretty much goes for all large animals) does not have the characteristics necessary for the efficient operation of natural selection.

For my money, the better motivating question and the question the research really asks is:

Is selection acting on genetic differences associated with Type 2 Diabetes & is it favoring the risky or protective differences?

What they found was that selection is acting on genetic differences associated with Type 2 Diabetes risk and is favoring the protective variants. They also found that these results did not appear to depend on the lifestyle of the populations studied, refuting the ”thrifty” genotype hypothesis. Selection seems to be favoring the genetic differences that protect against Type 2 Diabetes and it does not appear to be related to the lifestyle of the populations being studied. In doing so, it answers my questions, which I happen to believe are the “right” questions, but also perpetuates misconceptions about evolution.

I do understand why the researchers took this approach to present their work [UPDATE 9 Mar 2013: Underlined text added to clarify that "approach" does not refer to how the research was conducted-see comment thread]. The study is framed in a way that is relevant to the field, if not necessarily the science, which is kind of a shame, when you think about it.

*This post is also a reminder to me to make sure you check what the researchers said about the topic, instead of jumping to the conclusion that the University press officer got carried away trying to explain why a study is “interesting”. In this case, the public article reflects what the researchers wrote in their peer-reviewed, subscription access journal article.


gammaMap available for download

The software gammaMap - which implements the analyses developed in Wilson, Hernandez, Andolfatto and Przeworski (2011) PLoS Genetics 7: e1002395 - is available for download. It is provided as part of a flexible program called GCAT (general computational analysis tool) which is designed to rapidly facilitate novel variations on the standard analyses. GCAT has its own google code page, http://code.google.com/p/gcat-project. GCAT resembles BEAST and BUGs in that a statistical model is specified (using XML) and parameters are then estimated using MCMC or maximum likelihood. Future extensions to GCAT are planned that implement new fast approximations to gammaMap and omegaMap, and parallel processing, allowing the analyses to be scaled more readily to whole genomes.

New method inferring natural selection published today

I am pleased to report that my new paper "A population genetics-phylogenetics approach to inferring natural selection" is published today in PLoS Genetics. This is the culmination of two years work at the University of Chicago with Molly Przeworski, plus a good deal of follow-up since I moved to Oxford. In the paper we introduce a new way of combining population genetics and phylogenetics models of natural selection, and a statistical method (gammaMap) for estimating parameters under the model. From a collection of sequences within one or more species - in the paper, we use 100 X-linked coding sequences that Peter Andolfatto produced in Drosophila melanogaster and D. simulans - the method allows you to estimate the distribution of fitness effects within each lineage, and localize the signal of selection using a Bayesian sliding window approach. Using Ryan Hernandez's simulator SFSCODE we tested the method for robustness to demographic change and linkage disequilbrium, and we investigated the effect that common assumptions concerning spatial variation in selection coefficients (sitewise, genewise and sliding window approaches) have on inference of selection. During the winter break I will work on compiling the program for different platforms and writing the documentation, with a view to releasing the software early in the New Year. Subscribe to this blog for updates or - if you are too impatient to wait - send me an email.

What are the conditions for multiple foci of adaptation?

Selection on standing variation, soft sweeps, parallel adaptation: these alternatives to the population genetics paradigm of the S-shaped selective sweep have in common the idea that the response of a species to a change in selection pressure may frequently involve multiple mutations, which may arise in multiple locales, and which may appear at different sites in the genome. Consequently, the footprint of selection in the genome is different to that expected under a single selective sweep and therefore likely to be missed by scans of the genome looking for selection.

Many examples of parallel adaptation have been put forward, for instance multiple drug resistance in the malaria parasite Plasmodium vivax. But how plausible is parallel adaptation as an evolutionary mechanism, and what are the conditions that make it likely? These questions were addressed by Graham Coop presenting joint work with his postdoc Peter Ralph in one of the stand-out talks of the SMBE conference in Lyon.

Their key finding is that the multifarious parameters that go into building a spatial model of adaptation (strength of selection, the mutation rate, population density, average dispersal distance of offspring) can be distilled down to a single key quantity: the characteristic length given by the equation
When the geographical extent of the species range exceeds this characteristic length, the conditions are right for parallel adaptation. Graham's talk made accessible the complex mathematics behind this result. He has kindly made the slides available (click here) and the paper is now available at the Genetics website (click here).

Discovering the distribution of fitness effects

At this year's Society for Molecular Biology and Evolution meeting in Lyon I presented ongoing work estimating the distribution of fitness effects, which is a collaborative venture with Molly Przeworski and Peter Andolfatto. Earlier versions of this research appeared in talks I presented at Chicago in December (Ecology and Evolution Departmental seminar) and Liverpool in January (UK Population Genetics Group meeting), and it follows on from last year's SMBE presentation in which I discussed methods to tease out sub-genic variation in selection pressure.

There is intrinsic interest in the fitness effects of novel mutations in coding regions of the genome, especially the relative frequency of occurrence of neutral, beneficial and deleterious variants. Yet estimating the distribution of fitness effects (the DFE) is also of practical use when localizing the signal of adaptive evolution. The reason is that in Bayesian analyses, the assumed DFE can influence the strength of evidence for or against adaptation at a particular site. Consequently it is preferably to estimate the DFE at the same time as detecting adaptation at individual sites to avoid prior assumptions unduly influencing the results.

Having estimated the DFE, it is of use in quantifying the relative contribution of adaptation versus drift to genome evolution. The figure, taken from my talk in Lyon (slides here), illustrates the idea when a normal distribution is used to estimate the DFE; the relative area of the green to the yellow shaded regions represents the respective contribution of adaptation versus drift in amino acid substitutions accrued along the Drosophila melanogaster lineage.

omegaMap at BioHPC

All evolutionary biologists wishing to make use of omegaMap now have access to a high performance parallel computing cluster via the internet courtesy of Cornell's CBSU and Microsoft. The software, which allows the detection of selection and recombination in DNA or RNA sequences, can be run via the web interface at cbsuapps.tc.cornell.edu/omegamap.aspx, or downloaded as part of the BioHPC suite.

The web interface consists of a simple form where users can upload their configuration file and sequences in FASTA format. Completed jobs are notified by e-mail. To learn more about the project visit the CBSU home page.

Meanwhile, I am working on several major updates to omegaMap, the most interesting of which will probably be the development of a new model that allows for the joint analysis of natural selection acting on sequences from different populations or species. The aim is to integrate population genetic and phylogenetic models of selection in order to exploit the signal of selection contained both in polymorphism within populations (or species) and divergence between them. I will be presenting progress on this work, in the context of hominid evolution, at the 2009 SMBE meeting in Iowa City this June.

Human Evolution in New York City

Rounding off a hectic end to 2008 was a trip to visit Molly, currently on sabbatical in New York city. Joanna and I flew out to spend the final weekend before Christmas discussing projects and frequenting the local coffee shops, restaurants and bars. I took the opportunity to visit the American Museum of Natural History adjacent to Central Park after reading about its dinosaur collections in the Catcher in the Rye; pictured is an Allosaurus skeleton, which stands in the main entrance hall. Of particular interest was the Spitzer Hall of Human Origins which features a wealth of fossil remains and artefacts including a cast of the Laetoli footprints and a diorama of an Australopithecus afarensis nuclear family. Fittingly, the very focus of the New York trip was to discuss the on-going project to characterize natural selection between hominid species.