Calculating time of divergence using genome sequences and mutation rates (humans vs other apes)

There are several ways to report a mutation rate. You can state it as the number of mutations per base pair per year in which case a typical mutation rate for humans is about 5 × 10-10. Or you can express it as the number of mutations per base pair per generation (~1.5 × 10-8).

You can use the number of mutations per generation or per year if you are only discussing one species. In humans, for example, you can describe the mutation rate as 100 mutations per generation and just assume that everyone knows the number of base pairs (6.4 × 109).

The intrinsic mutation rate depends on the error rate of DNA replication. We don't know the exact value of this error rate but it's pretty close to 10-10 per base pair when you take repair into account [Estimating the Human Mutation Rate: Biochemical Method]. For single-cell species you simply multiply this number by the number of base pairs in the genome to get a good estimate of the mutation rate per generation.

The calculation for multicellular species is much more complicated because you have to know the number of cell divisions between zygote and mature germ cells. In some cases it's impossible to know this number (e.g.flowering plants, yeast). In other cases we have a pretty good estimate: for example, in humans there are about 400 cell divisions between zygote and mature sperm and about 30 cell divisions between zygote and mature egg cells. The number of cell divisions depends on the age of the parent, especially in males [Parental age and the human mutation rate]. This effect is significant—older parents pass on twice as many mutations as the youngest parents.

The parental age effect is comparable to the extremes in estimations of the human mutation rate based on different ways of measuring it [Human mutation rates - what's the right number?] [Human mutation rates ]. Those values range from about 60 mutations per generation to about 160 mutations per generation.

Thus, in the case of humans, we're dealing with estimates that differ by a factor of two depending on method and parental age.



-mutation types
-mutation rates
Let's assume that each child is born with 100 new mutations. This seems like a reasonable number. It's on the high end of direct counts by sequencing parents and siblings but there are reasons to believe these counts are underestimated (Scally, 2016). On the other hand, this value (100 mutations) is on the low end of the estimates using the biochemical method and the phylogenetic method.

Most of these mutations occur in the father but some were contributed by the mother. Since the child is diploid, we calculate the mutation rate per bp as: 100 ÷ 6.4 × 109 = 1.56 × 10-8 per base pair per generation. Assuming an average generation time of 30 years, this gives 1.56 × 10-8 ÷ 30 = 5.2 × 10-10 mutations per bp per year. That's the value given above (rounded to 5 × 10-10). Scally (2016) uses this same value except he assumes a generation time of 29 years.

There are many who think this value is considerably lower than previous estimates and this casts doubt on the traditional times of divergence chimps and human and the other great apes. For example, Scally (2016) says that prior to the availability of direct sequencing date the "consensus value" was 10 × 10-10 per bp per year.1 That's twice the value he prefers today. It works out to 186 mutations per generation!

I think it's been a long time since workers in the field assumed such a high mutation rate but let's assume he is correct and current estimates are considerably lower than those from twenty years ago.

You can calculate a time of divergence (t) between any two species if you know the genetic distance (d) between them measured in base pairs and the mutation rate (μ) in mutations per year.2 The genetic distance can be estimated by comparing genome sequences and counting the differences. It represents the number of mutations that have become fixed in the two lineages since they shared a common ancestor. Haploid reference genome sequences are sufficient for this estimate.

The mutation rate (μ) is 100 mutations per generation divided by 30 years = 3.3 mutations per year.

The time of divergence is then calculated by dividing half that distance (in nucleotides) by the mutation rate (t = d/2 ÷ μ). (There are all kinds of "corrections" that can be applied to these values but let's ignore them for now and see what the crude data says.)

Human and chimp genomes differ by about 1.4%, which corresponds to 44.8 million nucleotide differences and d/2 = 22.4 million. Using 100 mutations per generation as the mutation rate means 5 × 10-10 per bp per year. From t = d/2 ÷ μ we get t = 6.8 million years.

This is a reasonable number. It's consistent with the known fossil record and it's in line with the current views of a divergence time for chimps and humans.

However, there are reasons to believe that some of the assumptions in this calculation are wrong. For example, the average generation time is probably not 30 years in both lineages over the last few million years. It's probably shorter, at least in the chimp lineage where the current generation time is 25 years. Using a generation time of 25 years gives a divergence time of 5.6 million years.

In addition, the overall differences between the human and chimp genomes may be only 1.2% instead of 1.4% (see Moorjani et al., 2016). If you combine this value with the shorter generation time, you get 4.25 million years for the time of divergence.

Given the imprecision of the mutation rate, the question of real generation time, and problems in estimating the overall difference between humans and chimps, we can't know for certain what time of divergence is predicted by a molecular clock. On the other hand, the range of values (e.g. 4.25 - 6.8 million years) isn't cause for great concern.

So, what's the problem? The problem is that applying the human mutation rate (100 mutations per generation) to more distantly related species gives strange results. For example. Scally (2016) uses this mutation rate and a difference of 2.6% to estimate the time of divergence of humans and orangutans. The calculation yields a value of 26 million years. This is far too old according to the fossil record.

Several recent papers have addressed this issue (Scally, 2016; Moorjani et al., 2016a; Moorjani et al., 2016b). Most of the problem is solved by assuming a much higher mutation rate in the past. The biggest effect is the generation time in years. It may have been as low as 15 years for much of the past ten million years. Many of the problems go away when you adjust for this effect.

What puzzles me is the approach taken by Moorjani et al. in their two recent papers. They say that the "new" mutation rate is 5 × 10-10 per bp per year. That's exactly the value I use above. It's roughly 100 new mutations per child (per generation). Moorjani et al. (2016a) think this value is surprisingly low because it leads to a surprising result. They explain it in a section titled "The Puzzle."

They assume that the human and chimp genomes differ by 1.2%. That works out to 38 million mutations over the entire genome. This is 19 million fixed mutated alleles in each lineage if the mutation rate in both lineages is equal and constant.

If the mutation rate is 5 × 10-10 per bp per year then for a haploid genome this is 1.6 mutations per year. Dividing 19 million by 1.6 gives 11.9 million years (rounded to 12 million) for the time of divergence. This is the value quoted by the authors.
Taken at face value, this mutation rate suggests that African and non-African populations split over 100,000 years and a human-chimpanzee divergence time of 12 million years ago (Mya) (for a human–chimpanzee average nucleotide divergence of 1.2% at putatively neutral sites). These estimates are older than previously believed, but not necessarily at odds with the existing—and very limited—paleontological evidence for Homininae. More clearly problematic are the divergence times that are obtained for humans and orangutans or humans and OWMs [Old World Monkeys]. As an illustration, using whole genome divergence estimates for putatively neutral sites suggests a human–orangutan divergence time of 31 Mya and human–OWM divergence time of 62 Mya. These estimates are implausibly old, implying a human-oraguntan divergence well into the Oligocene and OWM-hominoid divergence well into or beyond the Eocene. Thus, the yearly mutation rates obtained from pedigrees seem to suggest dates that are too ancient to be readily reconciled with the current understanding of the fossil record.
Here's the problem. If the mutation rate is 100 mutations per generation then this applies to DIPLOID genomes. Some of the mutations are contribute by the mother and some (more) by the father. If you apply this rate to a DIPLOID genome then the number of mutations per year is 3.1 (100/30 years). Or,

         5 × 10-10 per bp per year × 6.4 × 109 bp (diploid) = 3.2 mutations per year

Dividing 19 million mutations by 3.2 give a time of divergence of 5.9 million years. This is a reasonable number but it's half the value calculated by Moorjani et al. (2016a).

They also calculate a value of 12.1 million years for the human-chimp divergence in their second paper (and 15.1 million years for the divergence of humans and gorillas) (Moorjani et al., 2016b).

I think their calculations are wrong because they used the haploid genome size rather than the diploid genome where the mutations are accumulating. Both these papers appear in good journals and both were peer-reviewed. Furthermore, the senior author, Molly Przeworski, is a Professor at Columbia University (New York, NY, USA) and she's an expert in this field.

What am I doing wrong? Is it true that a mutation rate of ~100 mutations per generation means that human and chimpanzees must have been separated for 12 million years as Moorjani et al. say? Or is the real value 5.9 million years as I've calculated above?

Image Credit: The chromosome image is from Wikipedia: Creative Commons Attribution 2.0 Generic license. The chimp photo is also from Wikipedia.

1. Scally takes this value from Nachman and Crowell (2000) who claim that the mutation rate is ~2.5 × 1008 mutations per bp in humans. This works out to 160 mutations per generation and an overall mutation rate of 8 × 10-10 based on a generation time of 30 years, not 10 × 10-10 as Scally states.

2. This assumes that all mutations are neutral. The rate of fixation of neutral alleles over time is equal to the mutation rate. Since 8% of the genome is under selection, it's not true that all mutations are neutral but to a first approximation it's not far off.

Moorjani, P., Gao, Z., and Przeworski, M. (2016) Human germline mutation and the erratic evolutionary clock. PLoS Biology, 14:e2000744. [doi: 10.1371/journal.pbio.2000744]

Moorjani, P., Amorim, C.E.G., Arndt, P.F., and Przeworski, M. (2016b) Variation in the molecular clock of primates. Proc. Nat. Acad. Sci. (USA) 113:10607-10612. [doi: 10.1073/pnas.1600374113 ]

Nachman, M.W., and Crowell, S.L. (2000) Estimate of the mutation rate per nucleotide in humans. Genetics, 156:297-304. [PDF]

Scally, A. (2016) The mutation rate in human evolution and demographic inference. Current opinion in genetics & development, 41:36-43. [doi: 10.1016/j.gde.2016.07.008]

How much mitochondrial DNA in your genome?

Most mitochondrial genes have been transferred from the ancestral mitochondrial genome to the nuclear genome over the course of 1-2 billion years of evollution. They are no longer present in mitochondria but they are easily recognized because they resemble α-proteobacterial sequences more than the other nuclear genes [see Endosymbiotic Theory].

This process of incorporating mitochondrial DNA into the nuclear genome continues to this day. The latest human reference genome has about 600 examples of nuclear sequences of mitochondrial origin (= numts). Some of them are quite recent while others date back almost 70 million years—the limit of resolution for junk DNA [see Mitochondria are invading your genome!].

Estimating the number of numts isn't as easy as you might imagine. There are two main problems according to Hazkani-Covo and Martin (2017).
  1. Simple BLAST searches using mitochondrial sequences against the nuclear genome may overestimate the number of insertion events. That's because the hits need to be concatenated to see the extent of the insertion. You also need to take into account subsequent events, such as the insertion of a transposon into the mitochondrial fragment, that makes a single insertion event look like two independent events in genomic analyses.
  2. The number of numts may be underestimated because mitochondrial sequences are usually thought to be contaminants and they are removed from the genome sequence. There are several documented cases.
The authors examined 36 genomes for the presence of mitochondrial DNA. They looked at each potential event separately to verify that it was a genuine numt. They also looked for nupts—plastid DNA—in 24 genomes.

The results vary from a low of 7 numts to 6550 numts depending on the size of the genome. The best estimates for humans is 592, which is pretty much in line with earlier results. The number of nupts in plants and algae is about the same.

Image Credit: Moran, L.A., Horton, H.R., Scrimgeour, K.G., and Perry, M.D. (2012) Principles of Biochemistry 5th ed., Pearson Education Inc. page 175 [Pearson: Principles of Biochemistry 5/E]

Hazkani-Covo, E., and Martin, W.F. (2017) Quantifying the number of independent organelle DNA insertions in genome evolution and human health. Genome Biology and Evolution, evx078. [doi: 10.1093/gbe/evx078]

Lateral gene transfer in eukaryotes – where’s the evidence?

Lateral gene transfer (LGT), or horizontal gene transfer (HGT), is widespread in bacteria. It leads to the creation of pangenomes for many bacterial species where different subpopulations contain different subsets of genes that have been incorporated from other species. It also leads to confusing phylogenetic trees such that the history of bacterial evolution looks more like a web of life than a tree [The Web of Life].

Bacterial-like genes are also found in eukaryotes. Many of them are related to genes found in the ancestors of modern mitochondria and chloroplasts and their presence is easily explained by transfer from the organelle to the nucleus. Eukaryotic genomes also contain examples of transposons that have been acquired from bacteria. That's also easy to understand because we know how transposons jump between species.

The literature on eukaryotic genomes is full of additional claims of LGT from bacteria (and other eukaryotes) but many of those have subsequently been attributed to contamination of genomic DNA [see Contaminated genome sequences]. Nevertheless, it's commonly accepted that lateral gene transfer from bacteria to eukaryotes is real and each new eukaryotic genome has several hundred genes acquired from bacteria. It usually accounts for about 1% of the genome. For example, even after extensive analysis of tardigrade genome sequences, there's still somewhere between 1% and 2% HGT/LGT (Yoshida et al., 2017).

An extensive analysis of the finished human genome sequence still suggested that there were 145 genes derived from LGT (Crisp et al., 2015). Those same authors claim to have detected a low level of LGT/HGT in dozens of other eukaryotic species. Here's what they say in their abstract ...
We have taken advantage of the recent availability of a sufficient number of high-quality genomes and associated transcriptomes to carry out a detailed examination of HGT in 26 animal species (10 primates, 12 flies and four nematodes) and a simplified analysis in a further 14 vertebrates. Genome-wide comparative and phylogenetic analyses show that HGT in animals typically gives rise to tens or hundreds of active ‘foreign’ genes, largely concerned with metabolism. Our analyses suggest that while fruit flies and nematodes have continued to acquire foreign genes throughout their evolution, humans and other primates have gained relatively few since their common ancestor. We also resolve the controversy surrounding previous evidence of HGT in humans and provide at least 33 new examples of horizontally acquired genes.
That result was challenged by Salzberg (2017) who presented convincing evidence that many of the LGT claims were due to contamination, or they are mitochondrial genes, or they did not meet the minimal standards for LGT claims. He says,
In this study, I re-examined the claims of Crisp et al. [1] focusing on the human genes. Instead of using a large-scale, automated analysis, which by its very nature could enrich the results for artifactual findings, I looked at each human gene individually to determine whether the evidence is sufficient to support the conclusion that HGT occurred. An important principal here is that extraordinary claims require extraordinary evidence: there is no doubt that the vast majority of human genes owe their presence in the human genome to the normal process of inheritance by vertical descent. Thus, if other, more mundane processes can explain the alignments of a human gene sequence, these explanations are far more likely than HGT.
Bill martin is also skeptical. He also claims that even a low level of LGT in eukaryotes is too much. He claims there's no solid evidence to support those claims and they persist because researchers are not thinking critically about their results and the consequences (Martin, 2017). He says,
Claims for LGT among eukaryotes essentially did not exist before we had genomes because, in contrast to prokaryotes, there are no characters known among eukaryotes that require LGT in order to explain their distribution, except perhaps the spread of plastids via secondary symbiosis. Today, claims for eukaryote LGT are common in the literature, so common that students or nonspecialists might get the impression that there is no difference between prokaryotic and eukaryotic genetics. The time has come where we need to ask whether the many claims for eukaryote LGT – prokaryote to eukaryote LGT and eukaryote to eukaryote LGT – are true.
There are several problems with these claims according to Bill Martin. First, the pattern of LGT doesn't conform to what we see in bacteria where entire clades have inherited genes transferred from bacteria. Most of the claims of LGT are confined to a single species. Second, there's no reasonable mechanism for LGT as there is in bacteria.
The reality checks are simple. If the claims are true, then we need to see evidence in eukaryotic genomes for the cumulative effects of LGT over time, as we see with pangenomes in prokaryotes, and as we see with sequence divergence. That is, the number of genes acquired by LGT needs to increase in eukaryotic lineages as a function of time. We also need to see evidence for genetic mechanisms that could spread genes across eukaryote species (and order, and phylum) boundaries, as we see in prokaryotes. If we do not see the cumulative effects, and if there are no tangible genetic mechanisms, then we have to openly ask why, and entertain the possibility that the claims might not be true. Could it be that eukaryote LGT does not really exist to any significant extent in nature, but is an artefact produced by genome analysis pipelines?
This is not a popular view. That's not surprising coming from Bill Martin because he often challenges the current dogmas. He raises an issue that's more important than the presence of LGT in eukaryotes and that's the tendency of today's scientists to adopt a consensus view without thinking critically.
Why should I care about eukaryote LGT anyway? Is not the practical solution to just believe what everyone else does and “get with the programme” as a prominent eukaryote LGT proponent recently recommended that I do (Dan Graur is my witness). At eukaryote genome meetings, where folks pride themselves on the amounts and kinds of LGT they are finding in a particular eukaryote genome (not in all genomes), I feel like Winston Smith in Orwell's novel 1984, listening to an invented truth recited by members of the Inner Party. My mentors taught me that students of the natural sciences are not obliged to get with anyone's program, instead we are supposed to think independently and always to critically inspect, and re-inspect, current premises. Doing "get with the program" science in herds can produce curious effects. For example, the well-managed ENCODE project that ascribed a function to 80% of the human genome was a textbook case of everyone "getting with the program," and everyone, however, also missing the point, obvious to evolutionary biologists, that the headline result of 80% function cannot be true.

Image Credit:Scientific American, Doolittle, W. (2000) Uprooting the Tree of Life. Scientific American, February 2000.

Crisp, A., Boschetti, C., Perry, M., Tunnacliffe, A., and Micklem, G. (2015) Expression of multiple horizontally acquired genes is a hallmark of both vertebrate and invertebrate genomes. Genome biology, 16:50. [doi: 10.1186/s13059-015-0607-3]

Martin, W.F. (2017) Too Much Eukaryote LGT. BioEssays, 1700115. [doi: 10.1002/bies.201700115]

Salzberg, S.L. (2017) Horizontal gene transfer is not a hallmark of the human genome. Genome biology, 18:85. [doi: 10.1186/s13059-017-1214-2]

Yoshida, Y., Koutsovoulos, G., Laetsch, D.R., Stevens, L., Kumar, S., Horikawa, D.D., Ishino, K., Komine, S., Kunieda, T., and Tomita, M. (2017) Comparative genomics of the tardigrades Hypsibius dujardini and Ramazzottius varieornatus. PLoS Biology, 15:e2002266. [doi: 10.1371/journal.pbio.2002266]

Contaminated genome sequences

The authors of the original draft of the human genome sequence claimed that hundreds of genes had been acquired from bacteria by lateral gene transfer (LGT) (Lander et al., 2001). This claim was abandoned when the "finished" sequence was published a few years later (International Human Genome Consortium, 2004) because others had shown that the data was easily explained by differential gene loss in other lineages or by bacterial contamination in the draft sequence (see Salzberg, 2017).

Subsequent papers on eukaryotic genome sequences frequently reported the presence of several hundred bacterial genes due to LGT. The most extraordinary claim was that 17% of a tardigrade genome was due to LGT (Boothby et al., 2016). This claim led to the creation of a giant tardigrade that controlled the displacement-activated spore hub drive on Star Trek: Discovery. It was able to interface with the spore network by incorporating mycelium DNA using lateral gene transfer ['Star Trek: Discovery' Mudd-ies Up Tardigrade Science].

Unfortunately, the creators of the new Star Trek series didn't read the paper that came out a few months later showing that most of the bacterial DNA was due to contamination and not LGT (Koutsovoulos et al., 2016).

Bacteria DNA isn't the only contaminant. Longo et al. (2011) documented many cases of genome sequences that were contaminated with human DNA (Alu sequences). They found 492 genome sequences (out of 2,749) that contained detectable amounts of human DNA. Here's what they say in the abstract of their paper ...
Using a primate specific SINE, AluY, we screened 2,749 non-primate public databases from NCBI, Ensembl, JGI, and UCSC and have found 492 to be contaminated with human sequence. These represent species ranging from bacteria (B. cereus) to plants (Z. mays) to fish (D. rerio) with examples found from most phyla. The identification of such extensive contamination of human sequence across databases and sequence types warrants caution among the sequencing community in future sequencing efforts, such as human re-sequencing. We discuss issues this may raise as well as present data that gives insight as to how this may be occurring.
The take-home lesson is that draft sequences are often unreliable. Additional analysis (curation/annotation) often reveals numerous examples of contamination from unrelated sequences (e.g. Yoshida et al., 2017).

Boothby, T.C., Tenlen, J.R., Smith, F.W., Wang, J.R., Patanella, K.A., Nishimura, E.O., Tintori, S.C., Li, Q., Jones, C.D., and Yandell, M. (2015) Evidence for extensive horizontal gene transfer from the draft genome of a tardigrade. Proc. Natl. Acad. Sci. (USA), 112:15976-15981. [doi: 10.1073/pnas.1510461112 ]

International Human Genome Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature, 431(7011), 931-945. [doi: 10.1038/nature03001]

Koutsovoulos, G., Kumar, S., Laetsch, D.R., Stevens, L., Daub, J., Conlon, C., Maroon, H., Thomas, F., Aboobaker, A.A., and Blaxter, M. (2016) No evidence for extensive horizontal gene transfer in the genome of the tardigrade Hypsibius dujardini. Proc. Natl. Acad. Sci. (USA), 113:5053-5058. [doi: 10.1073/pnas.1600338113 ]

Lander, E. et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409:860-921. [doi: 10.1038/35057062]

Longo, M.S., O'Neill, M.J., and O'Neill, R.J. (2011) Abundant human DNA contamination identified in non-primate genome databases. PloS One, 6:e16410. [doi: 10.1371/journal.pone.0016410]

Salzberg, S. L. (2017) Horizontal gene transfer is not a hallmark of the human genome. Genome Biology, 18:85. [doi: 10.1186/s13059-017-1214-2]

Yoshida, Y., Koutsovoulos, G., Laetsch, D.R., Stevens, L., Kumar, S., Horikawa, D. D., Ishino, K., Komine, S., Kunieda, T., and Tomita, M. (2017) Comparative genomics of the tardigrades Hypsibius dujardini and Ramazzottius varieornatus. PLoS Biology, 15:e2002266. [doi: 10.1371/journal.pbio.2002266]

Parental age and the human mutation rate



-mutation types
-mutation rates

Mutations are mostly due to errors in DNA replication. We have a pretty good idea of the accuracy of DNA replication—the overall error rate is about 10-10 per bp. There are about 30 cell divisions in females between zygote and formation of all egg cells. In males, there are about 400 mitotic cell divisions between zygote and formation of sperm cells. Using these average values, we can calculate the number of mutations per generation. It works out to about 130 mutations per generation [Estimating the Human Mutation Rate: Biochemical Method].

This value is similar to the estimate from comparing the sequences of different species (e.g. human and chimpanzee) based on the number of differences and the estimated time of divergence. This assumes that most of the genome is evolving at the rate expected for fixation of neutral alleles. This phylogenetic method give a value of about 112 mutations per generation [Estimating the Human Mutation Rate: Phylogenetic Method].

The third way of measuring the mutation rate is to directly compare the genome sequence of a child and both parents (trios). After making corrections for false positives and false negatives, this method yields values of 60-100 mutations per generation depending on how the data is manipulated [Estimating the Human Mutation Rate: Direct Method]. The lower values from the direct method call into question the dates of the split between the various great ape lineages. This controversy has not been resolved [Human mutation rates] [Human mutation rates - what's the right number?].

It's clear that males contribute more to evolution than females. There's about a ten-fold difference in the number of cell divisions in the male line compared to the female line; therefore, we expect there to be about ten times more mutations inherited from fathers. This difference should depend on the age of the father since the older the father the more cell divisions required to produce sperm.

This effect has been demonstrated in many publications. A maternal age effect has also been postulated but that's been more difficult to prove. The latest study of Icelandic trios helps to nail down the exact effect (Jónsson et al., 2017).

The authors examined 1,548 trios consisting of parents and at least one offspring. They analyzed 2.682 Mb of genome sequence (84% of the total genome) and discovered an average of 70 mutations events per child.1 This gives an overall mutation rate of 83 mutations per generation with an average generation time of 30 years. This is consistent with previous results.

Jónsson et al. looked at 225 cases of three generation data in order to make sure that the mutations were germline mutations and not somatic cell mutations. They plotted the numbers of mutations against the age of the father and mother to produce the following graph from Figure 1 of their paper.

Look at parents who are 30 years old. At this age, females contribute about 10 mutations and males contribute about 50. This is only a five-fold difference—much lees than we expect from the number of cell divisions. This suggests that the initial estimates of 400 cell divisions in males might be too high.

An age effect on mutations from the father is quite apparent and expected. A maternal age effect has previously been hypothesized but this is the first solid data that shows such an effect. The authors speculate that oocyotes accumulate mutations with age, particularly mutations due to strand breakage.

Of these, 93% were single nucleotide changes and 7% were small deletions or insertions.

Jónsson, H., Sulem, P., Kehr, B., Kristmundsdottir, S., Zink, F., Hjartarson, E., Hardarson, M.T., Hjorleifsson, K.E., Eggertsson, H.P., and Gudjonsson, S.A. (2017) Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature, 549:519-522. [doi: 10.1038/nature24018]

The history of DNA sequencing

This year marks the 40th anniversary of DNA sequencing technology (Gilbert and Maxam, 1977; Sanger et al., 1977)1 The Sanger technique soon took over and by the 1990s it was the only technique used to sequence DNA. The development of reliable sequencing machines meant the end of those large polyacrylamide gels that we all hated.

Pyrosequencing was developed in the mid 1990's and by the year 2000 massive parallel sequencing using this technique was becoming quite common. This "NextGen" sequencing technique was behind the massive explosion in sequences in the early part of the 21st century.2

Even newer techniques are available today and there's a debate about whether they should be called Third Generation Sequencing (Heather and Chain, 2015).

Nature has published a nice review of the history of DNA sequencing (Shendure et al., 2017). I recommend it to anyone who's interested in the subject. The figure above is taken from that article.

1. Many labs were using the technology in 1976 before the papers were published.

2. New software and enhanced computer power played an important, and underappreciated, role.

Heather, J.M., and Chain, B. (2015) The sequence of sequencers: The history of sequencing DNA. Genomics, 107:1-8. [doi: 10.1016/j.ygeno.2015.11.003]

Maxam, A.M., and Gilbert, W. (1980) Sequencing end-labeled DNA with base-specific chemical cleavages. Methods in enzymology, 65:499-560. [doi: 10.1016/S0076-6879(80)65059-9]

Sanger, F., Nicklen, S., and Coulson, A.R. (1977) DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences, 74:5463-5467. [PDF]

Shendure, J., Balasubramanian, S., Church, G.M., Gilbert, W., Rogers, J., Schloss, J.A., and Waterston, R.H. (2017) DNA sequencing at 40: past, present and future. Nature, 550:345-353. [doi: 10.1038/nature24286]

Escape from X chromosome inactivation

Mammals have two sex chromosomes: X and Y. Males have one X chromosome and one Y chromosome and females have two X chromosomes. Since females have two copies of each X chromosome gene, you might expect them to make twice as much gene product as males of the same species. In fact, males and females often make about the same amount of gene product because one of the female X chromosomes is inactivated by a mechanism that causes extensive chromatin condensation.

The mechanism is known as X chromosome inactivation. The phenomenon was originally discovered by Mary Lyon (1925-2014) [see Calico Cats].

It's been known for a long time that some genes are fully repressed by X chromosome inactivation while others are only partially repressed and still others are fully expressed. The three different patterns are illustrated in the figure on the right taken from Figure 1 of a paper by Tukiainen et al. (2017).

The colored bars are genes on the X chromosome of females. Upward pointing arrows indicate that the genes is expressed and upward pointing T's indicate repression. The first pattern is the one that indicates standard X chromosome inactivation where genes on one of the chromosome are transcribed and genes on the inactivated chromosome are repressed. The second pattern is "escape" where homologous genes on both chromosomes are expressed. The third pattern is "variable" and there are a number of subcategories. Sometimes the gene on the "inactivated" chromosome is expressed in only one or two different tissues but repressed in all others. Sometimes the expression of genes on both chromosomes is variable with different levels of transcription in different tissues.

Up until now, it hasn't been possible to fully explore the escape from X chromosome inactivation (XCI) in human tissues because techniques for analyzing all of the human X chromosome genes in multiple tissues weren't readily available. However, the Genotype-Tissue Expression Consotium (GTEx) changed all that. This project looked at RNAs produced in 44 different tissues from 449 recently deceased individuals.

One of the studies examined the expression of 681 X-chromosome genes in males and females (Tukiainen et al., 2017). If the level of RNA produced in female tissues was greater than the level in male tissues, that indicated escape of some sort. Another study looked at expression in subjects who were heterozygous for particular genes. The two alleles could be distinguished by RNA sequencing and thus expression of each allele on different X chromosomes could be monitored. There were 186 genes with allelic differences.

The overall results can be summarized in the figure below based on overall RNA levels. It shows that 15% of the X chromosome genes escape XCI completely and 16% exhibit variable escape. X chromosome inactivation works for 69% of the genes.

Tukiainen, T., Villani, A.-C., Yen, A., Rivas, M.A., Marshall, J.L., Satija, R., Aguirre, M., Gauthier, L., Fleharty, M., Kirby, A., Cummings, B.B., Castel, S.E., Karczewski, K.J., Aguet, F., Byrnes, A., Lappalainen, T., Regev, A., Ardlie, K. G., Hacohen, N., and MacArthur, D.G. (2017) Landscape of X chromosome inactivation across human tissues. Nature, 550:244-248. [doi: 10.1038/nature24265]

Creationists questioning pseudogenes: the GULO pseudogene

This is the second post discussing creationist1 papers on pseudogenes. The first post addressed a paper by Jeffrey Tomkins on the β-globin pseudogene [Creationists questioning pseudogenes: the beta-globin pseudogene]. This post covers another paper by Tomkins claiming that the GULO pseudogenes in various primate species are not derived from a common ancestor but instead have been deactivated independently in each lineage.

The Tomkins' article was published in 2014 in Answers Research Journal, a publication that describes itself like this:
ARJ is a professional, peer-reviewed technical journal for the publication of interdisciplinary scientific and other relevant research from the perspective of the recent Creation and the global Flood within a biblical framework.
Tomkins explains two fundamental axioms of Young Earth Creationism.
  1. "An emerging theme from the continuing progression of genomics research across the spectrum of eukaryotic life is the widespread decay of pathways for vitamin-synthesis (Helliwell, Wheeler, and Smith 2013). This paradigm is of great importance to the creationist model of genetic entropy which postulates that genomes are in a continual state of degradation over time, not forward progressing evolution (Sanford 2010)."
  2. "Another important component of the creationist model of origins is the idea of molecular discontinuity between unrelated taxon (Tomkins and Bergman 2013). As will be demonstrated in this report, the enigma of the GULO pseudogene analyzed in the light of new genomic evidence most closely aligns with a creationist model incorporating both of these paradigms."
The idea here is that the loss of a gene for synthesizing vitamin C (GULO gene)2 is consistent with the YEC view of increasing loss and degradation of the genome. Such degradation must occur within species since the YEC model doesn't allow for shared ancestry. The main question Tomkins addresses is whether the pattern of GULO pseudogenes in various species is consistent with gene loss in an ancestral species and subsequent inheritance of a pseudogene in different lineages or whether the pattern is consistent with separate and independent loss in related species.

As you might have guessed, Tomkins argues that that the pattern is inconsistent with common ancestry and lends support to Young Earth Creationism. Here's the article ...
The Human GULO Pseudogene—Evidence for Evolutionary Discontinuity and Genetic Entropy
Jeffrey P. Tomkins, Institute for Creation research, Dallas, TX, USA
Answers in Genesis

Abstract: Modern genomics provides the ability to screen the DNA of a wide variety of organisms to scrutinize broken metabolic pathways. This wealth of data has revealed wide-spread genetic entropy in human and other genomes. Loss of the vitamin C pathway due to deletions in the GULO (L-gulonolactone oxidase) gene has been detected in humans, apes, guinea pigs, bats, mice, rats, pigs, and passerine birds. Contrary to the popularized claims of some evolutionists and neo-creationists, patterns of GULO degradation are taxonomically restricted and fail to support macroevolution. Current research and data reported here show that multiple GULO exon losses in human, chimpanzee, and gorilla occurred independently in each taxon and are associated with regions containing a wide variety of transposable element fragments. Thus, they are another example of sequence deletions occurring via unequal recombination associated with transposable element repeats. The 28,800 base human GULO region is only 84% and 87% identical compared to chimpanzee and gorilla, respectively. The 13,000 bases preceding the human GULO gene, which corresponds to the putative area of loss for at least two major exons, is only 68% and 73% identical to chimpanzee and gorilla, respectively. These DNA similarities are inconsistent with predictions of the common ancestry paradigm. Further, gorilla is considerably more similar to human in this region than chimpanzee—negating the inferred order of phylogeny. Taxonomically restricted gene degradation events are emerging as a common theme associated with genetic entropy and systematic discontinuity, not macroevolution.
The GULO gene encodes the enzyme L-glucono-γ-lactone oxidase, the terminal enzyme in the synthesis of ascorbic acid. Ascorbic acid is required in the synthesis of collagen and a few other processes in mammals. Mutations in the GULO gene can lead to loss of function but this is not lethal in many species because they get enough ascorbic acid in their diet.

The human gene is nonfunctional giving rise to a unitary pseudogene located on chromosome 8 at p21. As a result, ascorbic acid is now an essential component of the human diet. Because it has become essential, it is now called a vitamin (vitamin C) (see Helliwell et al., 2013) [Human GULOP Pseudogene].

The standard explanation for the origin of this pseudogene—and all other unitary pseudogenes—is that the original gene became inactivated by mutation at some time in the past. That null allele then became fixed in the population by random genetic drift. All descendants of that population inherited the pseudogene.

Tomkins takes a scattergun approach to the problem by bringing up all kinds of objections to the standard explanation. I don't have time to discuss all of his objections and I don't have enough knowledge of some of the issues to respond to his points. For example, I don't know enough about bird evolution to say whether the pattern of GULO gene loss is compatible with common ancestry or not.

Let's just look at the pseudogenes in primates to see which explanation is more reasonable. Lapachapelle and Drouin (2011) looked at the pattern of neutral substitutions in the primate lineages. All Haplorrhini3 primates (e.g. humans, chimpanzees, macaque, gibbon etc) have a pseudogene with certain shared characteristics, including a number of identical substitutions. This suggests that the ancestor of all Haplorrhini primates contained the pseudogene, which must have arisen shortly after the the split between Haplorrhini and Strepsirrhini (lemurs, galagos, etc.). According to the fossil record, the split occurred about 63 million years ago.

Lapachapelle and Drouin calculated that the pseudogene must have arisen about 61 Mya based on the neutral substitution rate. The fact that these values are so close lends support to the idea that all Haplorrhini species are derived from a common ancestor that lost the GULO gene.

The authors also looked at specific deletions to see if the results are consistent with common ancestry. All of the primate pseudogenes are missing exons 1 and 2 of the intact, functional, gene.4 They compared the sequences of the human, chimpanzee, and macaque genes to that of the galago gene. The result is shown in Figure 4 of their paper (see below).

Note that the large deletion of the two exons ("deletion") occurs at the same position in the human, chimpanzee, and macaque genomes. All three genomes also have two identical seven base pair indels in the upstream region preceding exon 1. This is evidence of common ancestry.

Lachapelle and Drouin were testing the hypothesis that large deletions in the GULO pseudogene were due to aberrant recombination between flanking transposable elements (TE). They mapped all surrounding transposons in the primate genes and concluded that TE's did not play a role in the deletion.

Tomkins discusses this paper in his creationist journal article. He ignores the evidence of common descent and focuses instead on the transposable elements. He points out that Lachapelle and Drouin failed to find evidence that TE's were responsible for the deletions. Here's what he wants his readers to conclude from a paper that strongly supports common descent ....
Despite the fact that TEs are apparently one of the main genomic drivers of deletion events in the genome, the researchers (Lachapelle and Drouin 2011) concluded that the lineage specific TE insertion patterns, which defied the standard inferred evolutionary model for primates, did not contribute to the loss of exons in the GULO gene. Thus, their evolutionary presuppositions caused the rejection of otherwise strong genomic data that implicated TE related unequal recombination at the GULO locus (resulting in exon deletion) that occurred in taxonomically restricted events.
I think it's disingenuous of Tomkins to focus on that aspect of the study while ignoring all the evidence for common descent.

The GULO pseudogene locus on human chromosome 8 is in a gene-rich region. Orthologous genes are present at the same site in all vertebrate species although the order of the surrounding genes has been repeatedly shuffled by microrearrangements (Yang, 2013).

The presence and order of the exons within the GULO gene/pseudogene in diverse vertebrates is consistent with several independent inactivations and descent from a common ancestor (Yang, 2013). One of them occurred in the primate lineage. All of the primate pseudogenes are missing exons 1 & 2 as well as exons 5, 7, and 10 as shown in the figure below.

The data is consistent with an ancestral pseudogene gene that was missing exons 1, 2, 5, 7, and 10. Exons 3 & 4 were subsequently lost in a separate events in the gibbon lineage. The orangutan and human pseudogenes are similar with respect to exon loss and the chimpanzee pseudogene is probably the same. (The 5′ region of the GULO pseudogene was not present in the chimp genome sequence.)

Tomkins doesn't discuss the evidence for the common ancestry of the primate pseudogenes and he doesn't try to explain the pattern according to a Young Earth Creationist worldview. Instead, he draws attention to another part of Yang's paper—the part where he documents the rearrangements of the genes surrounding the GULO locus. There's nothing unusual about such rearrangements. They are common between closely related species and even within a species. Over time, blocks of genes are shuffled and re-ordered so that distantly related vertebrates show very little synteny.

Tomkins thinks this is a serious problem for evolution ...
The GULO gene lies within a gene-dense region in all vertebrate genomes studied thus far (Yang 2013). Related to this fact is the evolutionary anomaly that the gene neighborhood surrounding the GULO locus is rearranged across the vertebrate spectrum of life, and the patterns cannot be readily resolved into the standard inferred evolutionary lineages (Yang 2013).
Once again, Tomkins is cherry-picking the data to focus on minor anomalies that don't fit with his strawman version of evolution. Once, again, he ignores the much more important data in the same paper that supports an ancient origin of an ancesral GULO pseudogene.

Let me close by mentioning one other "anomaly" that Tomkins raises. He questions whether the functional rat gene is an appropriate standard of comparison. You might be amused by his logic ...
Traditionally, the human GULO pseudogene has been compared to the functional rat GULO gene (Nishikimi, Kawai, and Yagi 1992; Nishikimi et al. 1994; Ohta and Nishikimi 1999). According to the UCSC genome browser ( and the Rat Genome Database (, the rat GULO gene (chr15, region p12) is oriented and transcribed on the minus strand. Interestingly, the human and ape GULO pseudogenes are oriented in the plus strand configuration (chr8, region p21.2 in human). While the rat GULO gene may serve as a general guide to exon presence and absence in degraded GULO genes in other mammals, the rat GULO is clearly in a different chromosomal configuration (compared to humans and apes) and represents a unique design pattern specific to rodents (mouse GULO is on chr15, minus strand).

1. In this case, Young Earth Creationist.

2. For more information on the GULO pseudogene see ...
How do Intelligent Design Creationists deal with pseudogenes and false claims?
Junk & Jonathan: Part 8—Chapter 5
Human GULOP Pseudogene

3. Also spelled Haplorhini.

4. Lachapelle and Drouin include a short 5′ exon (#1) that isn't present in most species. It's likely an artifact. I renumbered the exons according to Yang (2013).

Helliwell, K.E., Wheeler, G.L., and Smith, A.G. (2013) Widespread decay of vitamin-related pathways: coincidence or consequence? TRENDS in Genetics, 29:469-478. [doi: 10.1016/j.tig.2013.03.003]

Lachapelle, M.Y., and Drouin, G. (2011) Inactivation dates of the human and guinea pig vitamin C genes. Genetica, 139:199-207. [doi: 10.1007/s10709-010-9537-x]

Yang, H. (2013) Conserved or lost: molecular evolution of the key gene GULO in vertebrate vitamin C biosynthesis. Biochemical genetics, 51:413-425. [doi: 10.1007/s10528-013-9574-0]

Creationists questioning pseudogenes: the beta-globin pseudogene

Jonathan Kane recently (Oct. 6, 2017) posted an article on The Panda's Thumb where he claimed that Young Earth Creationists often don't get enough credit for raising serious issues about evolution [Five principles for arguing against creationism].

He mentioned some articles about pseudogenes as prime examples. I asked him for references and he responded with two articles by Jeffrey Tomkins that were published on the Answers in Genesis website. The first was on the β-globin pseudogene and the second was on the GULO pseudogene. Both articles claim that these DNA sequences aren't really pseudogenes because they have functions.

I'll deal with the β-globin pseudogene in this post and the GULO pseudogene in a subsequent post.

Here's the article ....
The Human Beta-Globin Pseudogene is Non-Variable and Functional.
Jeffrey P. Tomkins, Institute for Creation research, Dallas, TX, USA
Answers Research Journal

Abstract: One of the iconic (yet enigmatic) arguments for human-ape common ancestry has been the β-globin pseudogene (HBBP1). Evolutionists originally speculated that apparent mutations in HBBP1 were shared mutational mistakes derived from a human-chimpanzee common ancestor. However, others noted that if the gene was indeed non-functional, then it should have mutated markedly in the past 3 to 6 million years of human evolution due to a lack of selective constraint on the region. Recent research confirms that the HBBP1 region of the 6-gene β-globulin cluster is highly non-variable compared to the other β-globin genes based on large-scale DNA diversity assessment within both humans and chimpanzees. Highlighting the lack of HBBP1 sequence variability is genetic data from three different reports that link point mutations in the HBBP1 gene with β-thalassemia disease pathologies. Biochemical evidence for functionality is indicated by multiple categories of functional genomics data showing that the HBBP1 gene is transcriptionally active and a key interactive component of the β-globin gene network. In brief, the HBBP1 gene encodes two consensus regulatory RNAs that are alternatively transcribed and/or post-transcriptionally spliced. This functional complexity produces at least 16 different exon variant transcripts and 42 different intron variant transcripts. Two major regulatory regions in the HBBP1 locus contain active transcription factor binding sites that overlap multiple categorical regions of epigenetic data for functionally active chromatin. The HBBP1 gene also has the most regulatory associations with active and open chromatin within the entire β-globin cluster and its transcripts are expressed in at least 251 different human cell and/or tissue types. Instead of being a useless genomic fossil according to evolutionary predictions, the HBBP1 gene appears to be a highly functional and cleverly integrated feature of the human genome that is intolerant of mutation.
Before addressing the specific criticisms in this article it's important to not lose sight of the bigger issue. Creationists tend to focus on particular examples while ignoring the big picture. In this case, there is abundant evidence of gene duplications in all species and there's abundant evidence that the fate of one duplicated copy of a gene is often to become inactivated rendering it a pseudogene. This has given rise to a robust explanation of multigene families referred to as Birth-and-Death Evolution [The Evolution of Gene Families] [On the evolution of duplicated genes: subfunctionalization vs neofunctionalization]. In order for Young Earth Creationists to mount a serious challenge to evolution they need to provide a better explanation for all this data and they need to provide solid evidence that the Earth is less than 10,000 years old.

There are about 15,000 pseudogenes of various kinds in the human genome. You can't challenge the big picture of pseudogenes and junk DNA by picking out one example and trying to prove it has a function. This will not refute evolution even if it turns out to be true that one particular stretch of DNA looks like a pseudogene but actually has a function. And it certainly won't be evidence of a Young Earth.

Now let's deal with the Tomkins article. Here's a diagram showing the pseudogene in the β-globin gene cluster in humans and chimps.

There's a pseudogene at this locus in most of the great apes—an observation that's consistent with a duplication event tens of million of years ago followed by the loss of function of one of the copies. The pseudogene became fixed in the ancestral population and was passed down to all modern species. The rate at which most of the pseudogene sequence has accumulated base substitutions is consistent with the rate at which neutral mutations are fixed by random genetic drift. This indicates that most of the sequence is not under negative selection. As far as I know, creationists—especially Young Earth Creationists—haven't offered a reasonable explanation of this observation.

Tomkins' main point is that this stretch of DNA has a function so, presumably, the creator(s) copied this useful part of the genome and plugged it into one of the chromosomes as they were building each of the species. They didn't really care very much about the surrounding DNA so they didn't worry about copying it exactly. As it turns out, they introduced differences in the surrounding DNA so that the sequences of chimps and humans differ by about 2% and chimps and gorillas differ by about 4%. Humans and gorillas also differ by about 4%. The important point is that there are far fewer differences in the exons of the functional genes so they look "conserved" if you adopt an evolutionary perspective.

There's a stretch of DNA near the human β-globin pseudogene that has far fewer changes if you examine the chimp and human genomes. In evolutionary terms, it is "conserved." (It is reusable design if you are a Young Earth Creationist.) Tomkins quotes a paper by Moleirinho et al. (2013) documenting this conservation. The explanation is that the region between the γ-globin genes and the pseudogene is involved in regulating expression of the β-globin genes, probably because it contains a scaffold attachment site and associated sequences that regulate chromatin conformation. This role appears to have arisen shortly before the divergence of chimps and humans.

Here's what the sequence similarities look like on the UCSC Genome Browser. The degree of sequence similarity between the human genome and the genomes of chimps, gorillas, orangutans, and monkeys is shown as a histogram where the height of the bar indicates significant similarity. As you can see, the exons of the functional genes are conserved but the pseudogene sequence is not conserved. This is exactly what you expect if the pseudogene sequence is gradually drifting away from the ancestral gene that was functional right after the gene duplication event.

Much of the sequence surrounding the γ-globin genes is under selection, including a stretch that extend toward the pseudogene. This is the regulatory region that controls expression of the entire locus.

Thus, the evolutionary explanation is that a gene duplication occurred and one of the copies became a pseudogene. Subsequently, a region in the vicinity of the pseudogene acquired a new function involved in chromatin looping and regulation. That's why a large stretch of DNA near the γ-globin gene is conserved. I don't know how Tomkins explains the data other than just saying that the presence of function casts doubt on evolution.

Tomkins' other evidence for function relies on the ENCODE data. He notes that the pseudgogene region is transcribed as part of the pervasive transcription noted by ENCODE. It also contain numerous transcription factor binding sites, DNase I sensitive regions, and histone markers. Some of this might be remnants of the original gene but most are just spurious events that occur throughout the genome in junk DNA. Sandwalk readers will be familiar with the idea that ENCODE data does not prove function.


John Harshman send me his comparison of the β-globin region on the UCSC Genome Browser.

As you can see, the pseudogene region seems to be only slightly less conserved than the functional genes in this analysis. This isn't unexpected. The functional genes will drift apart over 100 million years by accumulating neutral mutations in the coding regions. The pseudogene arose about 65 million years ago in primate ancestors so it will have accumulated mutations at a faster rate since that time but not before. The difference in the primate lineage should amount to about 20% in that time.

When you compare the "conservation" of the various loci using an outgroup to the primate lineage, the pseudogene will only be about 20% less conserved than the functional genes. That's pretty much what you see in the figure.

When you do a binary comparison (e.g. chimp vs human), I'm assuming the algorithm subtracts the neutral mutation rate in order to calculate whether a sequence is conserved or not. Thus, in my figure, the pseudogene region only shows up as a small blip. This may be statistical error or a small bit of conserved sequence within the the second exon.

That's how I interpret the results. Any help will be appreciated. If you know how to get % sequence similarity comparisons on this browser then please post that information in the comments or email me.

Moleirinho, A., Seixas, S., Lopes, A.M., Bento, C., Prata, M.J., and Amorim, A. (2013) Evolutionary constraints in the β-globin cluster: the signature of purifying selection at the δ-globin (HBD) locus and its role in developmental gene regulation. Genome Biology and Evolution 5:559-571. [doi: 10.1093/gbe/evt029]

Historical evolution is determined by chance events

Modern evolutionary theory is based on the idea that alleles become fixed in a population over time. They can be fixed by natural selection if they confer selective advantage or they can be fixed by random genetic drift if they are nearly neutral or slightly deleterious [Learning about modern evolutionary theory: the drift-barrier hypothesis]. Alleles arise by mutation and the path that a population follows over time depends on the timing of mutations [Mutation-Driven Evolution]. That's largely a chance event.

As a result, the history of evolution is much more unpredictable than most people realize, especially when coupled with environmental effects. I call this "Evolution by Accident." It's similar to Stephen Jay Gould's idea of contingency.

The idea has been around for a very long time but recently it has become possible to test the idea at the molecular level by looking at actual mutations occurring in evolving populations [Strolling around slopes and valleys in the adaptive landscape]. It's also possible to reverse engineer an ancient gene and then test to see which of the historical mutations were important. This is what Joseph Thornton's group did with vertebrate glucocorticoid receptor (GR) genes. They showed that historical contingency and chance events dominated the evolutionary pathway leading to a cortisol-specific version of these receptors genes (Harms and Thornton, 2014). [see Historical contingency and the evolution of the glucocorticoid receptor].

Now Thornton's group has provided further evidence of historical contingency by looking at the evolution of steroid hormone receptor genes (Starr et al., 2017). Steroid hormone receptor proteins normally don't bind specifically to DNA but in the presence of hormone they form a hormone-protein complex that binds to specific sequences near the promoters of some genes. This promotes transcription of those genes. The receptor proteins are transcription activators in the presence of hormone.

There are two related steroid hormone receptor genes in vertebrates. One of them responds specifically to corticosteroids, androgens, and progesterones by binding to the steroid response element (SRE) with the sequence AGAACA. The other responds to estrogen by binding the estrogen response element (ERE) with the sequence AGGTCA. The genes apparently arose by gene duplication from an ancestral gene. Thornton's group reconstructed the ancestral gene (AncGR1) and showed that it binds to ERE.

Following an ancient gene duplication, one of the duplicated genes shifted function to become responsive to corticosteroids by binding to a different sequence (SRE). The shift in binding specificity is due to three substitutions in the DNA-binding site, or recognition helix (RH). However when these three mutations are added to the hypothetical ancestral protein, they are not sufficient to convert the receptor into a fully functional receptor that recognizes corticosteroids and binds tightly to SRE. Eleven different amino acid substitutions were also required during the evolution of the new receptor protein. These eleven substitutions were "permissive" in the sense they prepared the way for the shift in hormone recognition and DNA binding.

Thus, the evolution of the new receptor gene involved 11 permissive mutations (11P) followed by 3 RH mutations. We want to know how many different pathways could have produced the same result. Is the gene we see today the only possible outcome of millions of years of evolution or is it only one of many possibilities in sequence space?

Starr et al. (2017) began by constructing an ancestral gene containing the eleven permissive mutations (AncGR1 + 11P). They then asked how many pathways could lead to a change in sequence specificity. They answered the question by making mutation in four codons of the recognition helix—the three that were actually observed and one other that was bound to be important. They substituted all 20 amino acids at each of the four sites creating 160,000 combinations. They found 828 new variants that were just as good or better than the current mammalian gene. There were another 500 variants that were functional but not as efficient as the current gene.

What this means is that there are more than one thousand different ways of evolving a new receptor that recognizes the sequence AGAACA instead of AGGTCA. Almost all of the functional variants are accessible by gradual step-wise mutation of the three or four codons without going through a nonfunctional intermediate. The authors conclude that the historical outcome is not unique— it's only one of many possibilities. Some of these possibilities involved shorter paths than the historical outcome.
Taken together, these data indicate that the historical trajectory was not the only path, or even the shortest, from the ancestral RH to a derived protein that is SRE-specific.
This is not surprising. There's tons of data pointing to the same conclusion. In addition, evolutionary theory has always assumed that chance and contingency play an important role in the history of life. What's important about this paper is that the authors have quantified functional sequence space by testing all possible outcomes.

The pathway to SRE binding is enhanced by the eleven permissive mutations that preceded the change in binding. There are some pathways to SRE binding that don't require those permissive mutations but most do. The 11P mutations are mostly neutral and they presumably arose by chance during the evolution of these receptor genes. That means there are two different roles for chance and contingency in the evolution of corticosteroid-responsive receptors. Here's how the authors express it ...
Our results shed light on the roles of determinism and chance in protein evolution. The primary deterministic force is natural selection, which drives the evolution of forms that optimize fitness. Chance appears in two non-exclusive ways: as historical contingency, when the accessibility of some outcome depends on prior events that cannot be driven by selection for that outcome; and as stochasticity, when there are paths to numerous possible genotypes of similar function, and which one is realized is random.
Keep in mind that we are dealing with the evolution of a corticosteroid-responsive receptor. There's no particular reason why this particular receptor evolved as opposed to one that responded to other chemicals in the body and there's no particular reason why the new receptor had to bind to AGAACA as opposed to some other sequence variant. Therefore, the possible pathways to evolution of a new functional gene are many times greater than this result indicates.

Harms, M.J., and Thornton, J.W. (2014) Historical contingency and its biophysical basis in glucocorticoid receptor evolution. Nature, 512:203. [doi: 10.1038/nature13410]

Starr, T.N., Picton, L.K., and Thornton, J.W. (2017) Alternative evolutionary histories in the sequence space of an ancient protein. Nature, 549:409-413. [doi: 10.1038/nature23902]