Are splice variants functional or noise?

This is a post about alternative splicing. I've avoided using that term in the title because it's very misleading. Alternative splicing produces a number of different products (RNA or protein) from a single intron-containing gene. The phenomenon has been known for 35 years and there are quite a few very well-studied examples, including several where all of the splice regulatory factors have been characterized.

The number of known examples is quite small in any given species. In contrast, the number of different splice variants is enormous. Most human genes, for example, are associated with a dozen or so different variants that have been detected over the years. Almost of of these splice variants have been rejected by genome annotators because they are very rare, never leave the nucleus, and are never present in sufficient quantities to be functional. They are undoubtedly junk RNA produced by the sloppy spliceosome. This kind of noise should not be called alternative splicing because that term should be restricted to real examples that produce functional variants by some sort of regulatory mechanism.

This seems like common sense to me but, unfortunately, most scientists disagree. They continue to refer to any example of splice variants as alternative splicing even though they might be just splicing errors. In fact, most of these scientists don't even consider the possibility of splicing errors. See the following posts for a more thorough discussion of this problem.

Debating alternative splicing (part I)
Debating alternative splicing (part II)
Debating alternative splicing (Part III)
Debating alternative splicing (Part IV)

A recent paper by John Mattick and his collaborators highlights the problem (Deveson et al., 2017). Recall that Mattick is a prominent opponent of junk DNA. He thinks that most of the genome is devoted to producing regulatory RNAs. His "proof" is pervasive transciption. He claims there are thousands and thousands on long nocoding RNAs that have a function [John Mattick still claims that most lncRNAs are functional].

His most recent paper employs the latest technology for detecting RNAs in a cell. The authors highlight the fact that they can detect very low abundance RNAs. They apply the technique to map all the RNAS complementary to the DNA on human chromosome 21. They choose three tissues; testis, brain, and kidney. Two of these tissues are well-known examples of noisy transcription.

The results are not unexpected. They detected an enormous number of different transcripts covering most of the non-repetitive DNA in chromosome 21. Each protein-coding gene matched to dozens of different splice variants in addition to the standard mRNA. Although the authors make passing reference to the controversy over splicing, it's clear that they treat all of these mRNA variants as examples of true alternative splicing. But that's not the main point of their paper. The main point is that the rest of the chromosome specifies a large number of noncoding RNAs and those RNAs exhibit an enormous diversity of splice variants. The result is nicely captured in their summary image (right).

The old RNA-Seq view is shown in the upper-right part of the image. A typical protein-coding gene produces a number of splice variants that I assume are examples of splicing errors. Mattick and his colleagues assume they are due to alternative splicing. The noncoding part of the genome is complementary to another set of transcipts with a limited set of splice variants. Mattick assumes these regions are genes and the RNAs are functional, although he has no proof of that. I assume that most of these RNA are spurious transcipts of junk DNA. This should be the default assumption.

The new view is derived from their more exhaustive analysis of very rare transcripts. There are more splice variants from protein-coding genes but the increase is not enormous. In contrast, there are many more variants RNAs from the rest of the genome and this includes an enormous diversity of different exons. The title of the paper say it all: Universal alternative splicing of noncoding exons. Here are the main conclusion of the paper ...
We propose that noncoding exons are functionally modular, with alternative splicing generating an enormous repetoire of potentially regulatory RNAs and a rich transcriptional reservoir for gene evolution. (abstract)

One can envision a scenario where individual noncoding exons interact independently with other biomolecules (proteins, RNAs and/or DNA-motifs), organizing these around the scaffold of a noncoding transcript. In this way, alternative isoforms could assemble different collections of binding partners to dynamically regulate cellular processes. (discussion)
Yes, it's true that one could envisage such a scenario. One can image many things, but the real question is not how potent your imagination is but whether it's realistic.

Scenarios should be based on facts and not on wishful thinking. In this case there's a lot of evidence that most of our genome is junk. If you are going to propose that most of it contains genes for regulatory RNAs then you have an obligation to refute or discredit the evidence for junk. This paper doesn't do that.

Similarly, there are many good reasons to suspect that splice variants are mistakes in splicing. The variants are not conserved, most are present at less than one copy per cell, splicing errors are known to occur at relatively high frequency, and very few have been shown to have a function. The default assumption must be that they are junk RNA unless proven otherwise.

Mattick and his colleagues dismiss some of these objections using arguments that make no sense. The problem with this paper is that it is promoting an extraordinary claim without any serious evidence of function, let alone extraordinary evidence. I don't understand how it passed peer review. The data may be fine but the interpretation and the conclusions are not.

I think the tide is turning against Mattick and his supporters but perhaps that's just wishful thinking on my part. Take a look at the RNA variants in the lower right-hand corner of the figure. How many of you believe they represent exquisite fine-tuning of a regulatory RNA? How many of you think they are mostly transcriptional and splicing errors?

Deveson, I.W., Brunck, M.E., Blackburn, J., Tseng, E., Hon, T., Clark, T.A., Clark, M.B., Crawford, J., Dinger, M.E., Nielsen, L.K., Mattick, J.S., and Mercer, T.R. (2017) Universal alternative splicing of noncoding exons. Cell Systems, 6:(1-11). [doi: 10.1016/j.cels.2017.12.005]

The Salzburg sixty discuss a new paradigm in genetic variation

Sixty evolutionary biologists are going to meet next July in Salzburg (Austria)to discuss "a new paradigmatic understanding of genetic novelty" [Evolution – Genetic Novelty/Genomic Variations by RNA Networks and Viruses]. You probably didn't know that a new paradigm is necessary. That's because you didn't know that the old paradigm of random mutations can't explain genetic diversity. (Not!) Here's how the symposium organizers explain it on their website ...

For more than half a century it has been accepted that new genetic information is mostly derived from random‚ error-based’ events. Now it is recognized that errors cannot explain genetic novelty and complexity.

Empirical evidence establishes the crucial role of non-random genetic content editors such as viruses and RNA-networks to create genetic novelty, complex regulatory control, inheritance vectors, genetic identity, immunity, new sequence space, evolution of complex organisms and evolutionary transitions....

This new empirically based perspective on the evolution of genetic novelty will have more explanatory power in the future than the "error-replication" narrative of the last century.
Wow! Who knew?

The lead organizer is Günther Witzany, a philosopher of science and a prominent member of The Third Way [The Third Way: Günther Witzany]. We've encountered him before on Sandwalk: Here's why you can ignore Günther Witzany. Just about anyone can misunderstand molecular biology but it takes a philosopher of science to really screw it up. Witzany says ...
The older concepts we have now for a half century cannot sufficiently explain the complex tendency of the genetic code. They can't explain the functions of mobile genetic elements and the endogenous retroviruses and non-coding RNAs. Also, the central dogma of molecular biology has been falsified -- that is, the way is always from DNA to RNA to proteins to anything else, or the other "dogmas," e.g., replication errors drive evolutionary genetic variation, that one gene codes for one protein and that non-coding DNA is junk. All these concepts that dominated science for half a century are falsified now. ...
Here's a summary of where my views differ ...
  1. The fundamental concepts in evolution and molecular biology were worked out in the middle of the last century and thery have been steadily improved and modified since then. They are fully capable of explaining mobile genetic elements, endogenous retroviruses, and non-coding RNAs. Read any textbook.
  2. The Central Dogma of Molecular Biology says that once information is transferred to protein it can't go back to nucleic acids [Central Dogma of Molecular Biology]. It's blatantly obvious that Günther Witzany doesn't understand the Central Dogma. It's obvious that he hasn't read Crick's papers.
  3. The idea that replication errors create genetic variation has not been falsified. It is by far the most important source of mutation.
  4. The idea that one gene codes for one protein is a false strawman version of our current understanding of a gene [What Is a Gene?]. No knowledgeable scientist ever thought that all genes produced proteins and no knowledgeable scientist since 1980 was unaware of genes encoding multiple proteins. Read a textbook.
  5. No knowledgeable scientist ever said that all non-coding DNA is junk. They do, however, say that most of the DNA in the human genome is junk. That's a concept that was formed in the middle of the last century and has become more and more true as evidence accumulates in the 21st century. It has not been falsified as Günther Witzany claims. He has not been keeping up with the scientific literature.
  6. Nobody is questioning the fact that transposons and viruses can cause mutations and genomic rearrangements. The only serious debate is over the frequency of such events. By looking at the amount of variation in individual humans, scientists have determined that 99.9% of all variation is in the form of single nucleotide changes (SNPs) or small (1-10 bp) insertions/deletions. Larger differences due to transposons and viral insertions/deletions account for only 0.1% of the total [Genetic variation in human populations]. All sorts of mutations will contribute to evolution in the long run but it's absurd to think there's any "paradigm shift" in the making. Instead, this is a classic example of a paradigm shaft (a term coined by Diogenes on an earlier Sandwalk post).

How many lncRNAs are functional?

There's solid evidence that 90% of your genome is junk. Most of it is transcribed at some time but the transcripts are transient and usually confined to the nucleus. They are junk RNA [Functional RNAs?]. This is the view held by many experts but you wouldn't know that from reading the scientific literature and the popular press. The opposition to junk DNA gets much more attention in both venues.

There are prominent voices expressing the view that most of the genome is devoted to producing functional RNAs required for regulating gene expression [John Mattick still claims that most lncRNAs are functional]. Most of these RNAs are long noncoding RNAs known as lncRNAs. Although most of them fail all reasonable criteria for function there are still those who maintain that tens of thousands of them are functional [How many lncRNAs are functional: can sequence comparisons tell us the answer?].

There are very few serious reviews that address the controversy over function (but see Palazzo and Lee, 2015 ... the figure is from their paper). That's why I want to highlight a review that's just been published in Cell. It's a review that recognizes the controversy over function and points to the possibility that most putative lncRNAs may be junk (Kopp and Mendell, 2018). I'm going to quote directly from the introduction and the conclusion to show you how scientific reviews are supposed to be written.
There is a broad range of estimates for the number of lncRNA genes in mammals, ranging from less than 20,000 to over 100,000 in humans. Nevertheless, the function and biological relevance of the vast majority of lncRNAs remain enigmatic. Given that transcriptional regulatory elements, such as enhancers and promoters, are now known to initiate transcription bi-directionally, it is likely that many lncRNAs—if not the majority—actually represent RNAs that initiate at enhancers or promoters but do not perform sequence-specific functions. This conclusion is further suggested by the fact that many lncRNAs are localized to the nucleus with low expression levels and little primary sequence conservation. Recent reports of local gene regulation by lncRNA loci reinforce this notion and suggest that in many cases, the act of transcription or DNA elements within the lncRNA locus are more likely to be the source of regulatory activity than the actual lncRNA itself. Given these observations, it is clear that the mere existence or production of an RNA does not automatically imply its functionality. Indeed, we must assume until proven otherwise that of the tens of thousands of annotated lncRNAs, those that function independently of the DNA sequence from which they are transcribed represent a small minority. Nevertheless, even if a small percentage of lncRNAs are functional, they would still constitute a major gene class with hundreds or possibly thousands of members.
The best available data shows that less than 500 putative lncRNAs have a well-defined function. When I'm calculating the amount of functional DNA in the human genome I usually assume 5,000 genes for noncoding RNAs—most of them are not lncRNAs. I still think that's a good estimate.

The act of transcription around promoter regions may play a role in regulation. In such cases, the sequence of the transcript may be irrelevant but the transcribed region of the genome has a function. There aren't very many proven examples of this type of function. In most cases it looks like the transcripts are just due to sloppy initiation. Kopp and Mendell make an important point in the introduction when they say that the mere existence of a transcript does not mean it has a function. This point is usually ignored in the scientific literature.

The authors reinforce this view in their conclusions. They emphasize a point that most scientists find awkward; namely, that the default assumption must be lack of function (junk RNA) and the burden of proof is on those who propose that most lncRNAs have a function. When we detect a transcript, the most we can say for certain is that there's a transcription initiation site nearby. It may or may not be important.
Over the last decade, the study of lncRNAs has stimulated vigorous debate over the question of whether noncoding RNAs represent “transcriptional noise” or truly functional biomolecules. Clearly, there is no unifying answer—meaningful understanding of lncRNA function (or lack thereof) can only be achieved from detailed study on a case-by-case basis. Importantly, our evolving understanding of the prevalence of genomic elements that produce noncoding transcripts, such as enhancers, has mandated that we approach the experimental evaluation of a lncRNA locus with an agnostic view regarding whether the produced RNA is functional. As Occam’s razor dictates, the simplest hypothesis, in this case that the production of a lncRNA most likely marks the presence of a regulatory DNA element, is often the correct one.
I'm pleased to see that more and more scientists are recognizing the very real controversies over junk DNA and the role of pervasive transcription. Unfortunately, it still takes a bit of courage to stand up to the dominant (but incorrect) paradigm promoted by the ENCODE publicity campaign over the past decade.

Kopp, F., and Mendell, J.T. (2018) Functional Classification and Experimental Dissection of Long Noncoding RNAs. Cell, 172:393-407. [doi: 10.1016/j.cell.2018.01.011]

Palazzo, A.F., and Lee, E.S. (2015) Non-coding RNA: what is functional and what is junk? Frontiers in Genetics, 6. [doi: 10.3389/fgene.2015.00002]

ENCODE’s false claims about the number of regulatory sites per gene

Some beating of dead horses may be ethical, where here and there they display unexpected twitches that look like life.

Zuckerkandl and Pauling (1965)

I realize that most of you are tired of seeing criticisms of ENCODE but it's important to realize that most scientists fell hook-line-and-sinker for the ENCODE publicity campaign and they still don't know that most of the claims were ridiculous.

I was reminded of this when I re-read Brendan Maher's summary of the ENCODE results that were published in Nature on Sept. 6, 2012 (Maher, 2012). Maher's article appeared in the front section of the ENCODE issue.1 With respect to regulatory sequences he said ...
The consortium has assigned some sort of function to roughly 80% of the genome, including more than 70,000 ‘promoter’ regions — the sites, just upstream of genes, where proteins bind to control gene expression — and nearly 400,000 ‘enhancer’ regions that regulate expression of distant genes ... But the job is far from done, says [Ewan] Birney, a computational biologist at the European Molecular Biology Laboratory’s European Bioinformatics Institute in Hinxton, UK, who coordinated the data analysis for ENCODE. He says that some of the mapping efforts are about halfway to completion, and that deeper characterization of everything the genome is doing is probably only 10% finished.
We knew back in 2012 that there were only about 25,000 genes so why are there 70,000 promoters? And if this is only 10% of the total then how can there be 700,000 promoters?

Similarly, if there really are 400,000 enhancers (what ever they are) then that's 16 per gene. Throw in the unknown 90% that have yet to be discovered and you have 160 per gene. Really?
The main ENCODE claim is that a substantial percentage of the genome is devoted to regulation ...
… even using the most conservative estimates, the fraction of bases likely to be involved in direct regulation, even though incomplete, is significantly higher than that ascribed to protein codon exons (1.2%), raising the possibility that more information in the human genome may be important for gene regulation than for biochemical function. (ENCODE, 2012 p. 71)
Their value for coding region is too high but let's parse what they mean based on the idea that regulatory sequences account for more than 1.2% of the genome. That works out to 38 Mb of DNA. If we take a generous estimate of 10 bp per regulatory site then there must be 3.8 million sites or 152 sites per gene. That makes no sense. If makes even less sense if Birney is right and this is only 10% of all functional sites.

ENCODE never seriously considered the possibility that most of their sites have no function. We now know this was a serious error that tainted their conclusions. It's very common for papers to be retracted when the authors make mistakes that invalidate their conclusions. I'm sure we aren't going to see any retractions but it would be really nice if Nature (and Science) would at least publish an article admitting that they were duped by Ewan Birney and the other ENCODE researchers.

1. Brendan Maher published an online news article on the Nature website on Sept. 6, 2012. He acknowledges that many of us were highly critical of the ENCODE hype but he still defends the idea that much of the genome is functional (Fighting about ENCODE and junk). In that post, he claims that at least 20% of the genome could be devoted to regulation.

ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature, 489:57-74. [doi: 10.1038/nature11247]

Maher, B. (2012) The Human Encycleopedia. Nature, 489:46-48. [PDF]

Zuckerkandl, E. and Pauling, L. (1965) in EVOLVING GENES AND PROTEINS, V. Bryson and H.J. Vogel eds. Academic Press, New York NY USA

What’s in Your Genome?: Chapter 5: Regulation and Control of Gene Expression

I'm working (slowly) on a book called What's in Your Genome?: 90% of your genome is junk! The first chapter is an introduction to genomes and DNA [What's in Your Genome? Chapter 1: Introducing Genomes ]. Chapter 2 is an overview of the human genome. It's a summary of known functional sequences and known junk DNA [What's in Your Genome? Chapter 2: The Big Picture]. Chapter 3 defines "genes" and describes protein-coding genes and alternative splicing [What's in Your Genome? Chapter 3: What Is a Gene?]. Chapter 4 is all about pervasive transcription and genes for functional noncoding RNAs [What's in Your Genome? Chapter 4: Pervasive Transcription].

Chapter 5 is Regulation and Control of Gene Expression.
Chapter 5: Regulation and Control of Gene Expression

What do we know about regulatory sequences?
The fundamental principles of regulation were worked out in the 1960s and 1970s by studying bacteria and bacteriophage. The initiation of transcription is controlled by activators and repressors that bind to DNA near the 5′ end of a gene. These transcription factors recognize relatively short sequences of DNA (6-10 bp) and their interactions have been well-characterized. Transcriptional regulation in eukaryotes is more complicated for two reasons. First, there are usually more transcription factors and more binding sites per gene. Second, access to binding sites depends of the state of chromatin. Nucleosomes forming high order structures create a "closed" domain where DNA binding sites are not accessible. In "open" domains the DNA is more accessible and transcription factors can bind. The transition between open and closed domains is an important addition to regulating gene expression in eukaryotes.
The limitations of genomics
By their very nature, genomics studies look at the big picture. Such studies can tell us a lot about how many transcription factors bind to DNA and how much of the genome is transcribed. They cannot tell you whether the data actually reflects function. For that, you have to take a more reductionist approach and dissect the roles of individual factors on individual genes. But working on single genes can be misleading ... you may miss the forest for the trees. Genomic studies have the opposite problem, they may see a forest where there are no trees.
Regulation and evolution
Much of what we see in evolution, especially when it comes to phenotypic differences between species, is due to differences in the regulation of shared genes. The idea dates back to the 1930s and the mechanisms were worked out mostly in the 1980s. It's the reason why all complex animals should have roughly the same number of genes—a prediction that was confirmed by sequencing the human genome. This is the field known as evo-devo or evolutionary developmental biology.
           Box 5-1: Can complex evolution evolve by accident?
Slightly harmful mutations can become fixed in a small population. This may cause a gene to be transcribed less frequently. Subsequent mutations that restore transcription may involve the binding of an additional factor to enhance transcription initiation. The result is more complex regulation that wasn't directly selected.
Open and closed chromatin domains
Gene expression in eukaryotes is regulated, in part, by changing the structure of chromatin. Genes in domains where nucleosomes are densely packed into compact structures are essentially invisible. Genes in more open domains are easily transcribed. In some species, the shift between open and closed domains is associated with methylation of DNA and modifications of histones but it's not clear whether these associations cause the shift or are merely a consequence of the shift.
           Box 5-2: X-chromosome inactivation
In females, one of the X-chromosomes is preferentially converted to a heterochromatic state where most of the genes are in closed domains. Consequently, many of the genes on the X chromosome are only expressed from one copy as is the case in males. The partial inactivation of an X-chromosome is mediated by a small regulatory RNA molecule and this inactivated state is passed on to all subsequent descendants of the original cell.
           Box 5-3: Regulating gene expression by
           rearranging the genome

In several cases, the regulation of gene expression is controlled by rearranging the genome to bring a gene under the control of a new promoter region. Such rearrangements also explain some developmental anomalies such as growth of legs on the head fruit flies instead of antennae. They also account for many cancers.
ENCODE does it again
Genomic studies carried out by the ENCODE Consortium reported that a large percentage of the human genome is devoted to regulation. What the studies actually showed is that there are a large number of binding sites for transcription factors. ENCODE did not present good evidence that these sites were functional.
Does regulation explain junk?
The presence of huge numbers of spurious DNA binding sites is perfectly consistent with the view that 90% of our genome is junk. The idea that a large percentage of our genome is devoted to transcriptional regulation is inconsistent with everything we know from the the studies of individual genes.
           Box 5-3: A thought experiment
Ford Doolittle asks us to imagine the following thought experiment. Take the fugu genome, which is very much smaller than the human genome, and the lungfish genome, which is very much larger, and subject them to the same ENCODE analysis that was performed on the human genome. All three genomes have approximately the same number of genes and most of those genes are homologous. Will the number of transcription factor biding sites be similar in all three species or will the number correlate with the size of the genomes and the amount of junk DNA?
Small RNAs—a revolutionary discovery?
Does the human genome contain hundreds of thousands of gene for small non-coding RNAs that are required for the complex regulation of the protein-coding genes?
A “theory” that just won’t die
"... we have refuted the specific claims that most of the observed transcription across the human genome is random and put forward the case over many years that the appearance of a vast layer of RNA-based epigenetic regulation was a necessary prerequisite to the emergence of developmentally and cognitively advanced organisms." (Mattick and Dinger, 2013)
What the heck is epigenetics?
Epigenetics is a confusing term. It refers loosely to the regulation of gene expression by factors other than differences in the DNA. It's generally assumed to cover things like methylation of DNA and modification of histones. Both of these effects can be passed on from one cell to the next following mitosis. That fact has been known for decades. It is not controversial. The controversy is about whether the heritability of epigenetic features plays a significant role in evolution.
           Box 5-5: The Weismann barrier
The Weisman barrier refers to the separation between somatic cells and the germ line in complex multicellular organisms. The "barrier" is the idea that changes (e.g. methylation, histone modification) that occur in somatic cells can be passed on to other somatic cells but in order to affect evolution those changes have to be transferred to the germ line. That's unlikely. It means that Lamarckian evolution is highly improbable in such species.
How should science journalists cover this story?
The question is whether a large part of the human genome is devoted to regulation thus accounting for an unexpectedly large genome. It's an explanation that attempts to refute the evidence for junk DNA. The issue is complex and very few science journalists are sufficiently informed enough to do it justice. They should, however, be making more of an effort to inform themselves about the controversial nature of the claims made by some scientists and they should be telling their readers that the issue has not yet been resolved.

Calculating time of divergence using genome sequences and mutation rates (humans vs other apes)

There are several ways to report a mutation rate. You can state it as the number of mutations per base pair per year in which case a typical mutation rate for humans is about 5 × 10-10. Or you can express it as the number of mutations per base pair per generation (~1.5 × 10-8).

You can use the number of mutations per generation or per year if you are only discussing one species. In humans, for example, you can describe the mutation rate as 100 mutations per generation and just assume that everyone knows the number of base pairs (6.4 × 109).

The intrinsic mutation rate depends on the error rate of DNA replication. We don't know the exact value of this error rate but it's pretty close to 10-10 per base pair when you take repair into account [Estimating the Human Mutation Rate: Biochemical Method]. For single-cell species you simply multiply this number by the number of base pairs in the genome to get a good estimate of the mutation rate per generation.

The calculation for multicellular species is much more complicated because you have to know the number of cell divisions between zygote and mature germ cells. In some cases it's impossible to know this number (e.g.flowering plants, yeast). In other cases we have a pretty good estimate: for example, in humans there are about 400 cell divisions between zygote and mature sperm and about 30 cell divisions between zygote and mature egg cells. The number of cell divisions depends on the age of the parent, especially in males [Parental age and the human mutation rate]. This effect is significant—older parents pass on twice as many mutations as the youngest parents.

The parental age effect is comparable to the extremes in estimations of the human mutation rate based on different ways of measuring it [Human mutation rates - what's the right number?] [Human mutation rates ]. Those values range from about 60 mutations per generation to about 160 mutations per generation.

Thus, in the case of humans, we're dealing with estimates that differ by a factor of two depending on method and parental age.



-mutation types
-mutation rates
Let's assume that each child is born with 100 new mutations. This seems like a reasonable number. It's on the high end of direct counts by sequencing parents and siblings but there are reasons to believe these counts are underestimated (Scally, 2016). On the other hand, this value (100 mutations) is on the low end of the estimates using the biochemical method and the phylogenetic method.

Most of these mutations occur in the father but some were contributed by the mother. Since the child is diploid, we calculate the mutation rate per bp as: 100 ÷ 6.4 × 109 = 1.56 × 10-8 per base pair per generation. Assuming an average generation time of 30 years, this gives 1.56 × 10-8 ÷ 30 = 5.2 × 10-10 mutations per bp per year. That's the value given above (rounded to 5 × 10-10). Scally (2016) uses this same value except he assumes a generation time of 29 years.

There are many who think this value is considerably lower than previous estimates and this casts doubt on the traditional times of divergence chimps and human and the other great apes. For example, Scally (2016) says that prior to the availability of direct sequencing date the "consensus value" was 10 × 10-10 per bp per year.1 That's twice the value he prefers today. It works out to 186 mutations per generation!

I think it's been a long time since workers in the field assumed such a high mutation rate but let's assume he is correct and current estimates are considerably lower than those from twenty years ago.

You can calculate a time of divergence (t) between any two species if you know the genetic distance (d) between them measured in base pairs and the mutation rate (μ) in mutations per year.2 The genetic distance can be estimated by comparing genome sequences and counting the differences. It represents the number of mutations that have become fixed in the two lineages since they shared a common ancestor. Haploid reference genome sequences are sufficient for this estimate.

The mutation rate (μ) is 100 mutations per generation divided by 30 years = 3.3 mutations per year.

The time of divergence is then calculated by dividing half that distance (in nucleotides) by the mutation rate (t = d/2 ÷ μ). (There are all kinds of "corrections" that can be applied to these values but let's ignore them for now and see what the crude data says.)

Human and chimp genomes differ by about 1.4%, which corresponds to 44.8 million nucleotide differences and d/2 = 22.4 million. Using 100 mutations per generation as the mutation rate means 5 × 10-10 per bp per year. From t = d/2 ÷ μ we get t = 6.8 million years.

This is a reasonable number. It's consistent with the known fossil record and it's in line with the current views of a divergence time for chimps and humans.

However, there are reasons to believe that some of the assumptions in this calculation are wrong. For example, the average generation time is probably not 30 years in both lineages over the last few million years. It's probably shorter, at least in the chimp lineage where the current generation time is 25 years. Using a generation time of 25 years gives a divergence time of 5.6 million years.

In addition, the overall differences between the human and chimp genomes may be only 1.2% instead of 1.4% (see Moorjani et al., 2016). If you combine this value with the shorter generation time, you get 4.25 million years for the time of divergence.

Given the imprecision of the mutation rate, the question of real generation time, and problems in estimating the overall difference between humans and chimps, we can't know for certain what time of divergence is predicted by a molecular clock. On the other hand, the range of values (e.g. 4.25 - 6.8 million years) isn't cause for great concern.

So, what's the problem? The problem is that applying the human mutation rate (100 mutations per generation) to more distantly related species gives strange results. For example. Scally (2016) uses this mutation rate and a difference of 2.6% to estimate the time of divergence of humans and orangutans. The calculation yields a value of 26 million years. This is far too old according to the fossil record.

Several recent papers have addressed this issue (Scally, 2016; Moorjani et al., 2016a; Moorjani et al., 2016b). Most of the problem is solved by assuming a much higher mutation rate in the past. The biggest effect is the generation time in years. It may have been as low as 15 years for much of the past ten million years. Many of the problems go away when you adjust for this effect.

What puzzles me is the approach taken by Moorjani et al. in their two recent papers. They say that the "new" mutation rate is 5 × 10-10 per bp per year. That's exactly the value I use above. It's roughly 100 new mutations per child (per generation). Moorjani et al. (2016a) think this value is surprisingly low because it leads to a surprising result. They explain it in a section titled "The Puzzle."

They assume that the human and chimp genomes differ by 1.2%. That works out to 38 million mutations over the entire genome. This is 19 million fixed mutated alleles in each lineage if the mutation rate in both lineages is equal and constant.

If the mutation rate is 5 × 10-10 per bp per year then for a haploid genome this is 1.6 mutations per year. Dividing 19 million by 1.6 gives 11.9 million years (rounded to 12 million) for the time of divergence. This is the value quoted by the authors.
Taken at face value, this mutation rate suggests that African and non-African populations split over 100,000 years and a human-chimpanzee divergence time of 12 million years ago (Mya) (for a human–chimpanzee average nucleotide divergence of 1.2% at putatively neutral sites). These estimates are older than previously believed, but not necessarily at odds with the existing—and very limited—paleontological evidence for Homininae. More clearly problematic are the divergence times that are obtained for humans and orangutans or humans and OWMs [Old World Monkeys]. As an illustration, using whole genome divergence estimates for putatively neutral sites suggests a human–orangutan divergence time of 31 Mya and human–OWM divergence time of 62 Mya. These estimates are implausibly old, implying a human-oraguntan divergence well into the Oligocene and OWM-hominoid divergence well into or beyond the Eocene. Thus, the yearly mutation rates obtained from pedigrees seem to suggest dates that are too ancient to be readily reconciled with the current understanding of the fossil record.
Here's the problem. If the mutation rate is 100 mutations per generation then this applies to DIPLOID genomes. Some of the mutations are contribute by the mother and some (more) by the father. If you apply this rate to a DIPLOID genome then the number of mutations per year is 3.1 (100/30 years). Or,

         5 × 10-10 per bp per year × 6.4 × 109 bp (diploid) = 3.2 mutations per year

Dividing 19 million mutations by 3.2 give a time of divergence of 5.9 million years. This is a reasonable number but it's half the value calculated by Moorjani et al. (2016a).

They also calculate a value of 12.1 million years for the human-chimp divergence in their second paper (and 15.1 million years for the divergence of humans and gorillas) (Moorjani et al., 2016b).

I think their calculations are wrong because they used the haploid genome size rather than the diploid genome where the mutations are accumulating. Both these papers appear in good journals and both were peer-reviewed. Furthermore, the senior author, Molly Przeworski, is a Professor at Columbia University (New York, NY, USA) and she's an expert in this field.

What am I doing wrong? Is it true that a mutation rate of ~100 mutations per generation means that human and chimpanzees must have been separated for 12 million years as Moorjani et al. say? Or is the real value 5.9 million years as I've calculated above?

Image Credit: The chromosome image is from Wikipedia: Creative Commons Attribution 2.0 Generic license. The chimp photo is also from Wikipedia.

1. Scally takes this value from Nachman and Crowell (2000) who claim that the mutation rate is ~2.5 × 1008 mutations per bp in humans. This works out to 160 mutations per generation and an overall mutation rate of 8 × 10-10 based on a generation time of 30 years, not 10 × 10-10 as Scally states.

2. This assumes that all mutations are neutral. The rate of fixation of neutral alleles over time is equal to the mutation rate. Since 8% of the genome is under selection, it's not true that all mutations are neutral but to a first approximation it's not far off.

Moorjani, P., Gao, Z., and Przeworski, M. (2016) Human germline mutation and the erratic evolutionary clock. PLoS Biology, 14:e2000744. [doi: 10.1371/journal.pbio.2000744]

Moorjani, P., Amorim, C.E.G., Arndt, P.F., and Przeworski, M. (2016b) Variation in the molecular clock of primates. Proc. Nat. Acad. Sci. (USA) 113:10607-10612. [doi: 10.1073/pnas.1600374113 ]

Nachman, M.W., and Crowell, S.L. (2000) Estimate of the mutation rate per nucleotide in humans. Genetics, 156:297-304. [PDF]

Scally, A. (2016) The mutation rate in human evolution and demographic inference. Current opinion in genetics & development, 41:36-43. [doi: 10.1016/j.gde.2016.07.008]

How much mitochondrial DNA in your genome?

Most mitochondrial genes have been transferred from the ancestral mitochondrial genome to the nuclear genome over the course of 1-2 billion years of evollution. They are no longer present in mitochondria but they are easily recognized because they resemble α-proteobacterial sequences more than the other nuclear genes [see Endosymbiotic Theory].

This process of incorporating mitochondrial DNA into the nuclear genome continues to this day. The latest human reference genome has about 600 examples of nuclear sequences of mitochondrial origin (= numts). Some of them are quite recent while others date back almost 70 million years—the limit of resolution for junk DNA [see Mitochondria are invading your genome!].

Estimating the number of numts isn't as easy as you might imagine. There are two main problems according to Hazkani-Covo and Martin (2017).
  1. Simple BLAST searches using mitochondrial sequences against the nuclear genome may overestimate the number of insertion events. That's because the hits need to be concatenated to see the extent of the insertion. You also need to take into account subsequent events, such as the insertion of a transposon into the mitochondrial fragment, that makes a single insertion event look like two independent events in genomic analyses.
  2. The number of numts may be underestimated because mitochondrial sequences are usually thought to be contaminants and they are removed from the genome sequence. There are several documented cases.
The authors examined 36 genomes for the presence of mitochondrial DNA. They looked at each potential event separately to verify that it was a genuine numt. They also looked for nupts—plastid DNA—in 24 genomes.

The results vary from a low of 7 numts to 6550 numts depending on the size of the genome. The best estimates for humans is 592, which is pretty much in line with earlier results. The number of nupts in plants and algae is about the same.

Image Credit: Moran, L.A., Horton, H.R., Scrimgeour, K.G., and Perry, M.D. (2012) Principles of Biochemistry 5th ed., Pearson Education Inc. page 175 [Pearson: Principles of Biochemistry 5/E]

Hazkani-Covo, E., and Martin, W.F. (2017) Quantifying the number of independent organelle DNA insertions in genome evolution and human health. Genome Biology and Evolution, evx078. [doi: 10.1093/gbe/evx078]

Lateral gene transfer in eukaryotes – where’s the evidence?

Lateral gene transfer (LGT), or horizontal gene transfer (HGT), is widespread in bacteria. It leads to the creation of pangenomes for many bacterial species where different subpopulations contain different subsets of genes that have been incorporated from other species. It also leads to confusing phylogenetic trees such that the history of bacterial evolution looks more like a web of life than a tree [The Web of Life].

Bacterial-like genes are also found in eukaryotes. Many of them are related to genes found in the ancestors of modern mitochondria and chloroplasts and their presence is easily explained by transfer from the organelle to the nucleus. Eukaryotic genomes also contain examples of transposons that have been acquired from bacteria. That's also easy to understand because we know how transposons jump between species.

The literature on eukaryotic genomes is full of additional claims of LGT from bacteria (and other eukaryotes) but many of those have subsequently been attributed to contamination of genomic DNA [see Contaminated genome sequences]. Nevertheless, it's commonly accepted that lateral gene transfer from bacteria to eukaryotes is real and each new eukaryotic genome has several hundred genes acquired from bacteria. It usually accounts for about 1% of the genome. For example, even after extensive analysis of tardigrade genome sequences, there's still somewhere between 1% and 2% HGT/LGT (Yoshida et al., 2017).

An extensive analysis of the finished human genome sequence still suggested that there were 145 genes derived from LGT (Crisp et al., 2015). Those same authors claim to have detected a low level of LGT/HGT in dozens of other eukaryotic species. Here's what they say in their abstract ...
We have taken advantage of the recent availability of a sufficient number of high-quality genomes and associated transcriptomes to carry out a detailed examination of HGT in 26 animal species (10 primates, 12 flies and four nematodes) and a simplified analysis in a further 14 vertebrates. Genome-wide comparative and phylogenetic analyses show that HGT in animals typically gives rise to tens or hundreds of active ‘foreign’ genes, largely concerned with metabolism. Our analyses suggest that while fruit flies and nematodes have continued to acquire foreign genes throughout their evolution, humans and other primates have gained relatively few since their common ancestor. We also resolve the controversy surrounding previous evidence of HGT in humans and provide at least 33 new examples of horizontally acquired genes.
That result was challenged by Salzberg (2017) who presented convincing evidence that many of the LGT claims were due to contamination, or they are mitochondrial genes, or they did not meet the minimal standards for LGT claims. He says,
In this study, I re-examined the claims of Crisp et al. [1] focusing on the human genes. Instead of using a large-scale, automated analysis, which by its very nature could enrich the results for artifactual findings, I looked at each human gene individually to determine whether the evidence is sufficient to support the conclusion that HGT occurred. An important principal here is that extraordinary claims require extraordinary evidence: there is no doubt that the vast majority of human genes owe their presence in the human genome to the normal process of inheritance by vertical descent. Thus, if other, more mundane processes can explain the alignments of a human gene sequence, these explanations are far more likely than HGT.
Bill martin is also skeptical. He also claims that even a low level of LGT in eukaryotes is too much. He claims there's no solid evidence to support those claims and they persist because researchers are not thinking critically about their results and the consequences (Martin, 2017). He says,
Claims for LGT among eukaryotes essentially did not exist before we had genomes because, in contrast to prokaryotes, there are no characters known among eukaryotes that require LGT in order to explain their distribution, except perhaps the spread of plastids via secondary symbiosis. Today, claims for eukaryote LGT are common in the literature, so common that students or nonspecialists might get the impression that there is no difference between prokaryotic and eukaryotic genetics. The time has come where we need to ask whether the many claims for eukaryote LGT – prokaryote to eukaryote LGT and eukaryote to eukaryote LGT – are true.
There are several problems with these claims according to Bill Martin. First, the pattern of LGT doesn't conform to what we see in bacteria where entire clades have inherited genes transferred from bacteria. Most of the claims of LGT are confined to a single species. Second, there's no reasonable mechanism for LGT as there is in bacteria.
The reality checks are simple. If the claims are true, then we need to see evidence in eukaryotic genomes for the cumulative effects of LGT over time, as we see with pangenomes in prokaryotes, and as we see with sequence divergence. That is, the number of genes acquired by LGT needs to increase in eukaryotic lineages as a function of time. We also need to see evidence for genetic mechanisms that could spread genes across eukaryote species (and order, and phylum) boundaries, as we see in prokaryotes. If we do not see the cumulative effects, and if there are no tangible genetic mechanisms, then we have to openly ask why, and entertain the possibility that the claims might not be true. Could it be that eukaryote LGT does not really exist to any significant extent in nature, but is an artefact produced by genome analysis pipelines?
This is not a popular view. That's not surprising coming from Bill Martin because he often challenges the current dogmas. He raises an issue that's more important than the presence of LGT in eukaryotes and that's the tendency of today's scientists to adopt a consensus view without thinking critically.
Why should I care about eukaryote LGT anyway? Is not the practical solution to just believe what everyone else does and “get with the programme” as a prominent eukaryote LGT proponent recently recommended that I do (Dan Graur is my witness). At eukaryote genome meetings, where folks pride themselves on the amounts and kinds of LGT they are finding in a particular eukaryote genome (not in all genomes), I feel like Winston Smith in Orwell's novel 1984, listening to an invented truth recited by members of the Inner Party. My mentors taught me that students of the natural sciences are not obliged to get with anyone's program, instead we are supposed to think independently and always to critically inspect, and re-inspect, current premises. Doing "get with the program" science in herds can produce curious effects. For example, the well-managed ENCODE project that ascribed a function to 80% of the human genome was a textbook case of everyone "getting with the program," and everyone, however, also missing the point, obvious to evolutionary biologists, that the headline result of 80% function cannot be true.

Image Credit:Scientific American, Doolittle, W. (2000) Uprooting the Tree of Life. Scientific American, February 2000.

Crisp, A., Boschetti, C., Perry, M., Tunnacliffe, A., and Micklem, G. (2015) Expression of multiple horizontally acquired genes is a hallmark of both vertebrate and invertebrate genomes. Genome biology, 16:50. [doi: 10.1186/s13059-015-0607-3]

Martin, W.F. (2017) Too Much Eukaryote LGT. BioEssays, 1700115. [doi: 10.1002/bies.201700115]

Salzberg, S.L. (2017) Horizontal gene transfer is not a hallmark of the human genome. Genome biology, 18:85. [doi: 10.1186/s13059-017-1214-2]

Yoshida, Y., Koutsovoulos, G., Laetsch, D.R., Stevens, L., Kumar, S., Horikawa, D.D., Ishino, K., Komine, S., Kunieda, T., and Tomita, M. (2017) Comparative genomics of the tardigrades Hypsibius dujardini and Ramazzottius varieornatus. PLoS Biology, 15:e2002266. [doi: 10.1371/journal.pbio.2002266]

Contaminated genome sequences

The authors of the original draft of the human genome sequence claimed that hundreds of genes had been acquired from bacteria by lateral gene transfer (LGT) (Lander et al., 2001). This claim was abandoned when the "finished" sequence was published a few years later (International Human Genome Consortium, 2004) because others had shown that the data was easily explained by differential gene loss in other lineages or by bacterial contamination in the draft sequence (see Salzberg, 2017).

Subsequent papers on eukaryotic genome sequences frequently reported the presence of several hundred bacterial genes due to LGT. The most extraordinary claim was that 17% of a tardigrade genome was due to LGT (Boothby et al., 2016). This claim led to the creation of a giant tardigrade that controlled the displacement-activated spore hub drive on Star Trek: Discovery. It was able to interface with the spore network by incorporating mycelium DNA using lateral gene transfer ['Star Trek: Discovery' Mudd-ies Up Tardigrade Science].

Unfortunately, the creators of the new Star Trek series didn't read the paper that came out a few months later showing that most of the bacterial DNA was due to contamination and not LGT (Koutsovoulos et al., 2016).

Bacteria DNA isn't the only contaminant. Longo et al. (2011) documented many cases of genome sequences that were contaminated with human DNA (Alu sequences). They found 492 genome sequences (out of 2,749) that contained detectable amounts of human DNA. Here's what they say in the abstract of their paper ...
Using a primate specific SINE, AluY, we screened 2,749 non-primate public databases from NCBI, Ensembl, JGI, and UCSC and have found 492 to be contaminated with human sequence. These represent species ranging from bacteria (B. cereus) to plants (Z. mays) to fish (D. rerio) with examples found from most phyla. The identification of such extensive contamination of human sequence across databases and sequence types warrants caution among the sequencing community in future sequencing efforts, such as human re-sequencing. We discuss issues this may raise as well as present data that gives insight as to how this may be occurring.
The take-home lesson is that draft sequences are often unreliable. Additional analysis (curation/annotation) often reveals numerous examples of contamination from unrelated sequences (e.g. Yoshida et al., 2017).

Boothby, T.C., Tenlen, J.R., Smith, F.W., Wang, J.R., Patanella, K.A., Nishimura, E.O., Tintori, S.C., Li, Q., Jones, C.D., and Yandell, M. (2015) Evidence for extensive horizontal gene transfer from the draft genome of a tardigrade. Proc. Natl. Acad. Sci. (USA), 112:15976-15981. [doi: 10.1073/pnas.1510461112 ]

International Human Genome Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature, 431(7011), 931-945. [doi: 10.1038/nature03001]

Koutsovoulos, G., Kumar, S., Laetsch, D.R., Stevens, L., Daub, J., Conlon, C., Maroon, H., Thomas, F., Aboobaker, A.A., and Blaxter, M. (2016) No evidence for extensive horizontal gene transfer in the genome of the tardigrade Hypsibius dujardini. Proc. Natl. Acad. Sci. (USA), 113:5053-5058. [doi: 10.1073/pnas.1600338113 ]

Lander, E. et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409:860-921. [doi: 10.1038/35057062]

Longo, M.S., O'Neill, M.J., and O'Neill, R.J. (2011) Abundant human DNA contamination identified in non-primate genome databases. PloS One, 6:e16410. [doi: 10.1371/journal.pone.0016410]

Salzberg, S. L. (2017) Horizontal gene transfer is not a hallmark of the human genome. Genome Biology, 18:85. [doi: 10.1186/s13059-017-1214-2]

Yoshida, Y., Koutsovoulos, G., Laetsch, D.R., Stevens, L., Kumar, S., Horikawa, D. D., Ishino, K., Komine, S., Kunieda, T., and Tomita, M. (2017) Comparative genomics of the tardigrades Hypsibius dujardini and Ramazzottius varieornatus. PLoS Biology, 15:e2002266. [doi: 10.1371/journal.pbio.2002266]

Parental age and the human mutation rate



-mutation types
-mutation rates

Mutations are mostly due to errors in DNA replication. We have a pretty good idea of the accuracy of DNA replication—the overall error rate is about 10-10 per bp. There are about 30 cell divisions in females between zygote and formation of all egg cells. In males, there are about 400 mitotic cell divisions between zygote and formation of sperm cells. Using these average values, we can calculate the number of mutations per generation. It works out to about 130 mutations per generation [Estimating the Human Mutation Rate: Biochemical Method].

This value is similar to the estimate from comparing the sequences of different species (e.g. human and chimpanzee) based on the number of differences and the estimated time of divergence. This assumes that most of the genome is evolving at the rate expected for fixation of neutral alleles. This phylogenetic method give a value of about 112 mutations per generation [Estimating the Human Mutation Rate: Phylogenetic Method].

The third way of measuring the mutation rate is to directly compare the genome sequence of a child and both parents (trios). After making corrections for false positives and false negatives, this method yields values of 60-100 mutations per generation depending on how the data is manipulated [Estimating the Human Mutation Rate: Direct Method]. The lower values from the direct method call into question the dates of the split between the various great ape lineages. This controversy has not been resolved [Human mutation rates] [Human mutation rates - what's the right number?].

It's clear that males contribute more to evolution than females. There's about a ten-fold difference in the number of cell divisions in the male line compared to the female line; therefore, we expect there to be about ten times more mutations inherited from fathers. This difference should depend on the age of the father since the older the father the more cell divisions required to produce sperm.

This effect has been demonstrated in many publications. A maternal age effect has also been postulated but that's been more difficult to prove. The latest study of Icelandic trios helps to nail down the exact effect (Jónsson et al., 2017).

The authors examined 1,548 trios consisting of parents and at least one offspring. They analyzed 2.682 Mb of genome sequence (84% of the total genome) and discovered an average of 70 mutations events per child.1 This gives an overall mutation rate of 83 mutations per generation with an average generation time of 30 years. This is consistent with previous results.

Jónsson et al. looked at 225 cases of three generation data in order to make sure that the mutations were germline mutations and not somatic cell mutations. They plotted the numbers of mutations against the age of the father and mother to produce the following graph from Figure 1 of their paper.

Look at parents who are 30 years old. At this age, females contribute about 10 mutations and males contribute about 50. This is only a five-fold difference—much lees than we expect from the number of cell divisions. This suggests that the initial estimates of 400 cell divisions in males might be too high.

An age effect on mutations from the father is quite apparent and expected. A maternal age effect has previously been hypothesized but this is the first solid data that shows such an effect. The authors speculate that oocyotes accumulate mutations with age, particularly mutations due to strand breakage.

Of these, 93% were single nucleotide changes and 7% were small deletions or insertions.

Jónsson, H., Sulem, P., Kehr, B., Kristmundsdottir, S., Zink, F., Hjartarson, E., Hardarson, M.T., Hjorleifsson, K.E., Eggertsson, H.P., and Gudjonsson, S.A. (2017) Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature, 549:519-522. [doi: 10.1038/nature24018]