Illuminating dark matter in human DNA?

A few months ago, the press office of the University of California at San Diego issued a press release with a provocative title ...

Illuminating Dark Matter in Human DNA - Unprecedented Atlas of the "Book of Life"

The press release was posted on several prominent science websites and Facebook groups. According to the press release, much of the human genome remains mysterious (dark matter) even 20 years after it was sequenced. According to the senior author of the paper, Bing Ren, we still don't understand how genes are expressed and how they might go awry in genetic diseases. He says,

A major reason is that the majority of the human DNA sequence, more than 98 percent, is non-protein-coding, and we do not yet have a genetic code book to unlock the information embedded in these sequences.

We've heard that story before and it's getting very boring. We know that 90% of our genome is junk, about 1% encodes proteins, and another 9% contains lots of functional DNA sequences, including regulatory elements. We've known about regulatory elements for more than 50 years so there's nothing mysterious about that component of noncoding DNA.

Ren and his colleagues are trying to identify regulatory elements by looking at transcription factor binding sites and their associated chromatin alterations. If this sounds familiar, it's because scientists have been mapping these sites for at least 25 years.

Efforts to fill in the blanks are broadly captured in an ongoing international effort called the Encyclopedia of DNA Elements (ENCODE), and include the work of Ren and colleagues. In particular, they have investigated the role and function of chromatin, a complex of DNA and proteins that form chromosomes within the nuclei of eukaryotic cells.

Back in 2007 and 2012, ENCODE started mapping all the spurious transcription factor binding sites in the human genome. They also mapped all the spurious transcripts that are due to nonfunctional transcription and all the open chromatin domains associated with those sites. There are millions of these sites and most of them have nothing to do with normal gene expression and biological function.

The paper being promoted by the press release (Zhang et al., 2021) continues this work by mapping more spurious transcription factor binding sites. A tiny percentage of these might be genuine cis-regulatory elements (cCREs).

They applied assays to more than 600,000 human cells sampled from 30 adult human tissue types from multiple donors, then integrated that information with similar data from 15 fetal tissue types to reveal the status of chromatin at approximately 1.2 million candidate cis-regulatory elements in 222 distinct cell types.

The paper reports a total of 1,154,661 "distinct cCREs" spanning 14.8% of the genome. In typical fashion, there's no mention of spurious binding sites and very little attempt to identify real regulatory sites. As usual, all the sites are called cis-regulatory elements even though the authors must surely know that most of them are NOT regulatory elements. Most of that 14.8% of the genome is junk DNA but the authors forget to mention that possibility. I'm sure it just slipped their mind because they must know about spurious sites after all the controversy over earlier ENCODE results dating back to 2007.

Much of the emphasis in the paper is on the possible association of these sites with various diseases. Here's what they say in the introduction.

Genome-wide association studies (GWAS) have identified hundreds of thousands of genetic variants associated with a broad spectrum of human traits and diseases. The large majority of these variants are noncoding.

That's an interesting piece of information. If true, it means that there are at least 200,000 "genetic variants" or about eight per gene. Most of them will be in junk DNA and will have nothing to do with any nearby genes. They just happen to be linked to those genes. The authors identified 527 transcription factor binding sites that might possibly be associated with genetic variants that could be influencing nearby genes. Two of them look promising; one is associated with ulcerative colitis and one is associated with osteoarthritis. That's two out of 1.2 million.

I think this is what they mean when they say they're illuminating the dark matter of the genome.

Nobody questions the usefullness of GWAS in mapping genes to phenotypes, including diseases. Nobody doubts that some of these genetic association studies will actually reveal the genetic cause of the phenotypic change, as opposed to fortuitous associations with linked polymorphisms. Nobody is surprised that most of these genetic markers lie outside of protein-coding regions. What surprises me is that the benefits of these mapping experiments aren't presented in a context that recognizes the limitations and the difficulties in interpreting the data. In this case, why not discuss the fact that the vast majority of these so-called cCREs are likely to be spurious transcription factor binding sites that have nothing to do with gene regulation or disease? This means that the problem of identifying a few functional promoters amid a sea of irrelevant features is very difficult. Why not say that?

I struggle to come up with an explanation for this behavior. Is it because the authors know about spurious binding sites but don't want to complicate their paper by mentioning them? Or is it because the authors don't really understand junk DNA and spurious binding sites? And how do you explain the reviewers who approve these papers for publication?


Zhang, K., Hocker, J.D., Miller, M., Wang, A., Preisel, S. and Ren, B. (2021) A single-cell atlas of chromatin accessibility in the human genome. Cell 184: 5985-6001 e5919 doi: 10.1016/j.cell.2021.10.024

Aug 18 Webinar: Finding Data for your Research Organism: Plants and RNA-Seq data

Join us on August 18, 2021 at 12PM eastern time for the second webinar on finding data for your non-model research organism. In this webinar, you will learn how to use NCBI’s web resources to get data for a plant species, the black cottonwood. You will see how to find, access, and analyze gene and … Continue reading Aug 18 Webinar: Finding Data for your Research Organism: Plants and RNA-Seq data

MIT Professor Rick Young doesn’t understand junk DNA

Richard ("Rick") Young is a Professor of Biology at the Massachusetts Institute of Technology and a member of the Whitehead Institute. His area of expertise is the regulation of gene expression in eukaryotes.

He was interviewed by Jorge Conde and Hanne Winarsky on a recent podcast (Feb. 1, 2021) where the main topic was "From Junk DNA to an RNA Revolution." They get just about everything wrong when they talk about junk DNA including the Central Dogma, historical estimates of the number of genes, confusing noncoding DNA with junk, alternative splicing, the number of functional RNAs, the amount of regulatory DNA, and assuming that scientists in the 1970s were idiots.

In this episode, a16z General Partner Jorge Conde and Bio Eats World host Hanne Winarsky talk to Professor Rick Young, Professor of Biology and head of the Young Lab at MIT—all about “junk” DNA, or non-coding DNA.

Which, it turns out—spoiler alert—isn’t junk at all. Much of this so-called junk DNA actually encodes RNA—which we now know has all sorts of incredibly important roles in the cell, many of which were previously thought of as only the domain of proteins. This conversation is all about what we know about what that non-coding genome actually does: how RNA works to regulate all kinds of different gene expression, cell types, and functions; how this has dramatically changed our understanding of how disease arises; and most importantly, what this means we can now do—programming cells, tuning functions up or down, or on or off. What we once thought of as “junk” is now giving us a powerful new tool in intervening in and treating disease—bringing in a whole new category of therapies.

Here's what I don't understand. How could a prominent scientist at one of the best universities in the world be so ignorant of a topic he chooses to discuss on a podcast? Perhaps you could excuse a busy scientist who doesn't have the time to research the topic but what excuse can you offer to explain why the entire culture at MIT and the Whitehead must also be ignorant? Does nobody there ever question their own ideas? Do they only read the papers that support their views and ignore all those that challenge those views?

This is a very serious question. It's the most difficult question I discuss in my book. Why has the false narrative about junk DNA, and many other things, dominated the scientific literature and become accepted dogma among leading scientists? Soemething is seriously wrong with science.


ENCODE 3: A lesson in obfuscation and opaqueness

The Encyclopedia of DNA Elements (ENCODE) is a large-scale, and very expensive, attempt to map all of the functional elements in the human genome.

The preliminary study (ENCODE 1) was published in 2007 and the main publicity campaign surrounding that study focused on the fact that much of the human genome was transcribed. The implication was that most of the genome is functional. [see: The ENCODE publicity campaign of 2007].

The ENCODE 2 results were published in 2012 and the publicity campaign emphasized that up to 80% of our genome is functional. Many stories in the popular press touted the death of junk DNA. [see: What did the ENCODE Consortium say in 2012]

Both of these publicity campaigns, and the published conclusions, were heavily criticized for not understanding the distinction between fortuitous transcription and real genes and for not understanding the difference between fortuitous binding sites and functional binding sites. Hundreds of knowledgeable scientists pointed out that it was ridiculous for ENCODE researchers to claim that most of the human genome is functional based on their data. They also pointed out that ENCODE researchers ignored most of the evidence supporting junk DNA.

ENCODE 3 has just been published and the hype has been toned down considerably. Take a look at the main publicity article just published by Nature (ENCODE 3). The Nature article mentions ENCODE 1 and ENCODE 2 but it conveniently ignores the fact that Nature heavily promoted the demise of junk DNA back in 2007 and 2012. The emphasis now is not on how much of the genome is functional—the main goal of ENCODE—but on how much data has been generated and how many papers have been published. You can read the entire article and not see any mention of previous ENCODE/Nature claims. In fact, they don't even tell you how many genes ENCODE found or how many functional regulatory sites were detected.

The News and Views article isn't any better (Expanded ENCODE delivers invaluable genomic encyclopedia). Here's the opening paragraph of that article ...
Less than 2% of the human genome encodes proteins. A grand challenge for genomic sciences has been mapping the functional elements — the regions that determine the extent to which genes are expressed — in the remaining 98% of our DNA. The Encyclopedia of DNA Elements (ENCODE) project, among other large collaborative efforts, was established in 2003 to create a catalogue of these functional elements and to outline their roles in regulating gene expression. In nine papers in Nature, the ENCODE consortium delivers the third phase of its valuable project.1
You'd think with such an introduction that you would be about to learn how much of the genome is functional according to ENCODE 3 but you will be disappointed. There's nothing in that article about the number of genes, the number of regulatory sites, or the number of other functional elements in the human genome. It almost as if Nature wants to tell you about all of the work involved in "mapping the functional elements" without ever describing the results and conclusions. This is in marked contrast to the Nature publicity campaigns of 2007 and 2012 where they were more than willing to promote the (incorrect) conclusions.

In 2020 Nature seems to be more interested in obfuscation and opaqueness. One other thing is certain, the Nature editors and writers aren't the least bit interested in discussing their previous claims about 80% of the genome being functional!

I guess we'll have to rely on the ENCODE Consortium itself to give us a summary of their most recent findings. The summary paper has an intriguing title (Perspectives on ENCODE) that almost makes you think they will revisit the exaggerated claims of 2007 and 2012. No such luck. However, we do learn a little bit about the human genome.
  • 20,225 protein-coding genes [almost 1000 more than the best published estimates - LAM]
  • 37,595 noncoding genes [I strongly doubt they have evidence for that many functional genes]
  • 2,157,387 open chromatin regions [what does this mean?]
  • 1,224,154 transcription factor binding sites [how many are functional?]
That's it. The ENCODE Consortium seems to have learned only two things in 2012. They learned that it's better to avoid mentioning how much of the genome is functional in order to avoid controversy and criticism and they learned that it's best to ignore any of their previous claims for the same reason. This is not how science is supposed to work but the ENCODE Consortium has never been good at showing us how science is supposed to work.

Note: I've looked at some of the papers to try and find out if ENCODE stands by it's previous claim that most the genome is functional but they all seem to be written in a way that avoids committing to such a percentage or addressing the criticisms from 2007 and 2012. The only exception is a paper stating that cis-regulatory elements occupy 7.9% of the human genome (Expanded encyclopaedias of DNA elements in the human and mouse genomes). Please let me know if you come across anything interesting in those papers.


1. Isn't it about time to stop dwelling on the fact that 2% (actually less than 1%) of our genome encodes protein? We've known for decades that there are all kinds of other functional regions of the genome. No knowledgeable scientist thinks that the remaining 98% (99%) has no function.

The coronavirus life cycle

The coronavirus life cycle is depicted in a figure from Fung and Liu (2019). See below for a brief description.
The virus particle attaches to receptors on the cell surface (mostly ACE2 in the case of SARS-CoV-2). It is taken into the cell by endocytosis and then the viral membrane fuses with the host membrane releasing the viral RNA. The viral RNA is translated to produce the 1a and 1ab polyproteins, which are cleaved to produce 16 nonstructural proteins (nsps). Most of the nsps assemble to from the replication-transcription complex (RTC). [see Structure and expression of the SARS-CoV-2 (coronavirus) genome]

RTC transcribes the original (+) strand creating (-) strands that are subsequently copied to make more viral (+) strands. RTC also produces a cluster of nine (-) strand subgenomic RNAs (sgRNAs) that are transcribed to make (+) sgRNAs that serve as mRNAs for the production of the structural proteins. N protein (nucleocapsid) binds to the viral (+) strand RNAs to help form new viral particles. The other structural proteins are synthesized in the endoplasmic reticulum (ER) where they assemble to form the protein-membrane virus particle that engulfs the viral RNA.

New virus particles are released when the vesicles fuse with the plasma membrane.

The entire life cycle takes about 10-16 hours and about 100 new virus particles are released before the cell commits suicide by apoptosis.


Fung, T.S. and Liu, D.X. (2019) Human coronavirus: host-pathogen interaction. Annual review of microbiology 73:529-557. [doi: 10.1146/annurev-micro-020518-115759]


Structure and expression of the SARS-CoV-2 (coronavirus) genome


Coronaviruses are RNA viruses, which means that their genome is RNA, not DNA. All of the coronaviruses have similar genomes but I'm sure you are mostly interested in SARS-CoV-2, the virus that causes COVID-19. The first genome sequence of this virus was determined by Chinese scientists in early January and it was immediately posted on a public server [GenBank MN908947]. The viral RNA came from a patient in intensive care at the Wuhan Yin-Tan Hospital (China). The paper was accepted on Jan. 20th and it appeared in the Feb. 3rd issue of Nature (Zhou et al. 2020).

By the time the paper came out, several universities and pharmaceutical companies had already constructed potential therapeutics and several others had already cloned the genes and were preparing to publish the structures of the proteins.1

By now there are dozens and dozens of sequences of SARS-CoV-2 genomes from isolates in every part of the world. They are all very similar because the mutation rate in these RNA viruses is not high (about 10-6 per nucleotide per replication). The original isolate has a total length of 29,891 nt not counting the poly(A) tail. Note that these RNA viruses are about four times larger than a typical retrovirus; they are the largest known RNA viruses.

The RNA genome that's inside the virus particle looks very much like a typical eukaryotic mRNA molecule. It has a 5′ cap and a 3′ poly(A) tail of about 40-50 nucleotides. This RNA is translated by the host protein synthesis components as soon as it is injected into the cell.

The genome contains a number of genes where the word "gene" is used to define the open reading frame of the proteins produced by the virus. The initial translation products are two large polyproteins that are subsequently cleaved by proteases to produce smaller proteins. Most of time the viral RNA is translated to give the 1a polyprotein (~460 kDa) that is subsequently cleaved to produce 11 distinct non-structural proteins (nsps). Sometimes the ribosomes stall near the stop codon when they encounter a frameshift element (FSE) containing a "slippery site" that causes the ribosomes to skip one nucleotide. This avoids the stop codon and allows translation to continue into the 1b gene. The large 1ab polyprotein (~780 kDa) produces another five proteins after cleavage.


The functions of many (but not all) of these proteins have been discovered. Nsp12 is an RNA dependent RNA polymerase (RdRp). This is the enzyme that will copy the viral RNA to produce more infectious RNAs but it also produces a number of other transcripts (see below). RdRp is part of a large replication-transcription complex (RTC) that includes a number of accessory proteins (nsp2, nsp4, nsp6, nsp7+nsp8, nsp9, and nsp10). The exact functions of all these accessory proteins haven't been worked out in detail.

Nsp3 is a papain-like protease (PLpro) and nsp5 is a 3C-like cysteine protease (3CLpro). They are responsible for cleaving polyproteins 1a and 1ab.

Nsp13 is a 5′→3′ helicase (Hel) that's required for transcription. Nsp14 is a 3′→5′ exonuclease involved in proofreading. Nsp15 appears to be a uridine-specific endonuclease and nsp16 is an S-adenosylmethionine methyltransfersase.

The open reading frames at the 3′ end of the viral RNA cannot be translated because of the stop codon at the end of the 1ab "gene." Production of these proteins (e.g. S, M, E etc.) has to wait until later in the life cycle of the virus after the assembly of the RTC complex. As we shall see shortly, the synthesis of these late proteins involves a complicated process that requires production of many different transcripts.

The injected virus RNA is a (+) strand so production of new viral RNA requires two rounds of transcription. First, the RTC complex binds to the 3′ end of the (+) strand and copies it all the way to the 5′ end producing a (-) strand. This strand is then copied to produce new (+) strands that can be incorporated into new virus particles. The new (+) strands also act as messenger RNA to produce more 1a and 1ab polyproteins.2


Transcription from the 3′ end of the (+) strand also produces a group of subgenomic RNAs (sgRNAs). The 3′ end contains a number of transcription-regulating sequences (TRS-B) consisting of a 10 nucleotide AU-rich stretch of RNA. There is another TRS (TRS-L) at the 5′ end next to a leader sequence (L). When the RTC encounters a TRS it will pause and this may cause it to switch and continue transcription at TRS-L. This produces an sgRNA consisting of a stretch from the 3′ end (body) joined to the leader sequence at the 5′ end (leader).

The example shown below shows template switching between a TRS-B located at the 5′ end of the S gene and TRS-L to produce an S sgRNA. This sgRNA is then transcribed to produce an mRNA that can be translated to produce S protein.


Each of the genes at the 3′ end of the virus genome is associated with a TRS-B sequence so transcription from the 3′ end produces 9 different sgRNAs corresponding to the nine functional genes. (Open reading frame 10 is not a functional gene.) The figure on the right is from Kim et al. (2020).

Some of these "late" genes are required for assembly of new virus particles. S is the gene for the trimeric spike protein that mediates attachment of the virus to the ACE2 receptor on the surface of the host cell. M is a membrane glycoprotein— it is the most abundant structural protein. E is the envelope protein. N is the nucleocapsid protein that binds RNA and helps package it into the virus particle.

Reading frame 3 seems to produce two proteins, 3a and 3b. It's likely that 3a is an ion channel protein on the virus surface. Proteins 7 and 8 are additional viral assembly proteins. I don't know the function of protein 6 and I'm not sure if anyone else knows. Many coronaviruses don't make protein 6.

DISCLAIMER: I am not an expert on coronaviruses. Everything in this post is stuff I have learned in the past few days from reading published papers. Feel free to correct all the mistake I have made.


1. The behavior of these Chinese scientists doesn't match with the conspiracy theory that China engineered this virus—perhaps they weren't in on the conspiracy? :-)

2. I don't know how the transcription complex manages to copy right to the ends of the viral RNA. It seems to involve some complicated RNA secondary structures but I didn't bother reading the relevant papers.

References and Bibliography

Bar-On, Y.M., Flamholz, A., Phillips, R. and Milo, R. (2020) Science Forum: SARS-CoV-2 (COVID-19) by the numbers. Elife 9: e57309. [doi: 10.7554/eLife.57309]

Kim, D., Lee, J-Y., Yang, J-S., Kim, J.W., Kim. V.N., and Chang, H. (2020) The Architecture of SARS-CoV-2 Transcriptome. Cell 181:914-921 [doi: 10.1016/j.cell.2020.04.011]

Fung, T.S. and Liu, D.X. (2019) Human coronavirus: host-pathogen interaction. Annual review of microbiology 73: 529-557. [doi: 10.1146/annurev-micro-020518-115759]

Sawicki, S.G., Sawicki, D.L. and Siddell, S.G. (2007) A contemporary View of Coronavirus Transcription. Journal of virology 81(1):20-29. [doi: 10.1128/JVI.01358-06]

Zhou, P., Yang, X.-L., Wang, X.-G., Hu, B., Zhang, L., Zhang, W., Si, H.-R., Zhu, Y., Li, B. and Huang, C.-L. (2020) A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579(7798): 270-273. [doi: 10.1038/s41586-020-2012-7]

New feature added to Primer-BLAST to better design primers for expression assays

We’ve added a new feature (Max 3′ match), shown in Figure 1, to Primer-BLAST that limits the length of 3′ exon matches when designing exon-exon spanning primers. This makes it less likely that primers specifically designed to amplify transcripts will … Continue reading

The frequency of splicing errors reflects the balance between selection and drift

Splice variants are very common in eukaryotes. We know that it's possible to detect dozens of different splice variants for each gene with multiple introns. In the past, these variants were thought to be examples of differential regulation by alternative spicing but we now know that most of them are due to splicing errors. Most of the variants have been removed from the sequence databases but many remain and they are annotated as examples of alternative splicing, which implies that they have a biological function.

I have blogged about splice variants many times, noting that alternative splicing is a very real phenomenon but it's probably restricted to just a small percentage of genes. Most of splice variants that remain in the databases are probably due to splicing errors. They are junk RNA [The persistent myth of alternative splicing].

The ongoing controversy over the origin of splice variants is beginning to attract attention in the scientific literature although it's fair to say that most scientists are still unaware of the controversy. They continue to believe that abundant alternative splicing is a real phenomenon and they don't realize that the data is more compatible with abundant splicing errors.

Some molecular evolution labs have become interested in the controversy and have devised tests of the two possibilities. I draw your attention to a paper that was published 18 months ago.
Saudemont, B., Popa, A., Parmley, J. L., Rocher, V., Blugeon, C., Necsulea, A., Meyer, E., and Duret, L. (2017) The fitness cost of mis-splicing is the main determinant of alternative splicing patterns. Genome biology, 18:208. [doi: 10.1186/s13059-017-1344-6]

Background
Most eukaryotic genes are subject to alternative splicing (AS), which may contribute to the production of protein variants or to the regulation of gene expression via nonsense-mediated messenger RNA (mRNA) decay (NMD). However, a fraction of splice variants might correspond to spurious transcripts and the question of the relative proportion of splicing errors to functional splice variants remains highly debated.

Results
We propose a test to quantify the fraction of AS events corresponding to errors. This test is based on the fact that the fitness cost of splicing errors increases with the number of introns in a gene and with expression level. We analyzed the transcriptome of the intron-rich eukaryote Paramecium tetraurelia. We show that in both normal and in NMD-deficient cells, AS rates strongly decrease with increasing expression level and with increasing number of introns. This relationship is observed for AS events that are detectable by NMD as well as for those that are not, which invalidates the hypothesis of a link with the regulation of gene expression. Our results show that in genes with a median expression level, 92–98% of observed splice variants correspond to errors. We observed the same patterns in human transcriptomes and we further show that AS rates correlate with the fitness cost of splicing errors.

Conclusions
These observations indicate that genes under weaker selective pressure accumulate more maladaptive substitutions and are more prone to splicing errors. Thus, to a large extent, patterns of gene expression variants simply reflect the balance between selection, mutation, and drift.
This is another example of a well-written paper that explains the controversy and the two competing explanations; namely, functional alternative splicing and splicing errors. The authors suggest a test that might help distinguish between these two possibilities.
We propose here a test to quantify the fraction of splice variants corresponding to errors, i.e. having a negative impact on the fitness of organisms. The basis of this test is that the strength of splice signals is expected to reflect a balance between selection (which favors alleles that are optimal for splicing efficiency) and mutation and random genetic drift (which can lead to the fixation of non-optimal alleles). This selection-mutation-drift equilibrium therefore predicts a higher splicing accuracy at introns where errors are more deleterious for the fitness of organisms. Hence, if [splice variants] predominantly correspond to splicing errors, one should expect a negative correlation between the rate of [splice variant] events and their cost in terms of resource allocation (metabolic cost, mobilization of cellular machineries). The noisy splicing model therefore makes several specific predictions regarding the [splice variant] rate according to whether splice variants are detectable by NMD and according to the expression level, length, and number of introns of genes.1
They carry out their main test using genes in Paramecium tetrauelia because this organisms has short introns (20-35 bp) that can be covered in single RNA-seq reads. Then they apply the same test to human genes and conclude ...
For a given error rate, errors are expected to be more costly (in terms of metabolic resources and mobilization of cellular machineries) in highly expressed genes. Hence the fitness cost of mis-splicing is expected to increase with increasing expression level. Indeed, this is precisely what we observed in humans: the strength of selection against deleterious mutations at splice sites is strongly correlated to gene expression level (Fig. 6b). Since the risk of producing erroneous transcripts increases with the number of introns, this implies that all else being equal, there should be a stronger selective pressure against mis-splicing in intron-rich genes. The mutation-selection-drift theory therefore predicts that introns from weakly expressed/intron-poor genes should accumulate more non-optimal substitutions in their splice signals and therefore should show a higher splicing error rate. The relationships that we observe between [splice variant] rate, expression level, and intron number are perfectly consistent with these predictions, both in human (Fig. 5) and in paramecia (Fig. 3).
I'm not going to argue that this is a definitive answer to the problem but I'm pleased that more and more groups are promoting the idea that splicing errors is a viable explanation of the data. I'm also pleased that more attention is being paid to the fact that slightly deleterious events can persist in the population because they are effectively invisible to selection. This counters the prevailing narrative that everything we observe must be adaptive and functional.

Note: Saudermont et al. (2019) review the literature on the rate of splicing errors and note that it can be as high as 3%. My own review of the literature suggests that an error rate of this magnitude is rare but splicing is still error-prone. I estimate that a typical splice site is only 99.9% effective and, in addition, inappropriate splice sites are activated about 0.1% of the time in a typical human gene. Saudermont et al. alerted me to a paper by Stepankiw et al. (2015) that I hadn't read before. Those authors presented evidence that 1% of all transcripts are incorrectly spliced due to errors in the spliceosome reaction.


1. The authors refer to all splice variants as examples of alternative splicing (AS). I think this is confusing since the term "alternative splicing" has been used for decades to refer to real examples of differential splicing with a biological function. I think we should reserve that term for biologically meaningful examples of splice variants as opposed to variants due to splicing errors.

Stepankiw, N., Raghavan, M., Fogarty, E. A., Grimson, A., and Pleiss, J.A. (2015) Widespread alternative and aberrant splicing revealed by lariat sequencing. Nucleic acids research, 43:8488-8501. [doi: 10.1093/nar/gkv763]

Are multiple transcription start sites functional or mistakes?

If you look in the various databases you'll see that most human genes have multiple transcription start sites. The evidence for the existence of these variants is solid—they exist—but it's not clear whether the minor start sites are truly functional or whether they are just due to mistakes in transcription initiation. They are included in the databases because annotators are unable to distinguish between these possibilities.

Let's look at the entry for the human triosephosphate isomerase gene (TPI1; Gene ID 7167).


The correct mRNA is NM_0003655, third from the top. (Trust me on this!). The three other variants have different transcription start sites: two of them are upstream and one is downstream of the major site. Are these variants functional or are they simply transcription initiation errors? This is the same problem that we dealt with when we looked at splice variants. In that case I concluded that most splice variants are due to splicing errors and true alternative splicing is rare.

This is not a difficult question to answer when you are looking at specific well-characterized genes such as TPI1. The three variants are present at very low concentrations, they not conserved in other species, and they encode variant proteins that have never been detected. It seems reasonable to go with the null hypothesis; namely, that they are non-functional transcripts due to errors in transcription initiation.

However, this approach is not practical for every one of the 25,000 genes in the human genome so several groups have looked for a genomics experiment that will address the question. I'd like to recommend a recent paper in PLoS Biology that tries do this in a very clever way. It's also a paper that does an excellent job of explaining the controversy in a way that all scientific papers should copy.1
Xu, C., Park, J.-K., and Zhang, J. (2019) Evidence that alternative transcriptional initiation is largely nonadaptive. PLoS Biology, 17(3), e3000197. [doi: 10.1371/journal.pbio.3000197]

Abstract

Alternative transcriptional initiation (ATI) refers to the frequent observation that one gene has multiple transcription start sites (TSSs). Although this phenomenon is thought to be adaptive, the specific advantage is rarely known. Here, we propose that each gene has one optimal TSS and that ATI arises primarily from imprecise transcriptional initiation that could be deleterious. This error hypothesis predicts that (i) the TSS diversity of a gene reduces with its expression level; (ii) the fractional use of the major TSS increases, but that of each minor TSS decreases, with the gene expression level; and (iii) cis-elements for major TSSs are selectively constrained, while those for minor TSSs are not. By contrast, the adaptive hypothesis does not make these predictions a priori. Our analysis of human and mouse transcriptomes confirms each of the three predictions. These and other findings strongly suggest that ATI predominantly results from molecular errors, requiring a major revision of our understanding of the precision and regulation of transcription. [my emphasis - LAM]

Author summary

Multiple surveys of transcriptional initiation showed that mammalian genes typically have multiple transcription start sites such that transcription is initiated from any one of these sites. Many researchers believe that this phenomenon is adaptive because it allows production of multiple transcripts, from the same gene, that potentially vary in function or post-transcriptional regulation. Nevertheless, it is also possible that each gene has only one optimal transcription start site and that alternative transcriptional initiation arises primarily from molecular errors that are slightly deleterious. This error hypothesis makes a series of predictions about the amount of transcription start site diversity per gene, relative uses of the various start sites of a gene, among-tissue and across-species differences in start site usage, and the evolutionary conservation of cis-regulatory elements of various start sites, all of which are verified in our analyses of genome-wide transcription start site data from the human and mouse. These findings strongly suggest that alternative transcriptional initiation largely reflects molecular errors instead of molecular adaptations and require a rethink of the precision and regulation of transcription.
I'm not going to describe the experimental results; if you're interested you can read the paper yourself. Instead, I want to focus on the way the authors present the problem and how it could be resolved.

One of the important issues in these kinds of problems is not whether there are well-established cases where the phenomenon is responsible for functional alternatives but whether the phenomenon is widespread. In this case, we know of specific examples of genes with multiple transcription start sites (TSS) that have a well-established function. The authors include a brief summary of these examples and conclude with an important caveat.
Nevertheless, alternative TSSs with verified benefits account for only a tiny fraction of all known TSSs, while the vast majority of TSSs have unknown functions. More than 90,000 TSSs are annotated for approximately 20,000 human protein-coding genes in ENSEMBL genome reference consortium human build 37 (GRCh37). Recent surveys using high-throughput sequencing methods such as deep cap analysis gene expression (deepCAGE) showed that human TSSs are much more abundant than what has been annotated. Are most TSSs of a gene functionally distinct, and is ATI generally adaptive? While this possibility exits, here we propose and test an alternative, nonadaptive hypothesis that is at least as reasonable as the adaptive hypothesis. Specifically, we propose that there is only one optimal TSS per gene and that other TSSs arise from errors in transcriptional initiation that are mostly slightly deleterious. This hypothesis is based on the consideration that transcriptional initiation has a limited fidelity, and harmful ATI may not be fully suppressed by natural selection if the harm is sufficiently small or if the cost of fully suppressing harmful ATI is even larger than the benefit from suppressing it.
This is how scientific papers should be written but too often we see scientists who assume that because some variants are functional it must mean that all variants are functional. They don't bother to mention the possibility that some could be functional but most are not.

Why is it important to decide whether multiple transcription start sites are functional? The simple answer is that it's always better to know the truth but there's more to it than that. Because these variants are included in the sequence databases it means that they are usually assumed to be functional. Let's say someone wants to look at 5' UTR sequences in order to see if there are specific signals that control RNA stability. In the case of the TPI1 gene (see above) they will get 4 different results because there are four different transcription start sites and the programs that scan the databases aren't able to recognize that three of these might be artifacts. That's a problem.

It also affects the definition of a gene and the amount of DNA devoted to genes. If the longest transcript is taken as the true size of the gene, as it often is, then this misrepresents the true nature of the gene. There's no easy way to fix this problem unless we pay annotators to closely examine each individual gene to figure out which transcripts are functional and which ones are not. They've done this for many splice variants, which is why many splice variants have been removed from the sequence databases, but it's a labor-intensive and expensive task.

Up until now, most scientists have not been aware that there's a problem. As is the case with alternative splicing and other phenomena, the average scientists just assumes that the variants in the databases represent true functional alternatives that contribute to gene expression. The authors of this paper (Xu et al., 2019) want to alert everyone to the distinct possibility that their results with transcription start sites raise a much more general concern that needs to be addressed. That's why they say,
Our results on ATI echo recent findings about a number of phenomena that increase transcriptome diversity, including alternative polyadenylation, alternative splicing, and several forms of RNA editing. They have all been shown to be largely the results of molecular errors instead of adaptive regulatory mechanisms. Together, these findings reveal the astonishing imprecision of key molecular processes in the cell, contrasting the common view of an exquisitely perfected cellular life.
Read that last sentence very carefully because it addresses what I think is the main problem. It's a question of contradictory worldviews that color ones interpretation of the data. If you think that life is exquisitely designed (by natural selection) then you tend to look at all variants as part of an extremely complex system that fine-tunes gene expression. On the other hand, if you think that the drift-barrier hypothesis is valid then you tend to discount the power of natural selection to weed out all transcription and splicing errors and you see biochemistry as an inherently messy process.


1. I've been highly critical of papers about junk DNA and alternative splicing because they often ignore the fact that there is a controversy. They do not mention that there is solid evidence for junk DNA and solid evidence that alternative splicing is uncommon.

The persistent myth of alternative splicing

I'm convinced that widespread alternative splicing does not occur in humans or in any other species. It's true that the phenomenon exists but it's restricted to a small number of genes—probably fewer than 1000 genes in humans. Most of the unusual transcripts detected by modern technology are rare and unstable, which is consistent with the idea that they are due to splicing errors. Genome annotators have rejected almost all of those transcripts.

You can see links to my numerous posts on this topic at: Alternative splicing and the gene concept and Are splice variants functional or noise?.

The figure shows an idealized version of alternative splicing producing three different proteins with different combinations of exon coding regions.1 The idea is that cells can increase diversity by employing alternative splicing and this is often used as a rationale to explain how humans can get away with the same number of genes as other animals. According to this view, humans use alternative splicing to make several different proteins from a single gene. It's widely believed that >90% of human genes use alternative splicing to make several hundred thousand different proteins from only 20,000 protein-coding genes.

The other day I was reading a very interesting paper on bacterial group II introns where the authors demonstrated that group II intron RNAs in Lactoccus lactis could insert themselves into bacterial mRNAs by reverse splicing (LaRoche-Johnston et al., 2018). Recall that spliceosomal introns in eukaryotes probably evolved from group II introns and this paper fills in one of the crucial steps in that process but this is not the main take-home lesson from the paper. You can tell what the authors think is important from the title: "Bacterial group II introns generate genetic diversity by circularization and trans-splicing from a population of intron-invaded mRNAs."

They have chosen to emphasize the idea that bacteria can generate diversity through alternative splicing just like eukaryotes. Here's how they view alternative splicing in eukaryotes.
A hallmark of eukaryotic cells is the ability of their numerous introns, sequences that interrupt genes, to generate genetic diversity through the expression of several different protein variants from a single gene.
The authors are repeating an idea—I think it's a myth—that alternative splicing is widespread in eukaryotes and its purpose is to generate diversity. Like most authors these days, they think this is a proven fact that does not need to be critically examined.

I'm interested in tracing the origin of this idea to see what evidence has been advanced to support it so I was intrigued when the author included a reference to a paper I hadn't seen before. It was a 2017 review by Bush et al, in Philosophical Transaction B with an interesting title: "Alternative splicing and the evolution of phenotypic novelty" (Bush et al., 2017). The authors begin by explaining the problem they are trying to solve ...
Soon after the publication of the human genome sequence—which revealed a lower than expected number of protein coding genes—alternative splicing was proposed as a candidate to explain the diversification in the number of cell types observed in some eukaryotic lineages (a higher number of cell types in a given species is assumed to reflect increased organism complexity). As gene duplication has long been associated with functional innovation, it was initially assumed that overall gene number should correlate with the number of cell types. Gene duplication rates, however, failed to reflect the diversification of cell types observed in several eukaryotic lineages. There is a significant correlation between the two but weaker than expected and only moderately better if the analysis is restricted to metazoans. Whole-genome duplication events at the base of the vertebrate lineage, which have the highest number of cell types among eukaryotes, have comparable numbers of genes to most invertebrates (that have not undergone whole genome duplication). The poor relationship between a species' cell-type diversity and its total gene number has become known as ‘the G-value paradox’.
That's a bad beginning because, in my opinion, there's no such thing as the "G-value paradox;" thus, the authors are attempting to solve a problem that doesn't exist [Deflated egos and the G-value paradox]. As you might have guessed, the "solution" is alternative splicing.
Although other genomic features have been shown to correlate with cell-type number and may be important contributors to the evolution of complexity, alternative splicing—as a mechanism allowing transcript diversification in the absence of increases in gene number—is a prime candidate to explain the G-value paradox. Comparative studies have reported marked differences in the prevalence of alternative splicing across eukaryotic lineages as well as a significant correlation between alternative splicing and the number of cell types per species. These results are in principle consistent with an adaptive role of alternative splicing in determining a genome's functional information capacity and facilitating transcript diversification in species with greater numbers of cell types.
Statements like that are no longer surprising because there seems to be an overwhelming consensus within the field that widespread alternative splicing explains the low number of genes in humans. I argue that there's no need to "explain" the low number of genes and, furthermore, alternative splicing is not common.

In most cases where there is a genuine scientific controversy there will be facts and evidence to support both sides and the argument is over how to interpret those facts. The evidence that most unusual transcripts are just noise due to splicing eerrors is based on the following lines of evidence.
  • Splicing is associated with a known error rate that's consistent with the production of frequent spurious transcripts.
  • The unusual transcripts are usually present at less than one copy per cell.
  • The unusual transcripts are rapidly degraded and usually don't leave the nucleus.
  • The transcripts are not conserved.
  • The predicted protein products of these transcripts have never been detected.
  • The number of different unusual transcripts produced from each gene makes it extremely unlikely that they could all be biologically relevant.
  • The number of detectable transcripts correlates with the length of the gene and the number of introns, which is consistent with splicing errors.
  • Gene annotators who have looked closely at the data have determined that >90% of them are spuriuous junk RNA or noise.
If this were a genuine scientific controversy then there should be an equal amount of evidence in support of the competing hypothesis so let's see what evidence this review quotes to support the following claim in the second paragraph of their paper.
Alternative splicing is common in many eukaryote lineages, including metazoans, fungi and plants, with deep transcriptome sequencing of the human genome showing over 95% of multi-exon genes produce at least one alternatively spliced isoform [10,11].
Let's check out references 10 and 11 to see if there's strong evidence of alternative splicing.

Reference 10 is to a 2008 Nature Genetics paper by my colleagues from the University of Toronto (Toronto, Ontario, Canada) (Pan et al., 2008). I'm very familiar with this paper. The authors used the new technique of mRNA-Seq to assay six different tissues for mRNAs containing a splice junction sequence. These represent events where canonical splicing did not occur. By combining their data with earlier data from other experiments they estimate that ~95% of all human genes produce unusual transcripts. They estimate that a typical multi-exon gene produces an average of seven different transcript variants.

Note that I'm careful to use the term "transcript variants" because without further evidence we have no way of knowing whether these transcripts are noise or real examples of alternative splicing. Unfortunately, my colleagues didn't make this distinction—they refer to all events as alternative splicing. They don't even mention the possibility that they could be looking at splicing artifacts.

So, the authors of the more recent paper (Bush et al., 2017) are correct to refer to the 2008 paper as support for their claim because that's what the 2008 paper by my colleagues actually said. However, my colleagues didn't present any evidence that their abundant transcripts were functional and that alternative splicing actually makes a significant contribution to genetic diversity. It looks like a false claim of abundant alternative splicing is being accepted without critical evaluation—perhaps because it was published in a good journal and everyone assumes it underwent rigorous peer review.

What about the second reference, reference 11? That's a 2008 Nature paper by Wang et al. Those authors are much more strident in their claims; for example, they begin their paper by saying,
The mRNA and protein isoforms produced by alternative processing or primary RNA transcripts may differ in structure, function, localization, or other properties. Alternative splicing in particular is known to affect more than half of all human genes, and has been proposed as the primary driver of the evolution of phenotypic complexity in mammals.
If you've been following my discussion you will know that it's simply not true that half of all human genes exhibit alternative splicing. What IS true is that multiple transcript variants can be detected in these genes but whether they are noise or not remains to be determined.

The Wang et al. paper sets out to extend the data using mRNA-Seq in the same way as the Pan et al. paper except they assayed more tissues and collected more transcript sequences. They determined that "alternative splicing is nearly universal." Even when they restrict their analysis to more abundant transcripts, they calculate that 92% of multi-exon genes undergo alternative splicing. The authors do not discuss splicing errors and they present no evidence that these transcripts represent biologically relevant alternative splicing.

So what we have is a couple of frequently cited papers from 2008 that made unsubstantiated claims about the abundance of alternative splicing. Those claims have been widely accepted in spite of the fact that there's plenty of evidence that they are wrong. Most scientists seem to be completely unaware of the fact that the data can be best explained by splicing errors so they publish papers assuming that abundant alternative splicing is a well-documented fact.

How did this happen? The points I'm making have all been published in the scientific literature so they should be known to anyone who researches the topic. If you are thinking about working in this field, I recommend a recent paper by Bhuiyan et al. (2018). Here's what they say in the abstract.
Although most genes in mammalian genomes have multiple isoforms, an ongoing debate is whether these isoforms are all functional as well as the extent to which they increase the functional repertoire of the genome. To ground this debate in data, it would be helpful to have a corpus of experimentally-verified cases of genes which have functionally distinct splice isoforms (FDSIs).
This is how a scientific paper should be written. You explain the hypothesis and outline a test to see if it is correct. Here's what they conclude after looking at the data,
Recent studies have challenged whether most genes can produce multiple functional splice isoforms and our results can offer something to both sides of the debate. We acknowledge that other researchers may have different definitions of a functional splice isoform, but we view the debate within our operational definition – a functional splice isoform is one that is necessary for the gene’s overall function.

One side of the debate claims that most genes have multiple functionally distinct isoforms. Viewing our findings optimistically, we provide what is to our knowledge the only substantial list of human and mouse genes for which this is actually documented to be true. The low number of genes with such evidence can be interpreted as a vast opportunity for experimentalists to identify the functions of the isoforms for > 80% of genes. The other side of the debate approaches alternative splicing with a less Panglossian view, with the null hypothesis being that most isoforms do not have a specific distinct function. Multiple studies taking a genomic or evolutionary perspective have concluded that it is unlikely that most genes have multiple functional splice isoforms. Viewed pessimistically, our data is consistent with this body of work. If the literature lacks supporting evidence for widespread FDSIs, the null hypothesis should be maintained and claims that every observed isoform has a function to be discovered should be viewed skeptically.

To our knowledge, this report represents the first effort to curate the literature in order to determine the genes where splicing increases the genome’s functional potential. Such individual reports have been generally ignored in the debate about the function of alternative splicing, which has instead focused on databases and high-throughput data sets. Our estimate that only 4% of human and 9% of mouse genes have evidence for functionally distinct isoforms serves both a sobering reminder of the limited evidence, and a motivation for increased experimental efforts to settle the debate.
Let's hope that more and more scientists wake up to the fact that there's limited evidence for widespread alternative splicing and that in the absence of evidence the null hypothesis is junk RNA.

I'm not holding my breath.


1. Real alternative splicing also occurs with noncoding RNA genes but I'll restrict my discussion to protein-coding genes.

Bhuiyan, S.A., Ly, S., Phan, M., Huntington, B., Hogan, E., Liu, C.\C., Liu, J., and Pavlidis, P. (2018) Systematic evaluation of isoform function in literature reports of alternative splicing. BMC Genomics, 19:637. [doi: 10.1186/s12864-018-5013-2]

Pan, Q., Shai, O., Lee, L.J., Frey, B.J., and Blencowe, B. J. (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature genetics, 40:1413-1415. [doi: 10.1038/ng.259]

Wang, E.T., Sandberg, R., Luo, S., Khrebtukova, I., Zhang, L., Mayr, C., Kingsmore, S.F., Schroth, G.P., and Burge, C.B. (2008) Alternative isoform regulation in human tissue transcriptomes. Nature, 456:470-476. [doi: 10.1038/nature07509]