Sequencing human diploid genomes

Most eukaryotes are diploid, including humans. They have two copies of each autosome. Thousands of human genomes have been sequenced but in almost all cases the resulting genome sequence is a mixture of sequences from homologous chromosomes. If a site is heterogeneous—different alleles on each chromosome—then these are entered as variants.

It would be much better to have complete sequences of each individual chromosome (= diploid sequence) in order to better understand genetic heterogeneity in the human population. Until recently, there were only two examples in the databases. The first was Craig Venter's genome (Levey et al., 2007) and the second was an Asian male (YH) (Cao et al., 2015).

Diploid sequences are much more expensive and time-consuming than standard reference sequences. That's because you can't just match sequence reads to the human reference genome in order to obtain alignment and position information. Instead, you have to pretty much construct de novo assemblies of each chromosome. Using modern technology, it's relatively easy to generate millions of short sequence reads and then match then up to the reference genome to get a genome sequence that combines information from both chromosomes. That's why it's now possible to sequence a genome for less that $1000 (US). De novo assemblies require much more data and more computing power.

A group at a private company (10X Genomics in Pleasanton, California (USA)) has developed new software to assemble diploid genome sequences. They used the technology to add seven new diploid sequences to the databases (Weisenfeld et al., 2017). The resulting assemblies are just draft genomes with plenty of gaps but this is still a significant achievement.

Here's the abstract,
Weisenfeld, N.I., Kumar, V., Shah, P., Church, D.M., and Jaffe, D.B. (2017) Direct determination of diploid genome sequences. Genome Research, 27:757-767. [doi: 10.1101/gr.214874.116]

Determining the genome sequence of an organism is challenging, yet fundamental to understanding its biology. Over the past decade, thousands of human genomes have been sequenced, contributing deeply to biomedical research. In the vast majority of cases, these have been analyzed by aligning sequence reads to a single reference genome, biasing the resulting analyses, and in general, failing to capture sequences novel to a given genome. Some de novo assemblies have been constructed free of reference bias, but nearly all were constructed by merging homologous loci into single “consensus” sequences, generally absent from nature. These assemblies do not correctly represent the diploid biology of an individual. In exactly two cases, true diploid de novo assemblies have been made, at great expense. One was generated using Sanger sequencing, and one using thousands of clone pools. Here, we demonstrate a straightforward and low-cost method for creating true diploid de novo assemblies. We make a single library from ∼1 ng of high molecular weight DNA, using the 10x Genomics microfluidic platform to partition the genome. We applied this technique to seven human samples, generating low-cost HiSeq X data, then assembled these using a new “pushbutton” algorithm, Supernova. Each computation took 2 d on a single server. Each yielded contigs longer than 100 kb, phase blocks longer than 2.5 Mb, and scaffolds longer than 15 Mb. Our method provides a scalable capability for determining the actual diploid genome sequence in a sample, opening the door to new approaches in genomic biology and medicine.


Cao, H., Wu, H., Luo, R., Huang, S., Sun, Y., Tong, X., Xie, Y., Liu, B., Yang, H., and Zheng, H. (2015) De novo assembly of a haplotype-resolved human genome. Nature biotechnology, 33:617-622. [doi:10.1038/nbt.3200]

Levy, S., Sutton, G., Ng, P.C., Feuk, L., Halpern, A.L., Walenz, B.P., Axelrod, N., Huang, J., Kirkness, E.F., Denisov, G., Lin, Y., MacDonald, J.R., Pang, A.W. C., Shago, M., Stockwell, T.B., Tsiamouri, A., Bafna, V., Bansal, V., Kravitz, S.A., Busam, D.A., Beeson, K. Y., McIntosh, T.C., Remington, K.A., Abril, J.F., Gill, J., Borman, J., Rogers, Y.-H., Frazier, M.E., Scherer, S.W., Strausberg, R.L., and Venter, J.C. (2007) The diploid genome sequence of an individual human. PLoS Biol, 5:e254. [doi: 10.1371/journal.pbio.0050254]

What’s in Your Genome?: Chapter 4: Pervasive Transcription (revised)

I'm working (slowly) on a book called What's in Your Genome?: 90% of your genome is junk! The first chapter is an introduction to genomes and DNA [What's in Your Genome? Chapter 1: Introducing Genomes ]. Chapter 2 is an overview of the human genome. It's a summary of known functional sequences and known junk DNA [What's in Your Genome? Chapter 2: The Big Picture]. Chapter 3 defines "genes" and describes protein-coding genes and alternative splicing [What's in Your Genome? Chapter 3: What Is a Gene?].

Chapter 4 is all about pervasive transcription and genes for functional noncoding RNAs. I've finally got a respectable draft of this chapter. This is an updated summary—the first version is at: What's in Your Genome? Chapter 4: Pervasive Transcription.
Chapter 4: Pervasive Transcription

How much of the genome is transcribed?
The latest data indicates that about 90% of the human genome is transcribed if you combine all the data from all the cell types that have been analyzed. This is about the same percentage that was reported by ENCODE in their preliminary study back in 2007 and about the same percentage they reported in the 2012 papers. Most of the transcripts are present in less than one copy per cell. Most of them are only found in one or two cell types. Most of them are not conserved in other species.
How do we know about pervasive transcription?
There are several technologies that are capable of detecting all the transcripts in a cell. The most powerful is RNA-Seq, a technique that copies RNAs into cDNA then performs massive parallel sequencing ("next gen" sequencing) on all the cDNAs. The sequences are then matched back to the reference genome to see which parts of the genome were transcribed. The technique is capable of detecting concentrations of less than one transcript per cell.
Different kinds of noncoding RNAs
There are ribosomal RNAs, tRNAs, and a variety of unique RNAs like those that are part of RNAse P, signal recognition particle etc. In addition there are six main classes of other noncoding RNAS in humans: small nuclear RNAs (snRNAs); small nucleolar RNAs (snoRNAs); microRNAs (miRNAs); short interfering RNAs (siRNAs); PIWI-interacting RNAs (piRNAs); and long noncoding RNAs (lncRNAs). There are many proven examples of functional RNAs in each of the main classes but there are also large numbers of putative members that may or may not be true functional noncoding RNAs.
        Box 4-1: Long noncoding RNAs (lncRNAs)
There are more than 100,000 transcripts identified as lncRNAS. Nobody knows how many of these are actually real functional lncRNAs and how many are just spurious transcripts. The best analyses suggest that less than 20,000 meet the minimum criteria for function and probably only a fraction of these are actually functional.
Understanding transcription
It's important to understand that transcription is an inherently messy process. Regulatory proteins and RNA polymerase initiation complexes will bind to thousands of sites in the human genome that have nothing to do with transcription of nearby genes.
        Box 4-2: Revisiting the Central Dogma
Many scientists and journalist believe that the discovery of massive numbers of noncoding RNAs overthrows the Central Dogma of Molecular Biology. They are wrong.
        Box 4-3: John Mattick proves his hypothesis?
John Mattick claims that the human genome produces tens of thousands of regulatory RNAs that are responsible for fine-tuning the expression of the protein-coding genes. He was given the 2012 Chen Award by the Human Genome Organization for "proving his hypothesis over the course of 18 years." He has not proven his hypothesis.
Antisense transcription
Some transcripts are complimentary to the coding strand in protein-coding genes. This is consistent with spurious transcription to yield junk RNA but many workers have suggested functional roles for most of these antisense RNAs.
What the scientific papers don't tell you
There are hundreds of scientific papers devoted to proving that most newly-discovered noncoding RNAs have a biological function. What they don't tell you is that most of these transcripts are present in concentrations that are inconsistent with function (<1 molecule per cell). They also don't tell you that conservation is the best measure of function and these transcripts are (mostly) not conserved. More importantly, the majority of these papers don't even mention the possibility that these transcripts could be junk RNA produced by spurious transcription. That's a serious omission—it means that science writers who report on this work are unaware of the controversy.
On the origin of new genes
Some scientists are willing to concede that most transcripts are just noise but they claim this is an adaptation for future evolution. The idea here is that the presence of these transcripts makes it easier to evolve new protein-coding genes. While it's true that such genes could evolve more readily in a genome full of noise and junk, this cannot be a reason for such a sloppy genome.
How do you determine function?
The best way to determine function is to take a single transcript and show that it has a demonstrable function. If you take a genomics approach, then the best way to narrow down the list is to concentrate on those transcripts that are present in sufficient concentrations and are conserved in related species. In the absence of evidence, the null hypothesis is junk.
Biochemistry is messy
We're used to the idea that errors in DNA replication give rise to mutations and mutations drive evolution. We're less used to the idea that all other biochemical processes have much higher error rates. This is true of highly specific enzymes and it's even more true of complex processes like transcription, RNA processing (splicing), and translation. The idea that transcription errors could give rise to spurious transcripts in large genomes is perfectly consistent with everything we know about such processes. In fact, it's inevitable that spurious transcripts will be common in such genomes.
        Box 4-4: The random genome project
Sean Eddy has proposed an experiment to establish a baseline level of spurious transcripts and to demonstrate that the null hypothesis is the best explanation for the majority of transcripts. He suggests that scientists construct a synthetic chromosome of random DNA sequences and insert it into a human cell line. The next step is to perform an ENCODE project on this DNA. He predicts that the methods will detect hundreds of transcription factor binding sites and transcripts.
Change your worldview
There are two ways of looking at biochemical processes within cells. The first imagines that everything has a function and cells are as fine-tuned and functional as a Swiss watch. The second imagines that biochemical processes are just good enough to do the job and there's lots of mistakes and sloppiness. The first worldview is inconsistent with the evidence. The second worldview is consistent with the evidence. If you are one of those people who think that cells and genomes are the products of adaptive excellence then it's time to change your worldview.


Cold Spring Harbor tells us about the “dark matter” of the genome (Part I)


This is a podcast from Cold Spring Harbor [Dark Matter of the Genome, Pt. 1 (Base Pairs Episode 8)]. The authors try to convince us that most of the genome is mysterious "dark matter," not junk. The main theme is that the genome contains transposons that could play an important role in evolution and disease.

Here's a few facts.
  • A gene is a DNA sequence that's transcribed. There are about 20,000 protein-coding genes and they cover about 25% of the genome (including introns). It's false to say that genes only occupy 2% of the genome. In addition to protein-coding genes, there are about 5,000 noncoding genes that take up about 5% of the genome. Most of them have been known for decades.
  • It has been known for many decades that the human genome has no more than 30,000 genes. This fact was known by knowledgeable scientists long before the human genome sequence was published.
  • It has been known for decades that about 50% of our genome is composed of defective bits and pieces of once-active transposons. Thus, most of our genome looks like junk and behaves like junk. It is not some mysterious "dark matter." (The podcast actually say that 50% of our genome is defective transposons but they claim this is a recent discovery and it's not junk.)
  • The evidence for junk DNA comes from many different sources. It's not a mystery. It's really junk DNA. The term "junk DNA" was not created to disguise our ignorance of what's in your genome.
  • In addition to genes, there are lots of other functional regions of the genome. No knowledgeable scientists ever thought that the only functional parts of the genome were the exons of protein-coding genes.
There's much value in research on ALS but does it have to be coupled with an incorrect view of our genome? How many errors can you recognize in this podcast? Keep in mind that this is sponsored by one of the leading labs in the world.
Most of the genome is not genes, but another form of genetic information that has come to be known as the genome’s “dark matter.” In this episode, we explore how studying this unfamiliar territory could help scientists understand diseases such as ALS.


Experts meet to discuss non-coding RNAs – fail to answer the important question

The human genome is pervasively transcribed. More than 80% of the genome is complementary to transcripts that have been detected in some tissue or cell type. The important question is whether most of these transcripts have a biological function. How many genes are there that produce functional non-coding RNA?

There's a reason why this question is important. It's because we have every reason to believe that spurious transcription is common in large genomes like ours. Spurious, or accidental, transcription occurs when the transcription initiation complex binds nonspecifically to sites in the genome that are not real promoters. Spurious transcription also occurs when the initiation complex (RNA plymerase plus factors) fires in the wrong direction from real promoters. Binding and inappropriate transcription are aided by the binding of transcription factors to nonpromoter regions of the genome—a well-known feature of all DNA binding proteins [see Are most transcription factor binding sites functional?].

The controversy over the role of these transcripts has been around for many decades but it has become more important in recent years as many labs have focused on identifying transcripts. After devoting much time and effort to the task, these groups are not inclined to admit they have been looking at junk RNA. Instead, they tend to focus on trying to prove that most of the transcripts are functional.

Keep in mind that the correct default explanation is that a transcript is just spurious junk unless someone has demonstrated that it has a function. This is especially true of transcripts present at less than one copy per cell; are not conserved in other species; and have only been detected in a few types of cells. That's the majority of transcripts.

Nobody knows how many different transcripts have been detected since there's no comprehensive database that combines all of the data. I suspect there are several hundred thousand different transcripts. Human genome annotators have struggled to represent this data accurately. They have rejected or ignored most of the transcripts and focused on those that are most likely to have a biological function. Unfortunately, their criteria for functionality are weak and this leads them to include a great many putative genes in their annotated genome. For example, the latest annotation by Ensembl lists 22,521 genes for noncoding RNAs. This is slightly more than the total number of protein-coding genes (20,338) [Human assembly and gene annotation].

It's important to note two things about the work of these annotators. First, they have correctly rejected most of the transcripts. Second, they cannot provide solid evidence that most of those 22,521 transcripts are actually functional. What they really should be saying is that these are the best candidates for real genes.

The experts held a meeting recently in Heraklion, Greece (June 9-14, 2017). You would think that a major emphasis in that meeting would have been on identifying how many of these transcripts are biologically functional but that doesn't seem to have been a major theme according to the brief report published in Genome Biology [Canonical mRNA is the exception, rather than the rule].

Let's look at what the authors have to say about the important question.
Investigations into gene regulation and disease pathogenesis have been protein-centric for decades. However, in recent years there has been a profound expansion in our knowledge of the variety and complexity of eukaryotic RNA species, particularly the non-coding RNA families. Vast amounts of RNA sequencing data generated from various library preparation methods have revealed these non-coding RNA species to be unequivocally more abundant than canonical mRNA species.
This is very misleading. It's certainly true that there are far more than 20,000 transcripts but that's not controversial. What's controversial is how many of those transcripts are functional and how many genes are devoted to producing those functional transcripts.

The report on the meeting doesn't offer an opinion on that matter unless the authors are referring only to functional RNA species. I get the impression that most of the people who attend these meeting are reluctant to state unequivocally whether there's convincing evidence of function for more than 5,000 RNAs. I don't think that evidence exists. Until it does, the default scientific position is that there are far fewer genes for functional noncoding RNAs than for proteins.


How much of the human genome is devoted to regulation?

All available evidence suggests that about 90% of our genome is junk DNA. Many scientists are reluctant to accept this evidence—some of them are even unaware of the evidence [Five Things You Should Know if You Want to Participate in the Junk DNA Debate]. Many opponents of junk DNA suffer from what I call The Deflated Ego Problem. They are reluctant to concede that humans have about the same number of genes as all other mammals and only a few more than insects.

One of the common rationalizations is to speculate that while humans may have "only" 25,000 genes they are regulated and controlled in a much more sophisticated manner than the genes in other species. It's this extra level of control that makes humans special. Such speculations have been around for almost fifty years but they have gained in popularity since publication of the human genome sequence.

In some cases, the extra level of regulation is thought to be due to abundant regulatory RNAs. This means there must be tens of thousand of extra genes expressing these regulatory RNAs. John Mattick is the most vocal proponent of this idea and he won an award from the Human Genome Organization for "proving" that his speculation is correct! [John Mattick Wins Chen Award for Distinguished Academic Achievement in Human Genetic and Genomic Research]. Knowledgeable scientists know that Mattick is probably wrong. They believe that most of those transcripts are junk RNAs produced by accidental transcription at very low levels from non-conserved sequences.

I agree with those scientists but for the sake of completeness here's what John Mattick believes about regulation.
Discoveries over the past decade portend a paradigm shift in molecular biology. Evidence suggests that RNA is not only functional as a messenger between DNA and protein but also involved in the regulation of genome organization and gene expression, which is increasingly elaborate in complex organisms. Regulatory RNA seems to operate at many levels; in particular, it plays an important part in the epigenetic processes that control differentiation and development. These discoveries suggest a central role for RNA in human evolution and ontogeny. Here, we review the emergence of the previously unsuspected world of regulatory RNA from a historical perspective.

... The emerging evidence suggests that there are more genes encoding regulatory RNAs than those encoding proteins in the human genome, and that the amount and type of gene regulation in complex organisms have been substantially misunderstood for most of the past 50 years. (Morris and Mattick, 2014)
The evidence does not support the claim that there are more than 20,000 genes for regulatory RNAs. It's more consistent with the idea that most transcripts are non-functional.

There's another speculation related to regulation. This one was promoted by ENCODE in their original 2007 preliminary study and later on in the now-famous 2012 papers. The ENCODE researchers identified thousand of putative regulatory sites in the genome and concluded ...
... even using the most conservative estimates, the fraction of bases likely to be involved in direct gene regulation, even though incomplete, is significantly higher than that ascribed to protein-coding exons (1.2%), raising the possibility that more information in the human genome may be important for gene regulation than for biochemical function.
They go on to speculate that 8.5% of the genome may be involved in regulation. Think about that for a minute. If we assume that each site covers 100 bp. then the ENCODE researchers are speculating that there might be more than 2 million regulatory sites in the human genome! That's about 100 regulatory sites for every gene!

This is absurd. There must be something wrong with the data.

It's not difficult to see the problem. The assays used by ENCODE are designed to detect transcription factor binding sites, places where histones have been modified, and sites that are sensitive to DNase I. These are all indicators of functional regulatory sites but they are also likely to be associated with non-functional sites. For example, transcription factors will bind to thousands of sites in the genome that have nothing to do with regulation [Are most transcription factor binding sites functional?].

It's very likely that spurious transcription factor binding will lead to histone modification and DNase I sensitivity due to the loosening of chromatin. What this means is that these assays don't actually detect regulatory sites or enhancers as ENCODE claims. Instead, they detect putative regulatory sites that have to be confirmed by additional experiments.

The scientific community is gradually becoming more and more skeptical of these over-interpreted genomic experiments.

The latest genomics paper on regulatory sires has just been posted on bioRχiv (Benton et al., 2017). This is a pre-publication archive site. The paper has not been peer-reviewed and accepted by a scientific journal but it's still making a splash on twitter and the rest of the internet.

Here's the abstract ...
Non-coding gene regulatory loci are essential to transcription in mammalian cells. As a result, a large variety of experimental and computational strategies have been developed to identify cis-regulatory enhancer sequences. However, in practice, most studies consider enhancer candidates identified by a single method alone. Here we assess the robustness of conclusions based on such a paradigm by comparing enhancer sets identified by different strategies. Because the field currently lacks a comprehensive gold standard, our goal was not to identify the best identification strategy, but rather to quantify the consistency of enhancer sets identified by ten representative identification strategies and to assess the robustness of conclusions based on one approach alone. We found significant dissimilarity between enhancer sets in terms of genomic characteristics, evolutionary conservation, and association with functional loci. This substantial disagreement between enhancer sets within the same biological context is sufficient to influence downstream biological interpretations, and to lead to disparate scientific conclusions about enhancer biology and disease mechanisms. Specifically, we find that different enhancer sets in the same context vary significantly in their overlap with GWAS SNPs and eQTL, and that the majority of GWAS SNPs and eQTL overlap enhancers identified by only a single identification strategy. Furthermore, we find limited evidence that enhancer candidates identified by multiple strategies are more likely to have regulatory function than enhancer candidates identified by a single method. The difficulty of consistently identifying and categorizing enhancers presents a major challenge to mapping the genetic architecture of complex disease, and to interpreting variants found in patient genomes. To facilitate evaluation of the effects of different annotation approaches on studies' conclusions, we developed a database of enhancer annotations in common biological contexts, creDB, which is designed to integrate into bioinformatics workflows. Our results highlight the inherent complexity of enhancer biology and argue that current approaches have yet to adequately account for enhancer diversity.
The authors looked at several ENCODE databases identifying sites of histone modification and DNase I sensitivity as well as sites that are transcribed. They specifically looked at databases predicting functional enhancers based on these data. What they found was very little correlation between the various databases and predictions of functionality. When they looked at independent assays using the same cell lines they found considerable variation and a surprising lack of correlation.

While this lack of correlation does not prove that the sites are non-functional, it does indicate that you shouldn't just assume that these sites identify real functional enhancers (regulatory sites). In other words, skepticism should be the appropriate stance.

But that's NOT what the authors conclude. Instead, they assume, without evidence, that every assay identifies real enhancers and what the data shows is that there's an incredible diversity of functional enhancers.
... we believe that ignoring enhancer diversity impedes research progress and replication, since, "what we talk about when we talk about enhancers" include diverse sequence elements across an incompletely understood spectrum, all of which are important for proper gene expression. [my emphasis - LAM]
I find it astonishing that the authors don't even discuss the possibility that they may be looking at spurious sites that have nothing to do with biologically functional regulation. Scientists can find all kinds of ways of rationalizing the data when they are convinced they are observing function (confirmation bias). In this case, the data tells them that many of the sites do not have all of the characteristics of actual regulatory sites. The obvious conclusion, in my opinion, is that the sites are non-functional, just as we suspect from our knowledge of basic biochemistry.

True believers, on the other hand, arrive at a different conclusion. They think this data shows increased complexity and mysterious functional roles that are "incompletely understood."

I hope reviewers of this paper will force the authors to consider spurious binding and non-functional sites. I hope they will force the authors to use "putative enhancers" throughout their paper instead of just "enhancers."


Benton, M.L., Talipineni, S.C., Kostka, D., and Capra, J.A. (2017) Genome-wide Enhancer Maps Differ Significantly in Genomic Distribution, Evolution, and Function. bioRxiv. [doi: 10.1101/176610]

Morris, K.V., and Mattick, J.S. (2014) The rise of regulatory RNA. Nature Reviews Genetics, 15:423-437. [doi: 10.1038/nrg3722]

Revisiting the genetic load argument with Dan Graur

The genetic load argument is one of the oldest arguments for junk DNA and it's one of the most powerful arguments that most of our genome must be junk. The concept dates back to J.B.S. Haldane in the late 1930s but the modern argument traditionally begins with Hermann Muller's classic paper from 1950. It has been extended and refined by him and many others since then (Muller, 1950; Muller, 1966).

Several prominent scientists have used the genetic load data to argue that most of our genome must be junk (King and Jukes, 1969; Ohta and Kimura, 1971; Ohno, 1972). Ohno concluded in in 1972 that ...
... all in all, it appears that calculations made by Muller, Kimora and others are not far off the mark in that at least 90% of our genome is 'junk' or 'garbage' of various sorts.
It's important to keep in mind that the genetic load argument is one of the Five Things You Should Know if You Want to Participate in the Junk DNA Debate. It's also very important to understand that this is positive evidence for junk DNA based on fundamental population genetics. It refutes the popular view that the idea of junk DNA is just based on not knowing all the functions of our genome. There's delicious irony in being accused of argumentum ad ignorantiam by those who are ignorant.

I've discussed gentic load several times on this blog (e.g. Genetic Load, Neutral Theory, and Junk DNA) but a recent paper by Dan Graur provides a good opportunity to explain it once more. The basic idea of Genetic Load is that a population can only tolerate a finite number of deleterious mutations before going extinct. The theory is sound but many of the variables are not known with precision.

Let's see how Dan handles them in his paper (Graur, 2017). In order to calculate the genetic load (or mutation load), we need to know the size of the genome, the mutation rate, and the percentage of mutations that are deleterious. Dan Graur assumes that the diploid genome size is 6.114 × 109 bp based on accurate cytology measurements from 2010. I think the DNA sequence data is more accurate so I would use 6.4 Gb. The difference isn't important.

There's a huge literature on mutation rates in humans. We don't know the exact value because there's a fair bit of controversy in the scientific literature. The values range from about 70 new mutations per generation to about 150 [see: Human mutation rates - what's the right number?]. Graur uses a range of mutation rates covering these values. He expresses them as mutations per site per generation which translates to values from 1.0 × 10-8 to 2.5 × 10-8. As we shall see, he calculates the genetic load for a range of mutation rates order to get an upper limit to the amount of functional DNA in our genome.

The most difficult part of these calculations is estimating the percentage of mutations that are beneficial, neutral, and deleterious. Population geneticists have rightly assumed that the number of beneficial (selected) mutations is insignificant so they concentrate on the number of deleterious mutations. The estimates range from about 4% of the total mutations to about 40% of the total based on the analysis of mutations in coding regions.

Most scientists assume that the correct value is about 10% of the total. What this means is that if there are 100 new mutations in every newborn there will be about 10 deleterious mutations if the entire genome is functional. If only 10% is functional then there will be only 1 deleterious mutation per generation. A mutation load of about one deleterious mutation per generation is the limit that a population can tolerate. Graur assumes 0.99. Others have proposed that the mutation load could be higher (Lynch, 2010; Agrawal and Whitlock, 2012) but it's unlikely to be more than 1.5. The difference isn't important.

Graur calculates a range of deleterious mutation rates (μdel) based on multiplying the percentage of deleterious mutations times the total number of mutations.

The other variable is the replacement level fertility of humans (F). Think of it this way: if every child has a significant number of deleterious mutations then the population can still survive if every couple has a huge number of children. Statistically, some of them will have fewer deleterious mutations and those ones will survive. If F = 50 then in order to get one survivor each person needs to have 50 children (or each couple needs to have 100 children).

Historical data suggests that the range of values goes from 1.05 to 1.75 per person (2.1 to 3.5 children per couple). Graur makes the reasonable assumption that the maximum sustainable replacement level fertility rate is 1.8 per person in human populations over the past million years or so.

The important part of the Graur paper is the table he constructs where he estimates the number of deleterious mutations by combining the mutation rate and the percentage of deleterious mutations on the y-axis and the fraction of the genome that may be functional on the x-axis. At the intersection of each value he calculates the minimum replacement level fertility values required to sustain the population.


Let's look at the first line in this table. The deleterious mutation rate is calculated using the lowest possible mutation rate and the smallest percentage of deleterious mutations (4%). Under these conditions, the human population could survive with a fertility value of 1.8 as long as less than 25% of the genome is functional (i.e. 75% junk) (red circle). That's the UPPER LIMIT on the functional fraction of the human genome.

But that limit is quite unreasonable. It's more reasonable to assume about 100 new mutations per generation with about 10% deleterious. Using these assumptions, only 10% of the genome could be functional with a fertility value of 1.8 (green circle).

Whatever the exact percentage of junk DNA it's clear that the available data and population genetics point to a genome that's mostly junk DNA. If you want to argue for more functionality then you have to refute this data.

Note: Strictly speaking, the genetic load argument only applies to sequence-specific DNA where mutations have a direct effect on function. Some DNA serves as necessary spacers between functional sequences and this DNA will only be affected by deletion mutations. This is a small percentage of the genome. However, there are bulk DNA hypotheses that attribute non-sequence specific function to most of the genome and if they are correct the genetic load argument carries no weight. So far, there is no good evidence that these bulk DNA hypotheses are valid and most objections to junk DNA are based on sequence-specific functions.


Agrawal, A. F., and Whitlock, M. C. (2012) Mutation load: the fitness of individuals in populations where deleterious alleles are abundant. Annual Review of Ecology, Evolution, and Systematics, 43:115-135. [doi: 10.1146/annurev-ecolsys-110411-160257]

Graur, D. (2017) An upper limit on the functional fraction of the human genome. Genome Biol Evol evx121 [doi: 10.1093/gbe/evx121]

King, J.L., and Jukes, T.H. (1969) Non-darwinian evolution. Science, 164:788-798. [PDF]

Lynch, M. (2010) Rate, molecular spectrum, and consequences of human mutation. Proceedings of the National Academy of Sciences, 107:961-968. [doi: 10.1073/pnas.0912629107]

Muller, H.J. (1950) Our load of mutations. American journal of human genetics, 2:111-175. [PDF]

Muller, H.J. (1966) The gene material as the initiator and the organizing basis of life. American Naturalist, 100:493-517. [PDF]

Ohno, S. (1972) An argument for the genetic simplicity of man and other mammals. Journal of Human Evolution, 1(6), 651-662. doi: [doi: 10.1016/0047-2484(72)90011-5]

Ohta, T., and Kimura, M. (1971) Functional organization of genetic material as a product of molecular evolution. Nature, 233:118-119. [PDF]

Confusion about the number of genes

My last post was about confusion over the sizes of the human and mouse genomes based on a recent paper by Breschi et al. (2017). Their statements about the number of genes in those species are also confusing. Here's what they say about the human genome.
[According to Ensembl86] the human genome encodes 58,037 genes, of which approximately one-third are protein-coding (19,950), and yields 198,093 transcripts. By comparison, the mouse genome encodes 48,709 genes, of which half are protein-coding (22,018 genes), and yields 118,925 transcripts overall.
The very latest Ensembl estimates (April 2017) for Homo sapiens and Mus Musculus are similar. The difference in gene numbers between mouse and human is not significant according to the authors ...
The discrepancy in total number of annotated genes between the two species is unlikely to reflect differences in underlying biology, and can be attributed to the less advanced state of the mouse annotation.
This is correct but it doesn't explain the other numbers. There's general agreement on the number of protein-coding genes in mammals. They all have about 20,000 genes. There is no agreement on the number of genes for functional noncoding RNAs. In its latest build, Ensemble says there are 14,727 lncRNA genes, 5,362 genes for small noncoding RNAs, and 2,222 other genes for nocoding RNAs. The total number of non-protein-coding genes is 22,311.

There is no solid evidence to support this claim. It's true there are many transcripts resembling functional noncoding RNAs but claiming these identify true genes requires evidence that they have a biological function. It would be okay to call them "potential" genes or "possible" genes but the annotators are going beyond the data when they decide that these are actually genes.

Breschi et al. mention the number of transcripts. I don't know what method Ensembl uses to identify a functional transcript. Are these splice variants of protein-coding genes?

The rest of the review discusses the similarities between human and mouse genes. They point out, correctly, that about 16,000 protein-coding genes are orthologous. With respect to lncRNAs they discuss all the problems in comparing human and mouse lncRNA and conclude that "... the current catalogues of orthologous lncRNAs are still highly incomplete and inaccurate." There are several studies suggesting that only 1,000-2,000 lncRNAs are orthologous. Unfortunately, there's very little overlap between the two most comprehensive studies (189 lncRNAs in common).

There are two obvious possibilities. First, it's possible that these RNAs are just due to transcriptional noise and that's why the ones in the mouse and human genomes are different. Second, all these RNAs are functional but the genes have arisen separately in the two lineages. This means that about 10,000 genes for biologically functional lncRNAs have arisen in each of the genomes over the past 100 million years.

Breschi et al. don't discuss the first possibility.


Breschi, A., Gingeras, T.R., and Guigó, R. (2017) Comparative transcriptomics in human and mouse. Nature Reviews Genetics [doi: 10.1038/nrg.2017.19]

Genome size confusion

The July 2017 issue of Nature Reviews: Genetics contains an interesting review of a topic that greatly interest me.
Breschi, A., Gingeras, T. R., and Guigó, R. (2017). Comparative transcriptomics in human and mouse. Nature Reviews Genetics [doi: 10.1038/nrg.2017.19]

Cross-species comparisons of genomes, transcriptomes and gene regulation are now feasible at unprecedented resolution and throughput, enabling the comparison of human and mouse biology at the molecular level. Insights have been gained into the degree of conservation between human and mouse at the level of not only gene expression but also epigenetics and inter-individual variation. However, a number of limitations exist, including incomplete transcriptome characterization and difficulties in identifying orthologous phenotypes and cell types, which are beginning to be addressed by emerging technologies. Ultimately, these comparisons will help to identify the conditions under which the mouse is a suitable model of human physiology and disease, and optimize the use of animal models.
I was confused by the comments made by the authors when they started comparing the human and mouse genomes. They said,
The most recent genome assemblies (GRC38) include 3.1 Gb and 2.7 Gb for human and mouse respectively, with the mouse genome being 12% smaller than the human one.
I think this statement is misleading. The size of the human genome isn't known with precision but the best estimate is 3.2 Gb [How Big Is the Human Genome?]. The current "golden path length" according to Ensembl is 3,096,649,726 bp. [Human assembly and gene annotation]. It's not at all clear what this means and I've found it almost impossible to find out; however, I think it approximates the total amount of sequenced DNA in the latest assembly plus an estimate of the size of some of the gaps.

The golden path length for the mouse genome is 2,730,871,774 bp. [Mouse assembly and gene annotation]. As is the case with the human genome, this is NOT the genome size. Not as much mouse DNA sequence has been assembled into a contiguous and accurate assembly as is the case with humans. The total mouse sequence is at about the same stage the human genome assembly was a few years ago.

If you look at the mouse genome assembly data you see that 2,807,715,301 bp have been sequenced and there's 79,356,856 bp in gaps. That's 2.88 Gb which doesn't match the golden path length and doesn't match the past estimates of the mouse genome size.

We don't know the exact size of the mouse genome. It's likely to be similar to that of the human genome but it could be a bit larger or a bit smaller. The point is that it's confusing to say that the mouse genome is 12% smaller than the human one. What the authors could have said is that less of the mouse genome has been sequenced and assembled into accurate contigs.

If you go to the NCBI site for Homo sapiens you'll see that the size of the genome is 3.24 Gb. The comparable size for Mus musculus is 2.81 Gb. That 15% smaller than the human genome size. How accurate is that?

There's a problem here. With all this sequence information, and all kinds of other data, it's impossible to get an accurate scientific estimate of the total genome sizes.


[Image Credit: Wikipedia: Creative Commons Attribution 2.0 Generic license]

Toward designer babies and creating (human?) genomes

0000-0002-8715-2896 Toward designer babies and creating (human?) genomes   Posted May 19, 2017 by Tabitha M. Powledge in Uncategorized post-info AddThis Sharing Buttons above NO SEX, NO PAIN: TOWARD LAB-GROWN DESIGNER BABIES It gives new

NCBI researchers and collaborators discover novel group of giant viruses

Nearly complete set of translation-related genes lends support to hypothesis that giant viruses evolved from smaller viruses An international team of researchers, including NCBI’s Eugene Koonin and Natalya Yutin, has discovered a novel group of giant viruses (dubbed “Klosneuviruses”) with … Continue reading