Easily download large amounts of genomic data with NCBI Datasets

Do you need to download a lot of genomic data? Maybe you need all primate reference genomes or maybe you need just a few really big genomes? Prior to the advent of NCBI Datasets, downloading such a large amount of data could be a frustrating and time consuming experience involving failed downloads and writing custom scripts. NCBI Datasets … Continue reading Easily download large amounts of genomic data with NCBI Datasets

Alan McHughen defends his views on junk DNA

Alan McHughen is the author of a recently published book titled DNA Demystified. I took issue with his stance on junk DNA [More misconceptions about junk DNA - what are we doing wrong?] and he has kindly replied to my email message. Here's what he said ...
I wrote DNA Demystified with the knowledge and intent for spurring debate and discussion on a number of issues.

My position on 'junk DNA' hasn't changed much since I first learned about it in the early to mid- 1970s. My primary concern now is that the term 'junk' is inappropriate, as it conveys an immediate negative image and engenders an emotional response. There may well be DNA sequences that serve no useful purpose and are only wastage, carried along through the generations as burdensome baggage (i.e. the 'ordinary' definition of 'junk'). Initially. as I'm sure you remember, all non-coding DNA was considered (by some) as "Junk DNA". I was never among them, expecting that eventually scientists would find some adaptive value to at least some of the non-coding sequences. I am happy to accept the data that this has come to pass-- that the now well-documented regulatory functions alone, for example, justify trashing the 'junk' label.

If there are tracts of truly useless DNA, it would be interesting to see how the organism responds when such sequences are deleted from the genome. That would be a true test of whether or not the excised DNA sequences were 'junk'.

You are free to disagree, of course, but I wanted to clarify my position.
Alan McHughen appears to dislike the term "junk" DNA because of its negative image and also because he thinks that the original definition has been disproved.

I don't want to discuss the first point because it's a red herring. As far as I can tell, the only people who dislike the word "junk" are doing so because they don't believe it's an accurate description of a substantial part of our genome. So let's just discuss the second point.

If I understand him correctly, his second point is that the term "junk" DNA was originally synonymous with "noncoding" DNA and, as he explains in his book, the scientists who used the word "junk" did so because they thought that all noncoding DNA was useless. His position now is that some noncoding DNA has been shown to be functional thereby refuting the original definition.

Let me remind readers of what he wrote in his book.
When it was first discovered, the nongenic DNA was sometimes called—somewhat derisively by people who didn't know better—"junk DNA" because it had no obvious utility, and they foolishly assumed that if it wasn't carrying coding information it must be useless trash.
My position is that there was never a time when knowledgeable scientists ever said that all noncoding DNA was junk. They never assumed that the only functional sequences in our genome were protein-coding sequences. Junk DNA was always defined as excess DNA that had no function and that definition is still valid.

Alan McHughen and I are from the same era but we clearly hung out with different crowds. My mentors were members of the 'phage group who were actively working on genes and their regulation and actively investigating other functional elements. I attended summer meetings at Cold Spring Harbor for five years (1969-73) and I can assure you that anyone who stood up in front of that group and said that all noncoding DNA was junk would have been laughed out of the room.

Here's what I knew in the early 1970s.
  • Some genes did not encode proteins. Ribosomal RNA genes and tRNA genes were discussed in the first edition of Watson's textboook in 1965. We all knew about these functional noncoding sequences.
  • Regulatory sequences such as promoters and operators controlled the expression of genes. The noncoding regulatory sequences of the lac operon and of the major operons of bacteriophage lambda were well known. Nobody ever thought that these noncoding regions were junk.
  • We knew about centromeres—noncoding functional DNA.
  • We knew about origins of replication—noncoding functional DNA. (I was working on DNA replication.)
Now, I'm not denying that there might have been scientists who didn't know these things and I'm not denying that some of them might have foolishly thought that all noncoding DNA was junk. These scientists may have been part of the group that Alan McHuthen knew in the 1970s but that group did not define junk DNA. They were not the experts.

Let's look at the 1972 paper by Susumu Ohno because that's the paper that made the term "junk DNA" popular. Ohno was an evolutionary biologist and a molecular geneticist and he was familiar with the thinking of the scientists in the 'phage group. He begins his paper by referring to the C-value paradox because that's an important part of the early thinking about junk DNA. Why do some species have a lot more DNA than others? ... it's because the excess DNA is junk. That's still the only reasonable explanation of the so-called C-value Paradox.

Ohno then discusses the genetic load argument by pointing out that we can only have about 30,000 genes or our species would go extinct. He estimates that only about 6% of our genome could be functional and references Kimura and Ohta's seminal paper on mutation rates and effective population sizes (Kimura and Ohta, 1971). He then says ....
Aside from conventional structural genes and regulatory genes, this 6% should include the promoter and operator region which are situated adjacent to each structural gene, for these regions can definitely sustain deleterious mutations. [His emphasis.]
Ohno did NOT think that all noncoding DNA was junk and neither did anyone else who knew what they were talking about. Ohno, and many others, knew perfectly well that regulatory sequences exist and that they are not junk. These experts did not foolishly assume "that if it wasn't carrying coding information then it must be useless trash."

So, I do not agree with Alan McHughen that the original definition of junk DNA equated it with noncoding DNA and I do not agree with him that the discovery of regulatory sequences "justify trashing the 'junk' label." I still think the genetic load argument has to be dealt with by opponents of junk DNA.

I'm still not exactly sure where the revisionist history comes from. Perhaps someone can help me out by coming up with a reference from the 1970s where some knowledgeable scientist makes the point that all noncoding DNA must be junk.

Now let's move on to 2020. There are a large number of experts who think that most of our genome is junk. I'd like to ask Alan how he deals with the evidence for junk DNA and what evidence he can offer to support the claim that most of our genome is functional.

Here's a paper from my friends Alex Palazzo and Ryan Gregory (Palazzo and Gregory, 2017) and another one from Ford Doolittle and Tyler Brunet (Doolittle and Brunet, 2017). They are good starting points for further discussion.
Palazzo, A.F. and Gregory, T.R. (2014) The Case for Junk DNA PLOS Genetics 10:e1004351. [doi: 10.1371/journal.pgen.1004351]

With the advent of deep sequencing technologies and the ability to analyze whole genome sequences and transcriptomes, there has been a growing interest in exploring putative functions of the very large fraction of the genome that is commonly referred to as “junk DNA.” Whereas this is an issue of considerable importance in genome biology, there is an unfortunate tendency for researchers and science writers to proclaim the demise of junk DNA on a regular basis without properly addressing some of the fundamental issues that first led to the rise of the concept. In this review, we provide an overview of the major arguments that have been presented in support of the notion that a large portion of most eukaryotic genomes lacks an organism-level function. Some of these are based on observations or basic genetic principles that are decades old, whereas others stem from new knowledge regarding molecular processes such as transcription and gene regulation.

Doolittle, W.F. and Brunet, T.D. (2017) On causal roles and selected effects: our genome is mostly junk BMC biology 15:116. [doi: 10.1186/s12915-017-0460-9]

The idea that much of our genome is irrelevant to fitness—is not the product of positive natural selection at the organismal level—remains viable. Claims to the contrary, and specifically that the notion of “junk DNA” should be abandoned, are based on conflating meanings of the word “function”. Recent estimates suggest that perhaps 90% of our DNA, though biochemically active, does not contribute to fitness in any sequence-dependent way, and possibly in no way at all. Comparisons to vertebrates with much larger and smaller genomes (the lungfish and the pufferfish) strongly align with such a conclusion, as they have done for the last half-century.

Kimura, M. and Ohta, T. (1971) Protein polymorphism as a phase of molecular evolution Nature 229:467-469. [doi: 10.1038/229467a0]

More misconceptions about junk DNA – what are we doing wrong?

I'm actively following the views of most science writers on junk DNA to see if they are keeping up on the latest results. The latest book is DNA Demystified by Alan McHughen, a molecular geneticist at the University California, Riverside. It's published by Oxford University Press, the same publisher that published John Parrington's book the deeper genome. Parrington's book was full of misleading and incorrect statements about the human genome so I was anxious to see if Oxford had upped it's game.1, 2

You would think that any book with a title like DNA Demystified would contain the latest interpretations of DNA and genomes, especially with a subtitle like "Unraveling the double Helix." Unfortunately, the book falls far short of its objectives. I don't have time to discuss all of its shortcomings so let's just skip right to the few paragraphs that discuss junk DNA (p.46). I want to emphasize that this is not the main focus of the book. I'm selecting it because it's what I'm interested in and because I want to get a feel for how correct and accurate scientific information is, or is not, being accepted by practicing scientists. Are we falling for fake news?
When it was first discovered, the nongenic DNA was sometimes called—somewhat derisively by people who didn't know better—"junk DNA" because it had no obvious utility, and they foolishly assumed that if it wasn't carrying coding information it must be useless trash.

In evolutionary terms, a DNA sequence with no function is simply dead weight that gets carried along, at some cost to the organism, to be jettisoned at the first opportunity. If the sequences were not adaptively important, evolution would have kicked them out as expendable excess baggage. The fact that nonrecipe DNA continues to be part of the human and other eukaryotic genomes over millions of years indicates that there is some adaptive value to carrying the "junk baggage" along, even if that value remains unclear to us today.

In addition to various putative regulatory and structural functions, recent evidence indicates that mutations in the integernic noncoding DNA leads to a increase in susceptibility to various diseases. If confirmed, it would show a clear adaptive value to "junk" DNA.

Today, we appreciate that this is not useless junk and now call it noncoding DNA. About 80% of the DNA is known to have some activity, even if the exact activity hasn't been determined. We now usually call it the more benign dark DNA.
OMG! It looks like the proponents of junk DNA have failed to make an impression on this author.

There's a lot of misinformation in those few paragraphs, as most Sandwalk readers know. Here are a few highlights ...
  • Nongenic DNA was never called junk DNA by any knowledgeable scientist back in the 1970s or at any time since. That's misleading information that has been debunked so many times that it astounds me how any modern molecular geneticist could possibly believe it [Stop Using the Term "Noncoding DNA:" It Doesn't Mean What You Think It Means]. Some of the "foolish" scientists who taught us about junk DNA include Susumu Ohno, Sydney Brenner, Motoo Kimura, Francis Crick, Thomas Jukes, and Ford Doolittle. Apparently they didn't know any better. The list of modern "fools" is even longer and it contains the names of dozens of highly respected scientists who are experts on genomes and molecular evolution. I do not understand how any modern scientist could dismiss the work of experts in the field of genomics by imagining that they were/are stupid enough to dismiss all noncoding DNA as junk. How did that silly myth ever take hold?
  • The author has a 1950's adaptationist view of evolution. He is unaware of the fact that modern evolutionary theory can easily accommodate genomes that are 90% junk. He is also unaware of the data since there's good experimental evidence to support the idea that less than 10% of the sequences in our genome are under selective constraint
  • Five Things You Should Know if You Want to Participate in the Junk DNA Debate
  • We have known for decades that regulatory sequences are abundant and essential. They were NEVER dismissed as junk DNA. The fact that mutations in regulatory regions might cause diseases in humans is perfectly consistent with everything we known about functional DNA and it has nothing to do with junk. (I should also note that mutations in real junk DNA can cause diseases but that doesn't mean the DNA isn't junk.)
  • Once again, there was never a time when knowledgeable scientists were confused about the difference between junk DNA and noncoding DNA. A substantial percentage of noncoding DNA is junk but the most of the functions residing in noncoding DNA have been known for fifty years [What's In Your Genome? - The Pie Chart]. The idea that 80% of our genome has some sort of functional biological activity has been thoroughly debunked and discredited. Knowledgeable scientists do NOT refer to most of our genome as "dark" DNA because they are fully aware of all the positive evidence showing that most of it (~90%) is junk. It's disappointing that the majority of scientists are unaware of this evidence [Required reading for the junk DNA debate]
Let's be clear about one thing. The propagation of this misleading information about junk DNA is not entirely Alan McHughen's fault. He is merely repeating what he thinks is the consensus view of the experts—it is not his field. I might criticize him for not doing his homework before publishing this in his book but I assume that he had no idea that what he was writing was controversial.

The fault lies with those of us who are proponents of junk DNA and with science culture. Somehow, the idea that 90% of our genome is junk has failed to make an impression on our fellow scientists. Somehow, the idea that evolution includes Neutral Theory, random genetic drift, and a thorough understanding of the principles of population genetics has failed to reach the average biologist. What are we doing wrong? How can we fix it?

1. Full disclosure, Oxford declined to publish my book after I sent them a proposal.

2. Read my posts on Parrington's book at: John Parrington discusses genome sequence conservation.

ENCODE 3: A lesson in obfuscation and opaqueness

The Encyclopedia of DNA Elements (ENCODE) is a large-scale, and very expensive, attempt to map all of the functional elements in the human genome.

The preliminary study (ENCODE 1) was published in 2007 and the main publicity campaign surrounding that study focused on the fact that much of the human genome was transcribed. The implication was that most of the genome is functional. [see: The ENCODE publicity campaign of 2007].

The ENCODE 2 results were published in 2012 and the publicity campaign emphasized that up to 80% of our genome is functional. Many stories in the popular press touted the death of junk DNA. [see: What did the ENCODE Consortium say in 2012]

Both of these publicity campaigns, and the published conclusions, were heavily criticized for not understanding the distinction between fortuitous transcription and real genes and for not understanding the difference between fortuitous binding sites and functional binding sites. Hundreds of knowledgeable scientists pointed out that it was ridiculous for ENCODE researchers to claim that most of the human genome is functional based on their data. They also pointed out that ENCODE researchers ignored most of the evidence supporting junk DNA.

ENCODE 3 has just been published and the hype has been toned down considerably. Take a look at the main publicity article just published by Nature (ENCODE 3). The Nature article mentions ENCODE 1 and ENCODE 2 but it conveniently ignores the fact that Nature heavily promoted the demise of junk DNA back in 2007 and 2012. The emphasis now is not on how much of the genome is functional—the main goal of ENCODE—but on how much data has been generated and how many papers have been published. You can read the entire article and not see any mention of previous ENCODE/Nature claims. In fact, they don't even tell you how many genes ENCODE found or how many functional regulatory sites were detected.

The News and Views article isn't any better (Expanded ENCODE delivers invaluable genomic encyclopedia). Here's the opening paragraph of that article ...
Less than 2% of the human genome encodes proteins. A grand challenge for genomic sciences has been mapping the functional elements — the regions that determine the extent to which genes are expressed — in the remaining 98% of our DNA. The Encyclopedia of DNA Elements (ENCODE) project, among other large collaborative efforts, was established in 2003 to create a catalogue of these functional elements and to outline their roles in regulating gene expression. In nine papers in Nature, the ENCODE consortium delivers the third phase of its valuable project.1
You'd think with such an introduction that you would be about to learn how much of the genome is functional according to ENCODE 3 but you will be disappointed. There's nothing in that article about the number of genes, the number of regulatory sites, or the number of other functional elements in the human genome. It almost as if Nature wants to tell you about all of the work involved in "mapping the functional elements" without ever describing the results and conclusions. This is in marked contrast to the Nature publicity campaigns of 2007 and 2012 where they were more than willing to promote the (incorrect) conclusions.

In 2020 Nature seems to be more interested in obfuscation and opaqueness. One other thing is certain, the Nature editors and writers aren't the least bit interested in discussing their previous claims about 80% of the genome being functional!

I guess we'll have to rely on the ENCODE Consortium itself to give us a summary of their most recent findings. The summary paper has an intriguing title (Perspectives on ENCODE) that almost makes you think they will revisit the exaggerated claims of 2007 and 2012. No such luck. However, we do learn a little bit about the human genome.
  • 20,225 protein-coding genes [almost 1000 more than the best published estimates - LAM]
  • 37,595 noncoding genes [I strongly doubt they have evidence for that many functional genes]
  • 2,157,387 open chromatin regions [what does this mean?]
  • 1,224,154 transcription factor binding sites [how many are functional?]
That's it. The ENCODE Consortium seems to have learned only two things in 2012. They learned that it's better to avoid mentioning how much of the genome is functional in order to avoid controversy and criticism and they learned that it's best to ignore any of their previous claims for the same reason. This is not how science is supposed to work but the ENCODE Consortium has never been good at showing us how science is supposed to work.

Note: I've looked at some of the papers to try and find out if ENCODE stands by it's previous claim that most the genome is functional but they all seem to be written in a way that avoids committing to such a percentage or addressing the criticisms from 2007 and 2012. The only exception is a paper stating that cis-regulatory elements occupy 7.9% of the human genome (Expanded encyclopaedias of DNA elements in the human and mouse genomes). Please let me know if you come across anything interesting in those papers.

1. Isn't it about time to stop dwelling on the fact that 2% (actually less than 1%) of our genome encodes protein? We've known for decades that there are all kinds of other functional regions of the genome. No knowledgeable scientist thinks that the remaining 98% (99%) has no function.

Structure and expression of the SARS-CoV-2 (coronavirus) genome

Coronaviruses are RNA viruses, which means that their genome is RNA, not DNA. All of the coronaviruses have similar genomes but I'm sure you are mostly interested in SARS-CoV-2, the virus that causes COVID-19. The first genome sequence of this virus was determined by Chinese scientists in early January and it was immediately posted on a public server [GenBank MN908947]. The viral RNA came from a patient in intensive care at the Wuhan Yin-Tan Hospital (China). The paper was accepted on Jan. 20th and it appeared in the Feb. 3rd issue of Nature (Zhou et al. 2020).

By the time the paper came out, several universities and pharmaceutical companies had already constructed potential therapeutics and several others had already cloned the genes and were preparing to publish the structures of the proteins.1

By now there are dozens and dozens of sequences of SARS-CoV-2 genomes from isolates in every part of the world. They are all very similar because the mutation rate in these RNA viruses is not high (about 10-6 per nucleotide per replication). The original isolate has a total length of 29,891 nt not counting the poly(A) tail. Note that these RNA viruses are about four times larger than a typical retrovirus; they are the largest known RNA viruses.

The RNA genome that's inside the virus particle looks very much like a typical eukaryotic mRNA molecule. It has a 5′ cap and a 3′ poly(A) tail of about 40-50 nucleotides. This RNA is translated by the host protein synthesis components as soon as it is injected into the cell.

The genome contains a number of genes where the word "gene" is used to define the open reading frame of the proteins produced by the virus. The initial translation products are two large polyproteins that are subsequently cleaved by proteases to produce smaller proteins. Most of time the viral RNA is translated to give the 1a polyprotein (~460 kDa) that is subsequently cleaved to produce 11 distinct non-structural proteins (nsps). Sometimes the ribosomes stall near the stop codon when they encounter a frameshift element (FSE) containing a "slippery site" that causes the ribosomes to skip one nucleotide. This avoids the stop codon and allows translation to continue into the 1b gene. The large 1ab polyprotein (~780 kDa) produces another five proteins after cleavage.

The functions of many (but not all) of these proteins have been discovered. Nsp12 is an RNA dependent RNA polymerase (RdRp). This is the enzyme that will copy the viral RNA to produce more infectious RNAs but it also produces a number of other transcripts (see below). RdRp is part of a large replication-transcription complex (RTC) that includes a number of accessory proteins (nsp2, nsp4, nsp6, nsp7+nsp8, nsp9, and nsp10). The exact functions of all these accessory proteins haven't been worked out in detail.

Nsp3 is a papain-like protease (PLpro) and nsp5 is a 3C-like cysteine protease (3CLpro). They are responsible for cleaving polyproteins 1a and 1ab.

Nsp13 is a 5′→3′ helicase (Hel) that's required for transcription. Nsp14 is a 3′→5′ exonuclease involved in proofreading. Nsp15 appears to be a uridine-specific endonuclease and nsp16 is an S-adenosylmethionine methyltransfersase.

The open reading frames at the 3′ end of the viral RNA cannot be translated because of the stop codon at the end of the 1ab "gene." Production of these proteins (e.g. S, M, E etc.) has to wait until later in the life cycle of the virus after the assembly of the RTC complex. As we shall see shortly, the synthesis of these late proteins involves a complicated process that requires production of many different transcripts.

The injected virus RNA is a (+) strand so production of new viral RNA requires two rounds of transcription. First, the RTC complex binds to the 3′ end of the (+) strand and copies it all the way to the 5′ end producing a (-) strand. This strand is then copied to produce new (+) strands that can be incorporated into new virus particles. The new (+) strands also act as messenger RNA to produce more 1a and 1ab polyproteins.2

Transcription from the 3′ end of the (+) strand also produces a group of subgenomic RNAs (sgRNAs). The 3′ end contains a number of transcription-regulating sequences (TRS-B) consisting of a 10 nucleotide AU-rich stretch of RNA. There is another TRS (TRS-L) at the 5′ end next to a leader sequence (L). When the RTC encounters a TRS it will pause and this may cause it to switch and continue transcription at TRS-L. This produces an sgRNA consisting of a stretch from the 3′ end (body) joined to the leader sequence at the 5′ end (leader).

The example shown below shows template switching between a TRS-B located at the 5′ end of the S gene and TRS-L to produce an S sgRNA. This sgRNA is then transcribed to produce an mRNA that can be translated to produce S protein.

Each of the genes at the 3′ end of the virus genome is associated with a TRS-B sequence so transcription from the 3′ end produces 9 different sgRNAs corresponding to the nine functional genes. (Open reading frame 10 is not a functional gene.) The figure on the right is from Kim et al. (2020).

Some of these "late" genes are required for assembly of new virus particles. S is the gene for the trimeric spike protein that mediates attachment of the virus to the ACE2 receptor on the surface of the host cell. M is a membrane glycoprotein— it is the most abundant structural protein. E is the envelope protein. N is the nucleocapsid protein that binds RNA and helps package it into the virus particle.

Reading frame 3 seems to produce two proteins, 3a and 3b. It's likely that 3a is an ion channel protein on the virus surface. Proteins 7 and 8 are additional viral assembly proteins. I don't know the function of protein 6 and I'm not sure if anyone else knows. Many coronaviruses don't make protein 6.

DISCLAIMER: I am not an expert on coronaviruses. Everything in this post is stuff I have learned in the past few days from reading published papers. Feel free to correct all the mistake I have made.

1. The behavior of these Chinese scientists doesn't match with the conspiracy theory that China engineered this virus—perhaps they weren't in on the conspiracy? :-)

2. I don't know how the transcription complex manages to copy right to the ends of the viral RNA. It seems to involve some complicated RNA secondary structures but I didn't bother reading the relevant papers.

References and Bibliography

Bar-On, Y.M., Flamholz, A., Phillips, R. and Milo, R. (2020) Science Forum: SARS-CoV-2 (COVID-19) by the numbers. Elife 9: e57309. [doi: 10.7554/eLife.57309]

Kim, D., Lee, J-Y., Yang, J-S., Kim, J.W., Kim. V.N., and Chang, H. (2020) The Architecture of SARS-CoV-2 Transcriptome. Cell 181:914-921 [doi: 10.1016/j.cell.2020.04.011]

Fung, T.S. and Liu, D.X. (2019) Human coronavirus: host-pathogen interaction. Annual review of microbiology 73: 529-557. [doi: 10.1146/annurev-micro-020518-115759]

Sawicki, S.G., Sawicki, D.L. and Siddell, S.G. (2007) A contemporary View of Coronavirus Transcription. Journal of virology 81(1):20-29. [doi: 10.1128/JVI.01358-06]

Zhou, P., Yang, X.-L., Wang, X.-G., Hu, B., Zhang, L., Zhang, W., Si, H.-R., Zhu, Y., Li, B. and Huang, C.-L. (2020) A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579(7798): 270-273. [doi: 10.1038/s41586-020-2012-7]

Where did your chicken come from?

Scientists have sequenced the genomes of modern domesticated chickens and compared them to the genomes of various wild pheasants in southern Asia. It has been known for some time that chickens resemble a species of pheasant called red jungle fowl and this led Charles Darwin to speculate that chickens were domesticated in India. Others have suggested Southeast Asia or China as the site of domestication.

The latest results show that modern chickens probably descend from a subspecies of red jungle fowl that inhabits the region around Myanmar (Wang et al., 2020). The subspecies is Gallus gallus spadiceus and the domesticated chicken subspecies is Gallus gallus domesticus. As you might expect, the two subspecies can interbreed.

The authors looked at a total of 863 genomes of domestic chickens, four species of jungle fowl, and all five subspecies of red jungle fowl. They identified a total of 33.4 million SNPs, which were enough to genetically distinguish between the various species AND the subspecies of red jungle fowl. (Contrary to popular belief, it is quite possible to assign a given genome to a subspecies (race) based entirely on genetic differences.)

The sequence data suggest that chickens were domesticated from wild G. g. spadiceus about 10,000 years ago in the northern part of Southeast Asia. The data also suggest that modern domesticated chickens (G. g. domesticus) from India, Pakistan, and Bangladesh interbred with another subspecies of red jungle fowl (G. g. murghi) after the original domestication. These chickens from South Asia contain substantial contributions from G. g. murghi ranging from 8-22%.

Next time you serve chicken, if someone asks you where it came from you won't be lying if you say it came from Myanmar.

Image credits: BBQ chicken, Creative Common License [Chicken BBQ]
Red Jungle Fowl, Creative Commons License [Red_Junglefowl_-Thailand]
Map: Lawler, A. (2020) Dawn of the chicken revealed in Southeast Asia, Science: 368: 1411.

Wang, M., Thakur, M., Peng, M. et al. (2020) 863 genomes reveal the origin and domestication of chicken. Cell Res (2020) [doi: 10.1038/s41422-020-0349-y]

Enhanced prokaryote type strain report now with details on needed type strain data

The Prokaryote type strain report provides information on type-strains for over 18,000 species. We revised and expanded the report to make it easier to identify cases where sequencing or establishing type material would have the biggest impact on improving prokaryote taxonomy … Continue reading

Expanded average nucleotide identity analysis now available for prokaryotic genome assemblies

As we described in an earlier post, GenBank uses average nucleotide identity (ANI) analysis to find and correct misidentified prokaryotic genome assemblies. You can now access ANI data for the more than 600,000 GenBank bacterial and archaeal genome assemblies through … Continue reading

Three scientists discuss junk DNA

I just found this video that was posted to YouTube on May 2019. It's produced by the University of California and it features three researchers discussing the question, "Is Most of Your DNA Junk!" The three scientists are:
  • Rusty Gage, a neuroscientist at the Salk Institute
  • Alysson Muotri, who studies brain development at the University of California, San Diego
  • Miles Wilkinson, who studies neuronal and germ cell development at the University of San Diego
None of them appear to be experts on genomes or junk DNA although one of them (Wilkinson) appears to have some knowledge of the evidence for junk DNA, although many of his explanations are garbled. What's interesting is that they emphasize the fact that some transposon-related sequences are expressed in some cells and they rely on this fact to remain skeptical of junk DNA. They also propose that excess DNA might be present in order to ensure diversity and prepare for future evolution. All three seem to be comfortable with the idea that excess DNA may be protecting the rest of the functional genome.

This is a good example of what we are up against when we try to convince scientists that most of our genome is junk.

Important changes coming to prokaryotic Reference and Representative genome assemblies

We are making changes to the set of bacterial and archaeal RefSeq Reference and Representative assemblies in February 2020. We will reduce the number of Reference assemblies to 15 that have annotation provided by outside experts (Table 1) and re-annotate … Continue reading