NHC Key Laboratory of Biosafety: So, where did SARS-CoV-2 Covid-19 come from?

Illustration by @zaiddagamseh | instagram

This post has been reviewed by one of our subject matter experts, Dr. Brett Finlay of UBC.

The paper we’ll be demystifying can be found here, if you’d like to follow along!

(Before we launch into our summary it’s important to note that 2019-nCoV and SARS-CoV-2 are both names for the virus that causes COVID19. The paper itself refers to the virus as 2019-nCoV, but we’ve referred to it as SARS-CoV-2 in this article and in most of our articles across the site).


Coronaviruses are a group of viruses that share similar properties. In particular, the genome of coronaviruses are all positive sense single stranded RNA and range between 26-32kilobases in length. Think of RNA as an instruction manual, it tells the cell how to build a protein. Being a positive sense RNA means that it’s in a language that’s directly understandable by the cell; if it were negative, the instructions manual would have to go through a “translating” process to make it readable for our cells. The length of the RNA indicates how long this instruction manual is. 

In late December 2019, several patients with viral pneumonia (infection in the lung due to a virus) started showing up, all linked to a seafood market in Wuhan in the Hubei province of China, where non-aquatic animals such as birds and rabbits were available for sale. This article aimed to report the epidemiological data from 9 patients who were diagnosed with viral pneumonia of an unidentified cause. It also describes the genetic material of this novel virus to provide insight on its potential origins and cell receptor binding.


A total of 9 patients were involved in this study: 8 had visited the Huanan seafood market in question and 1 had not visited the market but stayed at a hotel close by. Samples from these patients were taken and tested for common respiratory pathogens but none of them tested positive for a known pathogen. (Ah-ha! This is a novel virus we may be dealing with!). 

To isolate the virus, special pathogen-free human airway epithelial cells were infected in a lab with fluids or throat swab samples from the patients. Once the cells were incubated, they were processed to collect the viral RNA. The viral RNA was then multiplied so that the researchers could have a large enough sample of high-quality genetic material; this was then sequenced. Sequencing involves taking the genetic material and determining what the order of the nucleotide bases (A,U,C,G) are, analysing the unique “fingerprint” of the viruses. 

This isolated genetic material was then put through BLASTn, which is as cool as it sounds. BLAST (Basic Local Alignment Search Tool) is a gigantic database that allows you to compare the sequence (aka the genetic material you collected from the new virus) you have to other sequences published by other researchers and assess the similarities between them. Essentially, the researchers were looking for other viruses that had similar genetic material to SARS-CoV-2. The genome (what we just sequenced) was also put through a software called RAxML which would allow researchers to conduct phylogenetic analysis by generating a “family tree” of the viruses (ie. determining where this virus fits into the big family of coronaviruses, almost like Ancestry.com) using this genome and other viral genomes in its database, it explains how this virus fits into the grand scheme of classifying viruses. 


From the 9patients’ samples, researchers obtained 8 complete genome sequences and two partial genome sequences of SARS-CoV-2, i.e. 8 complete and 2 partial instruction manuals. Based on these genomes, the researchers then double-checked, did these patients indeed have this virus? A technique called reverse-transcriptase quantitative PCR was used to confirm that the instruction manuals they had produced were indeed also found in the patients’ original samples. This confirmed that they had the right target and that these patients had been infected with SARS-CoV-2.

When compared to each other, the 8 complete genomes were 99.98% similar, indicating very recent emergence in humans. This means that the virus hadn’t had enough time to mutate much. When comparing the SARS-CoV-2 with other viruses, the BLASTn search found the samples were closely related to 2 known coronaviruses: bat-SL-CoVZC45 (similarity: 87.99%) and bat-SL-CoVZXC21 (another SARS-like betacoronavirus of bat origin). In five regions (think chapters of the instructions manual), the similarity was greater than 90%. The region with the lowest similarity was that encoding for the spike glycoprotein (S) gene. 

The S gene (chapter) describes how to build the spike glycoproteins that are unique to coronaviruses and give it it’s appearance of wearing a crown (hence the name “corona”). The functions of the spike are to bind specific types of host cells and help the virus get into the host cell.

When comparing the S chapters of the SARS-CoV-2 genome to these other bat coronaviruses, the researchers only found around 75% similarity. Although SARS and MERS are two coronaviruses that also caused massive outbreaks, this new virus showed only about 79% similarity to SARS-CoV and 50% to MERS-CoV. The authors state that the biggest difference between these known pathogens and this novel coronavirus was that the spike protein of SARS-CoV-2 is longer. The length of SARS-CoV-2 is 1273 amino acids while SARS-CoV is 1255 amino acids and MERS-CoV is 1270 amino acids in length. 

Sarbecovirus are a subgenus of the genus betacoronavirus, that are a type of coronavirus. Think of subgenus, genus and family as classification rankings such as how lions are part of the group cats that are part of the bigger category: mammals. Sarbecovirus can be further split into 3 groupings, or clades, based on genome similarity. The first clade includes two SARS-CoV related strains originating from Bulgaria and Kenya. The second clade includes the two bat-derived SARS-like strains from Zhoushan in eastern China. The third clade includes SARS-CoV from humans any many similar SARS-like coronaviruses from bats collected from southwestern China. The 10 SARS-CoV-2 genome sequences (8 complete and 2 partial) were found to belong in clade 2. The researchers took it one step further in comparing not only the entire genome but also specifically the major encoding regions (i.e. think of this as the most important parts of the instruction manual for the furniture you’re building, not the small accessories that just make the final product look pretty). When it came to the spike glycoprotein genes and 1a regions (both are chapters of the instruction manual), the results were consistent with the genome phylogeny; SARS-CoV-2, bat-SL-CoVZC45, and bat-SL-CoVZXC21 clustered together. The spike glycoprotein is made up of 2 different proteins: S1 (recognizes and binds specific receptors on host cell) and S2 (help cell membrane transfusion so that the virus can get into the cell). The SARS-CoV-2 S2 protein showed around 93% similarity with bat-SL-CoVZC45 and bat-SL-CoVZXC21 but S1 showed to be more similar to that found in SARS-CoV. Therefore, although SARS-CoV-2 is closer to bat-SL-CoVZC45 and bat-SL-CoVZXC21 in terms of the entire genome, the receptor binding domain of SARS-CoV-2 was closer to that of SARS-CoV.


It’s important to note that the genetic sequences found in all 9 patients were almost identical. Given that mutations can occur in genomes with every replication cycle, this means that the virus likely originated from one source and was detected rather rapidly in these patients. However, as the virus infects more and more people, surveillance is needed to track possible mutations of this virus. 

Bats are a reservoir for coronaviruses, including most probably SARS-CoV-2. However, it is possible that they are not the source of infection for humans; rather, an intermediate host may exist between bats and humans. Why?

  1. The outbreak was first report in late December 2019; most bat species in Wuhan are hibernating during this time of the year
  2. Bats were not sold or found at the Huanan seafood market in question
  3. The sequence identity between SARS-CoV-2 and its close relatives, bat-SL-CoVZC45 and bat-SL-CoVZXC21 was less than 90%. Therefore, bat-SL-CoVZC45 and bat-SL-CoVZXC21 are not direct ancestors of SARS-CoV-2
  4. When considering history, in both SARS-CoV and MERS-CoV, bats acted as the natural reservoir. However, human were infected by an intermediate host between humans and bats. 

Conclusions and Limitations 

Although the findings of this study are significant, it is important to note that it does have some limitations. First, this study only included 9 patients, which is a rather small sample size to represent the entire situation (i.e. it’s difficult to look at a pixel and understand the whole picture). Second, this field of research is moving so quickly! This study was published back in February; since then, the virus has infected millions of more people all over the world, it is yet to be known how much the virus has evolved. All in all, this study starts to give us a vague idea of where this virus may have come from and how it fits into the large family of coronaviruses. The genetic sequence provides crucial information that is absolutely necessary and has been used in our current search for any possible  anti-viral drug treatments or vaccines to combat this pandemic. Indeed, these researchers showed us just how important the instruction manual is to understanding how this new piece of furniture works. 

Leave a Reply