Sequenced genomes per year

As part of building the case for creating our proposed CMOD resource, we wanted to know just how quickly the number of sequenced genomes was increasing. The thinking is that the more genomes are being sequenced, the more genomes there are that are going with virtually no community bioinformatics support.

There are two sources of release dates for sequenced genomes that are at least reasonably convenient to parse out — GOLD and NCBI Genome. (In theory, Genome Pages at the EBI should be a good resource too, but I’ve yet to figure out how to parse out a complete set of dates.) (Hat tips to @iddux and @ppgardne for suggestions…)

Without further ado, the chart showing the number of sequenced genomes per year looks like this:

In 2012, GOLD and NCBI added 3736 and 4585 sequenced genomes, respectively. Exponential growth has kept steady over the past 15 years, and if current trends hold, we’ll have sequenced a total of 100,000 genomes sometime between 2017 and 2021. Pretty amazing…

The code to generate the data in the plots is shown below…

EDIT 20130611: Fixed a typo which changed the NCBI data and the corresponding figure — see diff. Also, all data posted to FigShare.

ADDED 20130617: In fact, let’s take a look at the cumulative growth (numbers via NCBI only, aggregated at the species level):

ADDED 20130618: Prompted by a comment from @OmicsOmicsBlog, the analysis above deserves a few caveats (all of which are “described”/obfuscated in the code below). The charts above are based on NCBI genomes data, and I made two perhaps non-obvious filtering choices. First, there are four classes of “Status” — “SRA or Traces”, “No data”, “Scaffolds or contigs”, and “Chromosomes”. For the purposes of the analyses described above, I used all genomes at “Scaffolds or contigs” and “Chromosomes” status, since those are the stages that are amenable to gene and genome annotations. Second, NCBI records a “release date” and a “modify date” for each genome. I chose to use the earlier “release date”, but on second look that corresponds to when data were first publicly released, not the date that the “Scaffolds or contigs” and “Chromosomes” status was achieved. Both of these effects means there is probably a slight left-shift in the curves from how people might be interpreting things.

To try to get a plot that doesn’t have these non-obvious caveats, I went back to the GOLD dataset. Although we no longer can track viruses, GOLD does maintain a “Completed Date” which implies that the genome is fully sequenced and that we have a firm date on when that sequence was complete. Breaking these dates out by kingdom, we get this update plot:

Okay, many different ways to skin this cat, and reassuringly they all give the same basic message…


1 Comment

  1. I like the IMG stats, but I don’t know if you can parse those over time. http://img.jgi.doe.gov/cgi-bin/w/main.cgi?section=ImgStatsOverview#tabview=tab0

Trackbacks/Pingbacks

  1. Creating a Centralized Model Organism Database (CMOD) | The Su Lab - [...] that data are not available in an accessible and structured form. Pointers? EDIT: See results in this post. [...]
  2. How big is big data? | immflection - [...] of data generated in 2012, requiring the largest computing grid of 170 sites), genomes sequencing (100,000 genomes projected between …