Model organism databases are great. They span a spectrum of model organisms as diverse as mouse, rat, fly, worm, zebrafish, yeast, and E. coli. And they fulfill key roles for their respective communities, from warehousing key genomic data, to providing query and visualization tools, to performing biocuration. The NIH (and other funding agencies) pay good money to fund these community resources, and they are well worth their expense.

Crowdsourcing is one of our hammers, so we start seeing nails.
But what happens when your favorite organism isn’t covered by one of these existing databases? The number of sequenced genomes is increasing rapidly1, via both targeted sequencing of organisms of interest as well as metagenomic sequencing of complex samples. It seems impractical to fund a model organism database for each of them. The Generic Model Organism Database (GMOD) project offers open source software for this task, but creating a GMOD instance still requires a significant and ongoing investment in bioinformatics.

So if no model organism database exists, more often than not it seems like genomics data end up buried in journal publications and inaccessible to computational analyses. As a case in point, the transcriptional regulation of the Listeria monocytogenes genome was extensively characterized in a prominent Nature paper, revealing (among other findings) the existence of 517 operons and 103 small regulatory RNAs. But if you want the data on those operons or regulatory RNAs, you won’t find them in the GFF file at NCBI. You won’t find them in EnsemblBacteria either. Nor UCSC’s Microbial Genome Browser, nor at the Broad Institute’s site on Listeria. Nope, if you want that data, you’ll have to pull it out of the supplementary info of the Nature paper (in PDF format, no less).2

Now, this phenomenon is undoubtedly replicated across a diverse array of organisms and annotation efforts. Organizing and structuring these data would be incredibly valuable, both for researchers who study each individual organism, as well as in aggregate for bioinformatics analyses that span multiple organisms. How can we begin to address and solve this problem?

Since crowdsourcing is one of our lab’s hammers, we start seeing nails. Therefore, our lab is planning to create a Centralized Model Organism Database (CMOD), one single online resource that is designed to support all genomes and organisms. Importantly, the entire research community is empowered to edit and maintain these data. We’re planning on building CMOD within Freebase, which is owned by Google and self-described as “a free, open knowledge graph of 36 million people, places, and things”. In short, Freebase is a community-curated database of knowledge. For those who like oversimplified analogies, Wikipedia is to free-text what Freebase is to structured data.3

Clearly we cannot capture the full breadth of data and features found in dedicated model organism databases. Instead, we will focus on two core data types that are common to all genomics efforts – genome annotations (like operons and regulatory RNAs), and Gene Ontology annotations. Our game plan can be broken down into these three steps:

  1. Create data importers for structured data from NCBI, Ensembl, and other genomic databases.
  2. Create an alternate back-end for both JBrowse and WebApollo that is based on Freebase, resulting in CMOD-backed interfaces for genome browsing and editing.
  3. Create a community submission interface for Gene Ontology annotations.

Think this is a great idea? The dumbest idea ever? Not clear? If you were a grant reviewer, what would be the weaknesses you’d highlight in your review? We’d love to hear your feedback, particularly those from the microbiology and metagenomics communities. Leave a comment below, or engage on Twitter

EDIT: See also my recent talk on this subject:

 


  1. I would love to get some real statistics here. Anyone have any pointers or suggestions? Hoping for something like these figures on sequencing costs or the growth of GenBank. EDIT: See results in this post. []
  2. I’m looking for more examples where lots of gene or gene annotation data has been published, but where that data are not available in an accessible and structured form. Pointers? []
  3. Note that Wikidata is a similar project to Freebase that we love as well. We’re using Wikidata for another project, but for reasons we won’t get into here, Freebase is currently the better solution for CMOD and we’re strongly evaluating it here too. []