Creating a Centralized Model Organism Database (CMOD)

Model organism databases are great. They span a spectrum of model organisms as diverse as mouse, rat, fly, worm, zebrafish, yeast, and E. coli. And they fulfill key roles for their respective communities, from warehousing key genomic data, to providing query and visualization tools, to performing biocuration. The NIH (and other funding agencies) pay good money to fund these community resources, and they are well worth their expense.

Crowdsourcing is one of our hammers, so we start seeing nails.
But what happens when your favorite organism isn’t covered by one of these existing databases? The number of sequenced genomes is increasing rapidly1, via both targeted sequencing of organisms of interest as well as metagenomic sequencing of complex samples. It seems impractical to fund a model organism database for each of them. The Generic Model Organism Database (GMOD) project offers open source software for this task, but creating a GMOD instance still requires a significant and ongoing investment in bioinformatics.

So if no model organism database exists, more often than not it seems like genomics data end up buried in journal publications and inaccessible to computational analyses. As a case in point, the transcriptional regulation of the Listeria monocytogenes genome was extensively characterized in a prominent Nature paper, revealing (among other findings) the existence of 517 operons and 103 small regulatory RNAs. But if you want the data on those operons or regulatory RNAs, you won’t find them in the GFF file at NCBI. You won’t find them in EnsemblBacteria either. Nor UCSC’s Microbial Genome Browser, nor at the Broad Institute’s site on Listeria. Nope, if you want that data, you’ll have to pull it out of the supplementary info of the Nature paper (in PDF format, no less).2

Now, this phenomenon is undoubtedly replicated across a diverse array of organisms and annotation efforts. Organizing and structuring these data would be incredibly valuable, both for researchers who study each individual organism, as well as in aggregate for bioinformatics analyses that span multiple organisms. How can we begin to address and solve this problem?

Since crowdsourcing is one of our lab’s hammers, we start seeing nails. Therefore, our lab is planning to create a Centralized Model Organism Database (CMOD), one single online resource that is designed to support all genomes and organisms. Importantly, the entire research community is empowered to edit and maintain these data. We’re planning on building CMOD within Freebase, which is owned by Google and self-described as “a free, open knowledge graph of 36 million people, places, and things”. In short, Freebase is a community-curated database of knowledge. For those who like oversimplified analogies, Wikipedia is to free-text what Freebase is to structured data.3

Clearly we cannot capture the full breadth of data and features found in dedicated model organism databases. Instead, we will focus on two core data types that are common to all genomics efforts – genome annotations (like operons and regulatory RNAs), and Gene Ontology annotations. Our game plan can be broken down into these three steps:

  1. Create data importers for structured data from NCBI, Ensembl, and other genomic databases.
  2. Create an alternate back-end for both JBrowse and WebApollo that is based on Freebase, resulting in CMOD-backed interfaces for genome browsing and editing.
  3. Create a community submission interface for Gene Ontology annotations.

Think this is a great idea? The dumbest idea ever? Not clear? If you were a grant reviewer, what would be the weaknesses you’d highlight in your review? We’d love to hear your feedback, particularly those from the microbiology and metagenomics communities. Leave a comment below, or engage on Twitter

EDIT: See also my recent talk on this subject:

 


  1. I would love to get some real statistics here. Anyone have any pointers or suggestions? Hoping for something like these figures on sequencing costs or the growth of GenBank. EDIT: See results in this post. []
  2. I’m looking for more examples where lots of gene or gene annotation data has been published, but where that data are not available in an accessible and structured form. Pointers? []
  3. Note that Wikidata is a similar project to Freebase that we love as well. We’re using Wikidata for another project, but for reasons we won’t get into here, Freebase is currently the better solution for CMOD and we’re strongly evaluating it here too. []

5 Comments

  1. Paul Gardner

    I think this is a great idea! At present genome annotations are spread around the internets in disparate formats. For example, Rfam annotations are fairly invisible where they are.

    In theory DAS was meant to solve some of the sharing annotation issues. But I don’t think it has. This does seems like the sort of thing that could be in ENSEMBL’s scope, yet isn’t there yet.

  2. I completely agree with the need for this. As more sequences come along–many of them for user communities that just don’t have that kind of support–this could be a huge benefit.

    Recently I did a workshop and one of the attendees was a woman who worked outside of the academic arena. She had a bunch of really interesting marine organisms that she was getting sequence for. But she had really no support downstream of the sequences for annotations and visualizations and such. It would be a shame if these things weren’t available for more people to see and explore.

    Might avoid the name ‘model’ though. I think that’s part of the thing we want to get away from.

  3. Sam Payne

    Andy, One of the most problematic instances of results that get burried are papers about genome improvements. So the field of proteogenomics publishes improvements in genome annotations (new genes, etc), but this data is nearly always burried in supplementary data and never gets integrated. The second example of this is functional characterizations that trickle in one by one. See any issue of JBC and someone is biochemically characterizing an operon of hypothetical proteins. I’m not sure that these results ever get fully integrated.

  4. Chris Mungall

    WikiData and FreeBase are interesting ideas for hosting, but both present challenges.

    The CMOD infrastructure should be abstracted away from any particular storage infrastructure. Both WikiData and FreeBase have triple-like models (with slight differences to the W3C RDF model, e.g. no bNodes). In fact GMOD-CHADO uses a triple-like model of sorts.

    A lot of the work in mapping a genome annotation model into triples has been done for you – see https://github.com/JervenBolleman/FALDO (which doesn’t use bNodes and would easily translate to a FreeBase information model).

    If the infrastructure for CMOD uses a bNode-free subset of RDF then it should be transferrable to different hosting solutions, including any kind of triplestore in the cloud, which I think would be a good thing.

  5. Andrew

    @Paul, Great, thanks! Looking forward to working with you on those RFAM annotations…

    @Mary, Marine organisms would be a great community to interact with, and I hope that would lead to useful downstream tools. (Agreed on the use of “model”, but it fits too well with the well established “GMOD” label!)

    @Sam Payne, exactly! You’ve hit on both the use cases that ended up in the grant proposal — genome annotations (e.g., gene boundaries) and functional gene annotations (e.g., Gene Ontology annotations).

    @Chris, as usual, you’re touching on an important data modeling issue, and it’s a key issue in terms of how CMOD will interact with the rest of the universe of bioSemWeb resources. Thanks for the thoughts!

Trackbacks/Pingbacks

  1. Sequenced genomes per year | The Su Lab - [...] part of building the case for creating our proposed CMOD resource, we wanted to know just how quickly the…
  2. Aim 3: Centralized Model Organism Database | The Su Lab - [...] that we discussed most openly as we were drafting our proposal. So you should feel free to read our…