As we develop BioGPS, many people have asked, why develop another gene portal? With all the already-existing websites that display gene annotation, where does BioGPS fit into this crowded landscape?

I think this question begs a closer look at why there are so many gene portals in existence to begin with. Suppose you are interested in learning all that is known about the classic cell-cycle gene CDK2. You might start first with NCBI’s Entrez Gene and EBI’s Ensembl. Then you’d want to look at the model organism databases – MGI and RGD. You’d probably want to check an annotation aggregator like GeneCards. And then data providers like SymAtlas and the Allen Brain Atlas. Definitely can’t forget the Gene Wiki either. And on and on, down to tiny and (relatively) unknown resources run out of small academic labs.

All of these databases have basically the same goal – to display gene annotation information. And while each displays some unique information that you wouldn’t want to miss, it’s also plainly clear that lots of annotation and functionality is shared between these sites. My estimate? 80% duplication for 20% innovation…

How did we find ourselves in this fragmented landscape of gene portals? I believe in most cases (at least for SymAtlas and before that), these resources evolve out of the same basic chronology:

  • researchers generate a new annotation source (microarray data set, computational predictions, etc.)
  • researchers investigate ways to add these new data to existing resources to share with the world, but find no easy/flexible way to do it
  • researchers build simple web site to display new data
  • web site evolves to handle classic “gene portal” tasks, including searching, synonym resolution, and annotation display

Wouldn’t it be great if we in the bioinformatics developer community could just focus on that 20% innovation? More on how BioGPS enables this in a future post…