Blog



BioGPS funding – we need your help!

Posted by on May 10, 2012 in BioGPS | 0 comments

So the last two posts in this series have recapped four years of progress and outlined our path forward for BioGPS. Now, we are asking for your help.

Having reviewed tool development proposals for NIH study sections before, I know that evaluating proposals for software tools is quite difficult. If the paper proposal is all you have, then you make your best guess as to the value of the work and the likelihood of success. But it’s extraordinarily more convincing when the proposed tool already demonstrates widespread use in the community.

For the initial BioGPS proposal, we went a little bit overboard with this idea. We posted an “open call” for Letters of Support on SymAtlas (the predecessor to BioGPS) for two weeks. And during that time, we received 60 letters from our loyal users — everyone from endowed professors to postdocs to graduate students; scientists from academia, industry and government; anything ranging from 3-4 sentences to a page and a half. And we submitted all forty-five pages of them with our proposal.

Needless to say, this strategy worked. Reviewers noted both the quantity and the enthusiasm of our users in their written comments. And I am confident that those letters were a key factor in eventually getting our proposal funded.

Now, we’re prepping for our renewal and we need your help again. Please tell us a little bit about yourself, share with us why you use BioGPS, what impact it’s had on your research or what you’re particularly excited about in our plan moving forward. Email your letters to me directly at asuatscrippsdotedu, and feel free to contact me with any comments or questions. Your letters of support are essential for us to continue providing innovative tools for biomedical research. I look forward to hearing from you.

UPDATE: Many of you are probably reaching this page through the big, bold, and red link we put on the front page of BioGPS. Rest assured, we don’t plan on leaving that link there forever. In fact, it will only be there until the end of the month. So please take a moment now to send us your letter of support. Thank you.

Mapping the future of BioGPS

Posted by on May 9, 2012 in BioGPS | 0 comments

After four productive years of funding from NIGMS, we have a bit over a year left on our current grant for BioGPS. In addition to releasing some new features for data set visualization and management (our current Specific Aim #1), we are planning to spend significant time developing the aims and preliminary data for our renewal application.

So what are we planning next for BioGPS? We want to build on the strengths, so let’s first summarize what we’ve accomplished. We think BioGPS has three outstanding features, the combination of which makes it completely unique in the gene annotation space:

  • A dedicated user community: Whether it’s because of the data we provide to users or the flexibility provided by our user-customizable gene report layout, BioGPS gets a fair bit of web traffic. We get around two million pageviews per year from thousands of registered and anonymous users.
  • Mechanisms to harness and share users’ contributions: Most obviously, the BioGPS plugin library is the result of explicit contributions from over a hundred BioGPS users who registered at least one plugin. BioGPS also uses implicit contributions from users in the form of plugin usage statistics, which then allows us to use popularity as a proxy for utility.
  • The largest collection of gene-centric deep links available: Unlike a simple resource index, each BioGPS plugin is defined by a “URL Template” that gives us the exact URL for each gene-specific web page. That means for every gene indexed in BioGPS, we have the address of several hundred web pages that talk specifically about that gene. Powerful.

With these resources as our foundation, we envision many possible directions to explore in our renewal proposal. We’ve tried to organize them into these Specific Aims:

Aim #1: Continued growth of the user community. With community-based initiatives like BioGPS, you’re never done growing your community. We will continue adding features that focus on growing our user base, as well as each user’s network of connected entities (genes, plugins, layouts, and other users). While these don’t inherently have a biological use case at its core, the payoff in terms of BioGPS utility is invaluable.

Aim #2: Gene list sharing and management. We want to extend the gene-centric BioGPS model to gene lists. What will that mean? You currently have the ability to save gene lists for easy retrieval, and over 3500 lists have already been saved by our users. We will build in the same sharing and popularity features that we currently offer for plugins. We will also build “gene list plugins” (think heat maps and networks), as well as built-in enrichment analyses comparing to other community-contributed gene lists.

Aim #3: Structuring unstructured plugin data. While the first two aims are aimed at the nuts and bolts of building a community-based website, the last two aims focus on the scientific innovation. Right now, BioGPS receives and displays plugin content in completely unstructured form. Aside from knowing a web page is about a gene, BioGPS doesn’t really know what the plugin is saying about that gene. If we were able to provide structure to all of that unstructured content, it would open a whole new arena of integrative querying and data mining. We will propose to build a crowdsourcing solution to this challenge.

Aim #4: Structured data dissemination. Once we can effectively data mine within all the resources in the BioGPS plugin library, we will propose several applications that take advantage of those new capabilities. For example, we can develop a gene notification service to alert users when gene annotations are added or changed. We can mine plugins in near real-time for novel gene annotations (including, but not limited to, Gene Ontology). And we can also bring the wealth of knowledge in these gene-centric resources to the Linked Data community.

We’re still very early in the process of grant writing, so we certainly welcome feedback and comments!

BioGPS retrospective – the four year anniversary

Posted by on May 8, 2012 in BioGPS, usage stats | 0 comments

 Through the generous support from NIGMS, we have enjoyed stable funding for BioGPS development since 2008. During that time, we think we’ve made great progress building our community-extensible and user-customizable gene portal.

It’s worth recapping how the last four years have gone relative to what we originally proposed. For completeness, I’m posting the full text of the proposal, but let me summarize the four aims and what we’ve accomplished:

Specific Aim #1: Incorporate community-generated data by allowing users to upload custom numeric data sets for analysis and visualization.

  • What we proposed: Many users love BioGPS for the simple bar chart expression viewer coupled with a simple gene search interface. (We like simplicity.) Many users have emailed over the years asking to view other data sets through BioGPS, including their published and unpublished data.
  • What we did: This aim has taken the longest to get just right, but I’m happy to announce that we’re very close to releasing a significant new release here. In short, we’re going to be increasing the number of data sets available through BioGPS by an order of magnitude or two, plus we’re setting the stage for more increases after that. More details soon.

Specific Aim #2: Incorporate community-generated gene annotation by seeding a “gene wiki” with structured gene portal content.

  • What we proposed: We envisioned a “Gene Wiki” within Wikipedia whose goal was to produce a collaboratively-written, community-reviewed, and continuously-updated review article for every human gene.
  • What we did: We tackled this work early in the grant period, and it has blossomed more than we could have imagined at the outset. The Gene Wiki is now the subject of its own grant, and it has been the subject of many recent publications.

Specific Aim #3: Incorporate community-generated plugins by creating simple programmatic interfaces for external developers to extend BioGPS functionality.

  • What we proposed: To embrace the principle of community-extensibility, we designed a plugin architecture that allows any user to add new content to BioGPS without any involvement or delay from our developer staff.
  • What we did: This plugin interface in many ways is the core of the BioGPS architecture. By generalizing online gene-centric resources by a URL template, we have created the most comprehensive index of web pages on gene properties that currently spans over 250 unique resources.

Specific Aim #4: Enable users to share and customize the usage and layout of BioGPS plugins through optional user accounts.

  • What we proposed: We designed a powerful and customizable layout system that would allow users to mix and match gene annotation resources according to their individual use cases.
  • What we did: We provided several default layouts that cover common use cases, from literature searching, to accessing orthologs in model organism databases, to consulting pathway resources. In addition, BioGPS users have created over 2000 custom layouts according to their individual needs.

And I realize it’s been a while since we’ve disclosed our BioGPS usage statistics. I’m posting two tables, one from last year’s progress report comparing 2011 to 2010, and one that will be included in this year’s progress report:

Well, it looks like we’ve plateaued a bit on the user base. I chock that up to our lab’s transition to Scripps in the last year (and the inherent difficulties keeping up double-digit growth). But we’re redoubling our efforts to keep growing BioGPS, and I think our upcoming data set features will go a long way to expanding our user base even further.

Recapping gene-disease mining (via Twitter)

Posted by on May 7, 2012 in disease ontology, genewiki+, SPARQL | 0 comments

Gene Wiki Plus extensions

Storified by Andrew Su · Mon, May 07 2012 14:03:56

The uber-hacker Pierre posted a great blog post about using getting gene-disease mappings from the the Human Disease Ontology project.  I replied that our GeneWikiPlus effort was also generally focused on the same problem:
New blog post: Using the Disease ontology (DO) to map the genes involved in a category of disease. My notebook: http://bit.ly/INGQ57Pierre Lindenbaum
@yokofakun Nice… FYI, we also created http://genewikiplus.org to map gene-disease (and -SNP) links based on Gene Wiki + SNPedia #inpressAndrew Su
Chris Evelo subsequently followed up with an idea to adapt GeneWikiPlus to the Micronutrient Genomics Project:
@andrewsu any chance you can create "group of genes" -> genes -> SNP relationships for micronutrient relations on genewikiplus?Chris Evelo
@Chris_Evelo I think so? more details on what you’re thinking esp what you mean by "micronutrient relations"? perhaps easier by email…Andrew Su
@andrewsu Might be interesting to tweet as well. Is is about the Micronutrient Genomics Project (MGP) see: http://bit.ly/IWkAZJChris Evelo
MGP a.o. collects lists of genes and pathways about micronutrient metabolism and physiological role. Pathways at: http://bit.ly/IWlsNXChris Evelo
An example MGP genelist (iron) can be found here: http://bit.ly/IWlHsd but actually many genes are in micronutrient Gene Ontology classesChris Evelo
Would be great to automatically get those lists add SNPs and SNP function.Chris Evelo
Unfortunately, I promptly dropped the ball.  I’m picking that ball up now, resuming the discussion via Twitter

Incidentally, Pierre followed up his gene-disease mining effort with another implementation using the GeneWikiPlus SPARQL endpoint that the UniProt team put together:

Gene Wiki SPARQL endpoint http://ff.im/V9UEfBenjamin Good
New blog post: Mapping the genes involved in a category of disease: the GeneWikiPlus + SPARQL way.: http://bit.ly/IzxPyfPierre Lindenbaum
As usual, Pierre amazes with his hackery. 

I get a bit defensive that GeneWikiPlus isn’t yet “complete”, thinking we need to do more importing.  Ben rightly points out that it would be even better to get everyone to join the Linked Data world.

.@yokofakun works his magic again http://bit.ly/IzxPyf; GW+ clearly needs to import DO annotations for one-stop shoppingAndrew Su
@andrewsu of course if the DO gene Rif data was also shared as LInked Data, @yokofakun could produce that shop w/ one more import statementBenjamin Good

Integrating Knowledge Presentation at SMWCon

Posted by on Apr 30, 2012 in Gene Wiki, genewiki+, mashup, semantic web, semantic wikipedia, SPARQL, wiki, wikipedia | 0 comments

Our presentation at the Semantic MediaWiki Conference was a smashing hit! I discussed the software we developed to create GeneWiki+, which we’ve christened mwsync.

Mwsync is a cool little Java framework that makes it easy to maintain a live mirror of any MediaWiki site that exposes the standard MediaWiki API. The framework copies over any changes made to the source site on a repeating basis so you always have the most recent copy of the site content. It also integrates well with the Semantic MediaWiki extension, since it can do transformations of the page content during transfer- for instance, to make links within the page have a particular “type”.

We used this software to merge the information on the Gene Wiki on Wikipedia with information on human SNPs over at SNPedia. During integration, the software creates links between each SNP and the gene it resides on, as well as annotating each page with any diseases mentioned in the text of the article. This creates a SNP-gene-disease network that can then be explored using Semantic MediaWiki’s query engine, or exported as RDF for use in a SPARQL endpoint.

Of course, this integration technique can be used for more than just biology. As an example in the presentation, I explore the idea of merging two recipe wikis into one Semantic Recipes Wiki, and all the benefits that would bring. We think that the possibilities are huge and would be excited for others to use our software, give us some feedback/critiques, and show off what can be done.

The slides of my presentation can be found here: http://www.slideshare.net/erikclarke/integrating-knowledge-with-semantic-mediawiki-and-mwsync

The repository for mwsync is here: http://bitbucket.org/sulab/mwsync

Meet our GSoC scholars!

Posted by on Apr 30, 2012 in BioGPS, games, Gene Wiki, GSoC | 0 comments

 We’re proud to introduce Crowdsourcing Biology’s first class of GSoC Scholars! In no particular order:

Max Ludvigsson: I am a undergraduate student from Sweden. I am currently studying engineering physics and as I am interested in both biology and programming as well, I think this will be a great experience. My project is about creating a web interface for annotating the different plugins in bioGPS with appropriate metadata. Therefore, the project will mostly be in javascript. I plan to set up a blog on mludv.github.com to enable others to give input and share.

Kevin Wu: I’m currently a student at UCSD studying biology. I’ll be working on working on creating a fast and scalable gene list storage and enrichment calculation system. I currently keep a blog at blog.kevinformatics.com and will be publishing entries there about GSoC.

Shivansh Srivastava: I am a 3rd year Undergraduate from BITS Pilani- Goa Campus. I would be working on Idea – 7 – BioGPS:JQuery-based BioGPS gene-report layout canvas – over the summer. Looking forward to work with CrowdSourcing Biology.

Clarence Leung: I’m a joint Computer Science and Biology student at McGill University, where I also work at the McGill Centre for Bioinformatics. There, I’ve helped develop a biological game called Phylo: A Human Computing Framework for Comparative Genomics, which crowdsources players into helping to improve DNA sequence alignments for different animal species. This summer, I’ll be working on another biological game with the Su Lab called Dizeez, which helps to link gene names to diseases. I’ll be adding a multiplayer aspect to this game, which will make the game more exciting, and help us gather even more links between different genes and diseases. Since this is Google Summer of Code, after all, I think I’ll just use my Google+ stream as my blog.

Karthik Gangavarapu: I’m an undergraduate student at BITS Pilani, India. I am working on Idea #2 – Data Visualization interface. Its really exciting to be working with Crowdsourcing Biology.

We’re excited to have these five students join us this summer. For others who would like to keep track of their progress and participate in discussions, check out our Google Group on Crowdsourcing Biology.

Local talks on games and Gene Wiki

Posted by on Apr 25, 2012 in games, Gene Wiki, genewiki+, presentation, san diego, sulab | 0 comments

Last Friday I had the honor of speaking at the Salk Institute 'Systems to Synthesis' Symposium.  I introduced the idea of games with a biological purpose, showed off our early results with Dizeez and plugged some of the prototypes appearing at genegames.org.  The slides for the presentation are up on slideshare.

Tomorrow, Erik Clarke from our group will be speaking about his work on the GeneWiki+ at the Semantic Media Wiki conference up in Carlsbad.

Gene Wiki SPARQL endpoint

Posted by on Apr 23, 2012 in Gene Wiki, genewiki+, semantic web, SPARQL, sulab | 0 comments

Thanks to Leyla and Alex Garcia-Castro from UniProt and Florida State University respectively, we now have access to a SPARQL endpoint for the data in the Gene Wiki.  Access it live here:
http://virtuoso.idiginfo.org/sparql
(update on 4-28-12 , that is down and a live one is currently available at
http://199.102.237.69:8890/sparql
)

Here is one query that you might like to try that finds gene-disease links that we have mined from the text:
PREFIX wiki: <http://genewikiplus.org/wiki/Special:URIResolver/>
PREFIX property: <http://genewikiplus.org/wiki/Special:URIResolver/Property-3A> 
SELECT ?gene ?disease ?gene_name ?disease_name ?doid
WHERE {
 ?gene property:Is_associated_with_disease ?disease .
 ?gene property:HasSNP ?snp .
 ?snp property:Is_associated_with_disease ?disease .
 ?gene rdfs:label ?gene_name .
 ?disease rdfs:label ?disease_name .
 ?disease rdf:type ?disease_cat .
 ?disease_cat property:HasDOID ?doid .
 ?gene rdf:type wiki:Category-3AHuman_proteins .
}

How it works in brief

  1. Articles from the Gene Wiki and from SNPedia are transferred to genewikiplus.org
  2. As they go in, they are converted into a semi-structured form that enables queries in semantic media wiki.
  3. We dump the entire thing out as one giant RDF file.
  4. Leyla loads the RDF into their Virtuoso server (and performs some enhancements such as linking directly to UniProt RDF).
  5. and wa la!
(More details about the generation of the genewiki+ are available in this soon-to-be-published paper about the SNPedia mashup and this paper about Semantic Wiki Links in Wikipedia.)

Cool next steps

The RDF has OWL:sameAs links between all the Gene Wiki entries and their RDF equivalents in DBpedia and in UniProt's RDF representation.  It should be possible to explore connections that span these three (four including SNPedia) resources using Linked Data technologies like Virtuoso's Sponger.

Go forth! Play with our data!




Human Guided Forests – HGF

Posted by on Apr 6, 2012 in class prediction, classification, crowdsourcing, games, gwaps, hgf, machine learning, random forest, sulab | 0 comments

Yesterday I posted some slides about an idea I had recently which I call Human Guided Forests or HGF for short.  This an attempt to marry crowdsourcing with machine learning to produce better class predictors for datasets with very large features spaces.  Specifically, the idea is to replace the 'random' in the Random Forest algorithm with 'human'.

Forest road

Random Forests basically work like this:
Given a labeled dataset with M input variables and N samples,
  • Choose m as the number of input variables allowed per tree in the forest
  • For X iterations:
    1. Choose a subset of n samples from the training set
    2. Select m random input variables
    3. Build a decision tree using the randomly selected variables, all n samples (the 'bootstrap' or 'in bag' sample), and standard induction techniques (e.g. C4.5)
    4. Measure the error rate for that tree on the samples not used to train it (the 'out of bag' or 'oob' samples)
    5. Save the tree
After the forest of decision trees has been constructed, classify new samples by running them through all the trees and choose the class that is predicted most frequently.  (This is a very successful kind of 'ensemble classifier' that is similar to one that I, because of my ignorance, reinvented as one of my first projects in bioinformatics.)

This algorithm has been shown to be very effective at extracting good classifiers from datasets automatically.  However, as the random forest authors say:
 "But the cleverest algorithms are no substitute for human intelligence and knowledge of the data in the problem."  Leo Breiman and Adele Cutler
So, the question I'm posing is this: by inserting humans into the learning process can we improve it using their reasoning and background knowledge?

For HGF we replace the random input variable selection at step 2 above with expert guided variable selection.  (We may also let people guide the inference of the decision tree.)  The hypothesis is that the experts will choose feature sets that are better than the randomly selected ones - that will generalize to novel datasets with less error and produce more easily understood classifiers.  

As with standard RF, we need the HGF to produce many trees for each dataset with each tree having high classification performance and low overlap with the rest of the forest.  Ideally we would have each bootstrap of the training data converted into a tree by a different expert where the experts were drawn from a pool with highly diverse expertise.  This would require a significant investment of work from a large collection of expensive people..

Now, the next question is how on earth are we going to get a very large pool of skilled professionals to contribute their expertise to this project?  The answer we have been gravitating towards is games.  We hope to translate the feature selection problem into a game that knowledgable biologists and interested lay people will play for fun.

The formulation of the game(s) that will be used to drive an HGF implementation is very much a work in progress.  At the moment, the basic structure of our candidate games is that of a card game.  One way or another, players compose 'hands' of cards that correspond to features in a particular dataset.  For example, cards might correspond to genes from a gene expression dataset.  Hands are scored by testing the predictive performance of classifier trees inferred using the features in the hand and the training data (like one cycle of a random forest run).

Relation to Network Guided Forests


This idea is highly related to the concept of 'Network Guided Forests' (NGF) described by Dutkowski and Ideker in a PLoS paper last fall.  In that approach, the features used to build decision trees are constrained to related nodes within protein-protein interaction networks.  Features are selected for a given tree by picking one at random and then walking out along the network to bring in others in close proximity in the network.  The algorithm did not improve on classification performance in comparison to standard random forest as measured in cross-validation, but it did result in much more stable and coherent feature selection across several datasets.  It tended towards choosing genes that were known to relate to the phenotype of interest (e.g. breast cancer prognosis) much more often than random methods.  In comparison, NGF has the huge advantage that it can be used immediately based on data in databases without any dependence on human intelligence.  HGF has the theoretical advantage of tapping into a much broader collection of knowledge that is not limited to interaction data.  

Call for comments


At this point, this is just a nascent idea.  I have no evidence beyond intuition that it will succeed and there is a quite a bit of difficult work ahead to find out.  Any thoughts on it at this early point in time are most welcome!


BioGPS iPhone app Version 2.0

Posted by on Mar 29, 2012 in BioGPS, outsourcing | 1 comment

We are very proud to announce the arrival of the brand new iPhone app for BioGPS!

As humbly as we can possibly say it, this app is awesome. If you like BioGPS through your web browser, then you’ll love the app. As one colleague says,

I use it to dream up questions during seminars — it makes me look smart!

The app gives you full access to virtually all the functionality in BioGPS, including access to the 250+ plugins in the plugin library and all of your custom layouts.

One important note. If you are a user of the old BioGPS app, then you must download this app again. Apple does not allow transferring of apps between organizational accounts, so the old version under out GNF account has now been retired. That means you will not see the option to directly update, so click the link above to get the new version. This also means that we lost our ratings history, so please rate our new app!

We described previously how this was our experiment in outsourced development. For what it’s worth, our experience with Optra Systems was about as good as we could have hoped for. When the right situation arises, we’ll be going back to these guys…