How do I now test to see which, if any, of these signatures are showing up in my sample?
I have my input, (e.g. the Affy CEL file from my experiment), how do I get the output that indicates that my sample shows an active wound response, suggests poor outcomes in breast cancer patients, looks like lung-specific metastasis, etc. etc.
This should be relatively easy, no? I've got data about human gene expression, these people have made useful predictive models that take human gene expression as input. Where is the website?
Some people have directed me to useful resources like GeneSigDB that provide curated repositories of "gene signatures". However, these "signatures" are just sets of genes, they are not predictive models. If all that we needed were gene sets, no one would ever need to train a random forest classifier or a support vector machine on the data associated with those gene sets. Sets of phenotypically related genes are great, but I need the full predictive model.
The only system that I know of that seems to have the capacity to answer my question (had the model builders used it) is the Synapse platform. For example, if you are good at R, you should be able to use Synapse to execute any of the models submitted to the recent breast cancer prognosis challenge. This is a great step forward for the community (though it recapitulates pretty much everything from the more generic world of scientific workflow systems like Taverna).
But still.. a) comparatively very few published predictive models are in Synapse and b) should I really have to know R to answer that question?
How shall we find the concord of this discord?
—William Shakespeare, A Midsummer Night’s Dream
Big news coming out of the Su Lab today! As you may know, we’ve been doing a lot of work recently on the presentation of our datasets stored in BioGPS. After we completed the ability to easily browse and view datasets, we wanted to take it a step further and make it possible organize, simplify, explore and alter the presentation of the data. Our current data chart plugin was updated over the summer and that got us thinking about how to improve and aid insightful discoveries. We wanted to be able to easily load up a dataset to view, and just explore. If a dataset has factors, how do they looked aggregated, how about aggregated on multiple factors while still allowing us to order them any way we please? Modifying the visualization should let you to decide how you want to display the data for your own use.
To make this change, we started from the ground up. Today we’re releasing an entirely rebuilt Data Chart plugin in beta mode. This new plugin has all of the same features of the previous Data Chart plugin in addition to a wealth of new features. The aim of this beta release is to introduce the new chart to everyone so we can start to get feedback on the user experience behavior and features.
To best introduce the new plugin, lets walk through a simple example: If we load up the new plugin with dataset 1839 we’ll see the expression levels of CDK2 across 82 different samples. From the start, the expression levels for all the different samples should look something like this:
While this chart contains a lot of information, it’s extremely difficult to infer any details about what the data may be waiting to tell us. There is simply too much being displayed to extract anything useful by visual inspection. Let’s first get rid of most of these sample bars and aggregate them together based of the disease state. You can reposition the order of factors by dragging the respective factor box left or right. We then aggregate on disease by toggling the lower checkbox on the factor bar for disease. I also chose to hide the other factors by toggling their upper checkbox. The result looks like:
Okay, great! We can now see that there is a significant difference in CDK2 expression between the colorectal cancer and control samples. If you’d like to move the control samples up, factor options can be reordered by clicking the list icon on the factor box. The error bars that appear after aggregation show the standard deviation of the sample values that were aggregated. Let’s look further, is there an expression difference between gender? Aggregating on disease and sex looks like:
It’s now clear that there appears to be an insignificant difference in expression between gender. Lets investigate if the metastatic state affects the expression of CDK2 in our samples. We do this by toggling the lower checkbox on the metastasis box. You’ll also noticed I moved sex out the the way and first ordered by disease. This results in a chart that looks like:
We can now easily see that the metastatic state is not correlated to CDK2 expression with our diseased samples. The hierarchical organization also visually shows us that a metastatic state is not available in our control sample for obvious reasons. From start to finish, it’s easy to explore the data contained within our datasets in a completely visual manner without any technical literacy. This plugin is just the first step in our efforts to expanding the visual beauty of intuitively exploring biological data.
An abbreviated list of some of the new features available in the plugin are as follows:
- Sorting: Sometimes we want to rearrange all the different options for a given factor. Alphabetical, sequential, or to our own liking. The plugin lets you drag to rearrange the order of each factor option.
- Aggregating: Ever want to aggregate all the samples on a specific factor? The new plugin allows you to select a single or multiple factor type to aggregate on. This groups all of the samples that share the different factor options. This updates the standard error of the mean between the aggregated samples and then recalculate the mean value for all the different, newly aggregated sample bars.
- Search: Just as before, the new Data Chart plugin allows you to search across samples. As you change the display text you want to show, the new chart updates the search to act on the currently displayed text. When samples move around, the search updates to stays smart about where your highlighted samples may have moved to.
- Saving: Often we’d like to revisit the plugin while retaining the dataset sorting or aggregating arrangement from when we were previously browsing. The new plugin keeps track of your display settings for a particular dataset. It will also keep track of your last viewed history so you will always be presented where you left off.
- Export as SVG: We love the current plugin’s capacity to save as an image. However, this returns a PNG image file which gives us no access to the different components of the chart. We wanted an easy way to rearrange labels, change colors, remove tick marks, and other small use cases which would make the web interface too complex. To get around this, we created a way to save out the currently viewed chart as a scalable vector graphic (SVG). SVG is a graphics standard which allows you to scale the chart up to as big as you’d like without any loss in resolution. To save out the file simply toggle the view type in the bottom right corner and then either right click and “Save As” or just drag into a destination folder on your computer.
For those with common SVG editing programs like Adobe Illustrator you can load up the saved file, and will have access to all the different regions in the graphics. This allows you to delete regions, changes colors, add a watermark or any other modification you want without altering the quantitative scales of each sample bar or potentially adding pixelation. This feature gives you the full customization for post-processing the presentation of your dataset.
In addition to the listed features, the new plugin better handles data requests to BioGPS, has more informative tooltips, allows different sample label types, has a new dataset browser and many more features.
There are many datasets on BioGPS that don’t have factors or other metadata and thus can’t use most of these improvements. However, we still think this update will be a big improvement as you’ll be able to treat a plain dataset just like you had before. This rewrite of the previous plugin more easily allow us to improve its performance, add features and fix any issues you might find in the future. Please please please send any requests, improvements, comments, or bug reports — you can submit them directly onto the Bitbucket issue tracker here, or contact us through the normal channels. We are releasing this as a beta exactly so we can get your feedback! This plugin is entirely open source and freely available on Bitbucket to remix, improve, and release back into the public!
In this first post of the new year, we are happy to report that the update paper on BioGPS and MyGene.info is now published in the Nucleic Acids Research Database issue:
This paper highlights the exciting updates on BioGPS since our first paper was published in 2009. These improvements result from contributions by both our users and our team of BioGPS developers. As some evidence of this success, we have seen steady growth of BioGPS in the past two years:
BioGPS usage is measured using Google Analytics and our internal logging systems. Each month, we currently have average more than 155 000 page views from ~13 500 unique users. For comparison, those numbers grew steadily from 100 000 monthly pageviews and 7000 unique users in 2009. More than 5000 users have registered for a BioGPS user account (up from 900 in 2009).
In this new year, we are working hard to bring you even more exciting new features into BioGPS. So stay tuned!
Maximilian Ludvigsson took the first steps in the creation of Semantic BioGPS. BioGPS is a user-extensible Web portal that provides easy access to information about genes from hundreds of different websites. Maxmilian produced a tool that allows BioGPS users to annotate regions of gene-centric Web pages to state, computationally, what different areas of the page ‘mean’. These semantic annotations enable scripts to extract structured content about genes from these Web pages, paving the way for a new version of BioGPS that provides integrated views across multiple data sources.
Karthik G developed an interactive network visualization for the data linking genes to diseases in the GeneWiki+. The GeneWiki+ is a Semantic Media Wiki (SMW) installation that dynamically integrates data about human genes from Wikipedia and from SNPedia. While SMW queries provide a great way for programmers and advanced wiki users to interact with data, the graphical network that Karthik created gives ordinary biologists a new, intuitive, and sometimes beautiful way to explore connections between genes and disease.
Clarence Leung began the development of a new version of the crowdsourcing game Dizeez. In this new two-player game, players are challenged to get their partner to guess a particular disease by prompting them with related genes. This game follows in the tradition of ‘games with a purpose’ such as Foldit and the ESP game by producing novel, validated gene-disease associations as a result of game play.
Shivansh Srivastava worked on migrating BioGPS’s gene report layout windowing system from ExtJS to both a jQuery windowing environment and a Yahoo User Interface-based approach. This view in BioGPS provides biologists with a customizable environment for accessing gene-centric data from a diverse collection of sources. Shivansh’s efforts provided BioGPS developers with insight into the technical limitations of each solution, as compared to the current BioGPS ExtJS codebase.
Kevin Wu developed a scalable and efficient system for storing and analyzing biologically meaningful sets of genes. Accessible via a RESTful HTTP interface, the system uses MongoDB for storage and custom code for distributed computing that executes statistical comparisons across thousands of gene sets in parallel. For any particular gene set, Kevin’s code makes it possible to rapidly identify similar gene sets and to calculate the ‘enrichment’ (a statistical measure of overlap) of that gene set with respect to any other. This work will soon be integrated into BioGPS to allow users to save their own gene sets and to query for similar gene sets from others.
Thanks to all of our excellent students for their great contributions and to Google for sponsoring this unique program. We are looking forward to participating in the GSoC for many years to come!
We’re very excited to announce the addition of the Dataset Library to BioGPS! As I mentioned in my last blog post BioGPS now has thousands of datasets available for browsing. Providing this many datasets comes with some challenges, including making them easy to search and navigate for our BioGPS users. Version 2.0 of the data chart plugin was the first step towards these goals, and the Dataset Library is one giant leap in the same direction.
Dataset Library categories
Just like the Plugin Library the Dataset Library supports multiple ways for you to find the data you’re most interested in. The “Most Popular” and “Newest Additions” tabs help you get a quick look at datasets that the community frequently uses as well as those that have just been added to BioGPS. The Dataset Library also has categories of datasets (“Cancer”, “Arthritis”, etc) that allow you to quickly find relevant datasets.
Dataset Library advanced searching
The Dataset Library also supports searching across datasets by tags and species, just as the Plugin Library does. If you have a specific term, sample, etc that you’re looking for in datasets this can be done in the Search bar. Simply prefix your search with “in:dataset” and add your keywords. For example to search for all datasets that were profiled in the A549 lung cancer cell line, your search would be “in:dataset A549“. Hit the Search button and all datasets that contain “A549″ in any of their metadata fields will be returned in your results!
GEO identifier support in URLs
Previously BioGPS datasets were only identified with an internal integer (dataset “100″, etc). We knew we could do better (and make our URLs easier to read) by adding GEO identifiers as another way to look up datasets. Now if you want to see a specific dataset using its GEO ID (eg. “GSE1133″) it is as simple as going to http://biogps.org/dataset/GSE1133. Keep in mind that we’ve only loaded a subset of GSE datasets to date (mostly human). If there are public datasets that you’d like to see loaded, let us know!
His primary contribution so far is a nascent game for collecting gene-disease connections called Mobianga!. He is planning to submit the results of an experiment using this game to the Intel Science Talent Search, a prestigious national science fair. But, he needs help if he is going to succeed! If you know anything about genes and their relationship to disease or are capable of using resources like OMIM, PubMed, and Google find such information he needs you to play a few games! Even better, invite several of your friends to play a few games.
Help a 17 year old computational biologist reach his dreams, play Mobianga! today!
Mobianga contest at the American Society for Human Genetics annual Meeting
Technical detailsMobianga! makes use of the human disease ontology to provide the opportunity to easily annotate genes at varying levels of granularity. For each gene challenge, you start at the top of the hierarchy (e.g. choose between 'disease of cellular proliferation' and 'disease of mental health') and you work your way down to specific diseases. At each step you earn points based on an algorithm that assesses the precision of the annotation and degree of consensus among prior players in a manner similar to the Herdit game for music tagging recently published in PNAS.
The game is implemented as a Python-powered Web Application that runs in the Google App Engine. The code is open source and he would welcome collaborators. The game is intended to eventually run smoothly on phone-sized browsers (the name Mobianga came from 'the mobile annotation game'), but this optimization has not yet been achieved. Anyone that wants to help, please get in touch.
|Building intelligent systems for biology|
As one step in testing this general hypothesis, on Sept. 7, 2012, we released a game called ‘The Cure’. The objective of this game is to build a better (more intelligent) predictor of breast cancer survival time based on gene expression and copy number variation information from tumor samples. We selected this particular objective to align with the SAGE Breast Cancer Prognosis challenge.
In this game, available at http://genegames.org/cure/, the player competes with a computer opponent to select the highest scoring set of five genes from a board containing 25 different genes. The boards are assembled in advance to include genes judged statistically ‘interesting’ using the METABRIC dataset provided for the SAGE Challenge.
Below is a game in progress. I’m on the bottom and my opponent, Barney, is on the top. We alternate turns selecting a card (a gene) from the board and adding it to our hand. When we each complete a 5 card hand, the round finishes and whoever has the most points wins. Scores are determined by using training data to automatically infer and test decision tree classifiers that predict survival time. The trees can use both RNA expression and CNV data for the selected genes to infer predictive rules. The better the gene set performs in generating predictive decision trees, the higher the score. When the player defeats their opponent, they move on to play another board. (Multiple players play each board.)
|A game of the The Cure. Barney (the bad guy) is winning, I am looking at the CPB1 gene and, using the search feature, I have highlighted all genes that have the word cancer in any of their metadata in pink.|
Promotion, players and play
|Games played at The Cure since launch|
Predicting breast cancer prognosis
- Filter out games from players that indicated no knowledge of cancer biology.
- Rank each gene according to the ratio of the number of times that it was selected by different players to the number of times that it appeared in any played game.
- Select the top 20 genes according to this ranking.
- Insert this 20 gene ‘signature’ into the ‘Attractor Metagene’ algorithm that has dominated the SAGE challenge. To do this, we kept all of the code related to the use of clinical variables unchanged, but replaced the genes selected by the Attractor team with the genes selected by our game players.
The predictor generated with this protocol scored 69% correct on survival concordance index on the Sage challenge test dataset, just 3% behind the best submitted predictor and significantly above the median of hundreds of submitted models. (You can see the ranked results on the challenge leaderboard - search for team HIVE - and, with a free registration, you can inspect the model directly within the Synapse system operated by SAGE.)
In experiments conducted within the training dataset, we were able to consistently generate decision tree predictors of 10-year survival with an accuracy of 65% in 10-fold cross-validation using only genomic data (no clinical information). This was substantially better than classifiers produced using randomly selected genes (55%). Using an exhaustive search through the top 10 genes, we found 10 different unique gene combinations that, when aggregated, produced statistically significant (FDR < 0.05) indicators of survival within: (1) the training dataset used in the game, (2) a validation cohort from the same study, and (3) an independent validation set from a completely different study.
|Final Results from METABRIC round of BCC challenge|
!! Update, the mode submitted using the The Cure data (Team HIVE) scored 0.70 on the official test dataset for the METABRIC round of this competition, putting it at #43 of of 171 submitted models !!
ConclusionsThese early results from The Cure show clearly that biologists with knowledge that is relevant to cancer biology will play scientific games, and that combined with even basic analytical techniques, meaningful knowledge for inferring predictors of disease progression can be captured from their play. We suggest that this might open the door to a new form of ‘crowdsourcing’ that operates with much smaller, more specific crowds than are typically considered.
The data collected from the game so far is available as an SQL dump in our repository. This is the entire database used to drive and track the game with the exception of personal information such as email and IP addresses.
Thanks to Max Nanis, Salvatore Loguercio, Chunlei Wu, Ian Macleod and Andrew Su for all of your help making The Cure. Thanks in particular to Max who authored 99% of everything you see when you play the game.
The opponent in The Cure came from a Wikipedia Commons image from the game "You have to Burn the Rope". Thanks for sharing!
Will Barney defeat us? Only you can stop him!
The SAGE challenge finishes in two weeks and we need you to play to show what The Cure can do!
I'll be writing more about the game as results start to come in. For now, I just want to heartily thank the team members who helped get this together. Because of their help, this is definitely the best product of the genegames.org initiative so far.
- Max Nanis: Restyled the entire site - bringing it up from my 1999ish blind hacking to something that looks good in 2012...
- Sal Loguercio: Hacked the R code needed to get the DREAM7 data processed and into the game.
- Chunlei Wu: Helped Sal with R and with the initial gene filtering algorithms.
- Ian Macleod: Made Barney dance!
- Andrew Su: Helped with design, concept, and algorithms throughout (and provides us all with a home with big windows..)
This guest post was written by Adriel Carolino, a summer intern who has been spearheading this project to create a Centralized Model Organism Database (CMOD).
The recent explosion of metagenomic sequencing has resulted in an immense expanse of microbial genetic information. Current metagenomic analyses typically revolve around a taxonomic survey — describing what microbial species are present in a given sample. However, the field is gradually moving toward more functional analyses of the specific genes and gene products expressed by those microbes.
Unfortunately, there currently isn’t a single public resource that systematically catalogues all microbial gene annotations. For example, there is no database (that we are aware of) that you can query for “all species in the Firmicutes phylum that have a gene involved in potassium ion transport and in a nutrient sensing pathway”. There are two challenges with creating such a system.
First, we need to create a database that can handle the large scale and heterogeneity of microbial gene annotation data. Recently, we have begun exploring Freebase as a platform on which to build such a system. Freebase is a graph database with a structure characterized by sets of interlinked related nodes. Freebase allows for the storage of heterogeneous data in a structured manner; a node can have limitless connectivity while still upholding the intricacy of each connection. A Freebase-backed microbial database would allow us to amass and integrate information from a variety of different annotation projects.
Second, we need to populate that database with annotations from many different sources. Freebase also excels in this regard because it enables the Long Tail of scientists to contribute to the metagenomic annotation process. Like Wikipedia, Freebase welcomes contributions from the community, anything from individual facts from stand-alone labs to large data imports from large-scale efforts. The many varying groups within the community would all be able to input their information, and these contributions would have the ability to be queried for both as a whole and separately. The different microbes could be compared and grouped according to a variety of properties with modifiable specificity.
We first constructed a comprehensive data model that has been filled in with info from the well-studied organisms Escherichia coli K-12 substr. MG1655 and Pseudomonas aeruginosa PAO1. With information obtained from NCBI, the Gene Ontology, the UCSC Microbial Genome Browser, EcoliWiki, EcoCyc, and the Pseudomonas Genome Database, we have created 70 gene topics (35 from each genome) so far and are in the process of inputting more. They can all be viewed at the Microbial Gene Base within Freebase.
Right now, this basic infrastructure enables simple queries to, for example, retrieve all E. coli K-12 MG1655 genes that are involved in “zinc ion binding”. The Gene Ontology term is linked to a group of genes via sourced evidence, and each individual gene can point to properties it holds such as its genome, NCBI Entrez Gene ID, and locus (along with its specific location in a build).
We have obtained data from KEGG and EcoCyc regarding biological pathways and will upload that information soon. We hope to include other relevant data and other microbial species into the Freebase graph as well. We also believe, success pending, that an awesome and useful future aspect would be the development of a generic cross-species genome browser that could use and display the information for any species in the Microbial Gene Base.
Why do we think this system is important, especially since there is already a Generic Model Organism Database (GMOD) project? Because we think it’s unlikely that every microbial species will have a large enough community to justify its own GMOD instance (especially given the developer and curator requirements). For the Long Tail of microbial species, we think this Centralized Model Organism Database (CMOD) will be an effective, efficient, and scalable method to collaboratively organize knowledge on microbial genes.