|Building intelligent systems for biology|
Our research group has been exploring the concept of serious games for several months now. Aside from providing nerdy entertainment, our games collect (and distribute) biological knowledge from broad audiences of players. The hypothesis underlying this work is that, by capturing knowledge in forms suitable for computation, these games make it possible to build more intelligent programs.
As one step in testing this general hypothesis, on Sept. 7, 2012, we released a game called ‘The Cure’. The objective of this game is to build a better (more intelligent) predictor of breast cancer survival time based on gene expression and copy number variation information from tumor samples. We selected this particular objective to align with the SAGE Breast Cancer Prognosis challenge.
In this game, available at http://genegames.org/cure/, the player competes with a computer opponent to select the highest scoring set of five genes from a board containing 25 different genes. The boards are assembled in advance to include genes judged statistically ‘interesting’ using the METABRIC dataset provided for the SAGE Challenge.
Below is a game in progress. I’m on the bottom and my opponent, Barney, is on the top. We alternate turns selecting a card (a gene) from the board and adding it to our hand. When we each complete a 5 card hand, the round finishes and whoever has the most points wins. Scores are determined by using training data to automatically infer and test decision tree classifiers that predict survival time. The trees can use both RNA expression and CNV data for the selected genes to infer predictive rules. The better the gene set performs in generating predictive decision trees, the higher the score. When the player defeats their opponent, they move on to play another board. (Multiple players play each board.)
|A game of the The Cure. Barney (the bad guy) is winning, I am looking at the CPB1 gene and, using the search feature, I have highlighted all genes that have the word cancer in any of their metadata in pink.|
As you can see to the right of the board, information from the Gene Ontology, RefSeq, and PubMed is provided through the game interface to aid players in selecting their genes. Players are also encouraged to make use of external knowledge sources (in addition to their own brains).
Promotion, players and play
|Games played at The Cure since launch|
As of last Friday, Oct. 26, 2012 we have had 214 people register and have recorded 3,954 total games (including training games). The player demographics have remained stable with about 40% PhDs, nearly 50% declaring knowledge of cancer biology, and about 50% stating that they are biologists.
Predicting breast cancer prognosis
The data collected from game play includes information about the players (education, knowledge of cancer, profession) and the complete history of the genes that each player selects for each board that they play. While we are still considering methods for making use of this data (such as the Human Guided Forest), we used the following protocol to build a predictor to submit to the SAGE challenge.
- Filter out games from players that indicated no knowledge of cancer biology.
- Rank each gene according to the ratio of the number of times that it was selected by different players to the number of times that it appeared in any played game.
- Select the top 20 genes according to this ranking.
- Insert this 20 gene ‘signature’ into the ‘Attractor Metagene’ algorithm that has dominated the SAGE challenge. To do this, we kept all of the code related to the use of clinical variables unchanged, but replaced the genes selected by the Attractor team with the genes selected by our game players.
The predictor generated with this protocol scored 69% correct on survival concordance index on the Sage challenge test dataset, just 3% behind the best submitted predictor and significantly above the median of hundreds of submitted models. (You can see the ranked results on the challenge leaderboard – search for team HIVE – and, with a free registration, you can inspect the model directly within the Synapse system operated by SAGE.)
In experiments conducted within the training dataset, we were able to consistently generate decision tree predictors of 10-year survival with an accuracy of 65% in 10-fold cross-validation using only genomic data (no clinical information). This was substantially better than classifiers produced using randomly selected genes (55%). Using an exhaustive search through the top 10 genes, we found 10 different unique gene combinations that, when aggregated, produced statistically significant (FDR < 0.05) indicators of survival within: (1) the training dataset used in the game, (2) a validation cohort from the same study, and (3) an independent validation set from a completely different study.
|Final Results from METABRIC round of BCC challenge|
!! Update, the mode submitted using the The Cure data (Team HIVE) scored 0.70 on the official test dataset for the METABRIC round of this competition, putting it at #43 of of 171 submitted models !!
These early results from The Cure show clearly that biologists with knowledge that is relevant to cancer biology will play scientific games, and that combined with even basic analytical techniques, meaningful knowledge for inferring predictors of disease progression can be captured from their play. We suggest that this might open the door to a new form of ‘crowdsourcing’ that operates with much smaller, more specific crowds than are typically considered.
The data collected from the game so far is available as an SQL dump in our repository. This is the entire database used to drive and track the game with the exception of personal information such as email and IP addresses.
Thanks to Max Nanis, Salvatore Loguercio, Chunlei Wu, Ian Macleod and Andrew Su for all of your help making The Cure. Thanks in particular to Max who authored 99% of everything you see when you play the game.