Twenty questions for genes

Part 1: Introduction to the concept (this post)

Part 2: The prototype game

Part 3: Evaluation framework

If I asked you to think of your favorite gene, do you think I could guess the identity of that gene by first asking you twenty yes / no questions?

I personally might not be a very good guesser in this game, but I’m betting that we can write a computer program that would be pretty accurate. This program would ask whether your gene was related to specific biological processes, molecular functions, and/or cellular localizations (the main types of Gene Ontology annotations), and use your answers to quickly narrow down to exactly the gene you’re thinking of.

Yes

Growth Hormone 1

Reelin

Yes

Even if we could build such a system, why would we? Because this game could help organize the community’s knowledge of gene function. It is well-appreciated that Gene Ontology (GO) annotations are incomplete, which is not surprising given that PubMed grows by almost one million articles per year. Yet statistical analyses based on GO annotations are a cornerstone of data analysis methods, so improvements are sorely needed.

How can our twenty questions game advance biology? In the course of guessing your gene, we will likely identify discrepancies between what you think about your gene and what is represented in the GO annotation database. In the example above, I answered “Yes” to the question on protein binding, since reelin is clearly involved in binding its receptors VLDLR and ApoER2. Yet the GO annotation for protein binding (GO:0005515) does not appear in the official GO record for reelin, so this is clearly a case where the GO annotation database is incomplete. Discrepancies like these are candidate novel annotations, and our confidence in these candidate annotations increases if they are also inferred from the games of other players.

Quantitatively-minded readers will recognize that this game exactly boils down to an exercise in machine learning and classification analysis. In fact, we believe that building this game would be an ideal class project in artificial intelligence. A reasonable training set can be easily defined (based on existing annotations between the Gene Ontology and genes). There is a very straightforward evaluation metric (accuracy given increasing levels of false answers). And there is a rich collection of algorithms that can be applied to this challenge, from decision trees, to neural nets, or bayesian logic.

Therefore, we are issuing this call to arms for recruiting interested partners and collaborators. Specifically, we are looking for undergraduate or graduate educators running classes in artificial intelligence, computer algorithms, bioinformatics, or any related discipline. Interested groups can come from anywhere — elsewhere in San Diego, in California, in the US, or in the world. We envision that these class projects can run concurrently during the fall quarter or semester. If we get enough groups to sign up, we can run a worldwide competition to see who has the highest quality guesser. Finally, as an added bonus, any group that participates in this challenge will be included as a coauthor on a scientific paper that we will write after the evaluation is complete.

There are obviously many additional details that will be shared with participants. Please contact us or leave your contact details below if you’re interested in taking part!

Twenty questions for genes

Trackbacks/Pingbacks

Submit a Comment Cancel reply

Subscribe

Archives

Categories