Twenty questions for genes

Part 1: Introduction to the concept (this post)
Part 2: The prototype game
Part 3: Evaluation framework

If I asked you to think of your favorite gene, do you think I could guess the identity of that gene by first asking you twenty yes / no questions?

I personally might not be a very good guesser in this game, but I’m betting that we can write a computer program that would be pretty accurate. This program would ask whether your gene was related to specific biological processes, molecular functions, and/or cellular localizations (the main types of Gene Ontology annotations), and use your answers to quickly narrow down to exactly the gene you’re thinking of.

    Q1: Is your gene found in the cytoplasm? No
    Q2: Is your gene found in an intracellular organelle? No
    Q3: Is your gene involved in a cellular metabolic process? Yes
    Q4: Is your gene found in the membrane? No
    Q5: Is your gene involved in a protein modification process? Yes
    Q6: Is your gene involved in a cellular nitrogen compound metabolic process? No
    Q7: Is your gene involved in intracellular signal transduction? Yes
    Q8: Is your gene involved in the MAPKKK cascade? No
    Q9: Is your gene involved in positive regulation of catalytic activity? Yes
    Q10: Is your gene found in the extracellular space? Yes
    Q11: Is your gene involved in inflammatory response to infection? No
    Q12: Is your gene involved in protein binding? Yes
    Q13: Is your gene found in the soluble fraction? No
    Q14: Is your gene Growth Hormone 1? No
    Q15: Is your gene Reelin? Yes

Even if we could build such a system, why would we? Because this game could help organize the community’s knowledge of gene function. It is well-appreciated that Gene Ontology (GO) annotations are incomplete, which is not surprising given that PubMed grows by almost one million articles per year. Yet statistical analyses based on GO annotations are a cornerstone of data analysis methods, so improvements are sorely needed.

How can our twenty questions game advance biology? In the course of guessing your gene, we will likely identify discrepancies between what you think about your gene and what is represented in the GO annotation database. In the example above, I answered “Yes” to the question on protein binding, since reelin is clearly involved in binding its receptors VLDLR and ApoER2. Yet the GO annotation for protein binding (GO:0005515) does not appear in the official GO record for reelin, so this is clearly a case where the GO annotation database is incomplete. Discrepancies like these are candidate novel annotations, and our confidence in these candidate annotations increases if they are also inferred from the games of other players.

Quantitatively-minded readers will recognize that this game exactly boils down to an exercise in machine learning and classification analysis. In fact, we believe that building this game would be an ideal class project in artificial intelligence. A reasonable training set can be easily defined (based on existing annotations between the Gene Ontology and genes). There is a very straightforward evaluation metric (accuracy given increasing levels of false answers). And there is a rich collection of algorithms that can be applied to this challenge, from decision trees, to neural nets, or bayesian logic.

Therefore, we are issuing this call to arms for recruiting interested partners and collaborators. Specifically, we are looking for undergraduate or graduate educators running classes in artificial intelligence, computer algorithms, bioinformatics, or any related discipline. Interested groups can come from anywhere — elsewhere in San Diego, in California, in the US, or in the world. We envision that these class projects can run concurrently during the fall quarter or semester. If we get enough groups to sign up, we can run a worldwide competition to see who has the highest quality guesser. Finally, as an added bonus, any group that participates in this challenge will be included as a coauthor on a scientific paper that we will write after the evaluation is complete.

There are obviously many additional details that will be shared with participants. Please contact us or leave your contact details below if you’re interested in taking part!


0 Comments

Trackbacks/Pingbacks

  1. Twenty questions for genes — try it out | The Su Lab - [...] Part 1: Introduction to the concept [...]
  2. Twenty questions for genes — evaluation framework | The Su Lab - [...] Part 1: Introduction to the concept [...]

Leave a Reply

Your email address will not be published. Required fields are marked *