Mining the Gene Wiki

Our article about mining ontology-based gene annotations from the text of the Gene Wiki just came out at BMC Genomics.  Yay!

In the article, we discuss the results of what I think might be the simplest text-mining strategy that could possibly work.  Based on the premise that each Gene Wiki article is fundamentally about one particular gene, we make the simplifying assumption that all of the concepts detectable in the article are descriptors of what that gene does.  With those assumptions in place, we use the NCBO annotator to detect concepts from the Gene Ontology (GO) and the Human Disease Ontology (DO) in the text of articles about genes.  Each detected occurrence thus produces a candidate annotation for the gene.  From the article:

For example, we identified the GO term ‘embryonic development (GO:0009790)’ in the text of the article on the DAX1 gene: “DAX1 controls the activity of certain genes in the cells that form these tissues during embryonic development”.  From this occurrence, our system proposed the structured annotation ‘DAX1 participates in the biological process of embryonic development’.  Following the same pattern, we found a potential annotation to the DO term ‘Congenital Adrenal Hypoplasia’ (DOID:10492) in the sentence: “Mutations in this gene result in both X-linked congenital adrenal hypoplasia and hypogonadotropic
We found that, in terms of precision, this simple approach worked pretty well on detecting gene-disease  annotations (90-93%) but not nearly as well at detecting gene-function (GO) annotations (48-64%).  As you might expect, the recall equation worked in the opposite direction with many more potential GO annotations discovered (11,022) then DO annotations (2,983).  Though there was some overlap, the majority of the predicted annotations did not have any match in existing annotation databases, showing that the Gene Wiki contains some knowledge that centralized resources like the Gene Ontology Annotation database do not yet represent and that basic text mining provides a way to access that knowledge computationally.

But, you say, that precision for the GO is really low, what use is this really?  For applications that require 100% accuracy, like a curated database, well you would need to curate the predicted results and that might be quite a lot faster than searching through PubMed to find them all from scratch.  As it turns out, there are also other kinds of applications that can take advantage of data like this that has noise in it.  As long as there is a strong signal within the noise, probabilistic techniques, like enrichment analysis, can work.  This is possible because, although many of the individual annotations might turn out to be incorrect, as a group they are far far from random.

For more details, read the paper ;).