With the recent boom of next-gen sequencing technology, data of genomic variants are increasingly accessible to researchers. However the annotation of these variants data still remains a big challenge, particularly to understand their functional impact. Recent efforts have resulted in many useful resources to tackle this challenge, both by developing new algorithms for functional prediction and by assembling integrated databases for functional annotations. From a researcher’s point of view, after they have identified a list of variants in their own data, they then face the problem of organizing all available data on those variants. This process involves simply choosing a single “best” resource, or just going through all of these resources one by one which quickly becomes impractical.
To tackle this challenge, we propose to build a centralized hub system to host all available variant annotations, so that the end-users can have a one-stop solution for their annotation tasks. In our view, building such a successful hub system essentially comes down to two key points:
- Need a scalable and high-performance query engine for variant annotations, which is easy to understand given the scale of the data we are facing (e.g. human known SNPs alone are ~30M)
- Must have a framework for external users to contribute and provide enough motivation to build a stable community. This is important because a single development team is never going to be enough to catch up the fast-evolving variant annotations, and the community-backed the development is pretty much the only way to ensure its long-term success.
We previously built a similar system for gene annotations, called MyGene.info, which provides high-performance query services to access up-to-date gene annotation data integrated from various data resources. In MyGene.info, each data resource has a data-loading adapter (“data plugin”) to convert source data into a gene annotation document (a.k.a a dictionary in Python) with the gene ID as the document key. We then merge gene documents into a combined documents based on the same gene ID. MyGene.info can serve as a base on which to build our proposed query engine for variant annotations.
When comes down to the specific tasks for this Hackathon, here is what is in my mind for now (which obviously might completely change during the Hackathon itself):
- Enumerate a list of available variant annotation resources (e.g. ClinVar, Cosmic, etc.)
- Define variant annotation document structure
- Create an MyGene.info-like query engine to index variant annotation documents and the downstream query APIs
- Create so called “data plugins” to import data from external resources. Each data plugin should follow a base requirement, so that the loaded data will be merged into the main indices.
Thoughts or feedback? Feel free to start the discussion now or leave it at Hackathon.