Another new data release was just rolled out. ClinVar and CADD data were updated to their latest:
|last release||new release||# of variants
in new release
|# of variants
in last release
ClinVar and CADD annotations are available under “clinvar” and “cadd” subfields, respectively, for each annotated variant. MyVariant.info aggregates annotations from ClinVar, CADD and other 12 sources for each variant, so you can access them all in one request.
The total number of unique variants is now over 334M (334,443,525), compared to 316M previously. More details about the variant data we provide from MyVariant.info are always available from our documentation. The programmatic access of this information is available from our metadata endpoint.
Since the inclusion of ClinVar data in MyVariant.info, we have received a lot of feedback from our users, which resulted in the overhaul of ClinVar data structure in this new release. The changes are detailed below:
Data Structure Change:
This change reflects the fact that a single variant may correspond to one or more RCV records. In this release, RCV record specific fields, e.g. accession number, clinical significance, number of submitters, review status, last evaluated date, preferred name, origin and conditions are now moved under “clinvar.rcv” field (see an example here). If a variant includes multiple rcv records, each record will be represented as an element in a list under “clinvar.rcv” field (see examples here).
This changes should resolve the data missing issue in previous release. The current release includes 127,745 unique variants (with ~150K RCV records) annotated by ClinVar. Roughly 9K RCV records were left out because their corresponding variants cannot be properly mapped to human reference genome.
New fields added:
ClinVar Variant ID
ClinVar Variant ID is now included as “clinvar.variant_id” field. The definition of ClinVar Variant ID can be found here.
When available, the genomic position on hg38 assembly is now included as “clinvar.hg38” field, along with the existing “clinvar.hg19” field for hg19 genomic position for each variant.
More xref IDs
To facilitate cross-referencing with other variant databases, the available ids mapping each variant to OMIM, COSMIC, UniProt, dbVar are now included as separate fields under “clinvar“. Previous data release only included rsid from dbSNP.
“clinvar.xref” field is now removed. All available xref ids were either moved out as separate fields (e.g. “clinvar.uniprot“, “clinvar.omim“) or included under “clinvar.rcv.conditions.identifers” field.
New data structure examples:
A few examples of the new ClinVar data structure can be found at this gist.
In this demo, we will show you how to use our Python client myvariant.py to query for ClinVar data with only a few lines of code. The complete tutorial is provided as a Jupyter notebook here (the raw ipynb file is here).
- Install myvariant.py is easy with pip:
pip install myvariant
- Now you just need to import it and instantiate MyVariantInfo class:
import myvariant mv = myvariant.MyVariantInfo()
- To get available ClinVar annotations of a variant in ClinVar, you can pass an hgvs id and “clinvar” as the fields parameter to the getvariant method:
- Also, you can query for a single RCV accession number:
- Again, you can read the complete tutorial for more examples on this Jupyter notebook (the raw ipynb file is here).
Finally, we would like to thank all users who provided feedback and suggestions. As always, feel free to reach us at helpmyvariantinfo or @myvariantinfo if you have any questions or feedback.