Blog



Gene Wiki Data, BioThings, Mark2Cure, & more! A review of 2017 in the Su Lab

Posted by on Jan 1, 2018 in annual summary, BD2K, BioGPS, BioThings, mark2cure, Su lab, sulab, tsri | 0 comments

Rather than summarize the 2017 progress on in each project in separate, project-specific posts, I’m putting it all in once post this year for one important reason–recruitment! As with any academic research group, we expect to see a number of our talented team members move on to bigger and better things. 2017 did not disappoint (well, it was disappointing to lose such talent, but we’re also happy that these awesome people are finding amazing new opportunities).

In saying farewell to 2017, we also bid farewell and best wishes to team members who have really pushed the boundaries of science during their time here including:

  • Ramya Gamini–who takes her bioinformatics expertise to Pfizer where she will continue as a postdoctoral associate.
  • Tim Putman–who takes his bioinformatics chops to Oregon Health & Science University (OHSU) as a Bioinformatics Software Developer. There, he will continue to push for open data and data reuse as he works on developing the Monarch web application, aggregate and structure microbial model organism data in The Monarch Initiative Infrastructure, and develop tools and aggregate data in support of the NCATS Translator Project.
  • Benjamin Good–the talented Assistant Professor who drove the Gene Wiki / Wiki Data projects in the Su Lab as well as a number of citizen science projects such as Science Game Lab, theCure, Mark2Cure, and more!
  • Max Nanis–research programmer, developer, artist. His twitter feed is like a tribute to chaos. His success anywhere is a certainty.
  • Sebastian Burgstaller-Muehlbacher–his profile on the Su Lab site will forever remain a fictional heavy metal tribute to science–Rock on!!!

Losing so much talent in 2017, we’re fortunate enough to recruit some new R|D rock stars to the lab and are happy to welcome:

  • Byung Ryul Jeon–a visiting scientist, accomplished doctor, and scholar interested in broadening his knowledge and honing his research skills
  • Laura Hughes–data scientist and web developer with a knack for purposeful and deliberate data visualization. Laura came to us after applying her skills to make the world a better place at her previous job with USAID.
  • Alejo Covian–a full stack python developer shrouded in mystery.

Team members like Associate Professor Chunlei Wu, not only provides friendly expertise, and helpful guidance to new recruits–he’s also a leader when it comes to holiday fashion.

If you have some computational skills and would like to use them to push research forward, consider joining our lab! We have a lot of great projects which could use your help! Speaking of great projects, let’s start with the ones led by Chunlei (pictured left).

Chunlei has been the driving force behind BioGPS, MyGene.info, MyVariant.info, BioThings.io, and more. As recently announced, he will be the TSRI site principle investigator for the National Center for Data to Health (CD2H).

The newly created center will be led by researchers from OHSU (Tim has moved on from the Su Lab, but he hasn’t escaped our reach!!!!), Northwestern University, University of Washington, Johns Hopkins University School of Medicine, and Sage Bionetworks, together with TSRI, Washington University in St. Louis, the University of Iowa and The Jackson Laboratory.

For his part, Chunlei and his team will do what they do best–building high-performance and scalable data access infrastructure and defining community best practice for data processing and software implementations.
 
 

As of 2017, BioGPS has seen the launch of a new sheep atlas portal along with a corresponding research paper, and BioGPS made it to the top of the weekly list on Labworm. After demonstrating that the MyGene.info framework could be retooled for for Variants (MyVariant.info), the framework has been abstracted into the BioThings SDK. 2017 was a busy year for these related projects:

  • Kevin presented about MyVariant.info at Heart BD2K site visit (2017.04.20)
  • Chunlei presented to Global Alliance for Genomics and Health (GA4GH), Variant Interpretation for Cancer Consortium (VICC) working group (2017.05.09)
  • Chunlei presented a poster on the BioThings SDK at ISMB/BOSC (2017.07.21 – 2017.07.25)
  • Kevin presented a poster on the BioThings Explorer at ISMB/BOSC
  • And the BioThings Explorer manuscript titled, “Cross-linking BioThings APIs through JSON-LD to facilitate knowledge exploration”, has been submitted

2017 has also been busy for the Gene Wiki / Wiki Data team here in the Su Lab as:

  • the Gene Wiki / Wiki Data Team shared their work at Biocuration 2017 conference (03.26.2017 – 03.29.2017). Both Sebastian & Tim gave talks, while Greg and Nuria had poster presentations.
  • Greg also presented at Heart BD2K site visit (2017.04.20)
  • Andrew presented at Bioinformatics Open Source Conference (2017.07.22 – 2017.07.23)
  • and Andrew was also featured on the Wikimedia Research Showcase Webcast (2017.08.23)

In case you missed it, the renewal grant application for this project was also shared in a blog post this year which gives the big picture idea of the project moving forward.

While we’re on the subject of crowdsourced science, Mark2Cure had a busy year as well.

  • Mark2Cure hosteds #Mark4Rare event for Rare Disease Day (during the week prior to rare disease day)
  • Mark2Cure organized/hosted the Citizen Science Expo at La Jolla Library (2017.03.11)
  • Ginger presented about Mark2Cure at Heart BD2K site visit (2017.04.20)
  • Mark2Cure was featured on SciStarter (2017.04.27)
  • Mark2Cure had a joint webinar with Cochrane Crowd (2017.05.08) followed by a joint event (#MedLitBlitz) which included a Mark2Curathon (2017.05.11)
  • Ginger presented Mark2Cure at the 2017 Citizen Science Association Conference during the citizen science for biomedical research panel (2015.05.18)
  • Ginger presented a poster on Mark2Cure at the CSA 2017 conference (2015.05.18)
  • Max delivered a Project Slam about Mark2Cure at the CSA 2017 conference (2015.05.18)
  • Mark2Cure joined the CitSciBio at table in St. Paul Science Museum for Citizen Science Festival (2015.05.20)
  • Mark2Cure paneled for #CitSciChat (2017.07.19)
  • Mark2Cure joined #Dazzle4rare 2017.08.13 – 2017.08.19

In addition to these projects, members of the Su Lab have been busy advancing research in protein folding/microscopy analysis methods, osteoarthritis, and data-driven drug re-purposing methodologies–all while managing to have an epic time.

May the FAIR-principles-of-open-data be with you

New default behavior for ‘species’ parameter

Posted by on Nov 24, 2017 in mygene.info, species | 0 comments

MyGene.info API supports both "/gene" and "/query" endpoints. On its /query endpoint, an optional species parameter allows users to pass one or multiple species (as common species names or taxonomy ids) to filter down the query results.

Previously, the default species were set to "human,mouse,rat". This meant that, unless you explicitly specified other values for the species parameter, your query results (e.g. "q=cdk2") might look like this:

http://mygene.info/v3/query?q=cdk2

{
  "max_score": 457.24393,
  "took": 11,
  "total": 32,
  "hits": [
    {
      "_id": "1017",
      "_score": 457.24393,
      "entrezgene": 1017,
      "name": "cyclin dependent kinase 2",
      "symbol": "CDK2",
      "taxid": 9606
    },
    {
      "_id": "12566",
      "_score": 329.98914,
      "entrezgene": 12566,
      "name": "cyclin-dependent kinase 2",
      "symbol": "Cdk2",
      "taxid": 10090
    },
    {
      "_id": "362817",
      "_score": 279.2216,
      "entrezgene": 362817,
      "name": "cyclin dependent kinase 2",
      "symbol": "Cdk2",
      "taxid": 10116
    },
    {
      "_id": "143384",
      "_score": 22.91444,
      "entrezgene": 143384,
      "name": "CDK2 associated cullin domain 1",
      "symbol": "CACUL1",
      "taxid": 9606
    },
    {
      "_id": "52004",
      "_score": 20.558783,
      "entrezgene": 52004,
      "name": "CDK2-associated protein 2",
      "symbol": "Cdk2ap2",
      "taxid": 10090
    },
    {
      "_id": "78832",
      "_score": 17.98903,
      "entrezgene": 78832,
      "name": "CDK2 associated, cullin domain 1",
      "symbol": "Cacul1",
      "taxid": 10090
    },
    {
      "_id": "365493",
      "_score": 14.489841,
      "entrezgene": 365493,
      "name": "CDK2-associated, cullin domain 1",
      "symbol": "Cacul1",
      "taxid": 10116
    },
    {
      "_id": "13445",
      "_score": 13.166027,
      "entrezgene": 13445,
      "name": "CDK2 (cyclin-dependent kinase 2)-associated protein 1",
      "symbol": "Cdk2ap1",
      "taxid": 10090
    },
    {
      "_id": "690181",
      "_score": 8.355364,
      "entrezgene": 690181,
      "name": "similar to S-phase kinase-associated protein 1A (Cyclin A/CDK2-associated protein p19) (p19A) (p19skp1)",
      "symbol": "LOC690181",
      "taxid": 10116
    },
    {
      "_id": "690646",
      "_score": 7.2449207,
      "entrezgene": 690646,
      "name": "similar to S-phase kinase-associated protein 2 (F-box protein Skp2) (Cyclin A/CDK2-associated protein p45) (F-box/WD-40 protein 1) (FWD1)",
      "symbol": "LOC690646",
      "taxid": 10116
    }
  ]
}

With no species parameter specified in the query, 32 hits were returned corresponding to all genes from species "human, mouse, rat" with a match to cdk2 in some fields (like symbol, name fields etc.). You could return the matched genes from all species by specifying species=all in the query.

While "human,mouse,rat" was a useful default for users who just need to query genes in these common species, it may cause some confusion for those query terms only relevant to non-"human/mouse/rat" species. For example, previously, a query like q=F1RW06 returns no hits instead of the matching pig CDK3 gene, unless you add "species=pig" or "species=all".

Now, based on many user feedbacks, the default "species" behavior has been set to "all". The same "q=cdk2" query will now return matched genes from all species:

http://mygene.info/v3/query?q=cdk2

{
  "max_score": 393.0346,
  "took": 115,
  "total": 611,
  "hits": [
    {
      "_id": "1017",
      "_score": 393.0346,
      "entrezgene": 1017,
      "name": "cyclin dependent kinase 2",
      "symbol": "CDK2",
      "taxid": 9606
    },
    {
      "_id": "12566",
      "_score": 327.42117,
      "entrezgene": 12566,
      "name": "cyclin-dependent kinase 2",
      "symbol": "Cdk2",
      "taxid": 10090
    },
    {
      "_id": "362817",
      "_score": 270.2593,
      "entrezgene": 362817,
      "name": "cyclin dependent kinase 2",
      "symbol": "Cdk2",
      "taxid": 10116
    },
    {
      "_id": "100925631",
      "_score": 268.31903,
      "entrezgene": 100925631,
      "name": "cyclin dependent kinase 2",
      "symbol": "CDK2",
      "taxid": 9305
    },
    {
      "_id": "100981695",
      "_score": 268.31903,
      "entrezgene": 100981695,
      "name": "cyclin dependent kinase 2",
      "symbol": "CDK2",
      "taxid": 9597
    },
    {
      "_id": "105864946",
      "_score": 268.31903,
      "entrezgene": 105864946,
      "name": "cyclin dependent kinase 2",
      "symbol": "CDK2",
      "taxid": 30608
    },
    {
      "_id": "ENSMEUG00000005552",
      "_score": 268.31903,
      "name": "cyclin dependent kinase 2",
      "symbol": "CDK2",
      "taxid": 9315
    },
    {
      "_id": "103465316",
      "_score": 268.31903,
      "entrezgene": 103465316,
      "name": "cyclin dependent kinase 2",
      "symbol": "cdk2",
      "taxid": 8081
    },
    {
      "_id": "100117828",
      "_score": 268.31903,
      "entrezgene": 100117828,
      "name": "cyclin dependent kinase 2",
      "symbol": "Cdk2",
      "taxid": 7425
    },
    {
      "_id": "101544122",
      "_score": 268.31903,
      "entrezgene": 101544122,
      "name": "cyclin dependent kinase 2",
      "symbol": "CDK2",
      "taxid": 42254
    }
  ]
}

We think this changed default behavior for "species" parameter will give more
intuitive results for most of users. And you can easily mimic the old behavior by explicitly specifying species=human,mouse,rat in the query. It's also worth mentioning that, as before, our customized weighting function makes sure that the human, mouse, and rat genes with the same matches (e.g. the same symbol match of "cdk2") are always appear first comparing to those from other species.

As always, let us know if you have any comments or concerns via help@mygene.info or @mygene.info.

BioGPS Spotlight on the Sheep Gene Expression Atlas

Posted by on Nov 6, 2017 in BioGPS, data release, spotlight | 0 comments

In BioGPS, there are a number of Interminebased model organism plugins (and to a lesser extent model organism data sets) which allow users to explore gene expression in organisms typically studied in biomedical research. Model organisms such as mice, rats, flies, worms, zebra fish, etc. have well-annotated genomes and a lot of well-established tools for further exploring and contributing to the knowledgebase around those animals. In contrast, valuable agricultural animals do not have this degree of data, tools, and resource development. This may change as the biomedical and agricultural research domains blur thanks to the movement of medication and infectious disease from farm animals into humans. In this spotlight, we’re happy to introduce a new data set that’s been added to BioGPS–the Sheep Gene Expression Atlas. Emily Clark, a researcher and Chancellor’s Fellow from The Roslin Institute, University of Edinburgh, kindly answered our questions.

  1. In one tweet or less, introduce us to the Sheep Gene Expression Atlas:
    A high resolution atlas of gene expression across tissues and cell types in sheep.
  2.  

  3. Who is your target audience? How big is the community studying sheep genetics?
    Our target audience is the livestock research community, particularly those working on small ruminants. There is a large research community studying sheep genetics with research groups across the globe and an International Sheep Genomics Consortium (ISGC). The project is also a valuable resource for the Functional Annotation of Animal Genomes Consortium (FAANG) and represents the largest RNA-Seq FAANG dataset to date. Sheep are also an important non-human model and we hope the data will be useful for the mammalian genomics community more generally.
  4.  

  5. It looks like the academic article on the Sheep Gene Expression Atlas was published a little more than a month ago in PLOS Genetics. How long has the team been working on the atlas before reaching this point?
    The sheep gene expression atlas was initiated in 2013, so we have been working on it for approximately 4 years. The first year involved tissue collection then the following years, library preparation and data analysis.
  6.  

  7. In your paper, you illustrate the value of the Sheep Gene Expression Atlas by looking at Innate Immunity genes and the advantages of crossbreeding. What other types of research could this atlas contribute to? Antibody development for immunological assays? Prion disease research? Antibiotic use in animal husbandry?
    We hope that the atlas will now be used by researchers working in livestock genetics and genomics to link genotype to phenotype. It has potential uses in identifying targets for novel therapeutics, some of the dataset from the sheep expression atlas project has been used to identify genes relevant to resistance to mastitis, for example (Banos et al. 2017 The Genomic Architecture of Mastitis Resistance, BMC Genomics). Researchers at the Roslin Institute, interested in prion disease, are also looking at the expression of the gene PRNP (prion protein) across tissues using the sheep atlas dataset. The scale and scope of the dataset is such that it should contribute and provide information for multiple research projects and different fields in sheep but also other ruminants and livestock.
  8.  

  9. Who is the team behind the Sheep Gene Expression Atlas?
    The sheep atlas project was led by David Hume and Alan Archibald who initiated the work. It was coordinated by Emily Clark, with bioinformatic support from Stephen Bush. The project involved a large team of people for sample collection at The Roslin Institute including farm technicians who also managed the animals for the project. We are also very grateful to Chunlei Wu and Cyrus Afrasiabi for their help making the sheep atlas dataset visualisable on the BioGPS platform.
  10.  

  11. What is in store for the Sheep Gene Expression Atlas?
    Next we hope to use the data set for a global analysis of allele specific expression across tissues and cell types in sheep and we also have a comparative analysis of gene expression from a smaller subset of tissues in goat which we hope to release soon.

Thanks to Emily Clark, and the rest of the the Sheep Gene Expression Atlas team, for sharing their high resolution Sheep Gene Expression Atlas with BioGPS. If you use the Sheep Gene Expression Atlas data set in your research, be sure to cite their publication:

Clark EL, Bush SJ, McCulloch MEB, Farquhar IL, Young R, Lefevre L, et al. (2017) A high resolution atlas of gene expression in the domestic sheep (Ovis aries). PLoS Genet13(9): e1006997. https://doi.org/10.1371/journal.pgen.1006997

To search for your favorite genes in the Sheep Gene Expression Atlas, visit the sheep-specific portal at: http://biogps.org/sheepatlas/#goto=welcome

Prion protein expression in the Sheep Gene Expression Atlas in BioGPS

Happy Birthday Wikidata!

Posted by on Oct 26, 2017 in Wikidata | 0 comments

On Wikidata’s fifth birthday, we (the Gene Wiki team) offer our hearty congratulations!! It is amazing what has been achieved in such a short timespan. Wikidata has basically given us – and the larger research community – the gift of not having to maintain a core knowledge infrastructure. It has been taken care of (i.e. millions of SPARQL queries daily), so the research community can now focus on its core task, doing research.

Our project – the Gene Wiki project – started in 2008 with the objective to seed Wikipedia with high quality basic biomedical facts with the goal of crowdsourcing a gene-specific review article for every human gene. With the birth of Wikidata in 2012, we shortly after shifted our focus from Wikipedia to Wikidata. On Oct 6, 2014, we had our first milestone: all human genes had entities in Wikidata.

Since then, we have continued enriching Wikidata with not only gene annotations from other species, but also extended the coverage to related concepts such as diseases, drugs, chemical compounds and other related concepts. We have developed a python library (Wikidata Integrator), which started as a biomedical library but is now applied in other domain areas.

We view the current landscape of biomedical data in Wikidata as basically consisting of three layers. The first layer is those resources which our team has directly loaded. We have focused on resources that are the most commonly used by researchers to form a solid foundation of biomedical knowledge. The second layer is formed by partner organizations with whom we’ve collaborated to help bring their resources into Wikidata. These partners bring key new data types, including information on genetic variants (from CIViC) and on biological pathways (from Wikipathways and Reactome). And finally, we are perhaps most excited when we discover efforts that are completely independent in origin but highly synergistic in our mission. This group includes James Hare’s effort to load environmental exposures from the CDC, and the amazing Wikicite team for loading bibliographic data from the scientific literature.

The sum total of all this work is a richly interconnected network of open biomedical knowledge. And this network enables us to ask and answer an impressively diverse set of biomedical questions (a growing list is documented at https://www.wikidata.org/wiki/User:ProteinBoxBot/SPARQL_Examples).

The genewiki landscape with its three layers.

Looking ahead and as a birthday present, we can lift a corner of the veil on our imminent developments.

To improve the robustness we are developing stronger feedback loops to experts curating primary sources. These feedback loops are based on validation reports such as the already existing constraint violations, but we are also looking into more complex constraint patterns where multiple statements are validated together using Shape Expressions. Currently, our bots are running on a continuous integration platform called Jenkins, we are working towards more automation of our efforts, such as driving the feedback loops and quality control.

We are excited to continue our work to make Wikidata the most comprehensive hub for open and linked biomedical data!

New MyVariant.info data release log and new data updates

Posted by on Sep 25, 2017 in clinvar, data release, dbnsfp, myvariant.info, snpeff, uniprot | 0 comments

Don't want to look through our blog posts to find previous information about data updates on MyVariant.info? Now you don't have to! Metadata about our data updates is now being logged on in our docs at http://docs.myvariant.info/en/latest/doc/release_changes.html. Hence, from here on out, you can find the most up-to-date metadata on our data releases in our docs. These updates will be in the same easy-to-compare tables that you've seen in our blog posts. If you'd like the most-recent metadata in json, you can get it from our metadata endpoint. Furthermore, you can obtain the most recent, assembly-specific metadata by specifying: assembly=hg38 or assembly=hg19 as in this example for hg38.

Data updates as of September 7, 2017

While we're on the topic of new data releases, here are the most recent updates for GRCh37/hg19 variants:

last release new release # of variants
in last release
# of variants
in new release
ClinVar 2017-08 2017-09 310,349 316,940
dbnsfp 3.4a 3.5a 82,366,524 82,366,524
grasp 2.0.0.0 2.0.0.0 2,473,750 2,651,542
snpeff 4.3k 4.3k 424,568,367 581,983,125

And here are the updates for GRCh38/hg38 variants:

last release new release # of variants
in last release
# of variants
in new release
ClinVar 2017-08 2017-09 310,539 317,142
dbnsfp 3.4a 3.5a 82,443,748 82,443,748
snpeff 4.3k 4.3k 413,236,533 413,237,509
uniprot 2017-03 2017-07 527,607 527,607

As you can see, we've updated data from ClinVar, dbnsfp, grasp, snpeff, and uniprot. Visit and bookmark the MyVariant.info data release log and stay current on the newest MyVariant.info data releases.

The Sammies award and why it matters to Mark2Cure

Posted by on Sep 8, 2017 in citizen science, GenBank, mark2cure | 0 comments

In case you haven't heard, David Lipman and the GenBank team are in the running for the People's Choice Award of the Samuel J. Heyman Service to America Medals (#Sammies2017). Although Lipman and the GenBank team weren't featured in Medium.com or other news sources, they still made it to the final four.

At this point, many of you may be wondering why we're even talking about Lipman and the GenBank team on a discussion venue meant for Mark2Cure. Mark2Cure is a citizen science project that deals in biomedical literature, and doesn't involve BLAST or Lipman or GenBank, right?

But, when you think about how much of scientific progress is incremental, you begin to appreciate the impressive volume of preceding work. This is especially true if you work on a project like Mark2Cure.

Mark2Cure aims to enable citizen scientists to help mine information from the biomedical literature, which means that Mark2Cure would NOT exist if there wasn't a massive volume of preceding and ongoing work in biomedical research. We've been able to build Mark2Cure because key information infrastructure was already in place--PubMed. Lipman launched PubMed in 1997 followed by PubMed Central in 2000. Without PubMed and the subsequent tools built for utilizing PubMed, identifying abstracts and pulling them into Mark2Cure would be more difficult.
As expected, PubMed now has over 27 million articles, up from over 26 million earlier this year Interestingly enough, Lipman's and the GenBank's team nomination for the 2017 Sammies only cursorily mention PubMed Central in favor of focusing on GenBank and his contributions to infectious disease surveillance. Perhaps describing their work this way made it more accessible to anyone not in biomedical research. Unfortunately, their profile description doesn't adequately convey how important the infrastructure they've built is to modern biomedical research in the US, open science, and Mark2Cure.

Because the Mark2Cure community consists of people who've been impacted by Lipman and the GenBank team's work, I'll spell it out here:

For members of our community who like science and like being able to read scientific articles: PubMed Central (PMC) has been a central repository for research articles that ANYONE can access and read. Thanks to NIH leadership, publications resulting from research supported by the NIH must be deposited to PMC.

For members of our community who are afflicted or know someone who is afflicted by a rare genetic disorder: GenBank has been a central repository for DNA sequences and BLAST has been an important means of searching those sequences. Without a central repository for DNA sequences, it would be a lot more difficult for researchers to map and annotate functionality associated with those sequences, to draw comparisons on protein function across the different model organisms, and most importantly, to build on each other's work. Much of what we know (or will know) about rare disease genes or proteins comes from (or will come from) expanding on the work of researchers studying worms (or flies, mice, frogs, fish, and more) thanks to the knowledge sharing enabled by PubMed and GenBank.

For the members of our community who just like to help: Mark2Cure exists because of the sheer volume of incremental progress that is represented by the publication of biomedical research articles. Incremental progress isn't as exciting or fun to talk about as scientific 'breakthroughs', but in science a lot of incremental progress had to happen in order for these 'breakthroughs' to follow.

There is so much to sift through, and every contribution from our citizen scientists unlocks a bit more information buried in the text. The Mark2Cure dream is that in unlocking information from the text, you will be able to help with 'breakthroughs' in disease research.

Although I've been rambling about the importance of Lipman and the GenBank team's work to modern biomedical research, Mark2Cure would be nothing without the community of citizen scientists that contribute to it. In no way should this discussion of Lipman and team detract from this fact.

UMLS identifiers now available

Posted by on Sep 7, 2017 in mygene.info, UMLS | 0 comments

The Unified Medical Language System UMLS consolidates and standardizes health and biomedical vocabularies from several important resources to enable interoperability between computer systems. Now you can use MyGene.info service to obtain UMLS Concept Unique Identifiers (CUIs).

Here are a few quick examples:

{
      "_id": "1017",
      "_score": 149.0478,
      "name": "cyclin dependent kinase 2",
      "symbol": "CDK2",
      "umls": {
        "cui": "C1332733"
      }
}

What are you waiting for? Try it out yourself.

Information extraction and the missing Mark2Cure module

Posted by on Aug 25, 2017 in biocuration, citizen science, information extraction, mark2cure, national park services | 0 comments

In our previous post, we asked readers, 'What is your preferred moniker?'. Here is the response:

Mark2Curator: 36%
Citizen Scientist: 36%
Contributor: 18%
"Anything BUT volunpeer": 10%

Although it may seem a little strange that researchers have been struggling to find an answer to the "What's in a name?" issue for discussing citizen science, this struggle is a deeply representative of some of the important work biocurators do. "What's in a name? A citizen scientist by any other name still makes important contributions"

Researchers need a common vocabulary to be able to coherently exchange information, but settling on that vocabulary--on how that vocabulary is structured is difficult. Without a common vocabulary, it is easy for scientists to miss research that is valuable to their field of study. Although it has yet to be seen how the citizen science research community will settle this issue, in biomedical research, biocurators help with that sort of determination. Biocurators help standardize terms, define the rules governing how terms are classified and how they are organized. In doing so, they facilitate information quality control and exchange. Biocurators do all this and more.

Given that biocurators do very important, very tedious, and often very difficult work, one question we get quite a bit is:

"How is it possible to train citizen scientists to replace such important, skilled researchers?"

But this question is built on a fundamentally incorrect assumption about the goals of Mark2Cure. We KNOW biocurators do very important work, and that one of the most tedious, and time-consuming things that they do is information extraction.

Information extraction can generally be broken down into three tasks:
1. Named Entity Recognition (identifying and classifying words/phrases in text)
2. Normalization (linking that text to an ontology)
3. Relationship Extraction (identifying the relationship between different entities).

We want to train citizen scientists to help with this task, so that biocurators can apply their unique training towards solving problems in biomedical research analogous to the ones we're seeing in the citizen science field.

Since Mark2Cure is a citizen science project, the "What's in a name?" issue applies to us as well. Although our informal poll was only for fun, I was personally very happy with the results for two reasons:

1. I am a fan of wordplay, and I love that many users liked the term Mark2Curator--a term which blends Mark2Cure and biocurator. I love science puns

2. Even if I'm reading too much into it, I like to think that our users picked 'citizen scientist' or 'contributors' because they feel that the help they provide to Mark2Cure is important--because it is.

If you've gotten this far, you are probably one of our many astute readers and may have noticed that information extraction was divided into THREE tasks, when Mark2Cure only has TWO. Where is the third task? Why is it the missing task is the step in between the first and the last task?

The missing task, 'Normalization', is the task in between NER and Relationship Extraction. We started with NER because NER has been well-investigated so there was a solid foundation for us to build upon. We followed with the relationship extraction task because this would allow us to unlock some of the most difficult to access and valuable information in the text.

As for the Normalization task...it's currently in being built by volunteers. Mark2Curators have been helping us investigate NER mappings to different ontologies, and a very talented programmer and machine learning expert has been busy building the Normalization module. But we could use more help. We need feedback on potential interfaces for how parts of the module might work. If you'd like to help with that, answer the poll in our newsletter.

Of note for our U.S.-based Mark2Curators over 65 years of age.

Did you know? US National Park Services has a lifetime pass for seniors that will allow you to enter or park at US national parks for free or at a discounted rate. These passes only cost $10 now through August 27th. After August 28th, the price will go up to $80.

If you enjoy hiking, nature, or plan to visit any of our beautiful national parks, you may want to get your pass while it's still $10. In San Diego, the closest national park where you can purchase one in person is Cabrillo. To find the national park closest to you, visit the NPS's site. If you don't live near a park, but plan on visiting some in the future, you can purchase a pass by mail or online.

New Data Release for MyVariant.info 201708

Posted by on Aug 10, 2017 in data release, myvariant.info | 0 comments

Another fresh data release for MyVariant.info is out! In this data release, we have updated the data from ClinVar and UniProtKB to their latest versions, and also added variant annotation from CIViC and Cancer Genome Interpreter. Here are more details.

Data Sources Updated

ClinVar was updated to its latest (same version for both hg19 and hg38 assembly). And the variant annotations from UniProtKB were also updated to the latest (hg38 only):

Some numbers for GRCh37/hg19 variants:

last release new release # of variants
in last release
# of variants
in new release
ClinVar 2017-06 2017-08 307,101 310,280

Similarly, some numbers for GRCh38/hg38 variants:

last release new release # of variants
in last release
# of variants
in new release
UniProtKB 2017-03 2017-07 477,711 527,607
ClinVar 2017-06 2017-08 307,286 310,577

ClinVar annotations are available under "clinvar" subfields for each annotated variant. UniProtKB annotations are available under "uniprot" subfields for each annotated variant. MyVariant.info aggregates annotations from ClinVar, dbSNP, dbNSFP and other 17 sources for each variant, so you can access them all in one request.

The total number of unique variants is now over 424M (424,524,227), slightly higher than our previous release on June 2017, which is 424,519,520. More details about the variant data we provide from MyVariant.info are always available from our documentation. The programmatic access of this information is available from our metadata endpoint (and hg38 metadata).


New Data Sources Added

In this data release, we added variant annotations from CIViC and Cancer Genome Interpreter (CGI), through our collaborations with the GA4GH VICC working group. Both provide extensive annotations of cancer-associated genetic variants. And more specifically:

CIViC is an open access, open source, community-driven web resource for Clinical Interpretation of Variants in Cancer. The goal of CIViC is to enable precision medicine by providing an educational forum for the dissemination of knowledge and active discussion of the clinical significance of cancer genome alterations.

Cancer Genome Interpreter is designed to support the identification of tumor alterations that drive the disease and detect those that may be therapeutically actionable. CGI relies on existing knowledge collected from several resources and on computational methods that annotate the alterations in a tumor according to distinct levels of evidence.

You can access the data from CIViC under "civic" field. And note that "civic" field is only available for hg19 variants. Here are a few query examples:

curl 'http://myvariant.info/v1/variant/chr11:g.534285C%3ET?fields=civic'  
curl 'http://myvariant.info/v1/variant/chr1:g.11187094G%3ET?fields=civic'  
curl 'http://myvariant.info/v1/variant/chr17:g.7578455C%3EA?fields=civic'  

You can access the data from Cancer Genome Interpreter under "cgi" field. And note that "cgi" field is only available for hg19 variants. Here are a few query examples:

curl 'http://myvariant.info/v1/variant/chr3:g.178936091G%3ET?fields=cgi'  
curl 'http://myvariant.info/v1/variant/chr3:g.41266109C%3ET?fields=cgi'  
curl 'http://myvariant.info/v1/variant/chr3:g.41266113C%3EG?fields=cgi'

You can also do some combined queries just like other data sources we have:

curl 'http://myvariant.info/v1/variant/chr2:g.29443600G%3ET?fields=civic%2Ccgi'  
curl 'http://myvariant.info/v1/query?q=_exists_:civic%20AND%20_exists_:cgi&fields=civic%2Ccgi'  
curl 'http://myvariant.info/v1/query?q=civic.evidence_items.drugs.name:crizotinib&fields=civic'  
curl 'http://myvariant.info/v1/query?q=cgi.gene:ALK%20AND%20cgi.association:resistant&fields=cgi'  

That's all! And as always, feel free to reach us at help@myvariant.info or @myvariantinfo if you have any questions or feedback.

Join Mark2Cure and Dazzle4Rare

Posted by on Aug 4, 2017 in citizen science, Dazzle4Rare, mark2cure, polls | 0 comments

From August 13th to August 20th, Mark2Cure will be participating in the #Dazzle4Rare campaign to raise awareness for rare diseases. Did you know? About 10% of the population lives with a rare disease, and roughly 50% of rare diseases don’t have any sort of disease-specific foundation to support or research those diseases. See more interesting statistics about rare disease at Global Genes.
If you have a rare disease story you would like us to highlight for the campaign, please get in touch!

What's new in Mark2Cure?
The EDEM1 Entity Recognition mission is over 95% complete, please help us finish it so we can launch the next one. If it seems like we’ve been quiet lately it’s because we’ve been preparing for some major updates. If you’re curious about what’s in the pipeline or would like to preview/provide feedback for potential future interface designs, we’d LOVE to hear from you! Your feedback is how we improve! If not for our many marvelous Mark2Curators providing constructive criticism, Mark2Cure would be a clunky and more difficult to use platform.

Speaking of our volunteers, citizen scientists, participants, contributors, volunpeers, and Mark2Curators…there was an interesting discussion earlier today within the citizen science community on the best way to address the amazing people who help make science happen. In fact, a bunch of researchers even wrote an interesting paper about the pros and cons of different terminology

Which takes us to our current poll.

Lastly, there is an ongoing effort to increase discussion, collaboration, and cooperation within the citizen science (or whatever you wish to call it) community. This has led our friend Alice to introduce #CitSciStories. You may think that your contributions to science in your spare time are no big deal, but from the perspective of the researchers who rely on these contributions...you are amazing! Inspiring! Awesome beyond words! We love what you do and we love learning from you and getting to know you. If you'd like to share your story and inspire others to help science, please get in touch with Alice (@PenguinGalaxy). You can learn more about the #CitSciStories effort, here.