Blog



CitSciMed Blitz has started!

Posted by on Feb 26, 2018 in citizen science, Cochrane Crowd, events, EyesOnAlz, mark2cure, Rare Disease Day | 0 comments

It's on! The CitSciMedblitz week of challenges have started!

If you missed the webinar detailing the three biomedical/health citizen science research projects, it is available for viewing on youtube.
CitSciMedblitz webinar

You are welcome to participate in as many or as few of the challenges as you'd like, but a trophy will be awarded to the highest ranking participant across all THREE challenges. Read more about CitSciMedblitz from this post at citscibio.org

With regards to the challenges, up first (and going on now!) is the EyesOnAlz 24hr Catchathon. EyesOnAlz is an Alzheimer disease-focused citizen science project investigating stalled blood in brain images. It has a lot of cool images/videos in need of review by citizen scientists and a lot of fun features. The challenge has only just started and will run to 7am PST (3pm GMT) tomorrow (Feb. 28th) so get in on it ASAP!

The Mark2Cure challenge will start at 7am PST (3pm GMT) on Wednesday, February 28th. It is a doubly-special day because the 28th is Rare Disease Day and we have had an incredibly inspirational weekend at the Sanford Burnham Presby Rare Disease Day Symposium. We look forward to sharing rare disease stories from Mark2Curators and bringing awareness about these diseases as we tackle the literature around NGLY1 during this 24hr challenge.

Speaking of literature, our old friends at Cochrane Crowd are back with a lot of new features which you can explore during the Cochrane Screening Challenge. This challenge starts at 7am PST (3pm GMT) on Friday, March 2nd and runs for 24hrs.

CitSciMed Blitz, Rare Disease Day, and more

Posted by on Feb 2, 2018 in citizen science, Cochrane Crowd, events, EyesOnAlz, mark2cure, Rare diseaes | 0 comments

It's finally February which means it's time to prepare for Rare Disease Day 2018 and CitSciMedBlitz! This year's theme for Rare Disease Day continues off of last year's theme--research. According to RareDiseaseDay.org, patients are not only subjects but also proactive actors in research--and we couldn't agree more! Mark2Cure would not be where it is now without the inspiration, contributions, and drive from our partners and contributors in the rare disease community. Mark2Curators have inspired us with their generosity, perseverance, curiosity, and overall intellectual voraciousness--and for us, Rare Disease Day is an opportunity to share about the diseases that the Mark2Cure community cares about--and not just NGLY1-deficiency. If there is a disease that you care about that you'd like us to highlight for Rare Disease Day, please get in touch.

Patients are not only subjects but also proactive actors in research.
Patients kick start research
Patients drive research
Patients organize research
Patients proactively provide data

The increasing role of patients in research is not limited to Rare Disease
As citizen science becomes increasingly popular in biomedical research, patients and care providers are becoming increasingly important partners for disease research in general. And, as many of you have pointed out--we will all be patients at some point in our lives so it's nice to be able to actively contribute to disease research.

In addition to helping to organize the knowledge surrounding NGLY1-deficiency, patients and citizen scientists have been making important contributions to Alzheimer's disease research and contributing to health evidence--all of which brings us back to CitSciMed Blitz!

CitSciMed Blitz is coming

Similar to last year's MedLitBlitz, there will be prizes for the top contributors to all THREE platforms. Only participation during the 24hr challenges will count towards the prize, however, you are welcome to register and complete the training for the other platforms prior to the event if you'd like. Learn more about the event and the other platforms here.

Gene Wiki Data, BioThings, Mark2Cure, & more! A review of 2017 in the Su Lab

Posted by on Jan 1, 2018 in annual summary, BD2K, BioGPS, BioThings, mark2cure, Su lab, sulab, tsri | 0 comments

Rather than summarize the 2017 progress on in each project in separate, project-specific posts, I’m putting it all in once post this year for one important reason–recruitment! As with any academic research group, we expect to see a number of our talented team members move on to bigger and better things. 2017 did not disappoint (well, it was disappointing to lose such talent, but we’re also happy that these awesome people are finding amazing new opportunities).

In saying farewell to 2017, we also bid farewell and best wishes to team members who have really pushed the boundaries of science during their time here including:

  • Ramya Gamini–who takes her bioinformatics expertise to Pfizer where she will continue as a postdoctoral associate.
  • Tim Putman–who takes his bioinformatics chops to Oregon Health & Science University (OHSU) as a Bioinformatics Software Developer. There, he will continue to push for open data and data reuse as he works on developing the Monarch web application, aggregate and structure microbial model organism data in The Monarch Initiative Infrastructure, and develop tools and aggregate data in support of the NCATS Translator Project.
  • Benjamin Good–the talented Assistant Professor who drove the Gene Wiki / Wiki Data projects in the Su Lab as well as a number of citizen science projects such as Science Game Lab, theCure, Mark2Cure, and more!
  • Max Nanis–research programmer, developer, artist. His twitter feed is like a tribute to chaos. His success anywhere is a certainty.
  • Sebastian Burgstaller-Muehlbacher–his profile on the Su Lab site will forever remain a fictional heavy metal tribute to science–Rock on!!!

Losing so much talent in 2017, we’re fortunate enough to recruit some new R|D rock stars to the lab and are happy to welcome:

  • Byung Ryul Jeon–a visiting scientist, accomplished doctor, and scholar interested in broadening his knowledge and honing his research skills
  • Laura Hughes–data scientist and web developer with a knack for purposeful and deliberate data visualization. Laura came to us after applying her skills to make the world a better place at her previous job with USAID.
  • Alejo Covian–a full stack python developer shrouded in mystery.

Team members like Associate Professor Chunlei Wu, not only provides friendly expertise, and helpful guidance to new recruits–he’s also a leader when it comes to holiday fashion.

If you have some computational skills and would like to use them to push research forward, consider joining our lab! We have a lot of great projects which could use your help! Speaking of great projects, let’s start with the ones led by Chunlei (pictured left).

Chunlei has been the driving force behind BioGPS, MyGene.info, MyVariant.info, BioThings.io, and more. As recently announced, he will be the TSRI site principle investigator for the National Center for Data to Health (CD2H).

The newly created center will be led by researchers from OHSU (Tim has moved on from the Su Lab, but he hasn’t escaped our reach!!!!), Northwestern University, University of Washington, Johns Hopkins University School of Medicine, and Sage Bionetworks, together with TSRI, Washington University in St. Louis, the University of Iowa and The Jackson Laboratory.

For his part, Chunlei and his team will do what they do best–building high-performance and scalable data access infrastructure and defining community best practice for data processing and software implementations.
 
 

As of 2017, BioGPS has seen the launch of a new sheep atlas portal along with a corresponding research paper, and BioGPS made it to the top of the weekly list on Labworm. After demonstrating that the MyGene.info framework could be retooled for for Variants (MyVariant.info), the framework has been abstracted into the BioThings SDK. 2017 was a busy year for these related projects:

  • Kevin presented about MyVariant.info at Heart BD2K site visit (2017.04.20)
  • Chunlei presented to Global Alliance for Genomics and Health (GA4GH), Variant Interpretation for Cancer Consortium (VICC) working group (2017.05.09)
  • Chunlei presented a poster on the BioThings SDK at ISMB/BOSC (2017.07.21 – 2017.07.25)
  • Kevin presented a poster on the BioThings Explorer at ISMB/BOSC
  • And the BioThings Explorer manuscript titled, “Cross-linking BioThings APIs through JSON-LD to facilitate knowledge exploration”, has been submitted

2017 has also been busy for the Gene Wiki / Wiki Data team here in the Su Lab as:

  • the Gene Wiki / Wiki Data Team shared their work at Biocuration 2017 conference (03.26.2017 – 03.29.2017). Both Sebastian & Tim gave talks, while Greg and Nuria had poster presentations.
  • Greg also presented at Heart BD2K site visit (2017.04.20)
  • Andrew presented at Bioinformatics Open Source Conference (2017.07.22 – 2017.07.23)
  • and Andrew was also featured on the Wikimedia Research Showcase Webcast (2017.08.23)

In case you missed it, the renewal grant application for this project was also shared in a blog post this year which gives the big picture idea of the project moving forward.

While we’re on the subject of crowdsourced science, Mark2Cure had a busy year as well.

  • Mark2Cure hosteds #Mark4Rare event for Rare Disease Day (during the week prior to rare disease day)
  • Mark2Cure organized/hosted the Citizen Science Expo at La Jolla Library (2017.03.11)
  • Ginger presented about Mark2Cure at Heart BD2K site visit (2017.04.20)
  • Mark2Cure was featured on SciStarter (2017.04.27)
  • Mark2Cure had a joint webinar with Cochrane Crowd (2017.05.08) followed by a joint event (#MedLitBlitz) which included a Mark2Curathon (2017.05.11)
  • Ginger presented Mark2Cure at the 2017 Citizen Science Association Conference during the citizen science for biomedical research panel (2015.05.18)
  • Ginger presented a poster on Mark2Cure at the CSA 2017 conference (2015.05.18)
  • Max delivered a Project Slam about Mark2Cure at the CSA 2017 conference (2015.05.18)
  • Mark2Cure joined the CitSciBio at table in St. Paul Science Museum for Citizen Science Festival (2015.05.20)
  • Mark2Cure paneled for #CitSciChat (2017.07.19)
  • Mark2Cure joined #Dazzle4rare 2017.08.13 – 2017.08.19

In addition to these projects, members of the Su Lab have been busy advancing research in protein folding/microscopy analysis methods, osteoarthritis, and data-driven drug re-purposing methodologies–all while managing to have an epic time.

May the FAIR-principles-of-open-data be with you

New default behavior for ‘species’ parameter

Posted by on Nov 24, 2017 in mygene.info, species | 0 comments

MyGene.info API supports both "/gene" and "/query" endpoints. On its /query endpoint, an optional species parameter allows users to pass one or multiple species (as common species names or taxonomy ids) to filter down the query results.

Previously, the default species were set to "human,mouse,rat". This meant that, unless you explicitly specified other values for the species parameter, your query results (e.g. "q=cdk2") might look like this:

http://mygene.info/v3/query?q=cdk2

{
  "max_score": 457.24393,
  "took": 11,
  "total": 32,
  "hits": [
    {
      "_id": "1017",
      "_score": 457.24393,
      "entrezgene": 1017,
      "name": "cyclin dependent kinase 2",
      "symbol": "CDK2",
      "taxid": 9606
    },
    {
      "_id": "12566",
      "_score": 329.98914,
      "entrezgene": 12566,
      "name": "cyclin-dependent kinase 2",
      "symbol": "Cdk2",
      "taxid": 10090
    },
    {
      "_id": "362817",
      "_score": 279.2216,
      "entrezgene": 362817,
      "name": "cyclin dependent kinase 2",
      "symbol": "Cdk2",
      "taxid": 10116
    },
    {
      "_id": "143384",
      "_score": 22.91444,
      "entrezgene": 143384,
      "name": "CDK2 associated cullin domain 1",
      "symbol": "CACUL1",
      "taxid": 9606
    },
    {
      "_id": "52004",
      "_score": 20.558783,
      "entrezgene": 52004,
      "name": "CDK2-associated protein 2",
      "symbol": "Cdk2ap2",
      "taxid": 10090
    },
    {
      "_id": "78832",
      "_score": 17.98903,
      "entrezgene": 78832,
      "name": "CDK2 associated, cullin domain 1",
      "symbol": "Cacul1",
      "taxid": 10090
    },
    {
      "_id": "365493",
      "_score": 14.489841,
      "entrezgene": 365493,
      "name": "CDK2-associated, cullin domain 1",
      "symbol": "Cacul1",
      "taxid": 10116
    },
    {
      "_id": "13445",
      "_score": 13.166027,
      "entrezgene": 13445,
      "name": "CDK2 (cyclin-dependent kinase 2)-associated protein 1",
      "symbol": "Cdk2ap1",
      "taxid": 10090
    },
    {
      "_id": "690181",
      "_score": 8.355364,
      "entrezgene": 690181,
      "name": "similar to S-phase kinase-associated protein 1A (Cyclin A/CDK2-associated protein p19) (p19A) (p19skp1)",
      "symbol": "LOC690181",
      "taxid": 10116
    },
    {
      "_id": "690646",
      "_score": 7.2449207,
      "entrezgene": 690646,
      "name": "similar to S-phase kinase-associated protein 2 (F-box protein Skp2) (Cyclin A/CDK2-associated protein p45) (F-box/WD-40 protein 1) (FWD1)",
      "symbol": "LOC690646",
      "taxid": 10116
    }
  ]
}

With no species parameter specified in the query, 32 hits were returned corresponding to all genes from species "human, mouse, rat" with a match to cdk2 in some fields (like symbol, name fields etc.). You could return the matched genes from all species by specifying species=all in the query.

While "human,mouse,rat" was a useful default for users who just need to query genes in these common species, it may cause some confusion for those query terms only relevant to non-"human/mouse/rat" species. For example, previously, a query like q=F1RW06 returns no hits instead of the matching pig CDK3 gene, unless you add "species=pig" or "species=all".

Now, based on many user feedbacks, the default "species" behavior has been set to "all". The same "q=cdk2" query will now return matched genes from all species:

http://mygene.info/v3/query?q=cdk2

{
  "max_score": 393.0346,
  "took": 115,
  "total": 611,
  "hits": [
    {
      "_id": "1017",
      "_score": 393.0346,
      "entrezgene": 1017,
      "name": "cyclin dependent kinase 2",
      "symbol": "CDK2",
      "taxid": 9606
    },
    {
      "_id": "12566",
      "_score": 327.42117,
      "entrezgene": 12566,
      "name": "cyclin-dependent kinase 2",
      "symbol": "Cdk2",
      "taxid": 10090
    },
    {
      "_id": "362817",
      "_score": 270.2593,
      "entrezgene": 362817,
      "name": "cyclin dependent kinase 2",
      "symbol": "Cdk2",
      "taxid": 10116
    },
    {
      "_id": "100925631",
      "_score": 268.31903,
      "entrezgene": 100925631,
      "name": "cyclin dependent kinase 2",
      "symbol": "CDK2",
      "taxid": 9305
    },
    {
      "_id": "100981695",
      "_score": 268.31903,
      "entrezgene": 100981695,
      "name": "cyclin dependent kinase 2",
      "symbol": "CDK2",
      "taxid": 9597
    },
    {
      "_id": "105864946",
      "_score": 268.31903,
      "entrezgene": 105864946,
      "name": "cyclin dependent kinase 2",
      "symbol": "CDK2",
      "taxid": 30608
    },
    {
      "_id": "ENSMEUG00000005552",
      "_score": 268.31903,
      "name": "cyclin dependent kinase 2",
      "symbol": "CDK2",
      "taxid": 9315
    },
    {
      "_id": "103465316",
      "_score": 268.31903,
      "entrezgene": 103465316,
      "name": "cyclin dependent kinase 2",
      "symbol": "cdk2",
      "taxid": 8081
    },
    {
      "_id": "100117828",
      "_score": 268.31903,
      "entrezgene": 100117828,
      "name": "cyclin dependent kinase 2",
      "symbol": "Cdk2",
      "taxid": 7425
    },
    {
      "_id": "101544122",
      "_score": 268.31903,
      "entrezgene": 101544122,
      "name": "cyclin dependent kinase 2",
      "symbol": "CDK2",
      "taxid": 42254
    }
  ]
}

We think this changed default behavior for "species" parameter will give more
intuitive results for most of users. And you can easily mimic the old behavior by explicitly specifying species=human,mouse,rat in the query. It's also worth mentioning that, as before, our customized weighting function makes sure that the human, mouse, and rat genes with the same matches (e.g. the same symbol match of "cdk2") are always appear first comparing to those from other species.

As always, let us know if you have any comments or concerns via help@mygene.info or @mygene.info.

BioGPS Spotlight on the Sheep Gene Expression Atlas

Posted by on Nov 6, 2017 in BioGPS, data release, spotlight | 0 comments

In BioGPS, there are a number of Interminebased model organism plugins (and to a lesser extent model organism data sets) which allow users to explore gene expression in organisms typically studied in biomedical research. Model organisms such as mice, rats, flies, worms, zebra fish, etc. have well-annotated genomes and a lot of well-established tools for further exploring and contributing to the knowledgebase around those animals. In contrast, valuable agricultural animals do not have this degree of data, tools, and resource development. This may change as the biomedical and agricultural research domains blur thanks to the movement of medication and infectious disease from farm animals into humans. In this spotlight, we’re happy to introduce a new data set that’s been added to BioGPS–the Sheep Gene Expression Atlas. Emily Clark, a researcher and Chancellor’s Fellow from The Roslin Institute, University of Edinburgh, kindly answered our questions.

  1. In one tweet or less, introduce us to the Sheep Gene Expression Atlas:
    A high resolution atlas of gene expression across tissues and cell types in sheep.
  2.  

  3. Who is your target audience? How big is the community studying sheep genetics?
    Our target audience is the livestock research community, particularly those working on small ruminants. There is a large research community studying sheep genetics with research groups across the globe and an International Sheep Genomics Consortium (ISGC). The project is also a valuable resource for the Functional Annotation of Animal Genomes Consortium (FAANG) and represents the largest RNA-Seq FAANG dataset to date. Sheep are also an important non-human model and we hope the data will be useful for the mammalian genomics community more generally.
  4.  

  5. It looks like the academic article on the Sheep Gene Expression Atlas was published a little more than a month ago in PLOS Genetics. How long has the team been working on the atlas before reaching this point?
    The sheep gene expression atlas was initiated in 2013, so we have been working on it for approximately 4 years. The first year involved tissue collection then the following years, library preparation and data analysis.
  6.  

  7. In your paper, you illustrate the value of the Sheep Gene Expression Atlas by looking at Innate Immunity genes and the advantages of crossbreeding. What other types of research could this atlas contribute to? Antibody development for immunological assays? Prion disease research? Antibiotic use in animal husbandry?
    We hope that the atlas will now be used by researchers working in livestock genetics and genomics to link genotype to phenotype. It has potential uses in identifying targets for novel therapeutics, some of the dataset from the sheep expression atlas project has been used to identify genes relevant to resistance to mastitis, for example (Banos et al. 2017 The Genomic Architecture of Mastitis Resistance, BMC Genomics). Researchers at the Roslin Institute, interested in prion disease, are also looking at the expression of the gene PRNP (prion protein) across tissues using the sheep atlas dataset. The scale and scope of the dataset is such that it should contribute and provide information for multiple research projects and different fields in sheep but also other ruminants and livestock.
  8.  

  9. Who is the team behind the Sheep Gene Expression Atlas?
    The sheep atlas project was led by David Hume and Alan Archibald who initiated the work. It was coordinated by Emily Clark, with bioinformatic support from Stephen Bush. The project involved a large team of people for sample collection at The Roslin Institute including farm technicians who also managed the animals for the project. We are also very grateful to Chunlei Wu and Cyrus Afrasiabi for their help making the sheep atlas dataset visualisable on the BioGPS platform.
  10.  

  11. What is in store for the Sheep Gene Expression Atlas?
    Next we hope to use the data set for a global analysis of allele specific expression across tissues and cell types in sheep and we also have a comparative analysis of gene expression from a smaller subset of tissues in goat which we hope to release soon.

Thanks to Emily Clark, and the rest of the the Sheep Gene Expression Atlas team, for sharing their high resolution Sheep Gene Expression Atlas with BioGPS. If you use the Sheep Gene Expression Atlas data set in your research, be sure to cite their publication:

Clark EL, Bush SJ, McCulloch MEB, Farquhar IL, Young R, Lefevre L, et al. (2017) A high resolution atlas of gene expression in the domestic sheep (Ovis aries). PLoS Genet13(9): e1006997. https://doi.org/10.1371/journal.pgen.1006997

To search for your favorite genes in the Sheep Gene Expression Atlas, visit the sheep-specific portal at: http://biogps.org/sheepatlas/#goto=welcome

Prion protein expression in the Sheep Gene Expression Atlas in BioGPS

Happy Birthday Wikidata!

Posted by on Oct 26, 2017 in Wikidata | 0 comments

On Wikidata’s fifth birthday, we (the Gene Wiki team) offer our hearty congratulations!! It is amazing what has been achieved in such a short timespan. Wikidata has basically given us – and the larger research community – the gift of not having to maintain a core knowledge infrastructure. It has been taken care of (i.e. millions of SPARQL queries daily), so the research community can now focus on its core task, doing research.

Our project – the Gene Wiki project – started in 2008 with the objective to seed Wikipedia with high quality basic biomedical facts with the goal of crowdsourcing a gene-specific review article for every human gene. With the birth of Wikidata in 2012, we shortly after shifted our focus from Wikipedia to Wikidata. On Oct 6, 2014, we had our first milestone: all human genes had entities in Wikidata.

Since then, we have continued enriching Wikidata with not only gene annotations from other species, but also extended the coverage to related concepts such as diseases, drugs, chemical compounds and other related concepts. We have developed a python library (Wikidata Integrator), which started as a biomedical library but is now applied in other domain areas.

We view the current landscape of biomedical data in Wikidata as basically consisting of three layers. The first layer is those resources which our team has directly loaded. We have focused on resources that are the most commonly used by researchers to form a solid foundation of biomedical knowledge. The second layer is formed by partner organizations with whom we’ve collaborated to help bring their resources into Wikidata. These partners bring key new data types, including information on genetic variants (from CIViC) and on biological pathways (from Wikipathways and Reactome). And finally, we are perhaps most excited when we discover efforts that are completely independent in origin but highly synergistic in our mission. This group includes James Hare’s effort to load environmental exposures from the CDC, and the amazing Wikicite team for loading bibliographic data from the scientific literature.

The sum total of all this work is a richly interconnected network of open biomedical knowledge. And this network enables us to ask and answer an impressively diverse set of biomedical questions (a growing list is documented at https://www.wikidata.org/wiki/User:ProteinBoxBot/SPARQL_Examples).

The genewiki landscape with its three layers.

Looking ahead and as a birthday present, we can lift a corner of the veil on our imminent developments.

To improve the robustness we are developing stronger feedback loops to experts curating primary sources. These feedback loops are based on validation reports such as the already existing constraint violations, but we are also looking into more complex constraint patterns where multiple statements are validated together using Shape Expressions. Currently, our bots are running on a continuous integration platform called Jenkins, we are working towards more automation of our efforts, such as driving the feedback loops and quality control.

We are excited to continue our work to make Wikidata the most comprehensive hub for open and linked biomedical data!

New MyVariant.info data release log and new data updates

Posted by on Sep 25, 2017 in clinvar, data release, dbnsfp, myvariant.info, snpeff, uniprot | 0 comments

Don't want to look through our blog posts to find previous information about data updates on MyVariant.info? Now you don't have to! Metadata about our data updates is now being logged on in our docs at http://docs.myvariant.info/en/latest/doc/release_changes.html. Hence, from here on out, you can find the most up-to-date metadata on our data releases in our docs. These updates will be in the same easy-to-compare tables that you've seen in our blog posts. If you'd like the most-recent metadata in json, you can get it from our metadata endpoint. Furthermore, you can obtain the most recent, assembly-specific metadata by specifying: assembly=hg38 or assembly=hg19 as in this example for hg38.

Data updates as of September 7, 2017

While we're on the topic of new data releases, here are the most recent updates for GRCh37/hg19 variants:

last release new release # of variants
in last release
# of variants
in new release
ClinVar 2017-08 2017-09 310,349 316,940
dbnsfp 3.4a 3.5a 82,366,524 82,366,524
grasp 2.0.0.0 2.0.0.0 2,473,750 2,651,542
snpeff 4.3k 4.3k 424,568,367 581,983,125

And here are the updates for GRCh38/hg38 variants:

last release new release # of variants
in last release
# of variants
in new release
ClinVar 2017-08 2017-09 310,539 317,142
dbnsfp 3.4a 3.5a 82,443,748 82,443,748
snpeff 4.3k 4.3k 413,236,533 413,237,509
uniprot 2017-03 2017-07 527,607 527,607

As you can see, we've updated data from ClinVar, dbnsfp, grasp, snpeff, and uniprot. Visit and bookmark the MyVariant.info data release log and stay current on the newest MyVariant.info data releases.

The Sammies award and why it matters to Mark2Cure

Posted by on Sep 8, 2017 in citizen science, GenBank, mark2cure | 0 comments

In case you haven't heard, David Lipman and the GenBank team are in the running for the People's Choice Award of the Samuel J. Heyman Service to America Medals (#Sammies2017). Although Lipman and the GenBank team weren't featured in Medium.com or other news sources, they still made it to the final four.

At this point, many of you may be wondering why we're even talking about Lipman and the GenBank team on a discussion venue meant for Mark2Cure. Mark2Cure is a citizen science project that deals in biomedical literature, and doesn't involve BLAST or Lipman or GenBank, right?

But, when you think about how much of scientific progress is incremental, you begin to appreciate the impressive volume of preceding work. This is especially true if you work on a project like Mark2Cure.

Mark2Cure aims to enable citizen scientists to help mine information from the biomedical literature, which means that Mark2Cure would NOT exist if there wasn't a massive volume of preceding and ongoing work in biomedical research. We've been able to build Mark2Cure because key information infrastructure was already in place--PubMed. Lipman launched PubMed in 1997 followed by PubMed Central in 2000. Without PubMed and the subsequent tools built for utilizing PubMed, identifying abstracts and pulling them into Mark2Cure would be more difficult.
As expected, PubMed now has over 27 million articles, up from over 26 million earlier this year Interestingly enough, Lipman's and the GenBank's team nomination for the 2017 Sammies only cursorily mention PubMed Central in favor of focusing on GenBank and his contributions to infectious disease surveillance. Perhaps describing their work this way made it more accessible to anyone not in biomedical research. Unfortunately, their profile description doesn't adequately convey how important the infrastructure they've built is to modern biomedical research in the US, open science, and Mark2Cure.

Because the Mark2Cure community consists of people who've been impacted by Lipman and the GenBank team's work, I'll spell it out here:

For members of our community who like science and like being able to read scientific articles: PubMed Central (PMC) has been a central repository for research articles that ANYONE can access and read. Thanks to NIH leadership, publications resulting from research supported by the NIH must be deposited to PMC.

For members of our community who are afflicted or know someone who is afflicted by a rare genetic disorder: GenBank has been a central repository for DNA sequences and BLAST has been an important means of searching those sequences. Without a central repository for DNA sequences, it would be a lot more difficult for researchers to map and annotate functionality associated with those sequences, to draw comparisons on protein function across the different model organisms, and most importantly, to build on each other's work. Much of what we know (or will know) about rare disease genes or proteins comes from (or will come from) expanding on the work of researchers studying worms (or flies, mice, frogs, fish, and more) thanks to the knowledge sharing enabled by PubMed and GenBank.

For the members of our community who just like to help: Mark2Cure exists because of the sheer volume of incremental progress that is represented by the publication of biomedical research articles. Incremental progress isn't as exciting or fun to talk about as scientific 'breakthroughs', but in science a lot of incremental progress had to happen in order for these 'breakthroughs' to follow.

There is so much to sift through, and every contribution from our citizen scientists unlocks a bit more information buried in the text. The Mark2Cure dream is that in unlocking information from the text, you will be able to help with 'breakthroughs' in disease research.

Although I've been rambling about the importance of Lipman and the GenBank team's work to modern biomedical research, Mark2Cure would be nothing without the community of citizen scientists that contribute to it. In no way should this discussion of Lipman and team detract from this fact.

UMLS identifiers now available

Posted by on Sep 7, 2017 in mygene.info, UMLS | 0 comments

The Unified Medical Language System UMLS consolidates and standardizes health and biomedical vocabularies from several important resources to enable interoperability between computer systems. Now you can use MyGene.info service to obtain UMLS Concept Unique Identifiers (CUIs).

Here are a few quick examples:

{
      "_id": "1017",
      "_score": 149.0478,
      "name": "cyclin dependent kinase 2",
      "symbol": "CDK2",
      "umls": {
        "cui": "C1332733"
      }
}

What are you waiting for? Try it out yourself.

Information extraction and the missing Mark2Cure module

Posted by on Aug 25, 2017 in biocuration, citizen science, information extraction, mark2cure, national park services | 0 comments

In our previous post, we asked readers, 'What is your preferred moniker?'. Here is the response:

Mark2Curator: 36%
Citizen Scientist: 36%
Contributor: 18%
"Anything BUT volunpeer": 10%

Although it may seem a little strange that researchers have been struggling to find an answer to the "What's in a name?" issue for discussing citizen science, this struggle is a deeply representative of some of the important work biocurators do. "What's in a name? A citizen scientist by any other name still makes important contributions"

Researchers need a common vocabulary to be able to coherently exchange information, but settling on that vocabulary--on how that vocabulary is structured is difficult. Without a common vocabulary, it is easy for scientists to miss research that is valuable to their field of study. Although it has yet to be seen how the citizen science research community will settle this issue, in biomedical research, biocurators help with that sort of determination. Biocurators help standardize terms, define the rules governing how terms are classified and how they are organized. In doing so, they facilitate information quality control and exchange. Biocurators do all this and more.

Given that biocurators do very important, very tedious, and often very difficult work, one question we get quite a bit is:

"How is it possible to train citizen scientists to replace such important, skilled researchers?"

But this question is built on a fundamentally incorrect assumption about the goals of Mark2Cure. We KNOW biocurators do very important work, and that one of the most tedious, and time-consuming things that they do is information extraction.

Information extraction can generally be broken down into three tasks:
1. Named Entity Recognition (identifying and classifying words/phrases in text)
2. Normalization (linking that text to an ontology)
3. Relationship Extraction (identifying the relationship between different entities).

We want to train citizen scientists to help with this task, so that biocurators can apply their unique training towards solving problems in biomedical research analogous to the ones we're seeing in the citizen science field.

Since Mark2Cure is a citizen science project, the "What's in a name?" issue applies to us as well. Although our informal poll was only for fun, I was personally very happy with the results for two reasons:

1. I am a fan of wordplay, and I love that many users liked the term Mark2Curator--a term which blends Mark2Cure and biocurator. I love science puns

2. Even if I'm reading too much into it, I like to think that our users picked 'citizen scientist' or 'contributors' because they feel that the help they provide to Mark2Cure is important--because it is.

If you've gotten this far, you are probably one of our many astute readers and may have noticed that information extraction was divided into THREE tasks, when Mark2Cure only has TWO. Where is the third task? Why is it the missing task is the step in between the first and the last task?

The missing task, 'Normalization', is the task in between NER and Relationship Extraction. We started with NER because NER has been well-investigated so there was a solid foundation for us to build upon. We followed with the relationship extraction task because this would allow us to unlock some of the most difficult to access and valuable information in the text.

As for the Normalization task...it's currently in being built by volunteers. Mark2Curators have been helping us investigate NER mappings to different ontologies, and a very talented programmer and machine learning expert has been busy building the Normalization module. But we could use more help. We need feedback on potential interfaces for how parts of the module might work. If you'd like to help with that, answer the poll in our newsletter.

Of note for our U.S.-based Mark2Curators over 65 years of age.

Did you know? US National Park Services has a lifetime pass for seniors that will allow you to enter or park at US national parks for free or at a discounted rate. These passes only cost $10 now through August 27th. After August 28th, the price will go up to $80.

If you enjoy hiking, nature, or plan to visit any of our beautiful national parks, you may want to get your pass while it's still $10. In San Diego, the closest national park where you can purchase one in person is Cabrillo. To find the national park closest to you, visit the NPS's site. If you don't live near a park, but plan on visiting some in the future, you can purchase a pass by mail or online.