No trick, this new plugin is quite the treat

Posted by on Oct 31, 2018 in BioGPS | 0 comments

Check out the new PanDrugs plugin from @pandrugs_cnio–the clean design neatly and intuitively provides information on drugs that relate to your gene of interest.
Thanks to the good folks at Centro Nacional de Investigaciones Oncologicas, this awesome new resource is available on BioGPS.

Speaking of BioGPS–just so we’re not confused with the other ‘BioGPS’ tools that came after us–BioGPS stands for:

Happy Halloween

Posted by on Oct 31, 2018 in crowdsourcing, mark2cure, Webinar | 0 comments

M2C Halloween movie parody poster

If you love science and enjoy learning, you’re in for a treat! Andrew (the brains behind Mark2Cure) will be holding a webinar using two case studies (the Gene Wiki Project and Mark2Cure) to illustrate the use of crowdsourcing as it applies to knowledge management for translational research. Registeration is free and open to anyone (we checked first). There may be some required questions on the registration form which may not necessarily be applicable to you, but they’re only meant to inform, not exclude. So, if you’re interested in the webinar but don’t have an affiliation with an institute or department, feel free to select ‘other’ and type in ‘Mark2Cure’.

Event Details

(View the original announcement)

Speaker: Andrew Su, PhD, Professor, Department of Integrative, Structural and Computational Biology, The Scripps Research Institute

Overview: Crowdsourcing involves the engagement of large communities of individuals to collaboratively accomplish tasks at massive scale. These tasks could be online or offline, paid or for free. But how can crowdsourcing science help your research? This webinar will describe two crowdsourcing projects for translational research, both of which aim to better organize biomedical information so that it can be more easily accessed, integrated, and queried:

First, the goal of the Gene Wiki project is to create a community-maintained knowledge base of all relationships between biological entities, including genes, diseases, drugs, pathways, and variants. This project draws on the collective efforts of informatics researchers from a wide range of disciplines, including bioinformatics, cheminformatics, and medical informatics.

Second, the Mark2Cure project partners with the citizen scientist community to extract structured content from biomedical abstracts with an emphasis on rare disease. Although citizen scientists do not have any specialized expertise, after receiving proper training, Mark2Cure has shown that in aggregate they perform bio-curation at an accuracy comparable to professional scientists.

The Digital Scholar Webinar Series introduces health researchers at USC, CHLA and beyond to digital approaches and tools relevant to their research. The series showcases the potential and limitations of digital approaches health researchers need to be aware of. All webinars will be accessible afterward on the Digital Scholar Program page.

Register for the webinar

Wikidata SPARQL Query Log Item and Property Co-occurrence Analysis

Posted by on Aug 31, 2018 in Uncategorized | 0 comments

This is part 2 of a blog post describing Wikidata SPARQL query logs. Part 1 is here. In this part, we’ll look at the most used items and specifically look at diseases, drugs, genes and proteins. And then look at co-occurrence of properties within queries..

Below is a plot of the most used items by total query count and by unique query count


Next, I’ll look specifically at diseases. These are the most used disease items by total query count and by unique query count:

I then retrieved all diseases, drugs, genes and protein from Wikidata and summed the counts for each type. The table below shows some summary information for each. The columns unique and total display the number of unique and total queries containing any item of that type, respectively.  The columns organic and robotic contain the number of unique queries classified as organic or robotic, and num_items contains the total number of each type in Wikidata.


unique total organic robotic num_items
disease 82388 181241 1742 80465 11240
gene 11795 30899 163 11618 757218
protein 950 3798 37 913 533161
drug 42692 209740 544 42136 145961


Property Co-occurrence

If two properties are used together within the same query, it may indicate something special.

First we’ll look at all properties. Below is a plot of the of top pairs of properties used together in a query, by total and unique count

and by organic vs robotic count:


Next, looking specifically at biomedical-related properties:

And lastly, I wanted to see the most common useragents for each of these pairs of properties.


These plots together indicates that there were thousands of unique, organic, integrative queries involving biomedical items of different classes performed during this time.

Wikidata Sparql Query Log Analysis

Posted by on Aug 31, 2018 in bioinformatics, Biomedical Informatics, biomedical research, Gene Wiki, SPARQL, Uncategorized, Wikidata | 0 comments

We recently gained access to the anonymised logs of several hundred million SPARQL queries from the Wikidata SPARQL endpoint. This blog post contains some discussion about the main takeaway points, while the full analysis and code can be found here.

What are we looking for?

We are interested in biomedical properties and items in Wikidata, specifically looking at how these entities are used within SPARQL queries, and how often they are searched for. While it would great to look at biomedical entities in the results of SPARQL queries, or being used as intermediates within the knowledge graph, it would be very difficult to do this at this time, and so we are only looking at how items or properties are explicitly used within the SPARQL queries themselves.

Why do we care about this?

Part of the work we do in the Su Lab involves working with data curators and data providers to make their data more findable, accessible, integratable and reusable. Using these query logs, we can get a sense of if we are succeeding. In addition, data providers often have to gather usage metrics to justify additional funding, or make decisions on future improvements to their data. Looking at the query logs can help get a sense of how people are using certain data.

What is in the data?

Please see here for a full description. Briefly, each line in the file dumps contains (1) the anonymized query (reformatted and processed for reducing identifiability), (2) timestamp, (3) source category (robotic or organic, explained here), (4) user agent: A simplified/anonymised version of the user agent string that was used with the request.

How was the data processed?

See here for the main script. Briefly, all 3 files were concatenated together and then the query strings were sorted, grouped and counted. In this way, we could have a set of unique queries, and the number of times each query was executed. We then counted the number of items or properties explicitly mentioned within each query, recording the number of unique queries and total queries each ID was found in. We also then looked at co-occurrence of properties within SPARQL queries. For example, counting the number of times Disease Ontology ID (P699) and subclass of (P279) are used together within the same query.

What did you find??

In this analysis, I’m mostly going to show the unique counts (i.e. if the exact same query is executed many times, it only counts once). The total counts are available with the full analysis.

Below are the counts for the most used properties by total query count and by unique query count


Below, the unique query counts are broken down into organic and robotic query counts. Takeaways for me: The scale is significantly different between organic and robotic queries. As a benchmark, out of 208 million total queries, less than 1 million were classified as organic. The most common unique robotic queries involved PubMed IDs, which make since given the scale of WikiCite, while the most common organic queries uses instance of and subclass of, which make sense given that these are very widespread and useful..

I decided to take a look at the PubMed ID queries, to get an idea of what they are typically. Here are a couple of them:

SELECT *	WHERE {	  ?var1 ( <> / <> ) "10000006".	}
SELECT *	WHERE {	  ?var1 ( <> / <> ) "10000007".	}
SELECT *	WHERE {	  ?var1 ( <> / <> ) "10000008".	}
SELECT *	WHERE {	  ?var1 ( <> / <> ) "10000009".	}

As you can see, many of the queries are the same query with different IDs.

Next, I looked at property usage stratified by useragent, for both unique and total query counts.

For fun, I was also curious what the most queried query was. It is the query below, which is looking for MusicBrainzIDs. It was executed 9,721,509 times! As the strings were replaced by “string1”, “string2”, etc. it is probably the case that this query was executed with different strings each time, but those have been anonymized. The other top 10 can be seen on github.

SELECT ?var1 ?var2 WHERE { VALUES ( ?var2 ) { ( “string1” ) ( “string2” ) ( “string3” ) ( “string4” ) ( “string5” ) ( “string6” ) ( “string7” ) ( “string8” ) ( “string9” ) ( “string10” ) ( “string11” ) ( “string12” ) ( “string13” ) ( “string14” ) ( “string15” ) ( “string16” ) ( “string17” ) ( “string18” ) ( “string19” ) ( “string20” ) ( “string21” ) ( “string22” ) ( “string23” ) ( “string24” ) ( “string25” ) ( “string26” ) ( “string27” ) ( “string28” ) ( “string29” ) ( “string30” ) } ?var1 <> ?var2 . }

Lets look at biomedical properties!

As you may know, the at the Su Lab we have a project where we enrich Wikidata with biomedical data. See here and here for more info. Part of this involves working with data curators and data providers to make their data more usable by the Wikidata community. Using these query logs, we can get a sense of if we are succeeding in making these biomedical data findable, accessible, and integratable through Wikidata.

We’ll start with biomedical properties, showing the unique and total query counts.

And broken down into organic and robotic query counts.


And lastly, property usage stratified by useragent and property. Takeaways for me: PBB_core (aka WikidataIntegrator) and Magnus’s tools are the largest queriers for reference items. Taxons, drugs, diseases, and genes are used highly through MediaWiki and through browser-based queries.

In the next blog post I’ll talk about item usage in queries, and co-occurrence of properties within the same query.

BioThings- Other Biothings in the works

Posted by on May 21, 2018 in BioThings,, | 0 comments

By now, you've probably seen the announcement that our renewal grant application has been approved. In addition to funding the improvement of, and the extension of lessons learned from, the renewal grant will fund the development of a BioThings Software Development Kit (SDK). To our knowledge this BioThings SDK will be the first open-source bioinformatics software development kit (SDK) for building high performance web services. We WANT people to be able to build high performance web services for accessing important biological data so that everyone can get the most out of existing data.

Building the BioThings SDK

The development of the BioThings SDK will have three phases:

Phase 1- Abstraction--in this phase, we will extract the common codebase from MyGene and MyVariant to form the three core components (“databuild”, “web-API” and “cloud-deployment”) of the BioThings SDK.

Phase 2- Customization Tool Creation-- in this phase, we will build customization mechanisms into the SDK. These mechanisms will include a project-specific configuration system, a scheduler (for harvesting data from data source specific parsers), and a some generic interfaces for adding customization into each of the three core components (“databuild”, “web-API” and “cloud-deployment”).

Phase 3- Test and Improve-- in this phase, we test to ensure that the BioThings SDK works by converting the existing MyGene and MyVariant project code base to use the BioThings SDK. In converting the code base, we will identify issues with the SDK and areas of improvement. Once the SDK is in place, we may iteratively improve it either by converting codebase for other BioThings APIs (if they're created prior to the completion of the SDK) or by creating new BioThings APIs for chemicals or diseases.

Improving Data Provenance of BioThings APIs

One of the advantages of using BioThings APIs like MyGene or MyVariant is that it's always up-to-date. The BioThings team has taken care of the issues with data parsing and munging that can arise when individual resources are updated, so smoothly that users can sometimes be surprised/caught off guard when a data update changes their results.

To ensure that BioThings APIs handle data provenance well, we will build methods for data discrepancy and quality control; as well as, data update log recording and reporting directly into the BioThings SDK. Data provenance is important for reproducibility especially when working with continuously updated data resources, and the BioThings SDK will include methods for this important aspect of data management.

Improving Utility of BioThings APIs

By building the BioThings SDK, we invite others to create APIs like MyGene and MyVariant. APIs like MyGene and MyVariant make it easier for developers to create tools for utilizing the annotation data available via these APIs. This could potentially mean the growth of APIs like MyGene and MyVariant which creates a new and interesting quandary--how to know when data resource updates affect something of interest to you? To address this quandary, the BioThings team proposes developing a tool which is tentatively called BioReel which will be discussed in our next post.

Sneak Peek at Changes Coming to

In addition to overhauling the MyGene and MyVariant websites, the BioThings website will be getting a cosmetic upgrade as well. Here's a little taste of what's to come!
image of new biothings logo

MyVariant- Lesson’s learned and what’s in store

Posted by on May 14, 2018 in BioThings, JSON-LD, | 0 comments

In case you missed it, a sneak peek at what's in store for was posted last week, so it's only FAIR to share our plans for MyVariant. Although the development of naturally followed that of, the scale of variant annotation data presented a difficult challenge that required additional architectural and performance considerations. At the time was first being developed, there were about 18 million genes in the 6+ data resources of interest compared to 340 million variants in the 12+ data resources of interest--a ~20x scale up in the number of items to index.

For this reason, the development of required a bit of tailoring to make it work. In overcoming the architectural and performance challenges, the team members learned a lot of valuable lessons on abstracting and standardizing the creation of APIs like and for different types (and scales) of biological entity data. To learn more about the general architecture behind MyGene and MyVariant services, check out the 2016 paper in Genome Biology.

The lessons learned on wrangling data of the scale that is handled by will be valuable as the BioThings team looks towards incorporating data from Ensemble Genomes which will drastically scale up the amount of data offered by In abstracting the process of building MyGene and MyVariant, the BioThings team has laid the foundation for building additional APIs centered around biological entities like chemicals and diseases!

Furthermore, the BioThings team will take the lessons learned and incorporate them into their efforts to create a generic Software Development Kit (SDK) for generating APIs around biological entities like genes and variants. More on the BioThings SDK later.

Variant annotation data is much more valuable in the context of genes; hence, the MyVariant team has been exploring ways to increase interoperability of the MyGene and MyVariant services using JSON-LD.

Linking data with JSON-LD

Both and store annotation data from different resources in JSON documents; however, differences in keys for the same data across the two services can make it challenging to obtain results for chained queries. JSON-LD provides a standard way to add semantic context to the existing JSON data structure enhancing the interpretability and therefore interoperability of the JSON data.

Basically, each API (like MyGene and MyVariant) specifies a JSON-LD context (ie- a JSON document that can provide a Universal Resource Identifier (URI) mapping for each key in the output JSON document). The use of URIs provides consistency when specifying subjects and objects, allowing the results for a multistep chained query to be obtained through a much more simplified query.

Learn about how it works here and imagine the possibilities.

The anniversary of Mark2Cure’s official launch is coming soon

Posted by on May 11, 2018 in Cochrane Crowd, mark2cure, publications | 0 comments

Mark2Cure's 3rd anniversary is coming up, and we are extremely grateful for the opportunity to have interacted and learned from you over the last few years! You make this project interesting. You make this project interesting and exciting. You make this project educational and humbling. You make this project useful and valuable. Although our research team has shrunk two half of what it was when we first started, we have been able to continue to move forward only because of you! We cannot thank you enough!

As a citizen science effort, Mark2Cure is primarily driven by volunteers--and volunteers like you have brought us to where we are today. As of today, we have over 1.3 million annotations!!! We are currently busy with the analysis, so please accept my apologies for being a bit more slow to respond to your inquiries. Fortunately, you and your fellow volunteers continue to help us move forward. In fact, we are excited to share a new preprint on aligning citizen science opportunities with the needs of students fulfilling community service and service learning requirements. The research on these requirements was primarily performed by a volunteer with a marketing/business background and was inspired by a few high school Mark2Curators who have been kind enough to share their experience and needs as students and volunteers.

You can find the preprint on bioarxiv here, and it has been submitted for peer review in the journal Citizen Science Theory and Practice.

A designer has also wrapped up her work on making Mark2Cure more intuitive and user-friendly. A huge thanks to those of you who took the time to provide feedback on individual parts of her designs--although your feedback may not necessarily be incorporated in them here, we will definitely take your detailed and valuable suggestions into consideration.

Since this is citizen science and your voices are important--I'd like to share the designs with all of you. You can find the wire frames here. Note that the actual wording/content is subject to change (especially since we've received detailed content recommendations from some of you), and that the focus is more about the layout of the content. Please feel free to share your opinions about it with us!

For those of you who have joined us in last year's #MedLitBlitz or this year's #CitSciMedBlitz, you may be familiar with our friends at Cochrane Crowd. Like Mark2Cure, Cochrane Crowd is a citizen science project where volunteers help inspect biomedical abstracts. Cochrane crowd was also launched in May and are celebrating their anniversary with the #showyourscreen 2 Million annotations challenge. Learn more about the challenge here.

Sneak peek–what’s in store for

Posted by on May 7, 2018 in mygene, | 0 comments

As of the most recent data update on April 24th to build version 20180422, the db grew to contain 22,132,511 documents. As a valuable service that has seen over 20 million requests in the last 30 days, was fortunate enough to receive renewed support for improving its offerings.

The landing page will be overhauled with a sleeker, more attractive, intuitive, responsive, cohesive, and user-friendly design. The updated landing site will reflect the ongoing improvements that were made to the website's architecture in the last few months as well as the expected changes in store for

What's in store for will expand to include highly-requested annotation sources such as the species and annotations available from Ensembl Genomes. Currently, Ensemble is already one of's ~7+ data resources, and contributes annotations for 1.6 million genes in >80 species. The inclusion of Ensemble Genomes can potentially add annotations for over 145 million genes from thousands of bacteria, fungi, plant, metazoa, and protists species!

In addition to this large and widely requested resource, will also import gene annotations from several smaller, more specialized data resources with the goal of making data from all of these resources more Findable, Accessible, Interoperative, and Reusable!

A FAIRly important note

Both the Su Lab and Wu Lab (to whom the renewal grant was awarded) are strong proponents of data re-use and have a strong interest in data FAIRness. How can,, and both labs' related efforts make existing biological data more FAIR? makes gene annotation data more Findable by providing a centralized resource that enables simple community contribution. All data sources included in are heavily indexed using existing identifiers, allowing that data to be acccessed via our simple search API. Since the data sources are pre-integrated into gene-specific JSON objects in, data from the included resources are standardized in structure and Accessible via our REST-based JSON API.

As part of the BioThings APIs, gene annotation data included in will be more Interoperable thanks to the compatibility with Linked Open Data resources using JSON-LD and standard vocabularies. Read more about the value of interoperable data in our recent paper on Cross-linking BioThings APIs via JSON-LD. By allowing community editing of the JSON-LD context files, we'll empower the community to iteratively improve the interoperability of the data.

Lastly, (and its sister BioThings APIs) will continue to help make data more Reusable by providing a high-performance, continuously-updated API with no authentication, registration, or usage limits. By providing R and python client libraries and encouraging the development of 3rd party clients, increases the accessibility and utility of the data.

Excited about what's in store for If not, check be sure to check out the BioThings paper for a taste of the possibilities that are coming!

Happy Citizen Science Day Hero Badge Hunting!

Posted by on Apr 18, 2018 in challenges, citizen science | 0 comments

Citizen Science Day was April 14th this year, and Mark2Cure partnered with the San Diego Public Library to host a local Citizen Science Expo. Of course, many of our wonderful contributors are not in San Diego and could not attend the event. For those of you who wish to get in on the Citizen Science Day excitement, we've joined the EyesOnAlz Citizen Science Day Hero challenge.

The challenge will run until 9am ET, April 21st and anyone interested in the challenge will have the opportunity to earn digital badges for just trying out (ie- registering or logging into) different citizen science projects. As Mark2Curators, you only need to log into your Mark2Cure account to earn a badge.

Learn more about this fun challenge at

Citizen Science Day 2018 is just around the corner!

Posted by on Mar 30, 2018 in citizen science, events | 0 comments

San Diego Citizen Science Day Expo

Citizen Science Day is on April 14th, this year and many citizen science organizations (including yours truly) are hosting citizen science events. Here in San Diego, we've teamed up with the San Diego Public Library and the Wet Lab group to put on the 3rd annual San Diego Citizen Science Day Expo. There are a lot of exciting new entrants into the San Diego citizen science scene, and we hope you will join us in learning about them at the expo. If you're in San Diego, please join us! The details are as follows:

Who: Anyone who wants to do science
What: San Diego Citizen Science Day Expo
When: Saturday, April 14, 2018. 1:00 PM – 5:00 PM
Where: North University Community Library (8820 Judicial Dr, San Diego, CA 92122)

Please note that the location has changed from the previous ones due to the limited availability of parking spots at the La Jolla Library. The North University Community Library has plentiful free parking, so please visit come if you're in the area! For the most up-to-date information about this event, visit

If you're not in San Diego, there is probably an exciting Citizen Science Day event happening near you! To find a Citizen Science Day Event near you, visit

San Diego March for Science

The March for Science is also happening on April 14th in San Diego. It starts at the Waterfront park at 10:00am and ends at 1:00pm (right before our event!). If you want to show your love for science consider joining the march! If you want to DO science, be sure to join a Citizen Science Day event near you (or contribute to Mark2Cure, of course!).

Current Status of Mark2Cure

Development status and workarounds

Unfortunately, Mark2Cure no longer has a full time developer working on the project, so a lot of the issues and bugs that have been reported probably will not be fixed for a long, long time. We are very sorry for the frustration our system has caused our users and extremely grateful for the patience, graciousness, and encouragement our users have returned to us. Mark2Cure is really made up of a wonderful bunch of individuals, and we are thankful that this project has introduced us to you. Fortunately, many of you really put the science in the term citizen science and have systematically found ways to contribute productively in spite of all the issues in our system. You are all too amazing!

NER module issues: The most frustrating one has been the inability to highlight certain words, and the random highlighting/un-highlighting of words when users try to mark something. This has been reported by many users (many, many thanks to those of you who took the time to report this issue). Fortunately, one of your fellow volunpeers has found a workaround that appears to be quite robust. To get around a lot of these highlighting issues, AJ_Eckhart highlights the entire paragraph to remove the preannotations. These preannotations seem to be an important factor in this problem, and he has tested this workaround for the 'cannot-highlight-a-specific-term' bug, the 'highlighting-a-term-un-highlights-something-else', and the 'highlighting-a-term-highlights' something else' bugs.

RE module issues: A number of you have kindly taken the time to report issues with the RE module--the most common issue is the seemingly random inability to throw out an annotation. For this issue, two workarounds have been reported by our users. LadySteph has found that returning to the dashboard and then returning to the task will enable you to submit the response you wish (eg- throw out an annotation) and TAdams has reported that many of you have gravitated towards submitting 'Cannot be determined' in lieu of throwing out an annotation. We will take both workarounds into consideration when we analyze the data, so thank you all very much for contributing in spite of all these issues!

Data analysis and research status

Speaking of analyzing the data--we might not yet have enough abstracts annotated in order to generate ground-breaking, new hypotheses on NGLY1 deficiency, but we have enough for some initial analyses on the application of citizen science towards information extraction. We are working towards more scientific publications and look forward to sharing the results of your work and crediting you for your help. Note that many journal submission systems are not made to account for group names or a huge volume of names in the authorship; hence, we will continue to have our Mark2Cure contributors listed on a dedicated page which will be linked in the paper. As with our first paper, this will be an opt-in process because we respect your right to privacy. More details on opting-in will be sent via our mailing list.