New Data Release for 201708

Posted by on Aug 10, 2017 in data release, | 0 comments

Another fresh data release for is out! In this data release, we have updated the data from ClinVar and UniProtKB to their latest versions, and also added variant annotation from CIViC and Cancer Genome Interpreter. Here are more details.

Data Sources Updated

ClinVar was updated to its latest (same version for both hg19 and hg38 assembly). And the variant annotations from UniProtKB were also updated to the latest (hg38 only):

Some numbers for GRCh37/hg19 variants:

last release new release # of variants
in last release
# of variants
in new release
ClinVar 2017-06 2017-08 307,101 310,280

Similarly, some numbers for GRCh38/hg38 variants:

last release new release # of variants
in last release
# of variants
in new release
UniProtKB 2017-03 2017-07 477,711 527,607
ClinVar 2017-06 2017-08 307,286 310,577

ClinVar annotations are available under "clinvar" subfields for each annotated variant. UniProtKB annotations are available under "uniprot" subfields for each annotated variant. aggregates annotations from ClinVar, dbSNP, dbNSFP and other 17 sources for each variant, so you can access them all in one request.

The total number of unique variants is now over 424M (424,524,227), slightly higher than our previous release on June 2017, which is 424,519,520. More details about the variant data we provide from are always available from our documentation. The programmatic access of this information is available from our metadata endpoint (and hg38 metadata).

New Data Sources Added

In this data release, we added variant annotations from CIViC and Cancer Genome Interpreter (CGI), through our collaborations with the GA4GH VICC working group. Both provide extensive annotations of cancer-associated genetic variants. And more specifically:

CIViC is an open access, open source, community-driven web resource for Clinical Interpretation of Variants in Cancer. The goal of CIViC is to enable precision medicine by providing an educational forum for the dissemination of knowledge and active discussion of the clinical significance of cancer genome alterations.

Cancer Genome Interpreter is designed to support the identification of tumor alterations that drive the disease and detect those that may be therapeutically actionable. CGI relies on existing knowledge collected from several resources and on computational methods that annotate the alterations in a tumor according to distinct levels of evidence.

You can access the data from CIViC under "civic" field. And note that "civic" field is only available for hg19 variants. Here are a few query examples:

curl ''  
curl ''  
curl ''  

You can access the data from Cancer Genome Interpreter under "cgi" field. And note that "cgi" field is only available for hg19 variants. Here are a few query examples:

curl ''  
curl ''  
curl ''

You can also do some combined queries just like other data sources we have:

curl ''  
curl ''  
curl ''  
curl ''  

That's all! And as always, feel free to reach us at or @myvariantinfo if you have any questions or feedback.

Join Mark2Cure and Dazzle4Rare

Posted by on Aug 4, 2017 in citizen science, Dazzle4Rare, mark2cure, polls | 0 comments

From August 13th to August 20th, Mark2Cure will be participating in the #Dazzle4Rare campaign to raise awareness for rare diseases. Did you know? About 10% of the population lives with a rare disease, and roughly 50% of rare diseases don’t have any sort of disease-specific foundation to support or research those diseases. See more interesting statistics about rare disease at Global Genes.
If you have a rare disease story you would like us to highlight for the campaign, please get in touch!

What's new in Mark2Cure?
The EDEM1 Entity Recognition mission is over 95% complete, please help us finish it so we can launch the next one. If it seems like we’ve been quiet lately it’s because we’ve been preparing for some major updates. If you’re curious about what’s in the pipeline or would like to preview/provide feedback for potential future interface designs, we’d LOVE to hear from you! Your feedback is how we improve! If not for our many marvelous Mark2Curators providing constructive criticism, Mark2Cure would be a clunky and more difficult to use platform.

Speaking of our volunteers, citizen scientists, participants, contributors, volunpeers, and Mark2Curators…there was an interesting discussion earlier today within the citizen science community on the best way to address the amazing people who help make science happen. In fact, a bunch of researchers even wrote an interesting paper about the pros and cons of different terminology

Which takes us to our current poll.

Lastly, there is an ongoing effort to increase discussion, collaboration, and cooperation within the citizen science (or whatever you wish to call it) community. This has led our friend Alice to introduce #CitSciStories. You may think that your contributions to science in your spare time are no big deal, but from the perspective of the researchers who rely on these are amazing! Inspiring! Awesome beyond words! We love what you do and we love learning from you and getting to know you. If you'd like to share your story and inspire others to help science, please get in touch with Alice (@PenguinGalaxy). You can learn more about the #CitSciStories effort, here.

Upcoming #CitSciChat on Biomedical Citizen Science

Posted by on Jul 14, 2017 in #citscichat, biomedical research, citizen science, mark2cure, presentation | 0 comments

New Mark2Cure Video added to our youtube playlist!

The Citizen Science Conference in May was very productive, and the last of Mark2Cure's recorded talks is now available on our youtube channel. As previously mentioned, Max delivered the project slam for Mark2Cure and was selected as one of the top three to deliver an abbreviated version during the 'Night in the Clouds' event.

View the two-minute talk here:

Biomedical CitSciChat on Wed. July 19th, at 11:00am PT

Speaking of the conference, we were able to connect in person with a lot of lovely people in the citizen science arena, especially the amazing people from @EyesOnAlz, @CitSciBio, and @CochraneCrowd. Because we're all passionate about bringing citizen science to biomedical research, we organized a panel for a biomedical #citscichat. Caren Cooper (@CoopSciScoop) kindly agreed to moderate the chat as usual, and Pietro (@pmichelu, @EyezOnAlz) was able to convince @foldit's Seth Cooper to join the panel.

What: Hour long chat on biomedical citizen science (#CitSciChat)

Where: online via twitter

When: Wed July 19 2:00pm ET (11:00am PT)

Why: Because citizen science is used in biomedical research too

Who: Everyone interested in citizen science is welcome to join this chat which will be moderated by citizen science expert and author, Caren Cooper. The panel so far includes:

  • Mark2Cure of course! Mark2Cure is a citizen science project for addressing the big data issue of biomedical literature. Citizen scientists help look for clues about NGLY1-deficiency in curated literature. (@Mark2Cure/@gtsueng, @x0xmaximus, @AndrewSu)
  • Cochrane Crowd is a citizen science project from the Cochrane Collaborative, and also looks to make biomedical literature more useful. Citizen Scientists help identify randomized controlled trials so that Cochrane Reviewers can use them to answer important medical questions. (@Cochrane_Crowd, @annanoelstorr)
  • EyesOnAlz/Stall Catchers is a citizen science project from the Human Computation Institute to identify blood blockages in short videos of the brain. Their game is super fun, helps with Alzheimer's research AND they have a major event (Catchathon) coming up. If you would like to host a local catchathon, check out this post. (@EyesOnAlz, @seplute, @Clair_csg, @pmichelu)
  • CitSciBio is NIH's new biomedical citizen science hub. It is sponsored by the Division of Cancer Biology at the National Cancer Institute. There are tools for collaborating, creating projects, and now you can login via your scistarter account. (@citscibio)
  • is a long standing, and very successful citizen science game which empowers gamers and volunteers to help determine the structure of proteins important to biomedical research. Seth Cooper from Northeastern University has agreed to join the panel to share about this wildly successful project. (@UWGameScience)
  • Beat the heat and help science!

    Need an excuse to stay indoors, avoid chores, and avoid the summer heat? Look no further! One of our current missions is over 80% complete. Help us finish it!

    Integrating Wikidata and other linked data sources – Federated SPARQL queries

    Posted by on Jul 13, 2017 in semantic web, SPARQL, Wikidata | 0 comments

    This blog is about running federated SPARQL queries on Wikidata.  A federated query is a special type of SPARQL query that runs on more then one SPARQL endpoint. It allows access to multiple linked data resources in a single query. Below is a template of a federated query.

    Structure of a federated query. It contains query patterns for both the local endpoint (green box) and a remote endpoint (blue box). The address of the remote SPARQL endpoint is expressed with the SERVICE keyword. (sourcecode)

    The WikiData Query Service (WDQS) now supports federated SPARQL queries on a limited number of endpoints. Remote SPARQL endpoints can be added to the list of supports endpoints by nomination.

    Wikidata aims to provide the sum of all knowledge to the world at large. To fulfil this,  it needs to be a hub to the total knowledge space. After all, different technical, legal or social limitations exist to include all in a single repository. Through federation or distributed querying, local and remote data can be combined.  Here we explore three ways to apply this type of querying on Wikidata content. 


    From Wikidata to an external SPARQL endpoint (Wikipathways)

    The following query applies federation to integrate between a pathway from Wikipathways and Wikidata. Wikidata contains items on human pathways from Wikipathways. Metabolic interactions are not yet captured in Wikidata. Through federation, these metabolic interaction can be obtained. In reverse direction is it possible to obtain properties of pathway elements from Wikidata. Take for example the “Sudden Infant Death Syndrome (SIDS) Susceptibility Pathways (Homo sapiens)” pathway. It contains various biological interactions. We could now, using federated queries, get properties such as the mass of a given pathway element. 

    Sudden Infant Death Syndrome (SIDS) Susceptibility Pathways (Homo sapiens)” pathway (

    Using this query as input, with a federated query it is possible to enrich this pathway with properties not captured in Wikipathways. One example would be the following query that takes interactions from the above pathway and combines that with the mass of the individual pathway parts.

    You can run this query here or watch it run on youtube

    From a remote SPARQL endpoint to Wikidata

    If a remote SPARQL endpoint is not (yet) eligible to be used in the WDQS, it is possible to run the query from the external endpoint. That is, if the external endpoint accepts submitting federated SPARQL queries. The SPARQL endpoint of UniProt is a nice example.  UniProt includes much more properties for proteins than currently captured in Wikidata. Properties that due to the more restrictive nature of the applicable license can’t be included in Wikidata. The following federated SPARQL endpoint runs on the SPARQL endpoint of UniProt. It selects all human UniProt entries with a sequence variants that leads to a loss of function, and also physically interacts with a drug used as an enzyme inhibitor.

    Integrating Wikidata content with data from UniProt using federated query submitted at

    Try it…


    From a local SPARQL endpoint to Wikidata (Data from a database that will never go into WD)

    Another interesting use case where federation can be quite handy, is in the context of local data. Epidemiological data on for example Zika outbreaks can contain large set of measurements spread over multiple time frames. Loading those measurements into Wikidata, can be be difficult especially if the outbreak is ongoing resulting in new data arriving in rapid intervals. One solution to enable integration with that data and other resources like Wikidata is running distributed queries from a local SPARQL endpoint. The local SPARQL endpoint has two roles, first it collects the measurements from the different Zika studies, secondly federated queries can be executed to enrich these measurements with knowledge from Wikidata.  We have create an example script that takes data on Zika outbreaks, converts that to linked data as RDF, which is then loaded in a local SPARQL endpoint. This prototype is available on github.

    This approach also works when one would like to integrate sensitive data (clinical patient data) with external WIkidata knowledge if the local endpoint is maintained from within a secure infrastructure which allows getting data from outside the infrastructure, but prevents exports.  


    Indeed SPARQL has a steep learning curve.

    Although writing SPARQL queries can be perceived as being quite intimidating, the feature to run federated queries on Wikidata content is very valuable and needed to make Wikidata the central hub of research data in the life sciences. The effort to learn SPARQL is worth it. Fortunately, Wikidata provides a large set of examples. Either for inspiration or as learning material.
    There are also quite some developments to make writing queries easier.

    There is also a R package that integrates SPARQL with R scripts, where the example queries from Wikidata can be scraped, which means that one could use the advantages SPARQL offers without writing a SPARQL query, simply by building on what others already made.

    Finally, there is always the help of the Twitter space. Many are quite eager to share SPARQL knowledge.

    The Gene Wiki project: Looking to the future v.2017

    Posted by on Jul 7, 2017 in Gene Wiki, GeneWikiRenewal, proposal, Wikidata | 3 comments

    The Gene Wiki project has been generously funded by the National Institutes for General Medical Sciences (NIGMS) since 2009. As the second funding period is wrapping up early next year, it was time to once again look forward and think about our vision for the next 4-5 years. Posted below is what we came up with, submitted earlier this week to the NIH as a competing renewal proposal with Lynn Schriml and Kristian Andersen as co-Investigators. Fingers crossed!

    Also a fine time to recognize that this proposal resulted from the direct and indirect contributions of so many people — postdocs, grad students, staff, past and current collaborators, Wikidata and Wikipedia communities, etc etc — far too many to name individually here. For a mostly comprehensive list, please see our grant-related publications.

    New Data Release for 201706

    Posted by on Jun 20, 2017 in data release, | 0 comments

    Another fresh data release for is out! In this data release, we have updated the data from ClinVar to their latest versions, and also added two new fields under ClinVar and ExAC to handle specific cases, including genotype set and multi-allelic variants. Here are more details.

    Data Sources Updated

    ClinVar was updated to its latest (same version for both hg19 and hg38 assembly):

    Some numbers for GRCh37/hg19 variants:

    last release new release # of variants
    in last release
    # of variants
    in new release
    ClinVar 2017-04 2017-06 282,772 307,101

    Similarly, some numbers for GRCh38/hg38 variants:

    last release new release # of variants
    in last release
    # of variants
    in new release
    ClinVar 2017-04 2017-06 282,956 307,286

    ClinVar annotations are available under "clinvar" subfields for each annotated variant. aggregates annotations from ClinVar, dbSNP, dbNSFP and other 12 sources for each variant, so you can access them all in one request.

    The total number of unique variants is now over 424M (424,519,520), slightly higher than our previous release on April 2017, which is 424,515,266. More details about the variant data we provide from are always available from our documentation. The programmatic access of this information is available from our metadata endpoint (and hg38 metadata).

    New Field for genotype set under ClinVar

    There are a few submissions in ClinVar that represent assertions about simple or complex genotypes. To include this information in, we have included a new genotypeset field under clinvar. There are two subfields under genotypeset, which is genotype and type. The "genotype" field records all variants as hgvs ids sharing the same genotype with the target variant. And the "type" field specifies the genotype which these variants are sharing, e.g. "CompoundHeterozygote".

    • Query for variants having "genotypeset" information:
    curl ''  

    * Or query for the genotypeset information for a specific variant:

    curl '>A?fields=clinvar.genotypeset'
      "_id": "chr5:g.151208511G>A",
      "_version": 2,
      "clinvar": {
        "_license": "",
        "genotypeset": {
          "genotype": [
          "type": "CompoundHeterozygote"

    New Field for multi-allelic under ExAC and ExAC-nonTCGA subset.

    A multiallelic site is a specific locus in a genome that contains three or more observed alleles, again counting the reference as one, and therefore allowing for two or more variant alleles. The VCF source file from ExAC provides information about multi-allelic variants by organizing information about all multi-allelic variants at the same locus in one record. Hence, we decide to include a 'multi-allelic' field under 'exac'. The field will list all multi-allelic variants as hgvs ids related to a specific variant.

    Thus, users could query for all multi-allelic variants for a target variant, e.g. chr10:g.103234255C>G using:

    curl '>G?fields=exac.multi-allelic'
      "_id": "chr12:g.103234255C>G",
      "_version": 3,
      "exac": {
        "_license": "",
        "multi-allelic": [

    Or query for all multi-allelic variants in ExAC using:

    curl ''  

    Please note that these two fields do not introduce any incompatible changes in the data structure, so your existing code should just work fine.

    That's all! And as always, feel free to reach us at or @myvariantinfo if you have any questions or feedback.

    Happy Fathers Day!

    Posted by on Jun 18, 2017 in biocuration, citizen science, conference, mark2cure, presentation | 0 comments

    A HUGE thanks to all the dads (and EVERYONE) who has been contributing to make a difference for the NGLY1 families.

    Shipping delays Apologies to international prize and drawing winners who were waiting for their prizes. Most of the international packages that we shipped out in May/June have been returned to us due to customs issues (fortunately, this happened at some point prior to shipping so the postage on these is still good, unfortunately, it took a long time for these to get back to us so we can address the issue). We’ll be trying again to get these out ASAP.

    Max’s original project slam now online As mentioned in our previous newsletter, Max delivered the project slam for Mark2Cure at the Citizen Science Conference in Minnesota. The project slam talks were supposed to have been recorded and still may be released by the Citizen Science Association someday, but we couldn’t wait. Here’s our recording of Max’s project slam. He finished within his allotted four minutes, and was engaging enough to win one of three invitations to deliver an even shorter version of the slam at an even the following day.
    You can check it out here:

    You be the scientist! One thing we’ve heard (and quite agree with) at the Citizen Science Conference is that trained volunteers are capable of doing more than simple tasks. Mark2Curators have very much fed into the tutorial process, and played an important role in testing and improving the design of the interface. The entities our users have identified from the text have already yielded interesting clues which we’ve used to expand the set of documents to investigate, and by now, there are users who have read a lot of abstracts—A LOT! If you’ve read something that sticks out in your mind as being potentially related to NGLY1-deficiency, share it with us! We’d love to hear YOUR hypothesis on what might be an interesting term to explore and why.

    Science Game Lab: tool for the unification of biomedical games with a purpose

    Posted by on Jun 16, 2017 in games, genegames, gwaps, Science Game Lab, SGL, sulab | 0 comments

    Scripps team: Benjamin M. Good, Ginger Tsueng, Andrew I Su
    Playmatics Team: Sarah Santini, Margaret Wallace, Nicholas Fortugno, John Szeder, Patrick Mooney, 
    With helpful ideas from: Jerome Waldispuhl, Melanie Stegman
    Games with a purpose and other kinds of citizen science initiatives demonstrate great potential for advancing biomedical science and improving STEM education.  Articles documenting the success of projects such as and Eyewire in high impact journals have raised wide interest in new applications of the distributed human intelligence that these systems have tapped into.  However, the path from a good idea to a successful citizen science game remains highly challenging.  Apart from the scientific difficulties of identifying suitable problems and appropriate human-powered solutions, the games still need to be created, need to be fun, and need to reach a large audience that remain engaged for the long-term.  Here, we describe Science Game Lab (SGL) (, a platform for bootstrapping the production, facilitating the publication, and boosting both the fun and the value of the user experience for scientific games with a purpose.  
    Ever since the project famously demonstrated that teams of human game players could often outperform supercomputers at the challenging problem of 3d protein structure prediction, so-called ‘games with a purpose’ have seen increasing attention from the biomedical research community.  A few other games in this genre include: Phylo for multiple sequence alignment, EteRNA for RNA structure design, Eyewire for mapping neural connectivity, The Cure for breast cancer prognosis prediction, Dizeez for gene annotation, and MalariaSpot for image analysis.  Apart from tapping into human intelligence at scale, these efforts have also produced valuable educational opportunities.  Many of these games are now used to introduce their underlying concepts in classroom settings where games in all forms are increasingly working their way into curriculums.  Concomitant with the rise of these ‘serious games’, citizen science efforts such as the Zooniverse and Mark2Cure have sought similar aims but have packaged their work as volunteer tasks, analogous to unpaid crowdsourcing tasks, rather than as elements of games.  
    Many of these initiatives have succeeded in independently addressing challenging technical problems through human computation, improving science education, and generally raising scientific awareness.  However, with so much interest from the scientific community and a booming ecosystem of game developers, there are actually relatively few of these games in operation now.  Recognizing the opportunity, various groups have attempted to push the area forward through new funding opportunities and through various ‘game jams’ such as the one that produced the game ‘genes in space’ for use in analyzing microarray data in cancer.  Here, we take a different approach towards expanding the ecosystem of games with a scientific purpose.  Rather than attempting to seed the genesis of specific new game-changing games, we hope to lower the barrier to entry for new games and related citizen science tasks to generally promote the development of the entire field.  With this high-level aim in mind, we developed Science Game Lab (SGL) to make it easier for developers to create successful scientific games or game-like learning and volunteer experiences.  Specifically, SGL is intended to address the challenges of recruiting players and volunteers, keeping them engaged for the long term, and reducing the development costs associated with creating a scientific gaming experience.
    The Science Game Lab Web application
    SGL is a unique, open-source portal supporting the integration of games and volunteer experiences meant to advance science and science education (  Unlike other related sites that act more like volunteer management and/or project directory services, such as SciStarter and Science Game Center, SGL is not simply a listing of related websites.  Rather, it is an attempt to create a user experience that takes place directly within the SGL context yet still incorporates content from third parties.  The system is largely inspired by game industry portals such as Kongregate that enable developers to incorporate their games directly into a unified metagame experience .
    Players can use the portal to find and play games with their achievements within the games tracked on site-wide high score lists and achievement boards (Figure 1).  Players can earn the SGL points that drive these leaderboards for actions taken in different games.  In this way, SGL provides developers with access to a metagame that can be used to encourage players in addition to the incentives offered within individual games (Figure 2).  This metagame can also be used by the system administrators to help direct the player community’s attention to particular games or particular tasks within games.  For example, actions taken on new games might earn more points than actions taken on more established games as a way to ‘spread the wealth’ generated by successful games.    

    Figure 1.  SGL home page demonstrating site-wide high score list, game listing, and links to achievements, help, and user profile information.
    Figure 2.  Badges displayed on user’s profile page.  Available badges not yet achieved are greyed out.
     Developers interact with SGL by incorporating a small javascript library into their application and using the SGL ‘developer dashboard’ to pair up events in their game with points, badges and quests managed by the SGL server.  At this time, SGL only supports games that operate online as Web applications.  The games are hosted by the developers and rendered in the SGL context within an iframe.  The SGL iframe provides a ‘heads up display’ that provides real time feedback to game players with respect to events sent back to the SGL server such as earning points, gathering badges, or progressing through the stages of a quest (Figure 3).  This display provides developers with the ability to add game mechanics to sites that are not overtly games.  For example, Wikipathways incorporated a pathway editing tutorial into SGL, using the heads up display to reward users with SGL points and badges for completing various stages of the tutorial.   The tutorial also took advantage of the SGL quest-building tool (Figure 4).  Games are submitted by developers for approval by SGL administrators.  Once approved, the games appear in the public view and can be accessed by any player.  

    Figure 3.  The heads up display provided by the SGL iframe.  Shows events captured by the API and provides users with immediate feedback.   

    Figure 4.  Tasks in SGL can be grouped into quests.  The figure shows a particular user’s progress through various quests available within the system.

    If a critical initial mass of effective games can be integrated, SGL could strongly benefit new developers by providing immediate access to a large player population.  Site-level status, identity and community features can help with the even greater challenge of long-term player engagement, a noted problem in the field.  Within the context of science-related gaming, such status icons might eventually be used as practically useful, real-world marks of achievement inline with the notion of ‘Open Badges.  As demonstrated by the Wikipathways tutorial application, SGL can be used to replace the need for developers to host their own login systems, user tracking databases, and reward systems – all of which can be accomplished using the SGL developer tools. Citizen scientists are not homogenous in their motivations. Designing to be inclusive of gamers and non-gamers can be challenging. By offering an alternative means of experiencing a web-based citizen science application, SGL allows developers to cater to both their gaming and non-gaming contributor audience. Together, these features unite to raise the overall potential for growth within the world of citizen science and scientific gaming.  
    Future directions
    SGL is currently functional, but so far has attracted only a small number of developers willing to integrate their content into the portal.  Future work would need to address the challenge of raising the perceived value of integration with the site while lowering the perceived difficulty.  Looking forward, key challenges for the future of SGL include better support for:
    • games meant for mobile devices
    • development of quests that span multiple games
    • teachers to build SGL-focused lesson plans and track student progress
    • creating new ‘SGL-native’ games
    • integration with external authentication systems
    None of these are insurmountable challenges, but they all require significant continued investment in software development.  As an open source project, we encourage contributions from anyone that shares in our vision of spreading and doing science through the grand unifying principle of fun.

    Building communities of knowledge with Wikidata

    Posted by on Jun 16, 2017 in crowdsourcing, Gene Wiki, semantic wikipedia, sulab, wiki, Wikidata, wikipedia | 0 comments

    As the Wikimedia Movement works to define its strategy for the next fifteen years, it is worthwhile to consider how its recent product Wikidata may fit into that strategy.  As its homepage states,
    Wikidata is a free and open knowledge base that can be read and edited by both humans and machines.”
    Wikidata is a particular kind of database designed to capture statements about items in the world with references that support those statements.  Because Wikidata is a database, its contents are meant to be viewed in the context of software that retrieve the data through queries and then renders the data to meet the needs of a user in a certain context.  The same data can thus be viewed on Wikidata-specific pages such as and in the infoboxes of Wikipedia articles such as  Importantly, Wikidata content can also be used in applications outside of the Wikimedia family such as   
    Examples of Wikidata use now include:
    The molecular biology community (and in particular the Gene Wiki group) has embraced Wikidata as a global platform for knowledge integration and distribution.  To help envision how Wikidata may fit into the strategic vision of the WMF movement, it is worth taking a look at how and why this particular community is using Wikidata.  
    History of the Gene Wiki initiative
    The sequencing of the human genome at the beginning of this century and the consequent rush of data and new technology for producing even more data fundamentally changed how research in biology is conducted.  Before the year 2000, research typically proceeded with a single gene focus.  A typical PhD thesis would entail the analysis of the genetics or function of one gene or protein at a time.  A few years after the first genome however, it became possible to measure the activity of ten’s of thousands of genes at once resulting in an omnipresent problem of generating interpretations of experimental results containing hundreds of genes.  While a scientist may come to grasp the literature surrounding a single gene quite well, it is not possible to know everything there is to know about all 20,000+ genes in the genome – particularly when this knowledge is expanding on a minute by minute basis.  As a consequence, there arose a need to produce summaries of what was known about each gene so that researchers could quickly grasp its nature and easily find links to more detailed references as needed.  By 2008, many different research groups published wikis attempting to allow the scientific community to generate the required articles, e.g. WikiProteins, WikiGenes, and the Gene Wiki.  The Gene Wiki project was unique among this group as it anchored itself directly to Wikipedia and, likely as a result of that decision, has enjoyed long term success.  This initiative works within the English Wikipedia community to encourage and support the collection of articles about human genes.  Its main contributions are the infobox seen on the right hand side of of these articles and software for generating new article stubs using that template.  
    Wikidata and the Gene Wiki project
    For the past several years, the Gene Wiki core team (funded by an NIH grant) has focused primarily on seeding Wikidata with biomedical knowledge.  In comparison to managing data via direct inclusion and parsing of infobox templates as before, this makes the data much easier to maintain automatically and, importantly, opens it up for use by other applications.  As a result, Wikipedia isn’t the only application that can use this structured information.   One of the first products of that process was a new module (Infobox_gene) that draws all the needed data to render the gene infobox dynamically from Wikidata, greatly reducing the technical challenge of keeping the data presented there in sync with primary sources.  
    In addition to the relatively simple collection of gene identifiers and links off to key public databases that are presented in the infoboxes, Wikidata now has an extensive and growing network of knowledge linking genes to proteins, proteins to drugs, drugs to diseases, diseases to pathogens, pathogens to places, places to events, events to people, and so on and so on.  This unique, open, referenced, knowledge graph may eventually become the closest thing to ‘the sum of all human knowledge’.  Capturing knowledge in this structured form makes it possible to use it in all kinds of applications, each with their own community-specific user experiences.  As a case in point, the Gene Wiki group created Wikigenomes based primarily on data loaded into Wikidata.  This was followed quickly by Chlambase, an application specifically focused on distributing and collecting knowledge about different Chlamydia genomes.  These applications provide domain-specific user interface components such as genome browsers that are needed to present the relevant information effectively and thereby attract the attention of specialist users.  These users, in turn, have the opportunity to contribute their knowledge back to the broader community through contributions to Wikidata that can be mediated by the same software.  
    Wikidata and the world
    The molecular biology research community, as represented by the Gene Wiki project, are early adopters of Wikidata as a community platform for the collaborative curation and distribution of structured knowledge, but they are not alone.  The same fundamental patterns are already being applied by other communities, e.g. those interested in digital preservation and open bibliography.  In each case, we see communities working to transition from the current dominant paradigm of private knowledge management towards the knowledge commons approach made possible by wikidata.  This is not unlike the transition from the world of the Encyclopedia Britannica to the world of Wikipedia.  The only important difference is that the knowledge in question is structured in a way that makes it easier to reuse in different ways and in different applications.  

    Wikidata provides a mechanism for massively increasing the global good generated by the Wikimedia Foundation’s work by capturing knowledge in a form that can be agilely used to empower all manner of software with the sum of human knowledge.  

    Happy Memorial Day weekend!

    Posted by on May 26, 2017 in citizen science, CitSci2017, Cochrane Crowd, conference, mark2cure, MedLitBlitz, poster, presentation | 0 comments

    The last few weeks have been a bit hectic, so we've got plenty of news and info to share with you.

    First of all, if you haven't seen it yet, Cochrane Crowd has posted about about our joint webinar and the #MedLitBlitz. If you missed the webinar or had technical difficulties/time zone issues with it, it's available on youtube. The prize packages for the top three participants of #MedLitBlitz are packed and will be shipped either today or early next week (depending on whether or not shipments have been picked up for today or not).

    Secondly, Mark2Cure was at the Citizen Science Association conference from 2017.05.15-2017.05.20, and was fortunate enough to share about YOUR work to an audience of scientists who LOVE citizen science! More than a few researchers stopped to introduce themselves to me and spoke highly of our community! Although it's always weird to hear a recording of your own voice, I recorded my presentation because it wouldn't be fair to talk about the amazing work you've done without sharing it with you! You can find my presentation for the biomedical session in our youtube channel. On a side note, I know the audio quality isn't the best which is why I've transcribed it using youtube's captioning software. If you have trouble hearing the presentation (because of the poor audio quality), please turn on the closed captions.

    Max also delivered two lightning talks for the event, which I hope to upload soon.
    Not available yet, but will be soon In addition to the talks, we had a poster for Mark2Cure and a table at two public events.
    Max spreading the love for Mark2Cure

    We were especially pleased to be so close to our buddy at Cochrane Crowd for this event
    Cochrane Crowd looking good

    Lastly, it looks like one of the missions was completed just as I was settling back in after the conference. A HUGE thanks to everyone that helped complete the carpingly mission. A new mission has been launched in its place, so check it out if you have some free time.