ID mapping is a very common, and often not fun, task for every bioinformatician. Suppose you have a list of gene symbols or reporter ids from an upstream analysis, and then your next analysis requires the use of gene IDs (e.g. Entrez gene IDs or Ensembl gene IDs). Converting from one symbol/identifier to another is a conceptually simple but often tedious process.

Here we want to show you how to use the mygene module in Python to do ID mapping quickly and easily. mygene is a convenient Python module to access gene query web services.


Installing mygene

Install mygene is easy, as pip is your friend:

   pip install mygene

Now you just need to import it and instantiate the MyGeneInfo class:

   import mygene
   mg = mygene.MyGeneInfo()

Mapping gene symbols to Entrez gene ids

Suppose xli is a list of gene symbols you want to convert to entrez gene ids:

xli = ['DDX26B', 'CCDC83', 'MAST3', 'FLOT1', 'RPL11', 'ZDHHC20', 
       'LUC7L3', 'SNORD49A', 'CTSH', 'ACOT8']
You can then call the querymany method, telling it your input is “symbol”, and you want “entrezgene” (Entrez gene IDs) back:

   out = mg.querymany(xli, scopes='symbol', fields='entrezgene', species='human')
In short, scopes defines the type of the input identifier, fields defines the variable(s) to be returned, and species limits the species to search. The returned “out” looks like this:

[{u'_id': u'203522', u'entrezgene': 203522, u'query': u'DDX26B'},
 {u'_id': u'220047', u'entrezgene': 220047, u'query': u'CCDC83'},
 {u'_id': u'23031', u'entrezgene': 23031, u'query': u'MAST3'},
 {u'_id': u'10211', u'entrezgene': 10211, u'query': u'FLOT1'},
 {u'_id': u'6135', u'entrezgene': 6135, u'query': u'RPL11'},
 {u'_id': u'253832', u'entrezgene': 253832, u'query': u'ZDHHC20'},
 {u'_id': u'51747', u'entrezgene': 51747, u'query': u'LUC7L3'},
 {u'_id': u'26800', u'entrezgene': 26800, u'query': u'SNORD49A'},
 {u'_id': u'1512', u'entrezgene': 1512, u'query': u'CTSH'},
 {u'_id': u'10005', u'entrezgene': 10005, u'query': u'ACOT8'}]
The mapping result above is returned as a list of dictionaries. Each dictionary contains the fields you asked to return, in this case, “entrezgene” field. Each dictionary also returns the matching query term, “query“, and an internal id, “_id“, which is the same as “entrezgene” most of time (will be an ensembl gene id if a gene is available from Ensembl only).

Although the simple example above uses gene symbols from human, actually supports over 30 common identifiers (see the list here) and almost all species indexed by NCBI. And the annotation data are always updated on a weekly basis.

Get the idea of how it works? Continue to read the full tutorial here, which covers slightly more advanced examples and edge cases…