Transcript EUtils

A data retrieval workflow using
NCBI E-Utils + Python
John Pinney
Tech talk Tue 12th Nov
Task
Produce a data set given particular
constraints.
Allow easy revision/updates as needed.
Output some kind of report for a biologist.
(One possible) solution
A number of DBs/tools now accept queries via RESTful*
interfaces, in principle allowing
up-to-date data set retrieval.
fully online analysis workflows.
*REST = Representational State Transfer.
A client/server architecture that ensures stateless
communication, usually implemented via HTTP
requests.
Bioinformatics REST services
NCBI E-utils
PubMed, other DBs,
BLAST
EBI web services various
UniProt
protein sequences
KEGG
metabolic network data
OMIM
human genetic disorders
+ many others
(see e.g. biocatalogue.org for a registry)
E-Utils services
ESummary
EFetch
ESearch
ELink
all available through
http://eutils.ncbi.nlm.nih.gov/
Basic URL API
e.g. retrieve IDs of all human genes:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils +
esearch.fcgi?retmode=xml&db=gene&term=9606[TAXID]
esearch
retmode=xml
db=gene
term=9606[TAXID]
(which EUtil)
(output format)
(which DB)
(query term)
My tasks
1. Produce a list of human genes that are associated with
at least one resolved structure in PDB
AND
at least one genetic disorder in OMIM
2. Make an online table to display them
My tasks:
1. Produce a list of human genes that are associated with
at least one resolved structure in PDB
AND
at least one genetic disorder in OMIM
2. Make an online table to display them
Easy:
Python requests using PyCogent
PyCogent is a Python bioinformatics module that includes
convenience methods for interaction with a number of online
resources.
from cogent.db.ncbi import *
ef = EFetch(id='23491729', rettype='fasta')
protein = ef.read()
Bit more typing but still easy:
Python requests using urllib2
For services that are not available through PyCogent, you can
construct your own URLs using urllib2.
import urllib2
url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/
esummary.fcgi?retmode=xml&db=gene&id=7157"
result = urllib2.urlopen(url).read()
(TIP: use urllib.quote_plus to escape spaces and other special
characters when preparing your URL query).
Making your life much easier:
XML handling using BeautifulSoup
Using retmode=xml ensures consistency in output format, but it
can be very difficult to extract the data without a proper XML
parser.
The simplest and most powerful XML handling in Python I have
found is via the BeautifulSoup object model.
Making your life much easier:
XML handling using BeautifulSoup
Example: extract all structure IDs linked to gene 7153.
e = ELink(db='structure', dbfrom='gene', id=7153)
result = e.read()
Making your life much easier:
XML handling using BeautifulSoup
Example: extract all structure IDs linked to gene 7153.
e = ELink(db='structure', dbfrom='gene', id=7153)
result = e.read()
from bs4 import BeautifulSoup
soup = BeautifulSoup(result,'xml')
linkset = soup.eLinkResult.LinkSet
s = [ x.Id.text for x in
linkset.LinkSetDb.findAll('Link') ]
Using WebEnv to chain requests
If you specify usehistory='y' , NCBI can remember your
output result (e.g. a list of gene IDs) and use it as a batch input
for another EUtil request.
This is extremely useful for minimising the number of queries for
workflows involving large sets of IDs.
You keep track of this “environment” using the WebEnv and
query_key fields.
Using WebEnv to chain requests
def webenv_search(**kwargs):
e = ESearch(usehistory='y',**kwargs)
result = e.read()
soup = BeautifulSoup(result,'xml')
return {'WebEnv':soup.WebEnv.text,
'query_key':soup.QueryKey.text }
ESearch
db=structure
term=9606[TAXID]
ESearch
db=omim<
term="omim<medgen"[Filter]
(grey<background<
=<using<WebEnv)
structure'IDs
ELink
db=protein
dbfrom=structure
OMIM'IDs
protein'IDs
ELink
db=gene,<
dbfrom=protein
ELink
db=gene,<
dbfrom=omim
gene'IDs
gene'IDs
&
gene'IDs
Workflow for gene list
My tasks
✓1. Produce a list of human genes that are associated with
at least one resolved structure in PDB
AND
at least one genetic disorder in OMIM
2. Make an online table to display them
(next time!)
Summary
Using NCBI EUtils to produce a data set under given constraints
was relatively straightforward.
Resulting code is highly re-usable for future workflows
(especially if written as generic functions).
Python modules used
PyCogent
Simple request handling for the main EUtils.
pycogent.org
urllib2
General HTTP request handler.
docs.python.org/2/library/urllib2.html
BeautifulSoup
Amazingly easy to use object model for XML/HTML.
www.crummy.com/software/BeautifulSoup/bs4/doc/