Literature Mining and Systems Biology Lars Juhl Jensen EMBL Why? Overview • Information retrieval: finding the papers • Entity recognition: identifying the substance(s) • Information extraction:
Download
Report
Transcript Literature Mining and Systems Biology Lars Juhl Jensen EMBL Why? Overview • Information retrieval: finding the papers • Entity recognition: identifying the substance(s) • Information extraction:
Literature Mining and
Systems Biology
Lars Juhl Jensen
EMBL
Why?
Overview
• Information retrieval: finding the papers
• Entity recognition: identifying the substance(s)
• Information extraction: formalizing the facts
• Text mining: finding nuggets in the literature
• Integration: combining text and biological data
Status
• IR, ER, and simple IE
methods are fairly well
established
• Advanced NLP-based IE
systems are rapidly being
improved
• Methods for text mining
and text/data integration
are still in their infancy
Example
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1 and this
modification served as a priming step to promote
subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation
Information retrieval
• Ad hoc information retrieval
The user enters a query/a set of keywords
The system attempts to retrieve the relevant texts from
a large text corpus (typically Medline)
• Text categorization
A training set of texts is created in which texts are
manually assigned to classes (often only yes/no)
A machine learning methods is trained to classify texts
This method can subsequently be used to classify a
much larger text corpus
Ad hoc IR
• These systems are very useful since the user can
provide any query
The query is typically Boolean (yeast AND cell cycle)
A few systems instead allow the relative weight of each
search term to be specified by the user
• The art is to find the relevant papers even if they
do not actually match the query
Ideally our example sentence should be extracted by the
query yeast cell cycle although none of these words are
mentioned
Automatic query expansion
• In a typical query, the user will not have provided
all relevant words and variants thereof
• By automatically expanding queries with
additional search terms, recall can be improved
Stemming removes common endings (yeast / yeasts)
Thesauri can be used to expand queries with synonyms
and/or abbreviations (yeast / S. cerevisiae)
The next logical step is to use ontologies to make
complex inferences (yeast cell cycle / Cdc28 )
Document similarity
• The similarity of two documents can be defined
based on their word content
Each document can be represented by a word vector
Words should be weighted based on their frequency and
background frequency
The most commonly used scheme is tf*idf weighting
• Document similarity can be used in ad hoc IR
Rather than matching the query against each document
only, the N most similar documents are also considered
Document clustering
• Unsupervised clustering algorithms can be applied
to a document similarity matrix
All pairwise document similarities are calculated
Clusters of “similar documents” can be constructed
using one of numerous standard clustering methods
• Practical uses of document clustering
The “related documents” function in PubMed
Logical organization of the documents found by IR
Text categorization
• These systems are a lot less flexible than ad hoc
systems but can attain better accuracy
Works on a pre-defined set of document classes
Each class is defined by manually assigning a number
of documents to it
• Method
Rules may be manually crafted based on a very small
set of manually classified documents
Statistical machine learning methods can be trained on
a large number of classified documents
Example
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1 and this
modification served as a priming step to promote
subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation
Hints in the text
Strong: Cdc28 and Swe1 (“cell cycle” and “yeast”)
Weaker: mitotic cyclin, Clb2, and Cdk1 ( “cell cycle)
Machine learning
• Input features
Word content or bi-/tri-grams
Part-of-speech tags
Filtering (stop words, part-of-speech)
Singular value decomposition
• Training
Support vector machines are best suited
Choice of kernel function
Separate training and evaluation sets, cross validation
Entity recognition
• An important but boring problem
The genes/proteins/drugs mentioned within a given text
must be identified
• Recognition vs. identification
Recognition: find the words that are names of entities
Identification: figure out which entities they refer to
Recognition without identification is of limited use
Example
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1 and this
modification served as a priming step to promote
subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation
Entities identified
S. cerevisiae proteins: Clb2 (YPR119W), Cdc28
(YBR160W), Swe1 (YJL187C), and Cdc5 (YMR001C)
Recognition
• Features
Morphological: mixes letters and digits or ends on -ase
Context: followed by “protein” or “gene”
Grammar: should occur as a noun
• Methodologies
Manually crafted rule-based systems
Machine learning (SVMs)
• But what can it be used for?
Identification
• A good synonyms list is the key
Combine many sources
Curate to eliminate stop words
• Flexible matching to handle orthographic variation
Case variation: CDC28, Cdc28, and cdc28
Prefixes: myc and c-myc
Postfixes: Cdc28 and Cdc28p
Spaces and hyphens: cdc28 and cdc-28
Latin vs. Greek letters: TNF-alpha and TNFA
Disambiguation
• The same word may mean many different things
Entity names may also be common English words
(hairy) or technical terms (SDS)
Protein names may refer to related or unrelated proteins
in other species (cdc2)
• The meaning can be resolved from the context
ER can distinguish between names and common words
Disambiguating non-unique names is a hard problem
Ambiguity between orthologs can be safely be ignored
Co-occurrence extraction
• Relations are extracted for co-occurring entities
Relations are always symmetric
The type of relation is not given
• Scoring the relations
More co-occurrences more significant
Ubiquitous entities less significant
Same sentence vs. same paragraph
• Simple, good recall, poor precision
Example
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1 and this
modification served as a priming step to promote
subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation
Relations
Correct: Clb2–Cdc28, Clb2–Swe1, Cdc28–Swe1, and
Cdc5–Swe1
Wrong: Clb2–Cdc5 and Cdc28–Cdc5
Categorization of relations
• Extracting specific types of relations
Text categorization methods can be used to identify
sentences that mention a certain type of relations
Filtering can be done before or after relation extraction
• Well suited for database curation
Text categorization can be reused
High recall is most important
Curators can compensate for the lack of precision
Relation extraction by NLP
• Information is extracted based on parsing and
interpreting phrases or full sentences
Good at extracting specific types of relations
Handles directed relations
• Complex, good precision, poor recall
Example
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1 and this
modification served as a priming step to promote
subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation
Relations:
Complex: Clb2–Cdc28
Phosphorylation: Clb2Swe1, Cdc28Swe1, and
Cdc5Swe1
An NLP architecture
• Tokenization
Entity recognition with synonyms list
Word boundaries (multi words)
Sentence boundaries (abbreviations)
• Part-of-speech tagging
TreeTagger trained on GENIA
• Semantic labeling
Dictionary of regular expressions
• Entity and relation chunking
Rule-based system implemented in CASS
Semantic labeling
Gene and protein names
Cue words for entity recognition
Cue words for relation extraction
Named entity chunking
A CASS grammar recognizes
noun chunks related to gene
expression:
[nxgene The GAL4 gene]
Relation chunking
Our CASS grammar also extracts
relations between entities:
[nxexpr The expression of
[nxgene the cytochrome genes
[nxpg CYC1 and CYC7]]]
is controlled by
[nxpg HAP1]
[phosphorylation_active
[expression_repression_active
Lyn, [negation but not Jak2]
Btk
phosphorylated
regulates
CrkL]
the IL-2 gene]
[phosphorylation_active
Lyn
also participates in
[phosphorylation the tyrosine phosphorylation
and activation of syk]]
[expression_repression_active
IL-10
[phosphorylation_nominal
also decreased
the phosphorylation of
[expression mRNA expression of
the adapter protein SHC
IL-2 and IL18 cytokine receptors]
by the Src-related kinase Lyn]
[phosphorylation_nominal
phosphorylation of Shc by
the hematopoietic cell-specific
tyrosine kinase Syk]
[dephosphorylation_nominal
Dephosphorylation of
Syk and Btk
mediated by
SHP-1]
[expression_activation_passive
[expression IL-13 expression]
induced by
IL-2 + IL-18]
Mining text for nuggets
• New relations can be inferred from published ones
This can lead to actual discoveries if no person knows
all the facts required for making the inference
Combining facts from disconnected literatures
• Swanson’s pioneering work
Fish oil and Reynaud's disease
Magnesium and migraine
Trends
• Most similar to existing data mining approaches
Although all the detailed data is in the text, people may
have missed the big picture
• Temporal trends
Historical summaries
Forecasting
• Correlations
“Customers who bought this item also bought …”
Time
Buzzwords
Correlations
• “Customers who bought
this item also bought …”
• Protein networks
“Proteins that regulate
expression …”
“Proteins that control
phosphorylation …”
“Proteins that are
phosphorylated …”
• Co-author networks
Transcriptional networks
3592
79
32
Regulates
83
Regulated
P < 910-9
Signaling pathways
3704
27
11
Phosphorylates
44
Phosphorylated
P < 210-7
Integration
• Automatic annotation of high-throughput data
Loads of fairly trivial methods
• Protein interaction networks
Can unify many types of interactions
Powerful as exploratory visualization tools
• More creative strategies
Identification of candidate genes for genetic diseases
Linking genes to traits based on species distributions
RCCs
Disease candidate genes
• Rank the genes within a chromosomal region to
which a disease has been mapped
• Methods
G2D
• GeneFunctionChemicalPhenotypeDisease
• Uses MEDLINE but not the text
BITOLA
• GeneWordsDisease (similar to ARROWSMITH)
Hide and co-workers
• GeneTissueDisease
G2D
Genotype–phenotype
• Genes can be linked to traits by comparing the
species distributions of both
Mainly works for prokaryotes
Traits are represented by keywords
• Finding the species profiles
Gene profiles are found by sequence similarity
Keyword profiles are based co-occurrence with the
species name in MEDLINE
Annotation
• Many experiment result in groups of related genes
ER is used to find the associated abstracts
The frequency of each word is counted in the abstracts
Background frequencies of all words are pre-calculated
A statistical test is used to rank the words
• The same strategy can be applied to find MeSH
terms associated with a gene cluster
• Most people prefer using GO annotation instead
Outlook
• Literature mining will not be made obsolete by
<insert your favorite new technology here>
Repositories are always made too late
There will always be new types of relations
Semantically tagged XML may replace ER (hopefully!)
Semantically tagged XML will never tag everything
• Specific IE problems will become obsolete
Protein function
Physical protein interactions
Permission denied
• Open access
Literature mining methods cannot retrieve, extract, or
correlate information from text unless it is accessible
Restricted access is already now the primary problem
• Standard formats
Getting the text out of a PDF file is not trivial
Many journals now store papers in XML format
• Where do I get all the patent text?!
Innovation
• The basic tools are now in
place for IR, ER, and IE
Development was driven by
computational linguists
• Text- and data-mining
Biologists are needed
Collaboration with linguists
• Lack of innovation
Very few new ideas
Text should be combined
with other data
Acknowledgments
• EML Research
Jasmin Saric
Isabel Rojas
• EMBL Heidelberg
Peer Bork
Miguel Andrade
Michael Kuhn
Rossitza Ouzounova
Jan Korbel
Tobias Doerks
Exercises
Lars Juhl Jensen
EMBL
Entity recognition
• iHOP
http://www.pdg.cnb.uam.es/UniPub/iHOP/
• Ideas
Compare iHOP vs. PubMed for finding papers related
to a particular gene
Use iHOP to construct a small literature-based network
Information extraction
• Relation extraction
iProLINK (http://pir.georgetown.edu/iprolink/)
PreBIND (http://prebind.bind.ca)
PubGene (http://www.pubgene.org)
• Ideas
Check how complex sentences iProLINK can handle
Check how well PreBIND can discriminate between
physcial and other interactions (other interactions can
be found with PubGene, ProLinks, or STRING)
Text mining
• ARROWSMITH
http://arrowsmith.psych.uic.edu
• Ideas
Fish oil and Reynaud's disease
Magnesium and migraine
Arginine and somatomedin C
Estrogen and Alzheimer's disease
Integration 1
• Protein networks
STRING (http://string.embl.de)
ProLinks (http://dip.doe-mbi.ucla.edu/pronav/)
• Ideas
Use both tools to find functions for proteins of known
and unknown function
Use STRING to construct a network for a set of proteins
Try to reproduce the Ssn3–Msn2–Hsp104 link
Integration 2
• Finding candidate disease genes
G2D (http://www.ogic.ca/projects/g2d_2/)
BITOLA (http://www.mf.uni-lj.si/bitola/)
• Ideas
Take a look at the G2D results for some diseases where
you know which types of genes would be sensible to
suggest
Compare the results with BITOLA (if you have the
patience to figure out there interface!)