Literature Mining and Systems Biology Lars Juhl Jensen EMBL Why? Overview • Information retrieval: finding the papers • Entity recognition: identifying the substance(s) • Information extraction:

Download Report

Transcript Literature Mining and Systems Biology Lars Juhl Jensen EMBL Why? Overview • Information retrieval: finding the papers • Entity recognition: identifying the substance(s) • Information extraction:

Literature Mining and
Systems Biology
Lars Juhl Jensen
EMBL
Why?
Overview
• Information retrieval: finding the papers
• Entity recognition: identifying the substance(s)
• Information extraction: formalizing the facts
• Text mining: finding nuggets in the literature
• Integration: combining text and biological data
Status
• IR, ER, and simple IE
methods are fairly well
established
• Advanced NLP-based IE
systems are rapidly being
improved
• Methods for text mining
and text/data integration
are still in their infancy
Example
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1 and this
modification served as a priming step to promote
subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation
Information retrieval
• Ad hoc information retrieval
 The user enters a query/a set of keywords
 The system attempts to retrieve the relevant texts from
a large text corpus (typically Medline)
• Text categorization
 A training set of texts is created in which texts are
manually assigned to classes (often only yes/no)
 A machine learning methods is trained to classify texts
 This method can subsequently be used to classify a
much larger text corpus
Ad hoc IR
• These systems are very useful since the user can
provide any query
 The query is typically Boolean (yeast AND cell cycle)
 A few systems instead allow the relative weight of each
search term to be specified by the user
• The art is to find the relevant papers even if they
do not actually match the query
 Ideally our example sentence should be extracted by the
query yeast cell cycle although none of these words are
mentioned
Automatic query expansion
• In a typical query, the user will not have provided
all relevant words and variants thereof
• By automatically expanding queries with
additional search terms, recall can be improved
 Stemming removes common endings (yeast / yeasts)
 Thesauri can be used to expand queries with synonyms
and/or abbreviations (yeast / S. cerevisiae)
 The next logical step is to use ontologies to make
complex inferences (yeast cell cycle / Cdc28 )
Document similarity
• The similarity of two documents can be defined
based on their word content
 Each document can be represented by a word vector
 Words should be weighted based on their frequency and
background frequency
 The most commonly used scheme is tf*idf weighting
• Document similarity can be used in ad hoc IR
 Rather than matching the query against each document
only, the N most similar documents are also considered
Document clustering
• Unsupervised clustering algorithms can be applied
to a document similarity matrix
 All pairwise document similarities are calculated
 Clusters of “similar documents” can be constructed
using one of numerous standard clustering methods
• Practical uses of document clustering
 The “related documents” function in PubMed
 Logical organization of the documents found by IR
Text categorization
• These systems are a lot less flexible than ad hoc
systems but can attain better accuracy
 Works on a pre-defined set of document classes
 Each class is defined by manually assigning a number
of documents to it
• Method
 Rules may be manually crafted based on a very small
set of manually classified documents
 Statistical machine learning methods can be trained on
a large number of classified documents
Example
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1 and this
modification served as a priming step to promote
subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation
Hints in the text
 Strong: Cdc28 and Swe1 (“cell cycle” and “yeast”)
 Weaker: mitotic cyclin, Clb2, and Cdk1 ( “cell cycle)
Machine learning
• Input features




Word content or bi-/tri-grams
Part-of-speech tags
Filtering (stop words, part-of-speech)
Singular value decomposition
• Training
 Support vector machines are best suited
 Choice of kernel function
 Separate training and evaluation sets, cross validation
Entity recognition
• An important but boring problem
 The genes/proteins/drugs mentioned within a given text
must be identified
• Recognition vs. identification
 Recognition: find the words that are names of entities
 Identification: figure out which entities they refer to
 Recognition without identification is of limited use
Example
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1 and this
modification served as a priming step to promote
subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation
Entities identified
 S. cerevisiae proteins: Clb2 (YPR119W), Cdc28
(YBR160W), Swe1 (YJL187C), and Cdc5 (YMR001C)
Recognition
• Features
 Morphological: mixes letters and digits or ends on -ase
 Context: followed by “protein” or “gene”
 Grammar: should occur as a noun
• Methodologies
 Manually crafted rule-based systems
 Machine learning (SVMs)
• But what can it be used for?
Identification
• A good synonyms list is the key
 Combine many sources
 Curate to eliminate stop words
• Flexible matching to handle orthographic variation





Case variation: CDC28, Cdc28, and cdc28
Prefixes: myc and c-myc
Postfixes: Cdc28 and Cdc28p
Spaces and hyphens: cdc28 and cdc-28
Latin vs. Greek letters: TNF-alpha and TNFA
Disambiguation
• The same word may mean many different things
 Entity names may also be common English words
(hairy) or technical terms (SDS)
 Protein names may refer to related or unrelated proteins
in other species (cdc2)
• The meaning can be resolved from the context
 ER can distinguish between names and common words
 Disambiguating non-unique names is a hard problem
 Ambiguity between orthologs can be safely be ignored
Co-occurrence extraction
• Relations are extracted for co-occurring entities
 Relations are always symmetric
 The type of relation is not given
• Scoring the relations
 More co-occurrences  more significant
 Ubiquitous entities  less significant
 Same sentence vs. same paragraph
• Simple, good recall, poor precision
Example
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1 and this
modification served as a priming step to promote
subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation
Relations
 Correct: Clb2–Cdc28, Clb2–Swe1, Cdc28–Swe1, and
Cdc5–Swe1
 Wrong: Clb2–Cdc5 and Cdc28–Cdc5
Categorization of relations
• Extracting specific types of relations
 Text categorization methods can be used to identify
sentences that mention a certain type of relations
 Filtering can be done before or after relation extraction
• Well suited for database curation
 Text categorization can be reused
 High recall is most important
 Curators can compensate for the lack of precision
Relation extraction by NLP
• Information is extracted based on parsing and
interpreting phrases or full sentences
 Good at extracting specific types of relations
 Handles directed relations
• Complex, good precision, poor recall
Example
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1 and this
modification served as a priming step to promote
subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation
Relations:
 Complex: Clb2–Cdc28
 Phosphorylation: Clb2Swe1, Cdc28Swe1, and
Cdc5Swe1
An NLP architecture
• Tokenization
 Entity recognition with synonyms list
 Word boundaries (multi words)
 Sentence boundaries (abbreviations)
• Part-of-speech tagging
 TreeTagger trained on GENIA
• Semantic labeling
 Dictionary of regular expressions
• Entity and relation chunking
 Rule-based system implemented in CASS
Semantic labeling
 Gene and protein names
 Cue words for entity recognition
 Cue words for relation extraction
Named entity chunking
 A CASS grammar recognizes
noun chunks related to gene
expression:
[nxgene The GAL4 gene]
Relation chunking
 Our CASS grammar also extracts
relations between entities:
[nxexpr The expression of
[nxgene the cytochrome genes
[nxpg CYC1 and CYC7]]]
is controlled by
[nxpg HAP1]
[phosphorylation_active
[expression_repression_active
Lyn, [negation but not Jak2]
Btk
phosphorylated
regulates
CrkL]
the IL-2 gene]
[phosphorylation_active
Lyn
also participates in
[phosphorylation the tyrosine phosphorylation
and activation of syk]]
[expression_repression_active
IL-10
[phosphorylation_nominal
also decreased
the phosphorylation of
[expression mRNA expression of
the adapter protein SHC
IL-2 and IL18 cytokine receptors]
by the Src-related kinase Lyn]
[phosphorylation_nominal
phosphorylation of Shc by
the hematopoietic cell-specific
tyrosine kinase Syk]
[dephosphorylation_nominal
Dephosphorylation of
Syk and Btk
mediated by
SHP-1]
[expression_activation_passive
[expression IL-13 expression]
induced by
IL-2 + IL-18]
Mining text for nuggets
• New relations can be inferred from published ones
 This can lead to actual discoveries if no person knows
all the facts required for making the inference
 Combining facts from disconnected literatures
• Swanson’s pioneering work
 Fish oil and Reynaud's disease
 Magnesium and migraine
Trends
• Most similar to existing data mining approaches
 Although all the detailed data is in the text, people may
have missed the big picture
• Temporal trends
 Historical summaries
 Forecasting
• Correlations
 “Customers who bought this item also bought …”
Time
Buzzwords
Correlations
• “Customers who bought
this item also bought …”
• Protein networks
 “Proteins that regulate
expression …”
 “Proteins that control
phosphorylation …”
 “Proteins that are
phosphorylated …”
• Co-author networks
Transcriptional networks
3592
79
32
Regulates
83
Regulated
P < 910-9
Signaling pathways
3704
27
11
Phosphorylates
44
Phosphorylated
P < 210-7
Integration
• Automatic annotation of high-throughput data
 Loads of fairly trivial methods
• Protein interaction networks
 Can unify many types of interactions
 Powerful as exploratory visualization tools
• More creative strategies
 Identification of candidate genes for genetic diseases
 Linking genes to traits based on species distributions
RCCs
Disease candidate genes
• Rank the genes within a chromosomal region to
which a disease has been mapped
• Methods
 G2D
• GeneFunctionChemicalPhenotypeDisease
• Uses MEDLINE but not the text
 BITOLA
• GeneWordsDisease (similar to ARROWSMITH)
 Hide and co-workers
• GeneTissueDisease
G2D
Genotype–phenotype
• Genes can be linked to traits by comparing the
species distributions of both
 Mainly works for prokaryotes
 Traits are represented by keywords
• Finding the species profiles
 Gene profiles are found by sequence similarity
 Keyword profiles are based co-occurrence with the
species name in MEDLINE
Annotation
• Many experiment result in groups of related genes




ER is used to find the associated abstracts
The frequency of each word is counted in the abstracts
Background frequencies of all words are pre-calculated
A statistical test is used to rank the words
• The same strategy can be applied to find MeSH
terms associated with a gene cluster
• Most people prefer using GO annotation instead
Outlook
• Literature mining will not be made obsolete by
<insert your favorite new technology here>




Repositories are always made too late
There will always be new types of relations
Semantically tagged XML may replace ER (hopefully!)
Semantically tagged XML will never tag everything
• Specific IE problems will become obsolete
 Protein function
 Physical protein interactions
Permission denied
• Open access
 Literature mining methods cannot retrieve, extract, or
correlate information from text unless it is accessible
 Restricted access is already now the primary problem
• Standard formats
 Getting the text out of a PDF file is not trivial
 Many journals now store papers in XML format
• Where do I get all the patent text?!
Innovation
• The basic tools are now in
place for IR, ER, and IE
 Development was driven by
computational linguists
• Text- and data-mining
 Biologists are needed
 Collaboration with linguists
• Lack of innovation
 Very few new ideas
 Text should be combined
with other data
Acknowledgments
• EML Research
 Jasmin Saric
 Isabel Rojas
• EMBL Heidelberg






Peer Bork
Miguel Andrade
Michael Kuhn
Rossitza Ouzounova
Jan Korbel
Tobias Doerks
Exercises
Lars Juhl Jensen
EMBL
Entity recognition
• iHOP
 http://www.pdg.cnb.uam.es/UniPub/iHOP/
• Ideas
 Compare iHOP vs. PubMed for finding papers related
to a particular gene
 Use iHOP to construct a small literature-based network
Information extraction
• Relation extraction
 iProLINK (http://pir.georgetown.edu/iprolink/)
 PreBIND (http://prebind.bind.ca)
 PubGene (http://www.pubgene.org)
• Ideas
 Check how complex sentences iProLINK can handle
 Check how well PreBIND can discriminate between
physcial and other interactions (other interactions can
be found with PubGene, ProLinks, or STRING)
Text mining
• ARROWSMITH
 http://arrowsmith.psych.uic.edu
• Ideas




Fish oil and Reynaud's disease
Magnesium and migraine
Arginine and somatomedin C
Estrogen and Alzheimer's disease
Integration 1
• Protein networks
 STRING (http://string.embl.de)
 ProLinks (http://dip.doe-mbi.ucla.edu/pronav/)
• Ideas
 Use both tools to find functions for proteins of known
and unknown function
 Use STRING to construct a network for a set of proteins
 Try to reproduce the Ssn3–Msn2–Hsp104 link
Integration 2
• Finding candidate disease genes
 G2D (http://www.ogic.ca/projects/g2d_2/)
 BITOLA (http://www.mf.uni-lj.si/bitola/)
• Ideas
 Take a look at the G2D results for some diseases where
you know which types of genes would be sensible to
suggest
 Compare the results with BITOLA (if you have the
patience to figure out there interface!)