Introduction to Text Mining

Transcript Introduction to Text Mining

Introduction to Text Mining and
Natural Language Processing
BIF-30806
January 2010
Judith Risse
Outline


Literature and Databases
Natural Language Processing






Information Retrieval
Question Answering
Information Extraction
Indexing
Document Classification
Exercises
2
Definitions

Natural Language Processing (NLP)


the study of automated generation and understanding
of natural human languages (Wikipedia)
Text Mining

extract high quality (previously unknown) information
from large amounts of unstructured text
3
Biomedical Literature




communication of scientific discoveries
peer-reviewed and community reviewed
provides additional information of experimental
results
base for annotation of biological databases
4
Literature Databases



NCBI Bookshelf
PubMed Central
PubMed







currently 19476540 citations (Jan 27, 2010)
5414 journals in Medline
unique identifier PMID
entries contain author, journal and title info
more than 50% also abstracts
links to full-text articles
Medical Subject Headings (MeSH)
5
No of publications in millions
PubMed
PubMed growth
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
entries per year
total No of entries
2007
2004
2001
1998
1995
1992
1989
1986
1983
1980
1977
1974
1971
1968
1965
1962
1959
1956
1953
1950
6
Pubmed (3)
© NLM 2008
7
A scientific article

journal specific format



type of article



sections
print style
review
letter
document format


html
pdf
8
Article content

Full-text







title
authors
abstract
body
Tables
Figures
References
9
Biomedical Language

domain specific terminology


cytosolic, erythroid precursor
polysemic words

e.g. Drosophila gene names: coitus interruptus, lost in
space

acronyms



APC (activated protein C), mdh (malate dehydrogenase)
low frequency words
anaphora (references)

Overexpression of FumRs and Frds1 resulted in the best
citrate-producing strain in the presence of trace
manganese concentrations. This strain gave a maximum
yield of ….
10
Biomedical Language (2)


synonyms/creating new terms
typographical variants










malic dehydrogenase
L-malate dehydrogenase
NAD-L-malate dehydrogenase
malic acid dehydrogenase
NAD-dependent malic dehydrogenase
NAD-malate dehydrogenase
NAD-malic dehydrogenase
malate (NAD) dehydrogenase
MDH
L-malate-NAD+ oxidoreductase
11
Natural Language Processing


create computational models of language
multi-disciplinary


statistical properties of language


information technology, linguistics, artificial
intelligence, statistics ….
machine learning, rule-based, regular expressions
grammatical, morphological, syntactic and
semantic features
12
Grammatical Features

Grammar



Part of speech (POS)



rules governing a language
syntax and morphology
noun, verb, adjective, adverb, preposition
depends on context in sentence
Brill tagger (Eric Brill, PhD thesis,1993)


http://www.cst.dk/online/pos_tagger/uk/index.html
http://en.wikipedia.org/wiki/Brill_Tagger
13
Morphological Features


structure of words
inflection



word-formation



enzyme and enzymes (plural form)
catalyse, catalyses, catalysing (verb inflection)
earth, earthworm (compounding)
dependent, independent (derivation)
stemming and lemmatisation

reduction of words to common base form



am, are, is  be
catalyse, catalyses, catalysing  catalys
Porter Stemmer (tartarus.org/martin/PorterStemmer)
14
Syntactic Features

relationships between words in a sentence


noun-phrase, verb-phrase
subject – object relationships
15
POS Tagged Sentence

(NNP Pain) (VBD vanished) (IN for) (IN at) (JJS least) (CD
three) (NNS months) (IN in) (NNS rats) (WP who) (VBD were)
(VBN injected) (IN in) (DT the) (NN spine) (IN with) (DT a)
(NN gene) (IN that) (NNS triggers) (VBZ endorphins) (. .)
Pain - Proper singular noun
vanished - Verb, past tense
for - Preposition
at - Preposition
least - Superlative adjective
three - Cardinal number
months - Plural noun
in - Preposition
rats - Plural noun
who - wh-pronoun
were - Verb, past tense
injected - Verb, past participle
in - Preposition
the - Determiner
spine - Singular noun
with - Preposition
a - Determiner
gene - Singular noun
that - Preposition
triggers - Plural noun
endorphins - Verb, 3rd ps. sing.present
. - Final punctuation
16
Semantic Features


meaning of words given the context
dictionaries, thesauri

Gene Ontology
17
Contextual Analysis

Guilt by association


Co-occurrence analysis
Word frequency


bag of words
statistical analysis of word frequency
18
Exercise 1


take a gene/protein name of your interest
query pubMed and retrieve 1 abstract
Take a look at what the Porter stemmer does
using the abstract
 Describe what problems might occur from
stemming
 Porter Stemmer

http://maya.cs.depaul.edu/~classes/ds575/porter.h
tml
19
Coffee Break
Tasks of NLP



Information Extraction (IE)
Question Answering (QA)
Information Retrieval (IR)




machine translation
text proofing
speech recognition
optical character recognition (OCR)
21
Information Retrieval

Information retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text) that
satisfies an information need from within large
collections (usually stored on computers). Introduction to IR (CambUnivPr,
2008)

Indexing






Tokenization
Case Folding (TNFalpha, Tnfalpha  tnfalpha
Stemming
Stop-word removal (e.g. at, be, from, this …)
Boolean Queries
Vector Space Model queries
22
Zipf’s Law
• A small number of words occur very often
• Those high frequency words are often function
words (e.g. prepositions)
• Most words with low frequency
23
Boolean Queries

Combination of query terms with boolean
operators






AND
OR
NOT
Google, PubMed
high recall, low precision
unranked result
24
The vector space model
(1+logTF)log(N/DF)
term weight





term frequency
(TF)
inverse
document
frequency (IDF)
corpus size (N)
the vector points in ‘word space’
each dimension corresponds to a word or phrase
© Nat Rev Gen(2002):3 pp 601-610
25
IR Evaluation

A document is relevant if it addresses the stated
information need, not because it just happens to
contain all the words in the query. Introduction to IR (CambUnivPr,
2008)



document collection
test cases of information need, as queries
measure of relevance
26
Evaluation (2)

Precision


Recall


What fraction of the returned results are relevant to
the information need?
What fraction of the relevant documents in the
collection were returned by the system?
F-score


harmonic mean of precision and recall
(2×p×r)/(p+r)
27
Exercise 2

Compare the retrieval of abstracts between PubMed and
Phasar (www.bioinformatics.nl/biometa/applet.html or
twoquid.cs.ru.nl/applet.html) given the question:





What does prostaglandin inhibit?
How many results do you get?
Give examples of answers to the question.
Give 5 pmids of papers you would read given the results
in each search.
Which of the systems was more helpful and why?
28
Coffee Break
Question Answering




question posed in human language
answer extracted from unstructured text
more developed in generic domain
difficult in biomedical domain
30
Information Extraction & Text Mining



extract structured information from unstructured
text
Named Entity Recognition
identify relationships

e.g. protein-protein interactions
31
Information Extraction


extract meaning
from a text
combines:



pos-tagging
ontologies
regular expressions
© Nat Rev Gen(2002):3 pp 601-610
32
Named Entity Recognition



tagging of biological entities
high precision in generic NLP (0.9 F-score)
difficult in biology


complex terms, synonyms, disambiguation
gene symbols



typographical variations
no use of official symbols
gene/protein names
33
Challenges of NLP

Abbreviation




punctuation can be confused with end of sentence
Wash. (Washington) with wash.
Decimal points
apostrophes: To split or not to split?
34
Challenges (2)

hyphens




simple stemming


single or multiple words?
data-base vs. data base vs. database
carry-over?
operate operating operates operation operative
operatives operational  oper
case folding

brown car vs Mr. Brown
35
Anaphora


co-references
one expression refering to another



strictly only local antecendent statements
Sortal anaphora


The monkey took the banana and ate it.
this gene, the virus
resolution required for increased recall
36
Exercise 3

compare NER programmes


retrieve one pubMed abstract
http://biocreative.sourceforge.net/bionlp_tools_links
.html





NLProt
TerMine
Whatizit (http://www.ebi.ac.uk/webservices/whatizit/info.jsf)
What are the differences in recognized entities?
Do they miss any obvious entities?
37
Indexing

Inverted Index (Inverted File)





size of index is proportional to size of corpus
remove stopwords, use stemming for more
efficient index
classic version is a boolean index


for each word in the collection (dictionary)
list occurrence and frequency
can also contain positional information
sparse matrix
38
Example

total # of
occurrences
term position
in counted
words


number of docs
containing the term
document ids
deterministic 20 73 89 90 106 173 194 233 243 251
252 255 257 258 267 276 281 304 312 315 326
27 36822 44643 45285 53003 53061 86740 86743
97082 116618 121984 125750 125952 125968
126039 127633 128882 128978 129048 133781
133789 138493 140946 140947 152011 156191
157881 163490
deterrence 1 60
4 30309 30345 30444 30452
detonation 2 263 264
4 131781 131956 131995 132303
39
Suffix Array

A suffix array is an array that contains all the
pointers to the text suffixes listed in
lexicographical order.




Text is seen as one long string
A text suffix is a substring from given position till end
of string
position refers to beginning of word
return all occurrences of string W in large text A
40
Example:
the word: abracadabra
1. create all suffixes
2. sort suffixes on alphabet
3. resulting suffix array

Finding every occurrence of the substring is
equivalent to finding every suffix that begins with
the substring
41
Document Classification

assign a document to a class given its content




manual (ad hoc)
rule-based
decision tree
machine learning approaches
42
Statistical Text Classification




training documents for each class
supervised learning
test data or new data
training data and test data have to be similar
43
Naïve Bayes


posterior
probability
Naïve: all words in text are considered
independent
Bayes: uses Bayes theorem
prior
probability
P( B | A) P( A)
P( A | B) 
P( B)
44
Basic Probability Theory




Given A represents an event
the probability of A occuring is 0 ≤ P(A) ≤ 1
Joint probability P(A,B) = P(A∩B)
Conditional probability P(A | B)
Chain rule P(A,B) = P(A | B)P(B) = P(B | A)P(A)
45
Application to Document Classification
probability of a word belonging to category C
probability of a document belonging to category C given its words
wikipedia.org
46
Coffee Break
Exercise 4

Try to apply naïve Bayes to a selection of
sentences using



http://search.cpan.org/~kwilliams/AlgorithmNaiveBayes/
rugby.txt and tennis.txt as training and test data.
If you have it implemented try using this in
combination with the Porter Stemmer
(http://bionlp.stanford.edu/bionlp.pl)
48
Added Challenge

From sequence to abstract to NER





MSTESMIRDVELAEEALPQKMGGFQNSRRCLCLSLFSFLLVAGATTLFCLLNFGVIGPQR
DEKFPNGLPLISSMAQTLTLRSSSQNSSDKPVAHVVANHQVEEQLEWLSQRANALLANGM
DLKDNQLVVPADGLYLVYSQVLFKGQGCPDYVLLTHTVSRFAISYQEKVNLLSAVKSPCPKDTPEGAE
LKPWYEPIYLGGVFQLEKGDQLSAEVNLPKYLDFAESGQVYFGVIAL
retrieve UniprotID via BLAST (take best hit)
retrieve gene name using getz (GeneName field)
retrieve relevant abstracts from pubMed in Medline
format using eSearch and eFetch with the gene name
extract all protein/gene names from these abstracts



http://bionlp.stanford.edu/webservices.html
how do they relate to the original protein?
compare to the output of ebiMed using the gene
name (http://www.ebi.ac.uk/Rebholzsrv/ebimed/index.jsp)
49
Helpful resources




http://www-nlp.stanford.edu/links/statnlp.html
http://nlp.stanford.edu/IRbook/html/htmledition/mybook.html
www.biocreative.org
Drosophila gene names:

http://www.curioustaxonomy.net/gene/fly.html
50
Further Reading

Introduction to Information Retrieval



Cambridge University Press
ISBN 987-0-521-86571-5
The Text Mining Handbook


Cambridge University Press
ISBN-13 978-0-521-83657-9
51

Introduction to Text Mining

Transcript Introduction to Text Mining

Directory