Information Retrieval CSC 9010: Special Topics. Natural Language Processing. Paula Matuszek, Mary-Angela Papalaskari Spring, 2005 CSC 9010: Special Topics, Natural Language Processing.

Download Report

Transcript Information Retrieval CSC 9010: Special Topics. Natural Language Processing. Paula Matuszek, Mary-Angela Papalaskari Spring, 2005 CSC 9010: Special Topics, Natural Language Processing.

Information Retrieval
CSC 9010: Special Topics. Natural Language Processing.
Paula Matuszek, Mary-Angela Papalaskari
Spring, 2005
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
1
Finding Out About
• There are many large corpora of information that people
use. The web is the obvious example. Others include:
–
–
–
–
scientific journals
patent databases
Medline
Usenet groups
• People interact with all that information because they
want to KNOW something; there is a question they are
trying to answer or a piece of information they want.
• Information Retrieval, or IR, is the process of answering
that information need.
• Simplest approach:
– Knowledge is organized into chunks (pages or documents)
– Goal is to return appropriate chunks
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
2
Information Retrieval Systems
• Goal of an information retrieval system is to
return appropriate chunks
• Steps involve include
–
–
–
–
asking a question
finding answers
evaluating answers
presenting answers
• Value of an IR tool depends on how well it
does on all of these.
• Web search engines are the IR tools most
familiar to most people.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
3
Asking a question
• Reflect some information need
• Query Syntax needs to allow information need to
be expressed
– Keywords
– Combining terms
• Simple: “required”, NOT (+ and -)
• Boolean expressions with and/or/not and nested parentheses
• Variations: strings, NEAR, capitalization.
– Simplest syntax that works
– Typically more acceptable if predictable
• Another set of problems when information isn’t
text: graphics, music
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
4
Finding the Information
• Goal is to retrieve all relevant chunks. Too timeconsuming to do in real-time, so IR systems index
pages.
• Two basic approaches
– Index and classify by hand
– Automate
• For BOTH approaches deciding what to index on
(e.g., what is a keyword) is a significant issue.
• Many IR tools like search engines provide both
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
5
IR Basics
• A retriever collects a page or chunk. This may
involve spidering web pages, extracting documents
from a DB, etc.
• A parser processes each chunk and extracts
individual words.
• An indexer creates/updates a hash table which
connects words with documents
• A searcher uses the hash table to retrieve documents
based on words
• A ranking system decides the order in which to
present the documents: their relevance
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
6
How Good Is The IR?
• Information Retrieval systems are
evaluated with two basic metrics:
– Precision: What percent of document
returned are actually relevant to the
information need
– Recall: what percent of documents relevant
to information need are returned
• Can’t typically measure these exactly;
usually based on test sets.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
7
Selecting Relevant Documents
• Assume:
– we already have a corpus of documents defined.
– goal is to return a subset of those documents.
– Individual documents have been separated into
individual files
• Remaining components must parse, index,
find, and rank documents.
• Traditional approach is based on the words in
the documents (predates the web)
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
8
Extracting Lexical Features
• Process a string of characters
– assemble characters into tokens (tokenizer)
– choose tokens to index
• Standard lexical analysis problem
• Lexical Analyser Generator, such as lex
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
9
Lexical Analyser
• Basic idea is a finite state machine
• Triples of input state, transition token,
output state
A-Z
1
blank
A-Z
blank, EOF
0
2
• Must be very efficient; gets used a LOT
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
10
Design Issues for Lexical Analyser
• Punctuation
– treat as whitespace?
– treat as characters?
– treat specially?
• Case
– fold?
• Digits
– assemble into numbers?
– treat as characters?
– treat as punctuation?
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
11
Lexical Analyser
• Output of lexical analyser is a string of
tokens
• Remaining operations are all on these
tokens
• We have already thrown away some
information; makes more efficient, but
limits somewhat the power of our search
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
12
Stemming
• Additional processing at the token level
– We covered earlier this semester
• Turn words into a canonical form:
– “cars” into “car”
– “children” into “child”
– “walked” into “walk”
• Decreases the total number of different
tokens to be processed
• Decreases the precision of a search, but
increases its recall
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
13
Noise Words (Stop Words)
• Function words that contribute little or
nothing to meaning
• Very frequent words
– If a word occurs in every document, it is not
useful in choosing among documents
– However, need to be careful, because this is
corpus-dependent
• Often implemented as a discrete list
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
14
Example Corpora
• We are assuming a fixed corpus. Some sample
corpora:
–
–
–
–
Medline Abstracts
Email. Anyone’s email.
Reuters corpus
Brown corpus
• Will contain textual fields, maybe structured
attributes
– Textual: free, unformatted, no meta-information. NLP
mostly needed here
– Structured: additional information beyond the content
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
15
Structured Attributes for Medline
•
•
•
•
•
Pubmed ID
Author
Year
Keywords
Journal
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
16
Textual Fields for Medline
• Abstract
– Reasonably complete standard academic
English
– Capturing the basic meaning of document
• Title
– Short, formalized
– Captures most critical part of meaning
– Proxy for abstract
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
17
Structured Fields for Email
•
•
•
•
•
•
To, From, Cc, Bcc
Dates
Content type
Status
Content length
Subject (partially)
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
18
Text fields for Email
• Subject
– Format is structured, content is arbitrary.
– Captures most critical part of content.
– Proxy for content -- but may be inaccurate.
• Body of email
– Highly irregular, informal English.
– Entire document, not summary.
– Spelling and grammar irregularities.
– Structure and length vary.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
19
Indexing
• We have a tokenized, stemmed
sequence of words
• Next step is to parse document,
extracting index terms
– Assume that each token is a word and we
don’t want to recognize any more complex
structures than single words.
• When all documents are processed,
create index
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
20
Basic Indexing Algorithm
• For each document in the corpus
– Get the next token
– Create or update an entry in a list
• doc ID, frequency.
• For each token found in the corpus
– calculate #docs, total frequency
– sort by frequency
– Often called a “reverse index”, because it reverses the
“words in a document” index to be a “documents
containing words” index.
– May be built on the fly or created after indexing.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
21
Fine Points
• Dynamic Corpora (e.g., the web): requires
incremental algorithms
• Higher-resolution data (eg, char position).
– Supports highlighting
– Supports phrase searching
– Useful in relevance ranking
• Giving extra weight to proxy text (typically by
doubling or tripling frequency count)
• Document-type-specific processing
– In HTML, want to ignore tags
– In email, maybe want to ignore quoted material
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
22
Choosing Keywords
• Don’t necessarily want to index on every
word
– Takes more space for index
– Takes more processing time
– May not improve our resolving power
• How do we choose keywords?
– Manually
– Statistically
• Exhaustivity vs specificity
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
23
Manually Choosing Keywords
• Unconstrained vocabulary: allow creator of
document to choose whatever he/she wants
– “best” match
– captures new terms easily
– easiest for person choosing keywords
• Constrained vocabulary: hand-crafted
ontologies
– can include hierarchical and other relations
– more consistent
– easier for searching; possible “magic bullet”
search
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
24
Examples of Constrained Vocabularies
• ACM headings (www.acm.org/class/1998)
• H: Information Retrieval
– H3: Information Storage and Retrieval
– H3.3: Information Search and Retrieval
» Clustering
» Query formulation
» Relevance feedback
» Search process etc.
• Medline Headings (www.nlm.nih.gov/mesh/meshhome.html)
• L: Information Science
– L01: Information Science
–
L01.700: Medical Informatics
–
L01.700.508:
–
L01.700.508.280:
Information Storage and Retrieval
» Grateful Med [L01.700.508.280.400]
Medical Informatics Applications
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
25
Automated Vocabulary Selection
• Frequency: Zipf’s Law.
– Pn = 1/na, where Pn is the frequency of occurrence of the nth
ranked item and a is close to 1
– Within one corpus, words with middle
frequencies are typically “best”
• Document-oriented representation bias:
lots of keywords/document
• Query-Oriented representation bias:
only the “most typical” words. Assumes
that we are comparing across
documents.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
26
Choosing Keywords
• “Best” depends on actual use; if a word
only occurs in one document, may be
very good for retrieving that document;
not, however, very effective overall.
• Words which have no resolving power
within a corpus may be best choices
across corpora
• Not very important for web searching;
more relevant for some text mining.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
27
Keyword Choice for WWW
• We don’t have a fixed corpus of documents
• New terms appear fairly regularly, and are
likely to be common search terms
• Queries that people want to make are
wide-ranging and unpredictable
• Therefore: can’t limit keywords, except
possibly to eliminate stop words.
• Even stop words are language-dependent.
So determine language first.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
28
Comparing and Ranking Documents
• Once our IR system has retrieved a set of
documents, we may want to
• Rank them by relevance
– Which are the best fit to my query?
– This involves determining what the query is
about and how well the document answers it
• Compare them
– Show me more like this.
– This involves determining what the document
is about.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
29
Determining Relevance by Keyword
• The typical document retrieval query consists
entirely of keywords.
• Retrieval can be binary: present or absent
• More sophisticated is to look for degree of
relatedness: how much does this document
reflect what the query is about?
• Simple strategies:
– How many times does word occur in document?
– How close to head of document?
– If multiple keywords, how close together?
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
30
Keywords for Relevance Ranking
• Count: repetition is an indication of emphasis
–
–
–
–
Very fast (usually in the index)
Reasonable heuristic
Unduly influenced by document length
Can be "stuffed" by web designers
• Position: Lead paragraphs summarize content
– Requires more computation
– Also reasonably heuristic
– Less influenced by document length
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
31
Keywords for Relevance Ranking
• Proximity for multiple keywords
– Requires even more computation
– Obviously relevant only if have multiple keywords
– Effectiveness of heuristic varies with information
need; typically either excellent or not very helpful
at all
• All keyword methods
– Are computationally simple and adequately fast
– Are effective heuristics
– typically perform as well as in-depth natural
language methods for standard IR
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
32
Comparing Documents
• "Find me more like this one" really means that
we are using the document as a query.
• This requires that we have some conception of
what a document is about overall.
• Depends on context of query. We need to
– Characterize the entire content of this document
– Discriminate between this document and others in
the corpus
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
33
Characterizing a Document:
Term Frequency
• A document can be treated as a sequence of
words.
• Each word characterizes that document to some
extent.
• When we have eliminated stop words, the most
frequent words tend to be what the document is
about
• Therefore: fkd (# of occurrences of word K in
document d) will be an important measure.
• Also called the term frequency
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
34
Characterizing a Document:
Document Frequency
• What makes this document distinct from others
in the corpus?
• The terms which discriminate best are not those
which occur with high frequency!
• Therefore: Dk (# of documents in which word K
occurs) will also be an important measure.
• Also called the document frequency
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
35
TF*IDF
• This can all be summarized as:
– Words are best discriminators when they
• occur often in this document (term frequency)
• don’t occur in a lot of documents (document frequency)
• One very common measure of the importance of
a word to a document is TF*IDF: term frequency
* inverse document frequency
• There are multiple formulas for actually computing
this. The underlying concept is the same in all of
them.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
36
Describing an Entire Document
• So what is a document about?
• TF*IDF: can simply list keywords in
order of their TF*IDF values
• Document is about all of them to some
degree: it is at some point in some
vector space of meaning
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
37
Vector Space
• Any corpus has defined set of terms (index)
• These terms define a knowledge space
• Every document is somewhere in that
knowledge space -- it is or is not about each of
those terms.
• Consider each term as a vector. Then
– We have an n-dimensional vector space
– Where n is the number of terms (very large!)
– Each document is a point in that vector space
• The document position in this vector space can
be treated as what the document is about.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
38
Similarity Between Documents
• How similar are two documents?
– Measures of association
• How much do the feature sets overlap?
• Modified for length: DICE coefficient
– DICE(x,y) = 2 f(x,y) / ( f(x) + f(y) )
– # terms compared to intersection
• Simple Matching coefficient: take into account
exclusions
– Cosine similarity
• similarity of angle of the two document vectors
• not sensitive to vector length
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
39
Bag of Words
• All of these techniques are what is known
as bag of words approaches.
• Keywords treated in isolation
• Difference between "man bites dog" and
"dog bites man" non-existent
• If better discrimination is needed, IR
systems can add semantic tools
– Use POS
– Parse into basic NP VP structure
– Requires that query be more complex.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
40
Improvements
• The two big problems with short queries
are:
– Synonymy: Poor recall results from missing
documents that contain synonyms of search
terms, but not the terms themselves
– Polysemy/Homonymy: Poor precision results
from search terms that have multiple
meanings leading to the retrieval of nonrelevant documents.
Martin: www.cs.colorado.edu/~martin/csci5832.html
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
41
Query Expansion
• Find a way to expand a users query to
automatically include relevant terms
(that they should have included
themselves), in an effort to improve
recall
– Use a dictionary/thesaurus
– Use relevance feedback
Martin: www.cs.colorado.edu/~martin/csci5832.html
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
42
Dictionary/Thesaurus Example
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
43
Relevance Feedback
• Ask user to identify a few documents
which appear to be related to their
information need
• Extract terms from those documents and
add them to the original query.
• Run the new query and present those
results to the user.
• Typically converges quickly
Based on Martin: www.cs.colorado.edu/~martin/csci5832.html
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
44
Blind Feedback
• Assume that first few documents returned
are most relevant rather than having users
identify them
• Proceed as for relevance feedback
• Tends to improve recall at the expense of
precision
Based on Martin: www.cs.colorado.edu/~martin/csci5832.html
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
45
Post-Hoc Analyses
• When a set of documents has been
returned, they can be analyzed to improve
usefulness in addressing information need
– Grouped by meaning for polysemic queries
(using N-Gram-type approaches)
– Grouped by extracted information (Named
entities, for instance)
– Group into existing hierarchy if structured
fields available
– Filtering (e.g., eliminate spam)
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
46
Additional IR Issues
• In addition to improved relevance, can
improve overall information retrieval with
some other factors:
– Eliminate duplicate documents
– Provide good context
– Use ontologies to provide synonym lists
• For the web:
– Eliminate multiple documents from one site
– Clearly identify paid links
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
47
Summary
• Information Retrieval is the process of returning
documents to meet a user’s information need
based on a query
• Typical methods are BOW (bag of words) which
rely on keyword indexing with little semantic
processing
• NLP techniques used including tokenizing,
stemming, some parsing.
• Results can be improved by adding semantic
information (such as thesauri) and by filtering
and other post-hoc analyses.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari
48