No Slide Title

Download Report

Transcript No Slide Title

Information Retrieval Models
1
Retrieval Models
• A retrieval model specifies the details
of:
– Document representation
– Query representation
– Retrieval function
• Determines a notion of relevance.
• Notion of relevance can be binary or
continuous (i.e. ranked retrieval).
2
Classes of Retrieval Models
• Boolean models (set theoretic)
– Extended Boolean
• Vector space models
(statistical/algebraic)
– Generalized VS
– Latent Semantic Indexing
• Probabilistic models
3
Other Model Dimensions
• User Task
– Retrieval
– Browsing
• Logical View of Documents
– Index terms
– Full text
– Full text + Structure (e.g. hypertext)
4
Retrieval and browsing
• The User Task
Retrieval
Database
• Retrieval
Browsing
• information or data;
• purposeful.
• Browsing
• glancing around;
• F1; cars, Le Mans, France, tourism.
5
Logical View of documents
• Logical view of the documents
Accents
spacing
Docs
stopwords
Noun
groups
stemming
Manual
indexing
structure
structure
Full text
Index terms
• Document representation viewed as a
continuum: logical view of docs might shift.
6
Typical IR task
Docs
Index Terms
doc
match
Information Need
Ranking
query
7
IR keyword match
• Matching at index term level is quite imprecise;
• No surprise that users get frequently
unsatisfied;
• Since most users have no training in query
formation, problem is even worst;
• Frequent dissatisfaction of Web users;
• Issue of deciding relevance is critical for IR
systems: ranking.
8
Ranking
• A ranking is an ordering of the documents
retrieved that (hopefully) reflects the relevance
of the documents to the user query;
• A ranking is based on fundamental premises
regarding the notion of relevance, such as:
– common sets of index terms;
– sharing of weighted terms;
– likelihood of relevance.
• Each set of premises leads to a distinct IR
model.
9
IR Models
Set Theoretic
Fuzzy
Extended Boolean
Classic Models
U
s
e
r
Retrieval:
Adhoc
Filtering
boolean
vector
probabilistic
Structured Models
T
a
s
k
Non-Overlapping Lists
Proximal Nodes
Algebraic
Generalized Vector
Lat. Semantic Index
Neural Networks
Probabilistic
Inference Network
Belief Network
Browsing
Browsing
Flat
Structure Guided
Hypertext
10
IR Models
• The IR model, the logical view of the docs, and the
retrieval task are distinct aspects of the system.
LOGICAL VIEW OF DOCUMENTS
U
S
E
R
Retrieval
T
A
S
K
Browsing
Index Terms
Full Text
Classic
Set Theoretic
Algebraic
Probabilistic
Classic
Set Theoretic
Algebraic
Probabilistic
Flat
Flat
Hypertext
Full Text +
Structure
Structured
Structure Guided
Hypertext
11
Classic IR Models
• Traditional IR systems employ a set of index
terms to represent the documents;
• The key idea is that the document semantics can
be represented by the index terms;
• Usual formal models:
– Boolean;
– Vector-space;
– Probabilistic.
12
Classic IR Models
• Each document is represented by a set of
representative index terms;
• An index term is a document word useful for
remembering the document main themes;
• Usually, index terms are nouns because nouns
have meaning by themselves;
• However, search engines assume that all
words are index terms (full text
representation).
13
Classic IR Models
• Not all terms are equally useful for representing the
document contents: less frequent terms allow
identifying a narrower set of documents;
• The importance of the index terms is represented by
weights associated to them;
– ki be an index term
– dj be a document
– wij is a weight associated with (ki,dj)
• The weight wij quantifies the importance of the index
term for describing the document contents.
14
Classic IR Models
–
–
–
–
–
–
–
–
ki is an index term;
dj is a document;
t is the total number of terms;
N is the total number of docs;
K = (k1, k2, …, kt) is the set of all index terms;
wij >= 0 is a weight associated with (ki,dj);
wij = 0 indicates that term does not belong to doc;
vec(dj) = (w1j, w2j, …, wtj) is a weighted vector
associated with the document dj ;
– gi(vec(dj)) = wij is a function which returns the weight
associated with pair (ki,dj) .
15
Retrieval Tasks
• Ad hoc retrieval: Fixed document corpus,
varied queries.
• Filtering: Fixed query, continuous
document stream.
– User Profile: a model of relative static
preferences.
– Binary decision of relevant/not-relevant.
• Routing: Same as filtering but continuously
supply ranked lists rather than binary
filtering.
16
Retrieval: Ad Hoc x Filtering
• Ad hoc retrieval:
Q1
Q2
Collection
“Fixed Size”
Q3
Q4
Q5
17
Retrieval: Ad Hoc x Filtering
• Filtering:
Docs Filtered
for User 2
User 2
Profile
User 1
Profile
Docs for
User 1
Documents Stream
18
Common Preprocessing Steps
• Strip unwanted characters/markup (e.g. HTML
tags, punctuation, numbers, etc.).
• Break into tokens (keywords) on whitespace.
• Stem tokens to “root” words
– computational  comput
• Remove common stopwords (e.g. a, the, it, etc.).
• Detect common phrases (possibly using a
domain specific dictionary).
• Build inverted index (keyword  list of docs
containing it).
19