Information Storage and Retrieval

Download Report

Transcript Information Storage and Retrieval

Introduction
Ohm Sornil
Department of Computer Science
The National Institute of Development Administration
1
Introduction


IR is often regarded as being synonymous with
document retrieval and text retrieval
Task statement:
To retrieve documents that a user is likely to
find their contents relevant to his query
2
Definitions

Information need


Psychological state existing in a user’s
mind
Query

A user’s attempt to put his information
need into words
3
Sample Query Page
A query
4
Results Set
5
Data and Information

Data


Information



Raw facts
Data that have been shaped into a form that is
meaningful and useful to human being
Subjective and time-dependent
Regarding IR



Data is a document (a set of terms)
Information is what the document conveys (i.e.,
its meaning)
An IR system tries to retrieve documents whose
meaning is relevant to the query
6
Information Retrieval Process
A user with information need
Documents
Query
Results set
IR System
Document
Collection
7
IR System Structure
A user with information need

IR subsumes two related activities

Query
User
Interface
Searcher
Document
Collection
Index

Indexing
 The way documents are
represented for retrieval purposes
(i.e., index construction)
Searching
 The way document representations
are examined against the query
representation
 (and items are taken as related
to the search query)
8
A Simple Inverted Index
Document 1 = Information retrieval is searching and indexing
Document 2 = Indexing is building an index
Document 3 = An inverted file is an index
Document 4 = Building an inverted file is indexing
Vocabulary
Inverted List (document)
an
and
building
file
index
indexing
information
inverted
is
retrieval
searching
2, 3, 4
1
2, 4
3, 4
2, 3
1, 2, 4
1
3, 4
1, 2, 3, 4
1
1
9
Some History








The term “Information Retrieval” was coined in a paper by Moores in
1952
The International Conference of Scientific Information in 1958 was
marked the start of IR
Probabilistic model at Rand (Maron & Kuhn) (1960)
Boolean system development at Lockheed (1960s)
Vector Space model (Salton) (1965)
Statistical weighting methods and theoretical advances (1970s)
Refinements and advances in application (1980s)
User interface, large-scale testing and application (1990s)
10
Why Is IR A Difficult Problem?

Consider from two major aspects

Effectiveness (accuracy of the results)



Imprecise characteristics of natural language
Subjective nature of concepts
Efficiency



Amount of documents in the collection
Amount of requests per unit of time (workload)
Dynamicity of documents and the collection
11
Effectiveness of IR Systems

Natural language characteristics

Synonymy



Polysemy



Multiple words with the same meaning
e.g., dog & canine
Presentation level


Multiple meanings of one word
e.g., vaccine shot vs I shot a police.
For novice, for experts, etc.
Style of writing

Informative, sarcastic, humorous, etc.
12
Effectiveness of IR Systems (II)

Subjective nature of concepts



Prior knowledge and experience of users
Personal ability to understand / misundertand the concepts
These may change during the course of the search
13
Consequences

Query and information need



Different users may have different levels of understanding about a
document
Relevance judgment


Different users may form different queries based on their
understandings of the information need and ability to form “right”
queries (questions given to an IR system)
Document content


Any query is merely one possible way of expressing information need
Given the same document and query, different users may judge the
document differently
Though we want to get answers for an information need (pertinent
answers), an IR systems can only do its best to provide answers to
the query (relevant answers) – it can only guess what the
information need is
14
Retrieval Models


Algorithms (models) for ranking documents with regard to
a user query
A retrieval model specifies the details of:



Document (content) representation
Query representation
Ranking function
15
Retrieval Models (cont.)

Classical retrieval models




Boolean model
Probabilistic model
Vector Space model
Modern retrieval models





Latent Semantic Indexing model
Belief Network-based models
Neural Network model
Fuzzy Set model
…
16
Another Classification of Models

Exact match



Text pattern matching (KMP, BM, etc.)
Boolean model
Inexact match (best match)



Probabilistic
Vector Space
Many other models
17
The Boolean Model

Document representation


Query representation


A set of (key)words
A Boolean expression of keywords, connected by logical operations
Ranking function


1 Keywords within a document satisfy the Boolean query
0 Otherwise
Query: goat AND (ink OR zebra)
Document 1
Ant bird cat.
Dog elephant
fish goat.
Horse ink.
Ranking
function
1
18
The Vector Space Model

Basis


Document representation 


Given t distinct terms in the collection, each called an index term (collectively called
the vocabulary)
A t-dimensional vector
d
j
 (w 1, j ,w 2 , j ,..., w t , j )
Query representation

q
 (w 1,q ,w 2,q ,..., w t ,q )
A t-dimensional vector

Ranking function

The correlation between d j and q
ant 
q bird

1 ant
d2
d1
2 bird



d1
Ant bird
cat. Dog
fish.
Bird zebra.
Ink king.
Cat.
Query: ant bird
3
4
5
6
7
8
cat
dog
fish
ink
king
zebra
Vocabulary
8-Dimensional
vector space
 king
d2
zebra

d 1  (w 1,1 ,w 2,1 ,w 3,1,w 4 ,1 ,w 5,1, 0,0,0)

d 2  ( 0,w 2,2 ,w 3,2 , 0,0,w 6,2 ,w 7 ,2 ,w 8,2 )
19
The Probabilistic Model


Document
representation

 d
j  (w 1, j , w 2 , j ,..., w t , j ),
Query representation



q  (w 1,q ,w 2,q ,..., w t ,q ),
wi , j {0,1}
wi ,q {0,1}
Ranking function

P( R | d j )

sim(d j , q) 
P( R |d j )


P(d j | R)  P( R) P(d j | R)


sim(d j , q ) 
~
P(d j | R)  P( R) P(d j | R)

Assume independence of index terms
(
sim(d j , q) ~
(
 P( k

g i ( d j ) 1
i
| R ))  (
 P( k

g i ( d j ) 1
i
| R ))  (
 P(k

g i ( d j ) 0
i
| R ))
 P(k

g i ( d j ) 0
i
| R ))
20
Neural Network Model
w
t1
q1
Rank_Score(q, d_j) =
Activation level of d_j
d1
t2
q2
Query term
Layer
d2
t3
t4
Document term
Layer
d3
Document
Layer
21
Enhancing Probabilistic Models by
Logistic Regression


Probability of relevance is based on logistic regression from a sample
set of documents to determine values of the coefficients
At search time, the probability estimate is obtained by:
6
sim (d j , q )  c 0   c i X i
i 1
for the 6 X’s attribute measures shown above
22
Other Techniques

Other extensions to existing models







User interface



Extended Boolean model
Latent Semantic Indexing
Thesauri
Clustering
Term correlation information (e.g., P(a|b), P(c|a,b))
…
Support search dialogue
Visualization
Relevance feedback


Query expansion
Term reweighting
23
Envision
24
ThemeScapes
25
Measuring Retrieval Effectiveness

Recall-Precision

Precision



Recall



Measures accuracy of the results
The percentage of the documents retrieved in a search
that are relevant
Measures coverage
The percentage of the total relevant documents that are
retrieved in a search
There are many other matrices suitable for different
uses
26
Recall and Precision
Collection
Relevant Docs
in A (Ra)
Relevant Docs (R)
| Ra |
Re call 
|R|
Answer set (A)
| Ra |
Pr ecision 
| A|
27
Standard Test Collections


CACM, CISI, TIME, MEDLINE, NPL, …
Text REtrieval Conference (TREC) collections



A collection of various (sub)collections
Huge
Components of each test collection



Test collections
Information need
Relevance judgment
28
Test Document
<doc>
<docno> WSJ880406-0090 </docno>
<hl> AT&T Unveils Services to Upgrade Phone Networks Under Global Plan
</hl>
<author> Janet Guyon (WSJ Staff) </author>
<dateline> New York </dateline>
<text>
American Telephone & Telegraph Co. introduced the first of a new
generation of phone services with broad …
</text>
</doc>
29
Sample TREC Topic
(Information Need)
<TOP>
<NUM>
271
</NUM>
<TITLE>
Solar Power
</TITLE>
<DESC>
To what extent is solar power used as an alternative to
fossil fuels in various countries worldwide?
</DESC>
<NARR>
Although the development of solar power as a major energy
source has progressed slowly, in some parts of the world
it is used extensively. Where and for what purposes?
</NARR>
</TOP>
30
Relevance Judgment
Query
Relevant Documents
1
2
3
…
4, 9, 12, 14, 170, 200
30, 120, 480, 500
1, 2, 3, 10
31
Web Searching


With the advent of WWW, new generation of search
engines has been developed for searching the Web
Unique properties of the Web







Additional component


Sheer scale and exponential growth
Extreme variety of materials
Includes hyperlinks
A wide range of quality
Distributed data
Heterogeneous data
Web crawlers
Exploit link information to improve effectiveness
32
PageRank

A
T1


B
T2
T3
PageRank is Google’s method of measuring
a page’s “importance”

T1 thinks A is important
PageRank can be used to adjust results so
that sites that are more “important” will
move up in the results page of a user
Web Page Importance means “Pointed to by
lots of important pages”
PageRank for Page A
PR(A) = (1-d) + d(PR(T1)/C(T1) + … + PR(Tn)/ C(Tn))
Ti is a page pointing to page A
Page Ti has C(Ti) number of outgoing links
Damping parameter d  0.15
33
PageRank (cont.)
Computing PageRank
1 Crawl the web to generate NxN link matrix A
A[i,j] = 1 iff page Pi contains link to page Pj
2 A simple iterative algorithm to compute PR(A) for each
webpage, A
Ranking in Google
1 Rank pages according to “page factors” such as keywords
2 Calculate PageRank score for every page
3 Adjust the results from (1) by the PageRank scores
34
Efficiency Aspects

Collection scale and Workload





Parallel and distributed IR
Data compression
Replication
…
Dynamicity of documents and the collection



Re-index every night
Adaptive data structures
…
35
Parallel and Distributed IR

Motivations

Collection size



Workload


Searching and indexing costs grow with the size of the
underlying document collection
As more documents are added, performance (e.g., response
time, throughput) may deteriorate to the point where the
system is no longer usable
Good search engines should: (a) provide high query processing
rate and, at the same time, (b) have low response time for an
average query
Application of parallelism can greatly enhance ability to scale
traditional IR algorithms and support larger document collections
36
Parallel IR

Depending on the architecture of the system


SIMD (e.g., CM-2), MIMD (e.g., IBM SP2)
Parallel IR Approaches

Develop new retrieval strategies that exploit parallelism


e.g., Neural networks, genetic algorithm, etc.
Parallelize existing serial algorithms
37
Partitioned Parallel Processing
Subcomputation
/ Results
User
Query
Client
Process
Broker
Client
Process
Client
Process
Result
Client
Process
Each of the client processes performs a portion of the computation
and transmits an intermediate result back to the broker
38
Parallel IR (II)


IR computation is typically characterized by a small amount of
computation applied to a large amount of data (index)
How to partition the computation boils down to a question of
how to partition the index
39
Distributed IR


The main difference between techniques in Parallel IR
and those in Distributed IR is the amount of
interprocess communication
Techniques that suit well to a distributed environment
should have little communication


Partition the collection into parts, each completely is taken
care of by a node
Each node responsible for searching documents within the
portion of the collection assigned to it
40
Distributed IR (cont.)

Typical query processing in distributed IR systems
1
2
3
4

Select collections to search
Distribute query to the selected collections
Evaluate query at each collection in parallel
Combine results from distributed collections into the final
results
Issues

Collection partitioning


Collection selection



Same owner, geographical areas, similar semantics, etc.
Representation of a collection
Collection replication
Interoperability protocol (e.g., Z39.50, STARTS, etc.)
41
IR Related Problems






Document clustering
Document categorization
Summarization
Information Filtering
Federated search
…
42
Extensions to IR: Digital Libraries


We have looked at two aspects of information
systems (effectiveness and efficiency)
To have a complete information system, additional
aspects to be considered include:






Interoperability
Multimedia, multilingual, structured document support
Usability
Property rights and access control
Legal, social, and economic implications
Considering all these issues of information systems
becomes an area known as Digital Libraries (DLs)
43
A Framework of Digital Libraries
44
45
Research Trends in IR

Improving effectiveness


Improving efficiency







New retrieval strategies
Faster indexing, lower amount of space
Smaller query response time, higher throughput
Better understanding of user behavior and expectation
Evaluation techniques
Multilingual retrieval
More media types
New-types of applications
 Summarization
 Word association, document classification
 …
46
Major Resources for IR and DL









ACM Special Interest Group on IR (SIGIR)
ACM International Conferences on Digital Libraries
ACM Special Interest Group on Hypertext, Hypermedia and the
Web (SIGWEB)
IEEE Advances in Digital Libraries Conferences
Journal of Documentation
Digital Library Magazine
Text REtrieval Conference (TREC)
www.searchenginewatch.com
Countless number of useful resources on the Web
47