Web Search - Electures

Download Report

Transcript Web Search - Electures

Web Search – Summer Term 2006
I. General Introduction
(c) Wolfgang Hürst, Albert-Ludwigs-University
Introduction: Search
What is “search” (by machine)?
Data bases: Relational data bases, SQL, …
Search in structured data
Information Retrieval
Search in un- (or semi-)structured data
Example: Email-Archive
‘All Emails with sender [email protected] from April 1st-3rd, 2006’
Search in exactly specified (meta) data
‘All Emails that are somehow related to project x’
Search in unspecified and unstructured body
Information Retrieval (IR)
Information Retrieval (IR) deals with the
representation, storage, organization of, and access
to information items.
(Page 1, Baeza-Yates und Ribeiro-Neto [1])
Information Retrieval (IR) = Part of computer science
which studies the retrieval of information (not data)
from a collection of written documents. The retrieved
documents aim at satisfying a user information need
usually expressed in natural language.
(Glossary, page 444, Baeza-Yates & Ribeiro-Neto [1])
Note: Many other definitions exist
Generally, all share this common view:
INFORMATION
INFORMATION
NEED
DATA /
DOCUMENTS
QUERY
SEARCH PROCESS
USER
INFORMATION
NEED
DATA
DOCUMENTS
INFORMATION RETRIEVAL SYSTEM
SEARCH PROCESS
USER
INFORMATION
NEED
DATA
DOCUMENTS
INFORMATION RETRIEVAL SYSTEM
SEARCH PROCESS
USER
INFORMATION
NEED
QUERY
DATA
DOCUMENTS
RESULT
QUERY PROCESSING &
SEARCHING & RANKING
INDEXING
INDEX
INFORMATION RETRIEVAL SYSTEM
Information Retrieval (IR)
Main problem: Unstructured, imprecisely,
and imperfectly defined data
But also: The whole search process can be
characterized as uncertain and vague
INFORMATION
INFORMATION
NEED
DATA /
DOCUMENTS
QUERY
Hence: Information is often returned in form
of a sorted list (docs ranked by relevance).
‘Data Retrieval’ vs. ‘IR’
DATA RETRIEVAL
INFORM. RETRIEVAL
MATCHING
EXACT MATCH
PARTIAL / BEST MATCH
INFERENCE
DEDUCTION
INDUCTION
MODEL
DETERMINISTIC
PROBABILISTIC
CLASSIFICATION
MONOTHETIC
POLYTHETIC
QUERY LANGUAGE
ARTIFICIAL
NATURAL
QUERY
SPECIFICATION
COMPLETE
INCOMPLETE
ITEMS WANTED
MATCHING
RELEVANT
ERROR RESPONSE
SENSITIVE
INSENSITIVE
Source: C. J. van RIJSBERGEN: INFORM. RETRIEVAL
(http://www.dcs.gla.ac.uk/Keith/Chapter.1/Ch.1.html)
Summary of most imporant terms
Query = The expression of the user information need
in the input language provided by the information
system. The most common type of input language
simply allows the specification of keywords and of a
few boolean connectivities.
(Glossary, page 449, Baeza-Yates & Ribeiro-Neto [1])
Index = A data structure built on the text to speed up
searching.
(Glossary, page 443, Baeza-Yates & Ribeiro-Neto [1])
The concept of relevance = Measure to quantify
relevance of a particular document for a particular
user in a particular situation.
IR Process: Tasks Involved
INFORMATION NEED
DOCS.
DOCUMENTS
User Interface
QUERY
SELECT DATA FOR
INDEXING
RESULTS
QUERY PROCESSING
(PARSING & TERM
PROCESSING)
RESULT
REPRESENTATION
PARSING & TERM
PROCESSING
RANKING
LOGICAL VIEW OF THE
INFORMATION NEED
SEARCHING
LOGICAL VIEW OF THE
DOCUMENTS (INDEX)
PERFORMANCE EVALUATION
Evaluation of IR Systems
Standard approaches for algorithm and
computer system evaluation
Speed / processing time
Storage requirements
Correctness of used algorithms
But most importantly
Performance, effectiveness
Questions:
What is a good / better search engine?
How to measure search engine quality?
Etc.
Evaluation of IR Systems
Another important issue:
Usability, users’ perception
Example:
User 1 & system 1:
‘It took me 10 min to find the information.’
User 2 & system 2:
‘It took me 14 min to find the information.’
Evaluation of IR Systems
Another important issue:
Usability, users’ perception
Example:
User 1 & system 1:
‘It took me 10 min to find the information.
Those were the worst 10 minutes of my life.
I really hate this system!’
User 2 & system 2:
‘It took me 14 min to find the information.
I never had so much fun using any search
engine before!’
Some Historical Remarks
1950s: Basic idea of searching text with a
computer
1960s: Key developments, e.g.
The SMART system (G. Salton, Harvard/Cornell)
The Crainfield evaluations
1970s and 1980s: Advancements of basic ideas
But: mainly with small test collections
1990s: Establishment of TREC (Text Retrieval
Conference) series (since 1992 till today)
Large text collections, expansion to other fields and
areas, e.g. spoken document retrieval, non-english or
multi-lingual retrieval, information filtering, user
interactions, WWW, video retrieval, etc.
SOURCE: AMIT SINGHAL ‘MODERN INFORMATION RETRIEVAL: A BRIEF
OVERVIEW’ (CH. 1), IEEE BULLETIN, 2001
Information Retrieval & Web Search
Historically, IR was mainly motivated by text
search (libraries, etc.)
Today: Various other areas and data, e.g. multi
media (images, video, etc.), WWW, etc.
Web search: perfect example for an IR system
Goal: Find best possible results (web pages)
based on
a) Unstructured, heterogeneous, semistructured data
b) Imprecise, ambiguous, short queries
(Note: ‘Best possible results‘ is also a very vague
specification of the ultimate goal)
But: Very different from traditional IR tasks!
Characteristics of the Web
Size: The web is big! An there are lots of users!
Documents: Extreme variety regarding
formats, structure, quality, etc.
Users: Very different skills & intensions, e.g.
Find all information about related patents
Find some good tourist inform. about Paris
Find the phone no. of the tourist office
Location: The web is a distributed system
Spam: Expect manipulation instead of
cooperation from the document providers
Dynamic: The web keeps growing & changing
Web Search
Web search is an active research area with high
economical impact
Many open questions & challenges for research:
Improving existing systems,
adapting to new scenarios (more data, spam, …),
new challenges (diff. data formats, multimedia, …),
new tasks (desktop search, personalization, …),
etc.
Many other approaches & techniques exist, e.g.
Clustering,
specialized search engines,
meta search engines,
etc.
We will cover some of this here, i.e. …
Web Search Course: Rough Outline
Traditional (text) retrieval:
Index generation (data structures),
text processing,
ranking (TF*IDF, …),
models (Boolean, Vector Space, Probabilistic),
evaluation (precision & recall, TREC, …)
Only most important concepts as required for main
part of the course, i.e.:
Web search (special case of IR):
Special characteristics of the web,
ranking (PageRank, HITs, …),
crawling (Spiders, Robots),
indexing,
and some selected topics
Text books about (text) IR
[1] RICARDO BAEZA-YATES, BERTHIER RIBEIRO-NETO: ‘MODERN
INFORMATIN RETRIEVAL’, ADDISON WESLEY, 1999
[2] WILLIAM B. FRAKES, RICARDO BAEZA-YATES (EDS.): ‘INFORMATION
RETRIEVAL – DATA STRUCTURES AND ALGORITHMS’, P T R PRENTICE
HALL, 1992
[3] C. J. VAN RIJSBERGEN: ‘INFORMATION RETRIEVAL’, 1979,
AVAILABLE ONLINE AT http://www.dcs.gla.ac.uk/Keith/Preface.html
[4] I. WITTEN, A. MOFFAT, T. BELL: ‘MANAGING GIGABYTES’, MORGAN
KAUFMANN PUBLISHING, 1999
EXCERPTS FROM A NEW BOOK ‘INTRODUCTION TO INFORMATION
RETRIEVAL’ BY C. MANNING, P. RAGHAVAN, H. SCHÜTZ (TO APPEAR
2007) ARE AVAILABLE ONLINE AT
http://www-csli.stanford.edu/~schuetze/information-retrieval-book.html
Only certain topics will be covered in this course.
No books on web search, but selected articles will be
recommended in the lecture