Intelligent Information Retrieval and Web Search

Download Report

Transcript Intelligent Information Retrieval and Web Search

Intelligent Information Retrieval
(and Web Search)
Professor Celso A A Kaestner, PhD.
Brazil
1
Site:
www.dainf.ct.utfpr.edu.br/~kaestner/Konstanz/iir.htm
2
Introduction
3
Introduction: Information Retrieval
• IR: representation, storage, organization of,
and access to information items;
• Focus is on the user information need;
• User information need:
– Find all docs containing information on college football
teams which: (1) are maintained by an university and
(2) participate in the national tournament.
• Emphasis is on the retrieval of information
(not data).
4
Data retrieval x Information retrieval
• Data Retrieval:
– which docs. contain a set of keywords?
– well defined semantics;
– a single erroneous object implies failure!
• Information Retrieval (IR):
– information about a subject or topic;
– semantics is frequently loose;
– small errors are tolerated.
• IR system:
– interpret contents of information items;
– generate a ranking which reflects relevance;
– notion of relevance is most important.
5
Information Retrieval (IR)
• The indexing and retrieval of textual
documents.
• Searching for pages on the World Wide
Web is the most recent “killer app.”
• Concerned firstly with retrieving relevant
documents to a query.
• Concerned secondly with retrieving from
large sets of documents efficiently.
6
Typical IR Task
•
Given:
– A corpus of textual natural-language
documents.
– A user query in the form of a textual
string.
•
Find:
– A ranked set of documents that are
relevant to the query.
7
IR System
Document
corpus
Query
String
IR
System
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
8
Relevance
• Relevance is a subjective judgment
and may include:
– Being on the proper subject.
– Being timely (recent information).
– Being authoritative (from a trusted
source).
– Satisfying the goals of the user and
his/her intended use of the information
(information need).
9
Keyword Search
• Simplest notion of relevance is that
the query string appears verbatim in
the document.
• Slightly less strict notion is that the
words in the query appear frequently
in the document, in any order (bag of
words).
10
Problems with Keywords
• May not retrieve relevant documents
that include synonymous terms.
– “restaurant” vs. “café”
– “PRC” vs. “China”
• May retrieve irrelevant documents that
include ambiguous terms.
– “bat” (baseball vs. mammal)
– “Apple” (company vs. fruit)
– “bit” (unit of data vs. act of eating)
11
Beyond Keywords
• We will cover the basics of keyword-based
IR, but…
• We will focus on extensions and recent
developments that go beyond keywords.
• We will cover the basics of building an
efficient IR system, but…
• We will focus on basic capabilities and
algorithms rather than system’s issues that
allow scaling to industrial size databases.
12
Intelligent IR
• Taking into account the meaning of
the words used.
• Taking into account the order of words
in the query.
• Adapting to the user based on direct
or indirect feedback.
• Taking into account the authority of
the source.
13
IR System Architecture
User Interface
User
Need
User
Feedback
Query
Ranked
Docs
Text
Text Operations
Logical View
Query
Operations
Searching
Ranking
Indexing
Database
Manager
Inverted
file
Index
Retrieved
Docs
Text
Database
14
IR System Components
• Text Operations forms index words (tokens).
– Standardization (caps …)
– Stopword removal
– Stemming
• Indexing constructs an inverted index of
word to document pointers.
• Searching retrieves documents that contain
a given query token from the inverted index.
• Ranking scores all retrieved documents
according to a relevance metric.
15
IR System Components (continued)
• User Interface manages interaction
with the user:
– Query input and document output.
– Relevance feedback.
– Visualization of results.
• Query Operations transform the query
to improve retrieval:
– Query expansion using a thesaurus.
– Query transformation using relevance
feedback.
16
IR and the Web
• IR at the center of the stage:
– Advent of the Web changed this
perception once and for all:
•
•
•
•
universal repository of knowledge;
free (low cost) universal access;
no central editorial board;
many problems though: IR seen as key to
finding the solutions!
17
IR and the Web
• And more:
• Most of the human task employ the
treatment of information in textual and/ or
graphic form (Lyman, 2003);
• How Much Information project (Berkeley):
www.sims.berkeley.edu/how-much-info-2003.
• Each person generates 800 Mbytes / year.
18
IR and the Web
In 2002: 5 Exabytes of new information;
• Magnetic media (HD’s): 92%;
• Films: 7%;
• Print material: 0,01%;
• Optical media: 0,002%.
5
Exabytes = 5 million Terabytes =
5.000.000.000.000.000.000 bytes;
2 times the amount of 1999, given an
increasing rate of 30% / year.
19
IR and the Web
Information flow - radio, TV, Internet:
• 18 Exabytes of new information in 2002;
• 3,5 times of the amount stored;
• Telephone lines (and cell phones): 98%;
• 320 million hours of radio and TV
transmissions, with 70 million new hours,
with 81 Gigabytes of texts.
20
IR and the Web
Email:
• 31 billion of e-mails / year = 400.000 Tbytes
of new information;
The Internet (Web):
• 170 Tbytes of information = 17 times the
printed
content of the US Library of
Congress.
21
IR and the Web
Search sites:
• “Yahoo”, “Google”, etc. = the 1st option of
access for the users;
• A typical Internet user: 11 h 20 m / month;
• Access to the desired information = 1 / 3 of
the period;
• The user is obliged to verify if the received
information is the desired one, and several
times is impossible to recover the
information needed.
22
IR and the Web
• Information Glut or Information Overload: is
the main challenge to be surpassed by
automatic text treatment systems.
23
Web Search
• Application of IR to HTML documents
on the World Wide Web.
• Differences:
– Must assemble document corpus by
spidering the web.
– Can exploit the structural layout
information in HTML (XML).
– Documents change uncontrollably.
– Can exploit the link structure of the web.
24
Web Search System
Web
Spider
Document
corpus
Query
String
IR
System
1. Page1
2. Page2
3. Page3
.
.
Ranked
Documents
25
Other IR-Related Tasks
•
•
•
•
•
•
•
•
Automated document categorization
Information filtering (spam filtering)
Information routing
Automated document clustering
Recommending information or products
Information extraction
Information integration
Question answering
26
History of IR
• 1960-70’s:
– Initial exploration of text retrieval systems
for “small” corpora of scientific abstracts,
and law and business documents.
– Development of the basic Boolean and
vector-space models of retrieval.
– Prof. Salton and his students at Cornell
University are the leading researchers in
the area.
27
IR History Continued
• 1980’s:
– Large document database systems, many
run by companies:
• Lexis-Nexis
• Dialog
• MEDLINE
28
IR History Continued
• 1990’s:
– Searching FTPable documents on the
Internet
• Archie
• WAIS
– Searching the World Wide Web
• Lycos
• Yahoo
• Altavista
29
IR History Continued
• 1990’s continued:
– Organized Competitions
• NIST TREC
– Recommender Systems
• Ringo
• Amazon
• NetPerceptions
– Automated Text Categorization &
Clustering
30
Recent IR History
• 2000’s
– Link analysis for Web Search
• Google
– Automated Information Extraction
• Whizbang
• Fetch
• Burning Glass
– Question Answering
• TREC Q/A track
31
Recent IR History
• 2000’s continued:
– Multimedia IR
• Image
• Video
• Audio and music
– Cross-Language IR
• DARPA Tides
– Document Summarization
32
Related Areas
•
•
•
•
•
Database Management
Library and Information Science
Artificial Intelligence
Natural Language Processing
Machine Learning
33
Database Management
• Focused on structured data stored in
relational tables rather than free-form text.
• Focused on efficient processing of welldefined queries in a formal language (SQL).
• Clearer semantics for both data and queries.
• Recent move towards semi-structured data
(XML) brings it closer to IR.
34
Library and Information Science
• Focused on the human user aspects of
information retrieval (human-computer
interaction, user interface, visualization).
• Concerned with effective categorization of
human knowledge.
• Concerned with citation analysis and
bibliometrics (structure of information).
• Recent work on digital libraries brings it
closer to CS & IR.
35
Artificial Intelligence
• Focused on the representation of knowledge,
reasoning, and intelligent action.
• Formalisms for representing knowledge and
queries:
– First-order Predicate Logic
– Bayesian Networks
– Others …
• Recent work on web ontologies and intelligent
information agents brings it closer to IR.
36
Natural Language Processing
• Focused on the syntactic, semantic,
and pragmatic analysis of natural
language text and discourse.
• Ability to analyze syntax (phrase
structure) and semantics could allow
retrieval based on meaning rather
than keywords.
37
Natural Language Processing:
IR Directions
• Methods for determining the sense of
an ambiguous word based on context
(word sense disambiguation).
• Methods for identifying specific pieces
of information in a document
(information extraction).
• Methods for answering specific NL
questions from document corpora.
38
Machine Learning
• Focused on the development of
computational systems that improve their
performance with experience.
• Automated classification of examples based
on learning concepts from labeled training
examples (supervised learning).
• Automated methods for clustering unlabeled
examples into meaningful groups
(unsupervised learning).
39
Machine Learning:
IR Directions
• Text Categorization
– Automatic hierarchical classification (Yahoo).
– Adaptive filtering/routing/recommending.
– Automated spam filtering.
• Text Clustering
– Clustering of IR query results.
– Automatic formation of hierarchies (Yahoo).
• Learning for Information Extraction
• Text Mining
• Text Summarization
40