CSM06 Information Retrieval

Download Report

Transcript CSM06 Information Retrieval

CSM06 Information Retrieval
Lecture 1a – Introduction
Dr Andrew Salway
[email protected]
Lecture 1a: INTRODUCTION
What is
information retrieval?
Why?, Who?, What?
• Why do we need information retrieval?
• Who are the users of information retrieval
systems?
• What kinds of information do they want to
retrieve?
• Why study information retrieval? For
example, why is it important to understand
how search engines work?
Applications of
Information Retrieval
• For the World Wide Web
• For organisations’ intranets
• For our personal media
collections
• INSERT GOOGLE
screenshot
• INSERT AltaVista
screenshot
• INSERT Yahoo
screenshot - Query
• INSERT Autonomy
screenshot
• INSERT IBM Webfountain
screenshot
• INSERT my email
screenshot
• INSERT my photos
screenshot
A very brief history…
•
•
•
•
Libraries for 1,000’s of years
1950’s - computer-based IR
early 1990’s - web search
late 1990’s - multimedia search
Some traditional ways of
organizing information
• Table of Contents of a book
• Index of a book
• Library classification
schemes:
• Hierarchies (e.g. Dewey Decimal)
• Controlled vocabularies
• Collections of abstracts
From the dictionary…
Library. 1 A large organised
collection of books for reading or
reference. b A mass of learning
or knowledge; a source providing
knowledge and learning. c A
collection of films, gramophone
records, etc. when organised or
sorted for some specific
purpose…
The New Shorter Oxford English Dictionary, 1993
Information Retrieval
“the representation,
storage, organisation of,
and access to information
items”
(Baeza-Yates and Riberio-Neto
1999, page 1)
How is computer-based IR
different to traditional libraries?
•
•
•
•
•
Remote, multiple access
May have multiple indexes
Interactivity
Scale
Automatic indexing and
ranking
What are the particular
challenges for IR on the Web?
• Volume of text data – Google claims to index more than
8,000,000,000 webpages, and that’s not everything
• Multimedia information – traditional IR focussed on texts
• More and more multilingual information
• Cannot access original text when processing a query
• Distributed data – different platforms, bandwidths
• Large amount of volatile data and redundant data
• Diverse users (hence diverse information needs) and many
inexperienced users
• Some good news though! The links between webpages can be useful
for web search engines (more on this in Lecture 4)
Who are they?
“Analysts estimate that Google is worth
between $15 billion and $20 billion”
The Times, 29/01/2004