Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) ChengXiang Zhai

Download Report

Transcript Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) ChengXiang Zhai

Overview of Information Retrieval
(CS598-CXZ Advanced Topics in IR Presentation)
Jan. 18, 2005
ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
What is Information Retrieval (IR)?
• Narrow-sense:
– IR= Search Engine Technologies (IR=Google, library info
system)
– IR= Text matching/classification
• Broad-sense: IR = Text Information Management:
– Gneral problem: how to manage text information?
– How to find useful information? (info. retrieval) (e.g., google)
– How to organize information? (text classification) (e.g., automatically
assign email to different folders)
– How to discover knowledge from text? (text mining) (e.g., discover
correlation of events)
Why is IR Important?
•More and more online information in
general (Information Overload)
•Many tasks rely on effective management
and exploitation of information
•Textual information plays an important
role in our lives
•Effective text management directly
improves productivity
Elements of Text Info
Management Technologies
Retrieval
Applications
Visualization
Summarization
Filtering
Information
Access
Search
Mining
Applications
Mining
Information
Organization
Categorization
Extraction
Clustering
Natural Language Content Analysis
Text
Knowledge
Acquisition
A Quick Tour of the State of
the Art….
Component Technology 1:
Natural Language
Processing
What is NLP?
َ
َ
ُ
ً َ ‫ان أن يَ ُكونَ ِأم ْينَا ً َو‬
ْ
ُ ‫يَ ِج‬
َ ‫ب‬
َ ‫علَى اإل ْن‬
ِ ‫س‬
Arabic text … ‫صا ِدقَ ْا َم َع نَف َ ِس ِه َو َ َم َع أ ْه ِل ِه َو َِجي َْرانِ ِه َوأ ْن يَ ْبذ َل‬
‫على َما‬
ِ ‫… ُك َّل ُج ْه ٍد فِي ِإع‬
َ ‫الوط ِن َوأ ْن يَ ْع َم َل‬
َ ‫ْالء شَأ ِن‬
How can a computer make sense out of this string
?
- What are the basic units of meaning (words)?
Morphology
- What is the meaning of each word?
Syntax - How are words related with each other?
Semantics - What is the “combined meaning” of words?
Pragmatics - What is the “meta-meaning”? (speech act)
Discourse - Handling a large chunk of text
Inference - Making sense of everything
An Example of NLP
A dog is chasing a boy on the playground
Det
Noun Aux
Noun Phrase
Complex Verb
Semantic analysis
Dog(d1).
Boy(b1).
Playground(p1).
Chasing(d1,b1,p1).
+
Scared(x) if Chasing(_,x,_).
Scared(b1)
Inference
Verb
Det Noun Prep Det
Noun Phrase
Noun
Noun Phrase
Lexical
analysis
(part-of-speech
tagging)
Prep Phrase
Verb Phrase
Syntactic analysis
(Parsing)
Verb Phrase
Sentence
A person saying this may
be reminding another person to
get the dog back…
Pragmatic analysis
(speech act)
What we can do in NLP
A dog is chasing a boy on the playground
Det
Noun Aux
Noun Phrase
Verb
Complex Verb
Det Noun Prep Det
Noun Phrase
Noun
POS
Tagging:
97%
Noun Phrase
Prep Phrase
Verb Phrase
Parsing: partial >90%(?)
Semantics: some aspects
Verb Phrase
- Entity/relation extraction
- Word sense disambiguation
- Anaphora resolution
Inference: ???
Sentence
Speech act analysis: ???
What We Can’t Do in NLP
•100% POS tagging
– “He turned off the highway.” vs “He turned off the fan.”
•General complete parsing
– “A man saw a boy with a telescope.”
•Deep semantic analysis
– Will we ever be able to precisely define the meaning of “own”
in “John owns a restaurant.”?
Robust & general NLP tends to be “shallow” …
“Deep” understanding doesn’t scale up …
Component Technology 2:
Search (ad hoc retrieval)
What is Search (Ad hoc IR)?
database/collection
query
“robotics applications”
Retrieval
System
text docs
relevant docs
non-relevant docs
User
Robotics
others
What we can do in Search
•Search in a pure text collection is well
studied
– Many different methods
– Equally effective when optimized
•Basic search techniques (e.g., vector space,
prob. models) are good enough for
commercialization
– All implementing TF-IDF style heuristics
– Some new models have more potential for
further optimization
What we can’t do in Search
• Basic retrieval models
– No single model is the best on all test collections
– Automatic parameter optimization
• Lack of interactive search support
• Lack of personalization
• Search context modeling
• Retrieval with more than pure text
– With structures
– Multi-media
Component Technology 3:
Information Filtering
What is Information
Filtering?
•Stable & long term interest, dynamic info
source
•System must make a delivery decision
immediately as amydocument
“arrives”
interest:
…
Filtering
System
State of the Art: Filtering
•Content-based adaptive filtering
– Basic techniques, though not perfect, are there
– We haven’t seen many (any?) filtering
applications
•Collaborative filtering (recommender
systems)
– Simple methods can be (are being)
commercialized
– Real applications exist
– More applications are possible
Component Technology 4:
Text Categorization
What is Text Categorization?
•Pre-given categories and labeled document
examples (Categories may form hierarchy)
•Classify new documents
•A standard supervised learning problem
Sports
Categorization
System
Business
Education
…
Sports
Business
Education
…
Science
State of the Art: Categorization
• Many supervised learning methods have been
developed
– SVM is often the best in performance
– Other methods are also competitive
– Commercial applications exist, but not at a large-scale
– More applications can be developed
• Feature selection/extraction is often more
important than the choice of the learning
algorithm
• Applications have been developed
• Relatively well explored
Component Technology 5:
Clustering
The Clustering Problem
•Discover “natural structure”
•Group similar objects together
•Object can be document, term, passages
•Example
State of the Art: Clustering
•Many methods have been developed,
applicable in different situations
•Difficult to predict which method is the
best
•When patterns are clear, most methods
work well
•In difficult situations
– Special clustering bias must be incorporated
– Properties of clustering methods need to be
considered
End of State of the Art
Tour…
Where is IR Going?
•IR and related areas
•Current trends
•How would this course fit to the picture?
Related Areas
Applications
Models
Statistics
Optimization
Machine Learning
Pattern Recognition
Data Mining
Natural
Language
Processing
Algorithms
Applications
Web, Bioinformatics…
Information
Retrieval
Library & Info
Science
Databases
Software engineering
Computer systems
Systems
Current Trends
Models
Statistics
Optimization
Applications
Applications
Web, Bioinformatics…
Machine LearningWeb/ Bioinformatics/…
Pattern Recognition
Library & Info
Data
Mining
More Principled
Literature/Digital
Library
Science
Information
Models/Algorithms
Retrieval
Databases
Natural
Structured + Unstructured
More Powerful
Language
Data
Content Analysis
Processing
Software engineering
Computer systems
Algorithms
Human-Computer Interactions
High-Performance Computing
Systems
Publications/Societies
Learning/Mining
ICML
ISMB
ICML, NIPS, UAI
AAAI
NLP ACL
WWW
RECOMB, PSB
ACM SIGKDD
Statistics
??
Applications
HLT
Info Retrieval
ACM SIGIR
Info. Science
JCDL
ACM CIKM, TREC
COLING, EMNLP, ANLP
Software/systems
??
ASIS
Databases
ACM SIGMOD
VLDB, PODS, ICDE
Let Users Lead the Way…
• The underlying driving force has always been real
world applications
• The ultimate impact of research in IR is to benefit
people in accessing and using information in the
real world
• Research on many component technologies is
reaching a stage of “diminishing return”; the
challenge is how to make use of such imperfect
techniques
• Think more about complete solutions (as opposed
to component technologies) as well as new
applications
How would this Course
Fit to the Picture?
•Identify novel application problems
•Identify new research topics
•Examine existing research work in these
directions
•Design and carry out new projects in some
of the directions
•We will broadly look at 3 application
domains: Web, Email, and Literature