Transcript Slide 1
Alexander Gelbukh
Moscow, Russia
1
Mexico
2
Computing Research Center (CIC),
Mexico
3
Chung-Ang University, Korea
Electronic Commerce and
Internet Application Lab
4
Special Topics in Computer Science
The Art of
Information Retrieval
Alexander Gelbukh
www.Gelbukh.com
5
Information Retrieval
In a huge amount
of poorly structured information
find the information that you need
when you don’t know exactly what you need
or can’t explain it
The Web
User information need
Ranking
6
7
8
Information Retrieval
In a huge amount
of poorly structured information
find the information that you need
when you don’t know exactly what you need
or can’t explain it
The Web
User information need
Ranking
9
Importance
Knowledge: the main treasure of man
Web: Repository? Cemetery of information!
Natural language and multimedia information
o Poorly structured, badly written
Corporate and organizational document bases
o Senate speeches: Mexico
o Medical data collections
o Corporate memory. Microsoft knowledge base
Future: data explosion increasing importance
10
Perspectives
Corporations: corporate databases
Organizations: document bases
Government
o European Union multilingual problem
o The same in Asia
Academy
o
o
o
o
Lots of open research topics
Web topics
Computational Linguistics topics
Intelligent technologies, AI
11
Textbook
http://sunsite.dcc.uchile.cl/irbook/
12
Contents
1. Introduction
2. Modeling
3. Retrieval Evaluation
4. Query Languages
5. Query Operations
6. Text and Multimedia Languages and Properties
7. Text Operations
8. Indexing and Searching
9. Parallel and Distributed IR
10. User Interfaces and Visualization
11. Multimedia IR: Models and Languages
12. Multimedia IR: Indexing and Searching
13. Searching the Web
14. Libraries and Bibliographical Systems
15. Digital Libraries
13
Calendar
1.
2.
3.
4.
5.
September 18
Chapter 1
September 25
Chapter 2
October 2
Chapter 3
October 9
Chapter 4
October 16
Chapter 5
October 23 – midterm exam
6. October 30
Chapter 6
7. November 6
Chapter 7
8. November 13
Chapter 8
9. November 20
Chapter 10
10. November 27
Chapter 13
11. December 4
Chapter 14
12. December 11
Chapter 15
December – final exam
Introduction
Modeling
Retrieval Evaluation
Query Languages
Query Operations
Text and Multimedia Languages...
Text Operations
Indexing and Searching
User Interfaces and Visualization
Searching the Web
Libraries and Bibliographical Systems
Digital Libraries
14
Class structure
Main course: Information Retrieval
Discussion of previous chapter. Questions
I briefly present a new chapter
Research seminar: Natural Language Processing
Discussion of previous paper. Questions.
o Identification of possible research topics
Presentation of a new paper or current work
Discussion and questions
Goal: publications!
15
Natural Language Processing
Research Seminar
16
What CL is about
Computers to process natural language text
“Understand”
Generate
Search
Organize
Translate
…
Useful in IR
17
Methods
No: text as a stream of letters
o Brute force statistics
o Simplified heuristics (ex.: Porter)
Yes: attention to language rules
o Linguistically motivated approaches
o Knowledge-based approaches
o Corpus-based approaches
18
What IR is about
Classical IR: find words? Concepts!
Question answering
Summarization
Clustering
…
Take language seriously
19
Text representations for IR
Represent the retrieval features
o Strings → stems (lexemes), synsets, phrases.
o Women → woman, lady, female
o Old men and women → old woman
Structured representation of text
o Network of related events and entities
o Enables logical inference
20
CL tasks useful in IR
Morphology (stemming)
POS / Word dense disambiguation
Word relatedness
Anaphora resolution
Parsing and semantics (phrase search)
Synonymic rephrasing
Translation
etc…
Each one a whole science in itself
21
Morphology
Q: pig T: piggish
Simple: stemming
o piggish → pig-
Lexeme: set of word forms
o same stem can give different words
o pigment → not pig; piny → pine, not pin
Dictionary/corpus-based methods
o Learning; dictionary management
22
Part of Speech Disambiguation
Q: oil well T: He did very well
Q: what is an are? T: They are nice
Important for English, Chinese. Less
important for other types
Perhaps not so helpful directly, but is
necessary for most other tasks
Usually statistical / heuristic methods
23
Word Sense Disambiguation
Q: bank account T: on the beautiful banks of
Han river ...
bill: document, banknote, law, ax, peak,
Gates...
Very frequent, almost any word in text
Statistical & dictionary methods
International competitions
24
Word relatedness
Q: female T: woman (women)
o Synonyms. Subtypes/super-types
o Dictionaries. WordNet. Similarity. Lesk.
Q: Korea T: Seoul
o Other linguistic relationships (e.g., part)
o Real-world relationships (facts)
Q: Clinton T: Lewinsky
o Statistical co-occurrence (MI)
25
Anaphora resolution
Q: Awards of Prof. Han T: Prof. Han
said... He did... IBM awarded him...
o Frequency
o Phrases, co-occurrence, summarization,
inference, translation
Heuristic (Mitkov) and knowledgebased methods
Other types of co-reference
26
Parsing, semantics
Q: Awards of Prof. Han T1: Prof. Han among
many other prizes has several IBM awards
T2: Mr. Kang has an award Prof. Han does
not know of
Understanding of text
o Rich structured representation
Better phrase search; question answering,
summarization, ...
27
Synonymic rephrasing, reasoning
Q: experienced computer scientists T: Prof.
Han has been programming for many years
and awarded an IBM award
Requires good syntactic and semantic
analysis
Knowledge-based methods
28
Multilingual access
Q: 요구르트 T: We sell excellent yoghurt.
Продаем йогурт. Se vende rico
yogur.
o Search multilingual collections
Europe: dozens of official languages of EU
o If you don’t know how to say it in English
Dictionaries, bilingual corpora, ...
29
Tasks are entangled
Many of CL tasks require other tasks
o Morphology → syntax → semantics
Many CL tasks form circles
o parsing ← WSD ← parsing
o I see a wild cat with a telescope (tripod?)
Can be done quick-and-dirty (?)
o Fighting for last %s
o Zipf law: 20% of men drink 80% of beer
30
Tools and infrastructure
Analysis tools
o Tasks, methods
Dictionaries and grammars
o Types, structure
o Automatic acquisition
Corpora
o Corpora analysis tools and methods
31
Possible tasks
WSD to help IR
Clustering + summarization in IR results
Anaphora and coreference resolution to help
IR
Multilingual IR
Applications to Korean
... a lot of others
32
Reading
Textbooks
o Manning & Schütze, Allen, Jurafsky, Hausser, ...
CICLing proceedings
Computational Linguistics
Google, ResearchIndex
33
Questions
Who expects to publish?
Who will make a presentation at the next
seminar?
34
Thank you!
Till September 18
35