Transcript Slide 1

Alexander Gelbukh
Moscow, Russia
1
Mexico
2
Computing Research Center (CIC),
Mexico
3
Chung-Ang University, Korea
Electronic Commerce and
Internet Application Lab
4
Special Topics in Computer Science
The Art of
Information Retrieval
Alexander Gelbukh
www.Gelbukh.com
5
Information Retrieval





In a huge amount
of poorly structured information
find the information that you need
when you don’t know exactly what you need
or can’t explain it
 The Web
 User information need
 Ranking
6
7
8
Information Retrieval





In a huge amount
of poorly structured information
find the information that you need
when you don’t know exactly what you need
or can’t explain it
 The Web
 User information need
 Ranking
9
Importance
 Knowledge: the main treasure of man
 Web: Repository? Cemetery of information!
 Natural language and multimedia information
o Poorly structured, badly written
 Corporate and organizational document bases
o Senate speeches: Mexico
o Medical data collections
o Corporate memory. Microsoft knowledge base
 Future: data explosion  increasing importance
10
Perspectives
 Corporations: corporate databases
 Organizations: document bases
 Government
o European Union multilingual problem
o The same in Asia
 Academy
o
o
o
o
Lots of open research topics
Web topics
Computational Linguistics topics
Intelligent technologies, AI
11
Textbook
http://sunsite.dcc.uchile.cl/irbook/
12
Contents
1. Introduction
2. Modeling
3. Retrieval Evaluation
4. Query Languages
5. Query Operations
6. Text and Multimedia Languages and Properties
7. Text Operations
8. Indexing and Searching
9. Parallel and Distributed IR
10. User Interfaces and Visualization
11. Multimedia IR: Models and Languages
12. Multimedia IR: Indexing and Searching
13. Searching the Web
14. Libraries and Bibliographical Systems
15. Digital Libraries
13
Calendar
1.
2.
3.
4.
5.
September 18
Chapter 1
September 25
Chapter 2
October 2
Chapter 3
October 9
Chapter 4
October 16
Chapter 5
October 23 – midterm exam
6. October 30
Chapter 6
7. November 6
Chapter 7
8. November 13
Chapter 8
9. November 20
Chapter 10
10. November 27
Chapter 13
11. December 4
Chapter 14
12. December 11
Chapter 15
December – final exam
Introduction
Modeling
Retrieval Evaluation
Query Languages
Query Operations
Text and Multimedia Languages...
Text Operations
Indexing and Searching
User Interfaces and Visualization
Searching the Web
Libraries and Bibliographical Systems
Digital Libraries
14
Class structure
Main course: Information Retrieval
 Discussion of previous chapter. Questions
 I briefly present a new chapter
Research seminar: Natural Language Processing
 Discussion of previous paper. Questions.
o Identification of possible research topics
 Presentation of a new paper or current work
 Discussion and questions
 Goal: publications!
15
Natural Language Processing
Research Seminar
16
What CL is about
Computers to process natural language text
 “Understand”
 Generate
 Search
 Organize
 Translate
 …
Useful in IR
17
Methods
 No: text as a stream of letters
o Brute force statistics
o Simplified heuristics (ex.: Porter)
 Yes: attention to language rules
o Linguistically motivated approaches
o Knowledge-based approaches
o Corpus-based approaches
18
What IR is about
 Classical IR: find words? Concepts!
 Question answering
 Summarization
 Clustering
 …
Take language seriously
19
Text representations for IR
 Represent the retrieval features
o Strings → stems (lexemes), synsets, phrases.
o Women → woman, lady, female
o Old men and women → old woman
 Structured representation of text
o Network of related events and entities
o Enables logical inference
20
CL tasks useful in IR
 Morphology (stemming)
 POS / Word dense disambiguation
 Word relatedness
 Anaphora resolution
 Parsing and semantics (phrase search)
 Synonymic rephrasing
 Translation
etc…
Each one a whole science in itself
21
Morphology
 Q: pig T: piggish
 Simple: stemming
o piggish → pig-
 Lexeme: set of word forms
o same stem can give different words
o pigment → not pig; piny → pine, not pin
 Dictionary/corpus-based methods
o Learning; dictionary management
22
Part of Speech Disambiguation
 Q: oil well T: He did very well
 Q: what is an are? T: They are nice
 Important for English, Chinese. Less
important for other types
 Perhaps not so helpful directly, but is
necessary for most other tasks
 Usually statistical / heuristic methods
23
Word Sense Disambiguation
 Q: bank account T: on the beautiful banks of
Han river ...
 bill: document, banknote, law, ax, peak,
Gates...
 Very frequent, almost any word in text
 Statistical & dictionary methods
 International competitions
24
Word relatedness
 Q: female T: woman (women)
o Synonyms. Subtypes/super-types
o Dictionaries. WordNet. Similarity. Lesk.
 Q: Korea T: Seoul
o Other linguistic relationships (e.g., part)
o Real-world relationships (facts)
 Q: Clinton T: Lewinsky
o Statistical co-occurrence (MI)
25
Anaphora resolution
 Q: Awards of Prof. Han T: Prof. Han
said... He did... IBM awarded him...
o Frequency
o Phrases, co-occurrence, summarization,
inference, translation
 Heuristic (Mitkov) and knowledgebased methods
 Other types of co-reference
26
Parsing, semantics
 Q: Awards of Prof. Han T1: Prof. Han among
many other prizes has several IBM awards
T2: Mr. Kang has an award Prof. Han does
not know of
 Understanding of text
o Rich structured representation
 Better phrase search; question answering,
summarization, ...
27
Synonymic rephrasing, reasoning
 Q: experienced computer scientists T: Prof.
Han has been programming for many years
and awarded an IBM award
 Requires good syntactic and semantic
analysis
 Knowledge-based methods
28
Multilingual access
 Q: 요구르트 T: We sell excellent yoghurt.
Продаем йогурт. Se vende rico
yogur.
o Search multilingual collections
 Europe: dozens of official languages of EU
o If you don’t know how to say it in English
 Dictionaries, bilingual corpora, ...
29
Tasks are entangled
 Many of CL tasks require other tasks
o Morphology → syntax → semantics
 Many CL tasks form circles
o parsing ← WSD ← parsing
o I see a wild cat with a telescope (tripod?)
 Can be done quick-and-dirty (?)
o Fighting for last %s
o Zipf law: 20% of men drink 80% of beer
30
Tools and infrastructure
 Analysis tools
o Tasks, methods
 Dictionaries and grammars
o Types, structure
o Automatic acquisition
 Corpora
o Corpora analysis tools and methods
31
Possible tasks
 WSD to help IR
 Clustering + summarization in IR results
 Anaphora and coreference resolution to help
IR
 Multilingual IR
 Applications to Korean
 ... a lot of others
32
Reading
 Textbooks
o Manning & Schütze, Allen, Jurafsky, Hausser, ...
 CICLing proceedings
 Computational Linguistics
 Google, ResearchIndex
33
Questions
 Who expects to publish?
 Who will make a presentation at the next
seminar?
34
Thank you!
Till September 18
35