Transcript 슬라이드 1
Information Retrieval Tutorial 2004. 2. 13
Information & Communications University IR & NLP Lab http://ir.icu.ac.kr
맹 성 현 Copyright © 2004 Sung Hyon Myaeng
Outline
What is Information Retrieval (IR)?
Overview of Core IR Technology Overall Directions IR Expanded CLIR/MLIR Classification Topic Detection & Tracking Recommender Systems Summarization Question Answering Information Extraction Copyright © 2004 Sung Hyon Myaeng 2
What is IR?
Traditional IR: Willow System Copyright © 2004 Sung Hyon Myaeng 3
What is IR?
Web Search Engine Copyright © 2004 Sung Hyon Myaeng 4
What is IR?
Copyright © 2004 Sung Hyon Myaeng Ask Jeeves 5
IR & the Rest of the World
Information Retrieval Natural Language Processing Human Computer Interaction DB AI Statistics Copyright © 2004 Sung Hyon Myaeng Linguistics Cognitive Science 6
Evaluation of IR Systems
effectiveness “relevance” Ret NOT Ret Rel NOT Rel A C precision: A / A+C recall: A / A+B efficiency Interactive systems?
Others?
B D 0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 0 0.
2 0.
4 recall 0.
6 System A System B System C 0.
8 1 7 Copyright © 2004 Sung Hyon Myaeng
Overview of Text Retrieval
Text Processing Raw text Text Analysis Index Knowledge Resources & Tools User/System Interaction Info Needs Analysis of Info Needs Search Engine Matching (Inferencing) Query Retrieval Result 8 Copyright © 2004 Sung Hyon Myaeng
Text Processing (1)
- Indexing
Extraction of index terms and computation of their weights Index terms: represent document content & separate documents “economy” vs “computer” in a news article of
Financial Times
Morphological Analysis (stemming in English) “ 벨기에는” (“ 벨기 +” 에는” ?), “ 문서내의” (“ 문서” +” 내의” ) “information”, “informed”, “informs”, “informative” Rule-based vs dictionary-based n-gram “ 정보검색시스템” => “_ 정” , “ 정보” , “ 보검” , “ 검색” , … (bi-gram) “ 부정사” vs “ 부정한 정사” (similar enough in bi-gram!) Surprisingly effective in some languages 9 Copyright © 2004 Sung Hyon Myaeng
Text Processing (2)
– Storing indexing results
A B E A C F 1 2 C F A D G 3 4 n B G 1 2 3 4 … A v v v B v v C v v D E v v F v n G v v Inverted index Copyright © 2004 Sung Hyon Myaeng 10
Text Processing (3)
- Indexing
Use of various linguistic resources Dictionaries (noun, Josa, Eomi, bilingual, Proper noun, foreign words, …) For extraction and weighting of index terms Thesaurus (e.g. WordNet) Controlled vocabulary indexing Matching similar and related words Tagged Corpus Most NLP technology is used for term extraction “Bag of words” approach Sense disambiguation?
Word order?
11 Copyright © 2004 Sung Hyon Myaeng
Overview of Text Retrieval
Text Processing raw text text 분석 Index Knowledge Resources & Tools User/System Interaction Info Needs Analysis of Info Needs Search Engine Matching (Inferencing) Query Retrieval Result 12 Copyright © 2004 Sung Hyon Myaeng
User/System Interaction
– Query Models
Boolean AND, OR, NOT operators E.g (semi-conductor OR chip) AND stock NOT chocolate) adjacency, phrase operators E.g: “ stock exchange”
, “
그리고 아무 말도 하지 않았다”
)
Difficult for naïve users visual query interface Word list Vector space model system E.g.: (semi-conductor chip stock) Often interpreted as a Boolean query in search engines E.g. (semi-conductor OR chip OR stock)
13 Copyright © 2004 Sung Hyon Myaeng
User/System Interaction
– Query Models
“Natural Language” Query E.g.: “
I want to get information about ski resorts in Kangwon-do or in the Chungcheong area.”
Limitations in NLP Various tricks Query Expansion To resolve mismatches between query terms and index terms for documents A variety of linguistic resources are used (e.g. synonym, foreign word equivalence classes, bilingual dictionaries) Guide users to follow step-by-step instructions for detailed queries “ canned queries ” (E.g.:
” ) query templates
14 Copyright © 2004 Sung Hyon Myaeng
Ask Jeeves
화면 Copyright © 2004 Sung Hyon Myaeng 15
User/System Interaction
– Query Models
Relevance feedback “ Similar Pages ” in Web search engines From a simple query to better queries progressively Limited
recall
capability of human beings Recognition of a relevant document is much easier.
Intended to ease the difficulty of grasping the statistical properties of the entire collection An indirect way of capturing the user needs User profile To reflect user’s interest and orientation in interpreting user queries Need to gather & analyze user log data and learn user models 16 Copyright © 2004 Sung Hyon Myaeng
User/System Interaction
– Result Presentation
Information overload problem – too many retrieved A simple ranked list - title, author, URL, date, … Method 1: Organizing the retrieved documents Result Clustering (E.g. Vivisimo) “Zoom-in” operation (E.g: Scatter & Gather) Method 2: Visualizing the retrieved documents Overview of a large amount of information Visual expression of document properties E.g. TileBar Copyright © 2004 Sung Hyon Myaeng 17
Scatter/Gather
Copyright © 2004 Sung Hyon Myaeng 18
Tile Bar
Copyright © 2004 Sung Hyon Myaeng 19
Result Clustering
Copyright © 2004 Sung Hyon Myaeng 20
Text Retrieval Overview
Text Processing raw text text Analysis Index Knowledge Resources & Tools User/System Interaction Info Needs Analysis of Info Needs Search Engine Matching (Inferencing) Query Retrieval Result 21 Copyright © 2004 Sung Hyon Myaeng
Matching & Ranking (1)
Inverted File, … Query Terms Wt Pointers 가구 0.7
3 가야 0.9
.
.
.
.
.
.
신라 .
.
.
0.9
.
.
.
.
.
.
.
2 .
.
.
.
.
호랑이 .
.
.
0.6
.
.
.
2 Directory 1 2 3 4 5 .
.
.
.
.
.
.
275 276 1 2 5 4 6 .
.
.
.
.
.
.
3 5 1011 1012 1 4 Posting file Copyright © 2004 Sung Hyon Myaeng Doc #1 ---- ---- ---- Doc #2 ---- ---- ---- Doc #5 ---- ---- ---- 22
Matching & Ranking (2)
Ranking Retrieval Model Boolean (exact) => Fuzzy Set (inexact) Vector Space Probabilistic Inference Net … Weighting Schemes Index terms, query terms Parameters in formulas Document characteristics … Copyright © 2004 Sung Hyon Myaeng 23
IR Model Example: Vector Space Model
................dog............
....mouse .....dog........
mouse ........................
Q = < cat, mouse, 0 > Di = (d i1 , d i2 , ... , d in ) Q = (q 1 , q 2 , ... , q n ) Similarity = D i . Q / |D i |*|Q| cat dog Q D1 mouse 24 Copyright © 2004 Sung Hyon Myaeng
Matching & Ranking (3)
Techniques for efficiency New storage structure esp. for new document types Use of accumulators for efficient generation of ranked output Compression/decompression of indexes Technique for Web search engines Use of hyperlinks Inlinks & outlinks Authority vs hub pages In conjunction with Directory Services (e.g. Yahoo) Softbot – storing terabytes of data and efficient crawling ...
Copyright © 2004 Sung Hyon Myaeng 25
Web document retrieval – using hyperlinks
Initial Retrieval Set
A
TERM
C B
Candidates for additional retrieval
A: Hub document B: Authority document
Increase the weight of A, B
Copyright © 2004 Sung Hyon Myaeng To be ranked again using the link information 26
Characteristics of IR - summary
Unstructured vs Structured
Information Retrieval
Probabilistic Derived from contents Partial or “Best” Match Natural Language Relevance Ranked Retrieval Models Indexing Matching/Retrieval Query Types Results Criteria Results Ordering Information Retrieval/Data Retrieval Spectrum
Data Retrieval
Deterministic Complete Items Exact Match Structured Any Match Arbitrary 27 Copyright © 2004 Sung Hyon Myaeng
Overall Directions (1)
Efforts to improve retrieval effectiveness (as always!) Retrieval model, text analysis and representation, user interactions, ...
Specialized Search: domain-specific Context awareness (personalization, task-centered) profile, session logs, task models, etc.
Multi-something multimedia, multi-style, multilingual,… Distributed Environment with a large quantity Web search, meta-search, distributed retrieval (DB segmentation), meta-data retrieval, semantic Web New functionality filtering, TDT, classification, summarization, QA, information extraction, ...
28 Copyright © 2004 Sung Hyon Myaeng
Cross-Language & Multilingual IR
Cross-language IR Using a language (mother tongue) to retrieve documents in another language To overcome the language differences [terminology] cross-lingual, translingual (DARPA) Multilingual IR “retrieving relevant document in any of the languages contained in a multilingual document collection” (CL + | ML) Document Retrieval E.g.: Using Korean queries to search a DB consisting of Korean, English and Japanese documents 29 Copyright © 2004 Sung Hyon Myaeng
CLIR
The number of documents in languages other than own is rapidly increasing, and so is the need for retrieval.
The rate of annual increase for documents in the WEB English: 50%; All other languages: 90% Multilingual countries, organizations, enterprises, & users The limitation of machine translation technology More economical to translate necessary document after retrieval Not easy to construct a query in a foreign language even with the ability to comprehend written materials in the language reading vs. writing CLIR is fundamental to other multilingual information access technology 30 Copyright © 2004 Sung Hyon Myaeng
CLIR Problem (example)
현 대자 동차 ??
same village eastern exposure trend 현대자동차 주식 동향 현대 자동차 principal food stocks food and drink 3 4 31 Copyright © 2004 Sung Hyon Myaeng
Retrieval of Structured Documents
XML documents, hypertext, metadata, semantic Web Queries for structure and content FIND a document that INCLUDES a chapter whose title CONTAINS the term “hypertext” AND whose section CONTAINS the term “browsing”.
Queries for content and link FIND all documents about “information retrieval” that is referred to by a paper written by “Myaeng”.
Retrieval with ontology
vs
retrieval from ontology (e.g. RQL) 32 Copyright © 2004 Sung Hyon Myaeng
Classification - Motivation
Is this spam?
Sender: [email protected]
Subject: Your Business Listing - Global Trade Index Date: 2003-09-03 ( 수 ) 오전 6:40 Size: 6 KB Dear Site Owner, You are invited to list your site at the most important
Trade Directory
on the Internet. This directory system has attributes no other directory on the Internet has had do date, check us out!
Manufacturers - Wholesalers - Distributors - Resellers
and all businesses that are associated are welcome on our directory.
Your business will prosper from its association with our global resources!
There are NO CHARGES for a listing. Just click here enter your business details as you like. Thank you for your kind assistance in this matter, and The Team at EconomicGrowthNetwork.com
Toll Free - 866 516-8412 33 Copyright © 2004 Sung Hyon Myaeng
Classification - Motivation
How about these subject lines?
Re: 요청하신 자료입니다 .
그렇게 가면 안되지 .
Get V^iagram in the convenience of your home Generi.c Cia.lis – Lasts 2 times longer then Via.gra!
Re,no-va”te”*”you:r ‘d*ow_nst>;airs;^ Copyright © 2004 Sung Hyon Myaeng 34
Classification Problem
E-mail classification Given an e-mail message containing question and/or complaint, where should it be sent in ERMS?
Categories: AS, subscription/unsubscription, passwords, upgrades, usage, about-products, other questions, other complaints A not-so-easy example: 제목 : 아뒤 랑 본문 비번를 잊어 먹었슴다 .
안녕하세 여 … 오 랫 만에 제가 그만 홈에 들렀더니 … 아뒤랑 pass 를 잊어버렸네요 음 … 35 Copyright © 2004 Sung Hyon Myaeng
Problem Statement
Given: A description of an instance,
x
X
, where X is the
instance language
or
instance space
.
A fixed set of categories:
C
= {
c
1 ,
c
2 , … ,
c
n } How to represent text documents?
Determine: The category of
x
:
c
(
x
)
C,
where
c
(
x
) is a
categorization function
whose domain is
X
and whose range is
C
.
how to build categorization functions (“classifiers”).
36 Copyright © 2004 Sung Hyon Myaeng
A Typical Example for Document Classification
Testing Data:
“planning language proof intelligence” (AI) (Programming) (HCI)
Classes:
ML Planning Semantics Garb.Coll.
Multimedia
Training Data:
learning intelligence planning temporal programming semantics algorithm reinforcement network...
reasoning plan language...
language proof...
garbage collection memory optimization region...
...
GUI ...
37 Copyright © 2004 Sung Hyon Myaeng
Classification Methods
Rule-based Methods E.g.: assign a category if document contains a given Boolean combination of words Accuracy is often very high if a profile has been carefully refined over time by a subject expert.
Building and maintaining these profiles is expensive.
Inductive Learning Models Naïve Bayesian Model Decision Tree Model SVM (Support Vector Machine) Similarity-Based Models K-Nearest Neighbor Rocchio’s Model These require hand classified training data, but can be built (and refined) easily.
38 Copyright © 2004 Sung Hyon Myaeng
Topic Detection & Tracking (TDT) (1)
Event : A reported occurrence at a specific time and place, and the unavoidable consequences.
Specific elections, accidents, crimes, natural disasters.
“TWA-800 airplane crash” vs. “airplane accidents”
Activity : A connected set of actions that have a common focus or purpose campaigns, investigations, disaster relief efforts Topic : a seminal event or activity, along with all directly related events and activities Story : a topically cohesive segment of news that includes two or more DECLARATIVE independent clauses about a single topic 39 Copyright © 2004 Sung Hyon Myaeng
TDT (2) – First Story Detection
Automatically identify the first story on a new event from a stream of text
To detect the first story that discusses a topic, for all topics.
First Stories = Topic 1 = Topic 2 Time Not First Stories 40 Copyright © 2004 Sung Hyon Myaeng
TDT (3) – More about FSD
First story detection is an unsupervised learning task.
There is no supervised training.
On-line vs. Retrospective On-line: Flag onset of new events from live news feeds as stories come in Retrospective: Detection consists of identifying first story looking back over longer period Lack of advance knowledge of new events, but have access to unlabeled historical data as a contrast set Applications Intelligence services Finance: Be the first to trade a stock 41 Copyright © 2004 Sung Hyon Myaeng
TDT (4) – Other Tasks
Topic Tracking Standard text classification task Once a topic has been detected, identify subsequent stories about it However, very small training set (initially: 1!) Topic Detection Grouping stories from an accumulated collection Event-based classification with multiple topics (events) Retrospective 42 Copyright © 2004 Sung Hyon Myaeng
TDT (5)
Characteristics of Events News articles on an event are temporally close to each other.
lexical and temporal similarities “Similar” news articles over an extended time period events Different Use a time window to determine the scope of an event Changes in the used terms and their frequencies new event Use of clustering techniques (example) retrospective: bottom-up clustering with a time window (TD) Online (Tracking) single-pass, incremental clustering incremental IDF use of decaying function for old documents in computing the similarity 43 Copyright © 2004 Sung Hyon Myaeng
Recommender Systems (1)
Recommender systems are a technological proxy for a social process We rely on recommendations from other people.
An information discovery model where people try to find other people with similar tastes and then ask them to suggest new things In a typical recommender system People provide recommendations (input) The system aggregates and directs to appropriate recipients.
44 Copyright © 2004 Sung Hyon Myaeng
Recommender Systems (2)
Motivations Can we automatically aggregate quotes like: "I like this book; you might be interested in it" "I saw this movie, you ’ ll like it “ "Don ’ t go see that movie!
“ Finding new books, music, or movies, previously unknown to users Applications Corporate Intranets Recommendations, finding domain experts, … Ecommerce Product recommendations – amazon , CDNOW, … Medical Applications Matching patients to doctors, clinical trials, … Customer Relationship Management Matching customer problems to internal experts in a support organization Copyright © 2004 Sung Hyon Myaeng 45
Recommender Systems (3) - Types
Collaborative/Social-filtering system consumers ’ – aggregation of preferences and recommendations to other users based on similarity in behavioral patterns Content-based system – supervised machine learning used to induce a classifier to discriminate between interesting and uninteresting items for the user Knowledge-based system – knowledge about users and products used to reason what meets the user ’ s requirements, using discrimination tree, decision support tools, case-based reasoning (CBR) 46 Copyright © 2004 Sung Hyon Myaeng
Recommender Systems (4) - Example
Copyright © 2004 Sung Hyon Myaeng 47
Automatic Summarization (1)
Functionality indicative: to determine if the document would be of any interest informative: to reflect the original content as faithfully as possible under the compression rate evaluative: evaluation of the original document Fluency fragmented connected text Users generic user (query)- focused Target Documents single vs multiple documents 48 Copyright © 2004 Sung Hyon Myaeng
Automatic Summarization (2)
Source(s) Intermediary Representation Summary Word Frequencies Clue Phrases Layout Syntax Semantics Discourse Pragmatics
Analysis
Word Count Clue Phrases Statistical Structural Abstraction Aggregation Planning Realization Layout
Selection Condensation
Copyright © 2004 Sung Hyon Myaeng
Presentation
49
Automatic Summarization (3) - Approaches
Natural language understanding / generation Build knowledge representation of text Generate sentences summarizing content Hard to do well Keyword summaries Display most significant keywords Easy to do Hard to read, poor representation of content Sentence extraction Extract key sentences Medium hard Summaries often don ’ t read well Good representation of content Copyright © 2004 Sung Hyon Myaeng 50
Question Answering (Q/A) (1)
To provide an answer to a query, as opposed to a document Query: for factoid or exact answer Result: Ranked list of
Who invented the paper clip?
Where is Rider College located?
Name a film in which Jude Law acted.
51 Copyright © 2004 Sung Hyon Myaeng
Question Answering (2)
System Flow (example) Rule Set User Query Query Analyzer Query Set Query Categories Thesaurus Answer Answer Extractor Candidates Copyright © 2004 Sung Hyon Myaeng Document Set Retrieval Engine Retrieved Docs Doc Analyzer 52
Question Answering (3)
Query What is the fare cost for the round trip between New York and London on Concorde?
Rule Applied What [be] [ADJ] [NOUN] for Main phrase Extracted fare cost Categorization Of the phrase Financial loss Assign a Query category Assignment of query categories 53 Copyright © 2004 Sung Hyon Myaeng
Information Extraction foodscience.com-Job2
JobTitle: Ice Cream Guru Employer: foodscience.com
JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html
OtherCompanyJobs: foodscience.com-Job1 54
Information Extraction (2)
Goal: being able to answer semantic queries (a.k.a. “ database queries ” ) using “ unstructured ” natural language sources semi-structured textual document.
Transform this unstructured information into structured relations in a database/ontology.
...... 4 월 일본 동경에서의 테러 ...............................................
11 일 오후 ..... ...
상무성 장관 ....... 행인 동경시 5 ........... 피해 ... 사제폭탄 .........
................................................
명 중상 ..... .......
자동차 2 대 ....... 독일에서의 경우 .......... ............
사건 일시 장소 목표 인명 피해 재산 피해 테러 4.12 오후 동경시 상무성장관 5 명 중상 자동차 2 대 55 Copyright © 2004 Sung Hyon Myaeng
Information Extraction (3) – flow (example)
document local text analysis lexical analysis name recognition partial syntactic analysis scenario pattern matching discourse analysis coreference analysis inference template generation extracted templates Copyright © 2004 Sung Hyon Myaeng [Grishman, 1997] 56
Information Extraction: MUC (State of the Art
–
1997)
NE – named entity recognition CO – coreference resolution TE – template element construction TR – template relation construction ST – scenario template production 57 Copyright © 2004 Sung Hyon Myaeng
Knowledge Extraction Vision
Multi-dimensional Meta-data Extraction J F M A M J J A Topic Discovery Concept Indexing Thread Creation Term Translation Document Translation Story Segmentation Entity Extraction Fact Extraction Meta-Data
EMPLOYEE / EMPLOYER Relationships: Jan Clesius works for Clesius Enterprises Bill Young works for InterMedia Inc.
COMPANY / LOCATION Relationshis: Clesius Enterprises is in New York, NY InterMedia Inc. is in Boston, MA India Bombing NY Times Andhra Bhoomi Dinamani Dainik Jagran
58 Copyright © 2004 Sung Hyon Myaeng
References
Belkin, N.J. & Croft, W.B. (1987).
Retrieval Techniques
. In: Williams, M.E.,
Annual Review of Information Science and Technology
22(), 109-145, New York: Elsevier & ASIS.
E. Glover, S. Lawrence, M. Gordon, W. Birmingham, Lee-Giles (2001). Web Search – Your Way.
Comm. of the ACM
, 44 (12), 97-102.
Gregory Grefenstette (1998). “The Problem of Cross-Language Information Retrieval.” In Cross-Language Information Retrieval (ed: Grefenstette), Kluwer Academic Publishers.
Doug Oard et al. (1999). “Multilingual Information Discovery and AccesS (MIDAS).” D-Lib Magazine, 5 (10), Oct. Sung Hyon Myaeng et al. (1998). “ A Flexible Model for Retrieval of SGML Documents.” Proc. of the 21 st ACM SIGIR Conference, Austrailia.
59 Copyright © 2004 Sung Hyon Myaeng
References
James Allan (2002). “Introduction to Topic Detection and Tracking.” in Topic Detection and Tracking: Event-based Information Organization (ed: Allan), Kluwer Academic Publishers. Paul Resnick & Hal Varian (1997). “Recommender Systems.” CACM 40 (3), March, pp 56-58.
Bardrul Sarwar et al. (2001). “Item-based Collaborative Recommendation Algorithms”, http://citeseer.nj.nec.com/sarwar01itembased.html
Karen Sparck Jones (1999). “Automatic summarizing: factors and directions.” In Advances in Automatic Text Summarization (eds: Mani & Maybury), MIT Press.
Ellen Boorhees. (2000). “Overview of TREC-9 Question Answering Track.” Ralph Grishman (1997). “Information Extraction: Techniques and Challenges.” In Information Extraction - International Summer School SCIE-97, (ed: Maria Teresa Pazienza), Springer-Verlag, 1997. (See http://nlp.cs.nyu.edu/publication/index.shtml
) Andrew McCallum and Kamal Nigam. A Comparison of Event Models for Naive Bayes Text Classification. In
AAAI/ICML-98 Workshop on Learning for Text Categorization
, pp. 41-48.
60 Copyright © 2004 Sung Hyon Myaeng