슬라이드 1

Download Report

Transcript 슬라이드 1

Information Retrieval Tutorial 2004. 2. 13

Information & Communications University IR & NLP Lab http://ir.icu.ac.kr

맹 성 현 Copyright © 2004 Sung Hyon Myaeng

Outline

    What is Information Retrieval (IR)?

Overview of Core IR Technology Overall Directions IR Expanded  CLIR/MLIR  Classification  Topic Detection & Tracking  Recommender Systems  Summarization  Question Answering  Information Extraction Copyright © 2004 Sung Hyon Myaeng 2

What is IR?

Traditional IR: Willow System Copyright © 2004 Sung Hyon Myaeng 3

What is IR?

Google

Web Search Engine Copyright © 2004 Sung Hyon Myaeng 4

What is IR?

Copyright © 2004 Sung Hyon Myaeng Ask Jeeves 5

IR & the Rest of the World

Information Retrieval Natural Language Processing Human Computer Interaction DB AI Statistics Copyright © 2004 Sung Hyon Myaeng Linguistics Cognitive Science 6

Evaluation of IR Systems

 effectiveness  “relevance” Ret NOT Ret Rel NOT Rel A C      precision: A / A+C recall: A / A+B efficiency Interactive systems?

Others?

B D 0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 0 0.

2 0.

4 recall 0.

6 System A System B System C 0.

8 1 7 Copyright © 2004 Sung Hyon Myaeng

Overview of Text Retrieval

Text Processing Raw text Text Analysis Index Knowledge Resources & Tools User/System Interaction Info Needs Analysis of Info Needs Search Engine Matching (Inferencing) Query Retrieval Result 8 Copyright © 2004 Sung Hyon Myaeng

Text Processing (1)

- Indexing

  Extraction of index terms and computation of their weights Index terms: represent document content & separate documents  “economy” vs “computer” in a news article of

Financial Times

 Morphological Analysis (stemming in English)   “ 벨기에는” (“ 벨기 +” 에는” ?), “ 문서내의” (“ 문서” +” 내의” ) “information”, “informed”, “informs”, “informative”  Rule-based vs dictionary-based  n-gram   “ 정보검색시스템” => “_ 정” , “ 정보” , “ 보검” , “ 검색” , … (bi-gram) “ 부정사” vs “ 부정한 정사” (similar enough in bi-gram!)  Surprisingly effective in some languages 9 Copyright © 2004 Sung Hyon Myaeng

Text Processing (2)

– Storing indexing results

A B E A C F 1 2 C F A D G 3 4 n B G 1 2 3 4 … A v v v B v v C v v D E v v F v n G v v Inverted index Copyright © 2004 Sung Hyon Myaeng 10

Text Processing (3)

- Indexing

 Use of various linguistic resources  Dictionaries (noun, Josa, Eomi, bilingual, Proper noun, foreign words, …)  For extraction and weighting of index terms  Thesaurus (e.g. WordNet)  Controlled vocabulary indexing   Matching similar and related words Tagged Corpus  Most NLP technology is used for term extraction  “Bag of words” approach  Sense disambiguation?

 Word order?

11 Copyright © 2004 Sung Hyon Myaeng

Overview of Text Retrieval

Text Processing raw text text 분석 Index Knowledge Resources & Tools User/System Interaction Info Needs Analysis of Info Needs Search Engine Matching (Inferencing) Query Retrieval Result 12 Copyright © 2004 Sung Hyon Myaeng

User/System Interaction

– Query Models

  Boolean  AND, OR, NOT operators  E.g (semi-conductor OR chip) AND stock NOT chocolate)   adjacency, phrase operators  E.g: “ stock exchange”

, “

그리고 아무 말도 하지 않았다”

)

Difficult for naïve users  visual query interface Word list  Vector space model system  E.g.: (semi-conductor chip stock)  Often interpreted as a Boolean query in search engines  E.g.  (semi-conductor OR chip OR stock)

13 Copyright © 2004 Sung Hyon Myaeng

User/System Interaction

– Query Models

   “Natural Language” Query  E.g.: “

I want to get information about ski resorts in Kangwon-do or in the Chungcheong area.”

 Limitations in NLP  Various tricks Query Expansion  To resolve mismatches between query terms and index terms for documents  A variety of linguistic resources are used (e.g. synonym, foreign word equivalence classes, bilingual dictionaries) Guide users to follow step-by-step instructions for detailed queries  “ canned queries ” (E.g.:

“ Ask Jeeves

” )  query templates

14 Copyright © 2004 Sung Hyon Myaeng

Ask Jeeves

화면 Copyright © 2004 Sung Hyon Myaeng 15

User/System Interaction

– Query Models

  Relevance feedback  “ Similar Pages ” in Web search engines  From a simple query to better queries progressively  Limited

recall

capability of human beings  Recognition of a relevant document is much easier.

 Intended to ease the difficulty of grasping the statistical properties of the entire collection  An indirect way of capturing the user needs User profile  To reflect user’s interest and orientation in interpreting user queries  Need to gather & analyze user log data and learn user models 16 Copyright © 2004 Sung Hyon Myaeng

User/System Interaction

– Result Presentation

  Information overload problem – too many retrieved A simple ranked list - title, author, URL, date, …   Method 1: Organizing the retrieved documents   Result Clustering (E.g. Vivisimo) “Zoom-in” operation (E.g: Scatter & Gather) Method 2: Visualizing the retrieved documents  Overview of a large amount of information  Visual expression of document properties  E.g. TileBar Copyright © 2004 Sung Hyon Myaeng 17

Scatter/Gather

Copyright © 2004 Sung Hyon Myaeng 18

Tile Bar

Copyright © 2004 Sung Hyon Myaeng 19

Result Clustering

Copyright © 2004 Sung Hyon Myaeng 20

Text Retrieval Overview

Text Processing raw text text Analysis Index Knowledge Resources & Tools User/System Interaction Info Needs Analysis of Info Needs Search Engine Matching (Inferencing) Query Retrieval Result 21 Copyright © 2004 Sung Hyon Myaeng

Matching & Ranking (1)

 Inverted File, … Query Terms Wt Pointers 가구 0.7

3 가야 0.9

.

.

.

.

.

.

신라 .

.

.

0.9

.

.

.

.

.

.

.

2 .

.

.

.

.

호랑이 .

.

.

0.6

.

.

.

2 Directory 1 2 3 4 5 .

.

.

.

.

.

.

275 276 1 2 5 4 6 .

.

.

.

.

.

.

3 5 1011 1012 1 4 Posting file Copyright © 2004 Sung Hyon Myaeng Doc #1 ---- ---- ---- Doc #2 ---- ---- ---- Doc #5 ---- ---- ---- 22

Matching & Ranking (2)

 Ranking  Retrieval Model  Boolean (exact) => Fuzzy Set (inexact)  Vector Space  Probabilistic  Inference Net  …  Weighting Schemes  Index terms, query terms  Parameters in formulas  Document characteristics  … Copyright © 2004 Sung Hyon Myaeng 23

IR Model Example: Vector Space Model

... cat ........ dog ......

................dog............

....mouse .....dog........

mouse ........................

Q = < cat, mouse, 0 > Di = (d i1 , d i2 , ... , d in ) Q = (q 1 , q 2 , ... , q n ) Similarity = D i . Q / |D i |*|Q| cat dog Q D1 mouse 24 Copyright © 2004 Sung Hyon Myaeng

Matching & Ranking (3)

 Techniques for efficiency  New storage structure esp. for new document types  Use of accumulators for efficient generation of ranked output  Compression/decompression of indexes  Technique for Web search engines  Use of hyperlinks  Inlinks & outlinks  Authority vs hub pages    In conjunction with Directory Services (e.g. Yahoo) Softbot – storing terabytes of data and efficient crawling ...

Copyright © 2004 Sung Hyon Myaeng 25

Web document retrieval – using hyperlinks

Initial Retrieval Set

A

TERM

C B

Candidates for additional retrieval

A: Hub document B: Authority document

Increase the weight of A, B

Copyright © 2004 Sung Hyon Myaeng To be ranked again using the link information 26

Characteristics of IR - summary

Unstructured vs Structured

Information Retrieval

Probabilistic Derived from contents Partial or “Best” Match Natural Language Relevance Ranked Retrieval Models Indexing Matching/Retrieval Query Types Results Criteria Results Ordering Information Retrieval/Data Retrieval Spectrum

Data Retrieval

Deterministic Complete Items Exact Match Structured Any Match Arbitrary 27 Copyright © 2004 Sung Hyon Myaeng

Overall Directions (1)

     Efforts to improve retrieval effectiveness (as always!)  Retrieval model, text analysis and representation, user interactions, ...

 Specialized Search: domain-specific Context awareness (personalization, task-centered)  profile, session logs, task models, etc.

Multi-something  multimedia, multi-style, multilingual,… Distributed Environment with a large quantity  Web search, meta-search, distributed retrieval (DB segmentation), meta-data retrieval, semantic Web New functionality  filtering, TDT, classification, summarization, QA, information extraction, ...

28 Copyright © 2004 Sung Hyon Myaeng

Cross-Language & Multilingual IR

 Cross-language IR  Using a language (mother tongue) to retrieve documents in another language  To overcome the language differences  [terminology] cross-lingual, translingual (DARPA)  Multilingual IR  “retrieving relevant document in any of the languages contained in a multilingual document collection”  (CL + | ML) Document Retrieval  E.g.: Using Korean queries to search a DB consisting of Korean, English and Japanese documents 29 Copyright © 2004 Sung Hyon Myaeng

CLIR

     The number of documents in languages other than own is rapidly increasing, and so is the need for retrieval.

 The rate of annual increase for documents in the WEB  English: 50%; All other languages: 90% Multilingual countries, organizations, enterprises, & users The limitation of machine translation technology  More economical to translate necessary document after retrieval Not easy to construct a query in a foreign language even with the ability to comprehend written materials in the language  reading vs. writing CLIR is fundamental to other multilingual information access technology 30 Copyright © 2004 Sung Hyon Myaeng

CLIR Problem (example)

현 대자 동차 ??

same village eastern exposure trend 현대자동차 주식 동향 현대 자동차 principal food stocks food and drink 3 4 31 Copyright © 2004 Sung Hyon Myaeng

Retrieval of Structured Documents

    XML documents, hypertext, metadata, semantic Web Queries for structure and content  FIND a document that INCLUDES a chapter whose title CONTAINS the term “hypertext” AND whose section CONTAINS the term “browsing”.

Queries for content and link  FIND all documents about “information retrieval” that is referred to by a paper written by “Myaeng”.

Retrieval with ontology

vs

retrieval from ontology (e.g. RQL) 32 Copyright © 2004 Sung Hyon Myaeng

Classification - Motivation

Is this spam?

Sender: [email protected]

Subject: Your Business Listing - Global Trade Index Date: 2003-09-03 ( 수 ) 오전 6:40 Size: 6 KB Dear Site Owner, You are invited to list your site at the most important

Trade Directory

on the Internet. This directory system has attributes no other directory on the Internet has had do date, check us out!

Manufacturers - Wholesalers - Distributors - Resellers

and all businesses that are associated are welcome on our directory.

Your business will prosper from its association with our global resources!

There are NO CHARGES for a listing. Just click here enter your business details as you like. Thank you for your kind assistance in this matter, and The Team at EconomicGrowthNetwork.com

[email protected]

Toll Free - 866 516-8412 33 Copyright © 2004 Sung Hyon Myaeng

Classification - Motivation

How about these subject lines?

     Re: 요청하신 자료입니다 .

그렇게 가면 안되지 .

Get V^iagram in the convenience of your home Generi.c Cia.lis – Lasts 2 times longer then Via.gra!

Re,no-va”te”*”you:r ‘d*ow_nst>;airs;^ Copyright © 2004 Sung Hyon Myaeng 34

Classification Problem

 E-mail classification  Given an e-mail message containing question and/or complaint, where should it be sent in ERMS?

 Categories:  AS, subscription/unsubscription, passwords, upgrades, usage, about-products, other questions, other complaints A not-so-easy example: 제목 : 아뒤 랑 본문 비번를 잊어 먹었슴다 .

안녕하세 여 … 오 랫 만에 제가 그만 홈에 들렀더니 … 아뒤랑 pass 를 잊어버렸네요 음 … 35 Copyright © 2004 Sung Hyon Myaeng

Problem Statement

  Given:  A description of an instance,

x

X

, where X is the

instance language

or

instance space

.

 A fixed set of categories:

C

= {

c

1 ,

c

2 , … ,

c

n } How to represent text documents?

Determine:  The category of

x

:

c

(

x

) 

C,

where

c

(

x

) is a

categorization function

whose domain is

X

and whose range is

C

.

how to build categorization functions (“classifiers”).

36 Copyright © 2004 Sung Hyon Myaeng

A Typical Example for Document Classification

Testing Data:

“planning language proof intelligence” (AI) (Programming) (HCI)

Classes:

ML Planning Semantics Garb.Coll.

Multimedia

Training Data:

learning intelligence planning temporal programming semantics algorithm reinforcement network...

reasoning plan language...

language proof...

garbage collection memory optimization region...

...

GUI ...

37 Copyright © 2004 Sung Hyon Myaeng

Classification Methods

 Rule-based Methods  E.g.: assign a category if document contains a given Boolean combination of words  Accuracy is often very high if a profile has been carefully refined over time by a subject expert.

 Building and maintaining these profiles is expensive.

 Inductive Learning Models  Naïve Bayesian Model  Decision Tree Model  SVM (Support Vector Machine)  Similarity-Based Models   K-Nearest Neighbor Rocchio’s Model These require hand classified training data, but can be built (and refined) easily.

38 Copyright © 2004 Sung Hyon Myaeng

Topic Detection & Tracking (TDT) (1)

 Event : A reported occurrence at a specific time and place, and the unavoidable consequences.

  Specific elections, accidents, crimes, natural disasters.

“TWA-800 airplane crash” vs. “airplane accidents”

 Activity : A connected set of actions that have a common focus or purpose  campaigns, investigations, disaster relief efforts  Topic : a seminal event or activity, along with all directly related events and activities  Story : a topically cohesive segment of news that includes two or more DECLARATIVE independent clauses about a single topic 39 Copyright © 2004 Sung Hyon Myaeng

TDT (2) – First Story Detection

 Automatically identify the first story on a new event from a stream of text

To detect the first story that discusses a topic, for all topics.

First Stories = Topic 1 = Topic 2 Time Not First Stories 40 Copyright © 2004 Sung Hyon Myaeng

TDT (3) – More about FSD

    First story detection is an unsupervised learning task.

 There is no supervised training.

On-line vs. Retrospective  On-line: Flag onset of new events from live news feeds as stories come in  Retrospective: Detection consists of identifying first story looking back over longer period Lack of advance knowledge of new events, but have access to unlabeled historical data as a contrast set Applications  Intelligence services  Finance: Be the first to trade a stock 41 Copyright © 2004 Sung Hyon Myaeng

TDT (4) – Other Tasks

 Topic Tracking   Standard text classification task  Once a topic has been detected, identify subsequent stories about it However, very small training set (initially: 1!)  Topic Detection  Grouping stories from an accumulated collection  Event-based classification with multiple topics (events)  Retrospective 42 Copyright © 2004 Sung Hyon Myaeng

TDT (5)

  Characteristics of Events  News articles on an event are temporally close to each other.

  lexical and temporal similarities “Similar” news articles over an extended time period  events Different   Use a time window to determine the scope of an event Changes in the used terms and their frequencies  new event Use of clustering techniques (example)  retrospective: bottom-up clustering with a time window (TD)  Online (Tracking)  single-pass, incremental clustering  incremental IDF  use of decaying function for old documents in computing the similarity 43 Copyright © 2004 Sung Hyon Myaeng

Recommender Systems (1)

 Recommender systems are a technological proxy for a social process  We rely on recommendations from other people.

 An information discovery model where people try to find other people with similar tastes and then ask them to suggest new things  In a typical recommender system  People provide recommendations (input)  The system aggregates and directs to appropriate recipients.

44 Copyright © 2004 Sung Hyon Myaeng

Recommender Systems (2)

 Motivations  Can we automatically aggregate quotes like:     "I like this book; you might be interested in it" "I saw this movie, you ’ ll like it “ "Don ’ t go see that movie!

“ Finding new books, music, or movies, previously unknown to users  Applications  Corporate Intranets  Recommendations, finding domain experts, …  Ecommerce  Product recommendations – amazon , CDNOW, …  Medical Applications  Matching patients to doctors, clinical trials, …  Customer Relationship Management  Matching customer problems to internal experts in a support organization Copyright © 2004 Sung Hyon Myaeng 45

Recommender Systems (3) - Types

 Collaborative/Social-filtering system consumers ’ – aggregation of preferences and recommendations to other users based on similarity in behavioral patterns  Content-based system – supervised machine learning used to induce a classifier to discriminate between interesting and uninteresting items for the user  Knowledge-based system – knowledge about users and products used to reason what meets the user ’ s requirements, using discrimination tree, decision support tools, case-based reasoning (CBR) 46 Copyright © 2004 Sung Hyon Myaeng

Recommender Systems (4) - Example

Copyright © 2004 Sung Hyon Myaeng 47

Automatic Summarization (1)

 Functionality  indicative: to determine if the document would be of any interest  informative: to reflect the original content as faithfully as possible under the compression rate  evaluative: evaluation of the original document  Fluency  fragmented  connected text  Users  generic  user (query)- focused  Target Documents  single vs multiple documents 48 Copyright © 2004 Sung Hyon Myaeng

Automatic Summarization (2)

Source(s) Intermediary Representation Summary Word Frequencies Clue Phrases Layout Syntax Semantics Discourse Pragmatics

Analysis

Word Count Clue Phrases Statistical Structural Abstraction Aggregation Planning Realization Layout

Selection Condensation

Copyright © 2004 Sung Hyon Myaeng

Presentation

49

Automatic Summarization (3) - Approaches

   Natural language understanding / generation  Build knowledge representation of text  Generate sentences summarizing content  Hard to do well Keyword summaries  Display most significant keywords  Easy to do  Hard to read, poor representation of content Sentence extraction  Extract key sentences   Medium hard Summaries often don ’ t read well  Good representation of content Copyright © 2004 Sung Hyon Myaeng 50

Question Answering (Q/A) (1)

 To provide an answer to a query, as opposed to a document  Query: for factoid or exact answer  Result: Ranked list of pairs  Answer string: 50-250 bytes  Documents: to supplement the answer  Example from TREC-9  How much folic acid should an expectant mother get daily?

 Who invented the paper clip?

 Where is Rider College located?

 Name a film in which Jude Law acted.

51 Copyright © 2004 Sung Hyon Myaeng

Question Answering (2)

 System Flow (example) Rule Set User Query Query Analyzer Query Set Query Categories Thesaurus Answer Answer Extractor Candidates Copyright © 2004 Sung Hyon Myaeng Document Set Retrieval Engine Retrieved Docs Doc Analyzer 52

Question Answering (3)

Query What is the fare cost for the round trip between New York and London on Concorde?

Rule Applied What [be] [ADJ] [NOUN] for Main phrase Extracted fare cost Categorization Of the phrase Financial loss Assign a Query category Assignment of query categories 53 Copyright © 2004 Sung Hyon Myaeng

Information Extraction foodscience.com-Job2

JobTitle: Ice Cream Guru Employer: foodscience.com

JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html

OtherCompanyJobs: foodscience.com-Job1 54

Information Extraction (2)

 Goal: being able to answer semantic queries (a.k.a. “ database queries ” ) using “ unstructured ” natural language sources   semi-structured textual document.

Transform this unstructured information into structured relations in a database/ontology.

...... 4 월 일본 동경에서의 테러 ...............................................

11 일 오후 ..... ...

상무성 장관 ....... 행인 동경시 5 ........... 피해 ... 사제폭탄 .........

................................................

명 중상 ..... .......

자동차 2 대 ....... 독일에서의 경우 .......... ............

사건 일시 장소 목표 인명 피해 재산 피해 테러 4.12 오후 동경시 상무성장관 5 명 중상 자동차 2 대 55 Copyright © 2004 Sung Hyon Myaeng

Information Extraction (3) – flow (example)

document local text analysis lexical analysis name recognition partial syntactic analysis scenario pattern matching discourse analysis coreference analysis inference template generation extracted templates Copyright © 2004 Sung Hyon Myaeng [Grishman, 1997] 56

Information Extraction: MUC (State of the Art

1997)

NE – named entity recognition CO – coreference resolution TE – template element construction TR – template relation construction ST – scenario template production 57 Copyright © 2004 Sung Hyon Myaeng

Knowledge Extraction Vision

Multi-dimensional Meta-data Extraction J F M A M J J A Topic Discovery Concept Indexing Thread Creation Term Translation Document Translation Story Segmentation Entity Extraction Fact Extraction Meta-Data

EMPLOYEE / EMPLOYER Relationships: Jan Clesius works for Clesius Enterprises Bill Young works for InterMedia Inc.

COMPANY / LOCATION Relationshis: Clesius Enterprises is in New York, NY InterMedia Inc. is in Boston, MA India Bombing NY Times Andhra Bhoomi Dinamani Dainik Jagran

58 Copyright © 2004 Sung Hyon Myaeng

References

     Belkin, N.J. & Croft, W.B. (1987).

Retrieval Techniques

. In: Williams, M.E.,

Annual Review of Information Science and Technology

22(), 109-145, New York: Elsevier & ASIS.

E. Glover, S. Lawrence, M. Gordon, W. Birmingham, Lee-Giles (2001). Web Search – Your Way.

Comm. of the ACM

, 44 (12), 97-102.

Gregory Grefenstette (1998). “The Problem of Cross-Language Information Retrieval.” In Cross-Language Information Retrieval (ed: Grefenstette), Kluwer Academic Publishers.

Doug Oard et al. (1999). “Multilingual Information Discovery and AccesS (MIDAS).” D-Lib Magazine, 5 (10), Oct. Sung Hyon Myaeng et al. (1998). “ A Flexible Model for Retrieval of SGML Documents.” Proc. of the 21 st ACM SIGIR Conference, Austrailia.

59 Copyright © 2004 Sung Hyon Myaeng

References

       James Allan (2002). “Introduction to Topic Detection and Tracking.” in Topic Detection and Tracking: Event-based Information Organization (ed: Allan), Kluwer Academic Publishers. Paul Resnick & Hal Varian (1997). “Recommender Systems.” CACM 40 (3), March, pp 56-58.

Bardrul Sarwar et al. (2001). “Item-based Collaborative Recommendation Algorithms”, http://citeseer.nj.nec.com/sarwar01itembased.html

Karen Sparck Jones (1999). “Automatic summarizing: factors and directions.” In Advances in Automatic Text Summarization (eds: Mani & Maybury), MIT Press.

Ellen Boorhees. (2000). “Overview of TREC-9 Question Answering Track.” Ralph Grishman (1997). “Information Extraction: Techniques and Challenges.” In Information Extraction - International Summer School SCIE-97, (ed: Maria Teresa Pazienza), Springer-Verlag, 1997. (See http://nlp.cs.nyu.edu/publication/index.shtml

) Andrew McCallum and Kamal Nigam. A Comparison of Event Models for Naive Bayes Text Classification. In

AAAI/ICML-98 Workshop on Learning for Text Categorization

, pp. 41-48.

60 Copyright © 2004 Sung Hyon Myaeng