Event Clustering as a Basis for First Story Detection

Download Report

Transcript Event Clustering as a Basis for First Story Detection

Topic Detection and Tracking :
Event Clustering as a Basis
for First Story Detection
AI-Lab
Jung Sung Won
Abstract

Topic Detection and Tracking(TDT)


The organization of information by event
than by subject
In this paper


Overview of the TDT research program
Discuss our approach to two of the TDT
problems
Event clustering (Detection)
 First story detection

Introduction

Information Retrieval Research


Texts are usually indexed, retrieved and organized
on the basis of their subjects.
This research





Focusing on the events that are described by the
text rather than the broader subject it covers.
What is the major event discussed within this story?
Do these texts discuss the same event?
Not all texts can be reduced to a set of events
This work will necessarily apply only to text that
have an event focus : announcements, news
Topic Detection and Tracking

연구의 목적




방송 기사들을 사건으로 구성하기 위하여
기사의 수집처 : 텔레비전, 라디오, 유선방송
Automatic speech recognition (ASR) 필요
TDT연구는 3차례에 걸처 이루어짐
TDT-1 : 1996년 중반 ~ 1997년
 TDT-2 : 1998년
 TDT-3 : 1999년

TDT-1, The Pilot Study(1/2)

Proof-of-concept 을 위한 노력

1st Project : definition of the problem



Some unique thing that happens at some point in
time. (Allan et al., 1998a)
Ex) “computer virus detected at British Telecom,
March 3, 1993” ↔ “computer virus outbreaks”
Definition of three research problems


Segmentation
Detection



Event clustering
First story detection
Tracking
TDT-1, The Pilot Study(2/2)

created a small evaluation corpus



15,683개의 뉴스기사 ( CNN, Reuter )
generated a set of 25 news topics
employed a two-prong method for
assigning relevance judgments between
topics and stories


1 group : read every story in the corpus
2 group : used a search engine to look for
stories on each of the 25 topics
TDT-2, A Full Evaluation

The Primary goal


Create a full-scale evaluation of the TDT tasks
began in the pilot study.
Change
The two detection tasks were “merged” to create an
on-line version of the event clustering task.
 TDT-1와 다른 점은 small group으로 처리
 수정된 평가 방법 사용 ( Detection Error Tradeoff
graphs)
 큰 규모의 corpus를 사용

TDT-3, Multi-Lingual TDT

TDT-2와 다른 점



Task를 고려하여 구성하는 event의 범위가 다
르다.
multi-lingual 소스의 도입
New evaluation corpus.
On-Line Clustering Algorithms

Previous clustering work


In this study


Retrospective environment
On-line solution to first story detection.
Steps of Clustering
Converting stories to a vector
 Comparing stories and clusters
 Applying a threshold to determine if clusters are
sufficiently similar
 Classifier : the combination of vector and threshold

Related Clustering Work

기존 clustering기법들
Agglomerative hierarchical clustering
 Probabilistic approaches
 이 기법들은 clustering할 대상이 미리 있음


On-line environment에서의 한계점
Clustering할 대상이 미리 없음
 일부 알고리즘은 클러스터의 수를 정해야 하나 on-line
에서는 클러스터의 수를 알 수 없음


Single-pass clustering을 할 수 있는 알고리즘이
필요 → 이미 몇몇 알고리즘이 있음
Creating Story Vectors

INQUERY 를 사용해서 weight vector구성
tf 







t

dl 
 t  0.5  1.5 

avg
_
dl


 C  0.5 

log 
df


idf 
log( C  1)
d j ,k  0.4  0.6  tf k  idf k
t : 기사에 특정 lexical feature 등장 횟수
dl : the story’s length in words
avg_dl : 기사 내의 term개수 평균
C : 보조 corpus내의 기사의 수
df : term이 나타난 기사의 수 ( df=0 이면 df=1)
k : classifier와 기사에 동시에 나타나는 단어의 index
dj,k : 시간 j에 나타날 기사의 유사도
Comparing Clusters

Comparing a story to a cluster or the contents of two
clusters

Single-link, complete-link, group-average
q

)

N
sim (qi , d j
k 1
i ,k
N
k 1
 d j ,k
qi , k
Thresholds for Merging

Threshold를 쓰는 이유

The decision for generating a new

cluster.
Clustering for first story detection


Ex) threshold 0.5인 경우
Time-based thresholds


실제 news의 시간적 특성 고려
시간 i에 계산된 classifier에서 시간 j에 도달한 어떤 기사를 위한
threshold는
threshold (qi , d j )  0.4    ( sim (qi , d j )  0.4)    (date j  datei )

Decision Scores
decision (qi , d j )  sim (qi , d j )  threshold (qi , d j )
Experimental Setting – Data
Evaluation Measures

Measures of Text classification effectiveness
Recall and precision
Misses : the system does not detect a new event
False alarms : the system indicates a story contains
a new event when in truth it does not
 F1-Measuer (Lewis and Gale, 1994) : 2PR/(P+R)
 TDT Cost function



Cost  cos t fa * P( fa) * (1  P(topic))  cos t m * P(m) * P(topic)




P(fa) : the system false alarm rate
P(m) : miss probability
P(topic) : the prior probability that a story is relevant
to a topic
costfa = costm = 1.0
Event Clustering
First Story Detection
New topic : a topic whose event has not been previously
reported
 Motivation
 The property of time as a distinguishing feature of this
domain
 The name of the people, places, dates, and things : who,
what, when, where
 Method
 Use event clustering method
 If no classifier comparison results in a positive
classification decision for the current story, then the
current story has content not previously encountered,
and thus it contains discussion of a new topic
 Difference : finding the start of each topic
 On-line single link + time strategy

First Story Detection Experiment
Discussion of First Story Detection

The Good News


Low false alarm rates
The Bad News
경험적으로 점진적인 증가만을 기대할 수 있다.
 The limitation of the word-co-occurrence model
 Association with topics that heavily covered in the
news
