Event Clustering as a Basis for First Story Detection
Download
Report
Transcript Event Clustering as a Basis for First Story Detection
Topic Detection and Tracking :
Event Clustering as a Basis
for First Story Detection
AI-Lab
Jung Sung Won
Abstract
Topic Detection and Tracking(TDT)
The organization of information by event
than by subject
In this paper
Overview of the TDT research program
Discuss our approach to two of the TDT
problems
Event clustering (Detection)
First story detection
Introduction
Information Retrieval Research
Texts are usually indexed, retrieved and organized
on the basis of their subjects.
This research
Focusing on the events that are described by the
text rather than the broader subject it covers.
What is the major event discussed within this story?
Do these texts discuss the same event?
Not all texts can be reduced to a set of events
This work will necessarily apply only to text that
have an event focus : announcements, news
Topic Detection and Tracking
연구의 목적
방송 기사들을 사건으로 구성하기 위하여
기사의 수집처 : 텔레비전, 라디오, 유선방송
Automatic speech recognition (ASR) 필요
TDT연구는 3차례에 걸처 이루어짐
TDT-1 : 1996년 중반 ~ 1997년
TDT-2 : 1998년
TDT-3 : 1999년
TDT-1, The Pilot Study(1/2)
Proof-of-concept 을 위한 노력
1st Project : definition of the problem
Some unique thing that happens at some point in
time. (Allan et al., 1998a)
Ex) “computer virus detected at British Telecom,
March 3, 1993” ↔ “computer virus outbreaks”
Definition of three research problems
Segmentation
Detection
Event clustering
First story detection
Tracking
TDT-1, The Pilot Study(2/2)
created a small evaluation corpus
15,683개의 뉴스기사 ( CNN, Reuter )
generated a set of 25 news topics
employed a two-prong method for
assigning relevance judgments between
topics and stories
1 group : read every story in the corpus
2 group : used a search engine to look for
stories on each of the 25 topics
TDT-2, A Full Evaluation
The Primary goal
Create a full-scale evaluation of the TDT tasks
began in the pilot study.
Change
The two detection tasks were “merged” to create an
on-line version of the event clustering task.
TDT-1와 다른 점은 small group으로 처리
수정된 평가 방법 사용 ( Detection Error Tradeoff
graphs)
큰 규모의 corpus를 사용
TDT-3, Multi-Lingual TDT
TDT-2와 다른 점
Task를 고려하여 구성하는 event의 범위가 다
르다.
multi-lingual 소스의 도입
New evaluation corpus.
On-Line Clustering Algorithms
Previous clustering work
In this study
Retrospective environment
On-line solution to first story detection.
Steps of Clustering
Converting stories to a vector
Comparing stories and clusters
Applying a threshold to determine if clusters are
sufficiently similar
Classifier : the combination of vector and threshold
Related Clustering Work
기존 clustering기법들
Agglomerative hierarchical clustering
Probabilistic approaches
이 기법들은 clustering할 대상이 미리 있음
On-line environment에서의 한계점
Clustering할 대상이 미리 없음
일부 알고리즘은 클러스터의 수를 정해야 하나 on-line
에서는 클러스터의 수를 알 수 없음
Single-pass clustering을 할 수 있는 알고리즘이
필요 → 이미 몇몇 알고리즘이 있음
Creating Story Vectors
INQUERY 를 사용해서 weight vector구성
tf
t
dl
t 0.5 1.5
avg
_
dl
C 0.5
log
df
idf
log( C 1)
d j ,k 0.4 0.6 tf k idf k
t : 기사에 특정 lexical feature 등장 횟수
dl : the story’s length in words
avg_dl : 기사 내의 term개수 평균
C : 보조 corpus내의 기사의 수
df : term이 나타난 기사의 수 ( df=0 이면 df=1)
k : classifier와 기사에 동시에 나타나는 단어의 index
dj,k : 시간 j에 나타날 기사의 유사도
Comparing Clusters
Comparing a story to a cluster or the contents of two
clusters
Single-link, complete-link, group-average
q
)
N
sim (qi , d j
k 1
i ,k
N
k 1
d j ,k
qi , k
Thresholds for Merging
Threshold를 쓰는 이유
The decision for generating a new
cluster.
Clustering for first story detection
Ex) threshold 0.5인 경우
Time-based thresholds
실제 news의 시간적 특성 고려
시간 i에 계산된 classifier에서 시간 j에 도달한 어떤 기사를 위한
threshold는
threshold (qi , d j ) 0.4 ( sim (qi , d j ) 0.4) (date j datei )
Decision Scores
decision (qi , d j ) sim (qi , d j ) threshold (qi , d j )
Experimental Setting – Data
Evaluation Measures
Measures of Text classification effectiveness
Recall and precision
Misses : the system does not detect a new event
False alarms : the system indicates a story contains
a new event when in truth it does not
F1-Measuer (Lewis and Gale, 1994) : 2PR/(P+R)
TDT Cost function
Cost cos t fa * P( fa) * (1 P(topic)) cos t m * P(m) * P(topic)
P(fa) : the system false alarm rate
P(m) : miss probability
P(topic) : the prior probability that a story is relevant
to a topic
costfa = costm = 1.0
Event Clustering
First Story Detection
New topic : a topic whose event has not been previously
reported
Motivation
The property of time as a distinguishing feature of this
domain
The name of the people, places, dates, and things : who,
what, when, where
Method
Use event clustering method
If no classifier comparison results in a positive
classification decision for the current story, then the
current story has content not previously encountered,
and thus it contains discussion of a new topic
Difference : finding the start of each topic
On-line single link + time strategy
First Story Detection Experiment
Discussion of First Story Detection
The Good News
Low false alarm rates
The Bad News
경험적으로 점진적인 증가만을 기대할 수 있다.
The limitation of the word-co-occurrence model
Association with topics that heavily covered in the
news