Presentation - SNOW Workshop

Download Report

Transcript Presentation - SNOW Workshop

SNOW 2014 Data Challenge
Two-level message clustering for topic detection in Twitter
Georgios Petkos, Symeon Papadopoulos, Yiannis Kompatsiaris
Centre for Research and Technology Hellas (CERTH)
WWW 2014
Seoul, April 8th
• Applied approach:
Topic detection
Title extraction
Keyword extraction
Representative tweets selection
Relevant image extraction
• Evaluation
• Conclusions
• Duplicate tweet aggregation:
– Performed via simple hashing (very fast but does not capture near
duplicates such as some retweets)
– Counts kept for subsequent processing
• Language-based filtering:
– Only content in English is kept
– Public Java implementation used
• Significant computational benefit for subsequent steps. e.g.,
for the first timeslot:
– Originally: 15,090 tweets
– After duplicate aggregation: 7,546 unique tweets
– After language filtering: 6,359 unique tweets written in English
Topic detection (1/2)
• Different types of topic detection algorithms:
– Feature-pivot
– Document-pivot
– Probabilistic
• We opt for a document-pivot approach as we recognize that:
– A reply tweet typically refers to the same topic as the tweet to which
it is a reply.
– Tweets that include the same URL refer to the same topic
• Such information is readily available and cannot be easily
taken into account by other types of topic detection
• We generate first-level clusters by grouping together tweets
based on the above relationships (using a Union-Find
Topic detection (2/2)
• Not all tweets will belong to a first-level cluster; thus, we
perform a second-level clustering.
• We perform an incremental threshold based clustering
procedure that utilizes LSH:
– For each tweet find the best matching item that has already been examined. If
its similarity to it (using tf-idf and cosine similarity) is above some threshold
assign it to the same cluster, otherwise create a new cluster.
– If the examined tweet belongs to a first-level cluster, assign the other tweets
from the first-level cluster to the same second-level cluster (either existing or
new) and do not further consider these tweets
• Additionally, in order to reduce fragmentation:
– We use the lemmatized version of terms (Stanford), instead of their raw form
– We boost entities and hashtags by a constant factor (=1.5)
• Each second-level cluster is treated as a (fine-grained) topic.
• A very large number of topics per timeslot is produced (e.g. 2,669 for the
first timeslot) but we need to return only 10 per timeslot
• We recognize that the granularity and hierarchy of topics is important for
ranking: fine-grain subtopics of popular coarse-grain topics should be
ranked higher than other fine-grain topics that are not subtopics of a
popular coarse-grain topic
• To cater for this we:
– Detect coarse-grain topics by running again the document-pivot procedure (i.e. a third
clustering process) but this time further boosting entities and hashtags (not by a
constant factor, but a factor linear to their frequency)
– Map each fine-grain topic to a coarse-grain topic to obtain a two-level hierarchy
– Rank the coarse-grain topics by the number of tweets they contain
– Rank the fine-grain topics within each coarse-grain topic again by the number of tweets
they contain
• Apply a simple heuristic procedure to select the first few fine-grain topics
from the first few coarse-grain topics
Title extraction
• For each topic, we obtain a set of candidate titles by splitting
assigned tweets to sentences (Stanford NLP library used)
• Each candidate title gets a score depending on its frequency
and the average likelihood of appearance of the words in it in
an independent corpus
• Rank candidate titles and return the one with the highest
Keyword extraction
• We opt for phrases rather than unigrams, because phrases are more
descriptive and less ambiguous than unigrams
• For each topic, we obtain a set of candidate keywords by detecting the
noun phrases and verb phrases in the assigned tweets
• As in the case of titles, each candidate keyword gets a score depending on
its frequency and the average likelihood of appearance of the words in it
in an independent corpus
• Rank candidate keywords
• Find the position in the ranked list with the largest score gap and select
the keywords until that point
Representative tweets selection
• Related tweets for each topic are readily available since we
apply a document-pivot approach
• Satisfactory diversity is achieved by not considering duplicates
(pre-processing) and by considering replies (as part of the
core topic detection procedure)
• Selection: First, most popular, then all replies and then again
with popularity (until gathering 10 tweets).
Relevant image extraction
Three cases:
• If there are images in the tweets assigned to the topic, return
the most frequent image
• If not, query the Google search API with the title and return
the first image returned
• If a result is not fetched (possibly because the title is too
limiting), query again the Google search API but this time with
the most popular keyword
Evaluation (1/2)
• Significant computational benefit from pre-processing steps
• Typically, a few hundred first-level clusters
Evaluation (2/2)