BlogVox: Separating Blog Wheat from Blog Chaff Joshi, Justin Martineau (UMBC)

Download Report

Transcript BlogVox: Separating Blog Wheat from Blog Chaff Joshi, Justin Martineau (UMBC)

BlogVox: Separating Blog Wheat from
Blog Chaff
Akshay Java, Pranam Kolari, Tim Finin, Aupam
Joshi, Justin Martineau (UMBC)
James Mayfield (JHU/APL)
Motivation: Cleaning the Harvest
• BlogVox – A Blog analytics engine developed for the
TREC 2006 Blog Track.
• Presence of spam blogs or splogs and extraneous
content waters down the quality of the index.
• Narrowing down on the content of the post is
essential in lack of clearly demarcated opinion
sentences (like in eopinions, IMDB, Amazon etc)
• Noisy and unstructured text on the Blogosphere can
skew blog analytics/ business intelligence tools (as
observed in TREC 2006).
BlogVox Opinion Extraction System
BlogVox
• TREC 06: Finding opinionated
posts, either positive or negative,
about a query
• 2006 TREC Blog corpus:
• 80K blogs
• 300K posts
• 50 test queries
• BlogVox opinion extraction
system
• Document and sentence level
scorers
BlogVox challenges
• Combined scores using an
• Data cleaning and splog removal
SVM meta-learner
• Slangs
• Data cleaning: splogs and
• Semantic orientation of words
post identification
• Contradictions, sarcasms,
ungrammatical text
Separating Blog Wheat from Blog Chaff
Data cleaning for
• Splog removal
• Post content identification
Pre Indexing Steps
Collection
Collection Parsing
Parsing
1
Non
Non English
English
Blog
removal
Blog removal
2
Title
Title and
and
Content
Extraction
Content Extraction
Splog
Splog Detection
Detection
3
4
Spam in the Blogosphere
•
•
•
•
•
Types: comment spam, ping spam, splogs
Akismet: “87% of all comments are spam”
75% of update pings are spam (ebiquity 2005)
56% of blogs are spam (ebiquity 2005)
20% of indexed blogs by popular blog search
engines is spam (Umbria 2006, ebiquity 2005)
• Spam blogs (splogs) are weblogs used to
promoting affiliated websites or host ads
• “Spings, or ping spam, are pings that are sent from
spam blogs”
Motivation: host ads
Motivation: index affiliates,
promote pageRank
Data Cleaning: Splogs
Host Ads
Index affiliates,
Promote
pageRank
Plagiarized
content
Splog Detection Performance
• Splog detection using SVM
• 700 blogs, 700 splogs used for
training
• Model based on blog homepage
and local blog features
Nature of Splogs in TREC 2006
• Around 83K identifiable blog home-pages in the
collection, with 3.2M permalinks
• 81K blogs could be processed
• We use splog detection models developed on blog
home-pages; 87% accuracy
• We identified 13,542 splogs
• Blacklisted 543K permalinks from these splogs
• ~16% of the entire collection
• ~17% splog posts injected into TREC dataset1
1The TREC Blog06 Collection: Creating and Analyzing a Blog Test Collection – C. Macdonald, I. Ounis
1The TREC Blog06 Collection: Creating and Analyzing a Blog Test Collection – C. Macdonald, I. Ounis
Impact of Splogs in TREC Queries
851
852
Distribution of Splogs that appear TREC queries
853
854
855
856
857
858
120
859
860
861
862
863
100
864
Cholesterol
865
866
867
Number of Splogs
868
Hybrid Cars
80
869
870
871
872
873
874
60
875
American Idol
876
877
878
879
40
880
881
882
883
884
885
20
886
887
888
889
890
0
891
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
892
893
Top 100 Search Results ranked using TFIDF Scoring
894
895
896
897
898
Higher in Spam Prone Contexts
Splog Distribution for 'Spam Terms'
120
1
2
3
100
Card
4
5
6
Number of Splogs
7
8
80
9
Interest
10
11
12
13
14
60
15
Mortgage
16
17
18
19
40
20
21
22
23
24
25
20
26
27
28
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
Search Result Rank
Spam query terms based on analysis by McDonald et al 2006 ..
Separating Blog Wheat from Blog Chaff
Data cleaning for
• Splog removal
• Post content identification
Pre Indexing Steps
Collection
Collection Parsing
Parsing
1
Non
Non English
English
Blog
removal
Blog removal
2
Title
Title and
and
Content
Extraction
Content Extraction
Splog
Splog Detection
Detection
3
4
Data Cleaning: Content Identification
Navigation
Ads
Post content
Recent Posts
Data cleaning: Baseline heuristic
Navigational Links
Post Content
Sidebar
Ads
Eliminate link a if there exist a link b
• Within θ distance
• No Title tags between the links
• Avg length of text bearing nodes
less than a threshold
• b is the nearest link to a
An example DOM tree
Data cleaning: SVM cleaner
• Random collection of
150 blog posts
• Human evaluation of
400 links tagged as
content or extraneous
links
• We trained SVM using
linear kernel in this
analysis
DOM Features
Tag
Features
Position
Features
Word
Features
Evaluation
Data Cleaning: Effect of sidebar content
Related Work
• Web Spam Detection
•
•
•
Coverage: Blog Analytics Engines don’t
look beyond Blogosphere
Speed of detection is important, 150K
posts/hour
RSS feeds presents new opportunities,
and challenges
• Email spam Detection
• Nature of spamming: links, RSS
feeds, web graph, metadata
• Users targeted indirectly through
search engines, e.g. “N1ST” not
relevant for “NIST” query
• Template Detection
•
•
•
Repeated structural components
detected via sampling
Customization, use of javascripts and
AJAX is increasing
Simple heuristics using DOM
traversal work well in general cases
• Sentiment Analysis
•
•
•
Open domain opinion extraction is
complex
Opinions are part of a narrative
Subject for which the opinion is being
expressed is not easy to detect
Conclusions
• Noisy content on the Blogosphere present a major
challenge to the quality of blog analytics tools.
• Combination of heuristics and ML can be used to
effectively clean the data.
Ongoing Work
• DOM subtree elimination
• Identifying the subject of the opinion
• Slangs
• More training examples!
Thank you!
http://ebiquity.umbc.edu/
Backup Slides
Opinions in Social Media
“I went to school early so I would
have time to grab some lunch.
Which ended up consisting of a
crappy sandwich from
starbucks and a chai latte.
Lacey came into Starbucks
while I was there so we chatted
for a little bit and she thought
that I might be in her class.
After I finished eating I headed
to school and checked the
board……..”1
[1] http://annamay13x.livejournal.com/7061.html
Reader’s Perspective
Narrative
“Starbucks
Sandwiches are bad!”
Expressed
Opinions
Opinions can
influence buying
decisions of
customers
Keyword Stuffed Blog
• ‘coupon codes’, ‘casino’
Post Stitching
• Excerpts scraped from other sources
Post Weaving
• Spam Links contextually placed in post
Link-roll spam
• With fully plagiarized text
Difficulty
• We have been experimenting
with multiple approaches
starting mid 2005
• Data:
http://ebiquity.umbc.edu/resource/html/id/212
Difficulty
• Evolving spamming techniques and splog creation genres
• Most basic technique spam techniques
• Generate content by stuffing key dictionary words
• Generate link to affiliates, through link dumps on blogrolls, linkrolls
or after post content
• Evolving spam techniques
• Scrape contextually similar content to generate posts
• RSS hijacking
• Aggregation software, e.g. Planet X
• Intersperse links randomly
• Make link placement meaningful
• Add spam comments and then ping. Repeat.
TREC Submissions (Topic Relevance)
TREC Submissions (Opinion Extraction)