Natalie Glance Senior Research Scientist Nielsen BuzzMetrics © 2006 Nielsen BuzzMetrics, A VNU business affiliate.

Download Report

Transcript Natalie Glance Senior Research Scientist Nielsen BuzzMetrics © 2006 Nielsen BuzzMetrics, A VNU business affiliate.

Natalie Glance
Senior Research Scientist
Nielsen BuzzMetrics
© 2006 Nielsen BuzzMetrics, A VNU business affiliate
Background
 Nielsen BuzzMetrics aggregates consumer opinion
expressed in message boards, weblogs, Usenet and
other online discussions
 Parent company behind BlogPulse, blog search and
analytics website
© 2006 Nielsen BuzzMetrics, A VNU business affiliate
What drives weblog spam?
 Same goal as any other website spam: SEO
 Weblog hosts provide:

Free hosting for link farms to promote affiliate sites

Free hosting for web pages with sponsored ads
 Types of weblog spam

spam blogs – (pollute ping servers)

spam comments on legitimate blogs

spam trackback pings to legitimate blogs
© 2006 Nielsen BuzzMetrics, A VNU business affiliate
Collateral damage: blog search result contamination
 Search results for ‘mortgage’ :
© 2006 Nielsen BuzzMetrics, A VNU business affiliate
Collateral damage: trend graphs
 Explain the peaks: are they real or artifacts of spam?
© 2006 Nielsen BuzzMetrics, A VNU business affiliate
Collateral damage: real-time monitoring
 Spikes in keyword clusters

2006/07/28 10:39 a.m. {deleted myspace account}

2006/07/27 10:55 a.m. {landis tested yesterday}

2006/08/07 3:22 a.m. {investing debt directory}

2006/08/07 6:54 a.m. {adsense cents makers}

2006/08/07 1:11 p.m. {wwdc keynote}
 Breaking news or spam attack?
© 2006 Nielsen BuzzMetrics, A VNU business affiliate
Spam filtering challenges
 Different analytics, different trade-offs

weblog search requirements: high coverage, clean results, minimize
false positives

trend search: high precision to eliminate spurious artifacts

real-time monitoring: high coverage w/human oversight
 Different timeframes, different approaches

real-time search: highly efficient classification algorithms; automated
identification of spam attacks

historic search: offline spam identification can use combination of
approaches; sandbox for new weblogs
© 2006 Nielsen BuzzMetrics, A VNU business affiliate