Cobra: Content-based Filtering and Aggregation of Blogs

Transcript Cobra: Content-based Filtering and Aggregation of Blogs

Cobra: Content-based Filtering
and Aggregation of Blogs and
RSS Feeds
Ian Rose1, Rohan Murty1, Peter Pietzuch2, Jonathan
Ledlie1, Mema Roussopoulos1, Matt Welsh1
[email protected]
1
Harvard School of Engineering and Applied Sciences
2
Imperial College London
Motivation
• Explosive growth of the “blogosphere”
and other forms of RSS-based web
content. Currently over 72 million
weblogs tracked (www.technorati.com).
• How can we provide an efficient,
convenient way for people to access
content of interest in near-real time?
Ian Rose – Harvard University
NSDI 2007
2
Source: http://www.sifry.com/alerts/archives/000493.html
Ian Rose – Harvard University
NSDI 2007
3
Source: http://www.sifry.com/alerts/archives/000493.html
Ian Rose – Harvard University
NSDI 2007
4
Ian Rose – Harvard University
NSDI 2007
5
Challenges
• Scalability
– How can we efficiently support large numbers of
RSS feeds and users?
• Latency
– How do we ensure rapid update detection?
• Provisioning
– Can we automatically provision our resources?
• Network Locality
– Can we exploit network locality to improve
performance?
Ian Rose – Harvard University
NSDI 2007
6
Current Approaches
• RSS Readers (Thunderbird)
– topic-based (URL), inefficient polling model
• Topic Aggregators (Technorati)
– topic-based (pre-defined categories)
• Blog Search Sites (Google Blog Search)
– closed architectures, unknown scalability
and efficiency of resource usage
Ian Rose – Harvard University
NSDI 2007
7
Outline
• Architecture Overview
– Services: Crawler, Filter, Reflector
•
•
•
•
Provisioning Approach
Locality-Aware Feed Assignment
Evaluation
Related & Future Work
Ian Rose – Harvard University
NSDI 2007
8
General Architecture
Ian Rose – Harvard University
NSDI 2007
9
Crawler Service
1. Retrieve RSS feeds via
HTTP.
2. Hash full document &
compare to last value.
3. Split document into
individual articles.
Hash each article &
compare to last value.
4. Send each new article
to downstream filters.
Ian Rose – Harvard University
NSDI 2007
10
Filter Service
1. Receive subscriptions
from reflectors and
index for fast text
matching (Fabret ’01).
2. Receive articles from
crawlers and match
each against all
subscriptions.
3. Send articles that
match 1 subscription
to host reflectors.
Ian Rose – Harvard University
NSDI 2007
11
Reflector Service
1. Receive subscriptions
from web front-end;
create article “hit
queue” for each.
2. Receive articles from
filters and add to the hit
queues of matching
subscriptions.
3. When polled by a client,
return articles in hit
queue as an RSS feed.
Ian Rose – Harvard University
NSDI 2007
12
Hosting Model
• Currently, we envision hosting Cobra
services in networked data centers.
– Allows basic assumptions regarding node
resources.
– Node “churn” typically very infrequent.
• Adapting Cobra to a peer-to-peer
setting may also be possible, but this is
unexplored.
Ian Rose – Harvard University
NSDI 2007
13
Provisioning
• We employ an iterative, greedy, heuristic to
automatically determine the services required
for specific performance targets.
Ian Rose – Harvard University
NSDI 2007
14
Provisioning
Algorithm:
1. Begin with minimal topology (3 services).
2. Identify a service violation (in-BW, outBW, CPU, memory).
3. Eliminate the violation by “decomposing”
service into multiple replicas, distributing
load across them.
4. Continue until no violations remain.
Ian Rose – Harvard University
NSDI 2007
15
Provisioning: Example
BW: 25 Mbps
Memory: 1 GB
CPU: 4x
subscriptions: 6M
feeds: 600K
Ian Rose – Harvard University
NSDI 2007
16
Provisioning: Example
Ian Rose – Harvard University
NSDI 2007
17
Provisioning: Example
Ian Rose – Harvard University
NSDI 2007
18
Provisioning: Example
Ian Rose – Harvard University
NSDI 2007
19
Provisioning: Example
Ian Rose – Harvard University
NSDI 2007
20
Provisioning: Example
Ian Rose – Harvard University
NSDI 2007
21
Provisioning: Example
Ian Rose – Harvard University
NSDI 2007
22
Provisioning: Example
Ian Rose – Harvard University
NSDI 2007
23
Provisioning: Example
Ian Rose – Harvard University
NSDI 2007
24
Provisioning: Example
Ian Rose – Harvard University
NSDI 2007
25
Provisioning: Example
Ian Rose – Harvard University
NSDI 2007
26
Provisioning: Example
Ian Rose – Harvard University
NSDI 2007
27
Provisioning: Example
Ian Rose – Harvard University
NSDI 2007
28
Provisioning: Example
Ian Rose – Harvard University
NSDI 2007
29
Locality-Aware Feed
Assignment
• We focus on
crawler-feed locality.
• Offline latency
estimates between
crawlers and web
sources via King021.
• Cluster feeds to
“nearby” crawlers.
1Gummadi
et al., King: Estimating Latency
between Arbitrary Internet End Hosts
Ian Rose – Harvard University
NSDI 2007
30
Evaluation Methodology
• Synthetic user queries: number of words per
query based on Yahoo! search query data,
actual words drawn from Brown corpus.
• List of 102,446 real feeds from syndic8.com
• Scale up using synthetic feeds, with
empirically determined distributions for
update rates and content sizes (based in part
on Liu et al., IMC ’05).
Ian Rose – Harvard University
NSDI 2007
31
Benefit of Intelligent Crawling
One crawl of all 102,446
feeds over 15 minutes,
using 4 crawlers. BW
usage recorded for
varying filtering levels.
Overall, crawlers are
able to reduce bw usage
by 99.8% through
intelligent crawling.
Ian Rose – Harvard University
NSDI 2007
32
Locality-Aware Feed Assignment
Ian Rose – Harvard University
NSDI 2007
33
Scalability Evaluation: BW
Four topologies evaluated on
Emulab w/ synthetic feeds:
Subs
1M
10M
20M
40M
Feeds
100K
1M
500K
250K
Total Nodes
3
57
51
57
Crawlers
1
1
1
1
Filters
1
28
25
28
Reflectors
1
28
25
28
Bandwidth usage scales well
with feeds and users.
Ian Rose – Harvard University
NSDI 2007
34
Intra-Network Latency
Total user latency =
crawl latency +
polling latency +
intra-network latency
Overall, intra-network
latencies are largely
dominated by
crawling and polling
latencies.
Ian Rose – Harvard University
NSDI 2007
35
Provisioner-Predicted Scaling
Ian Rose – Harvard University
NSDI 2007
36
Related Work
• Traditional distributed pub/sub systems,
e.g. Siena (Univ. of Colorado):
– Address decentralized event matching and
distribution.
– Typically do not (directly) address overlay
provisioning.
– Often do not interoperate well with existing
web infrastructure.
Ian Rose – Harvard University
NSDI 2007
37
Related Work
• Corona (Cornell) is an RSS-specific
pub/sub system
– topic-based (subscribe to URLs)
– Attempts to minimize both polling load on
content servers (feeds) and update
detection delay.
– Does not specifically address scalability, in
terms of feeds or subscriptions.
Ian Rose – Harvard University
NSDI 2007
38
Future Work
• Many open directions:
– evaluating real user subscriptions &
behavior
– more sophisticated filtering techniques
(e.g. rank by relevance, proximity of query
words in article)
– subscription clustering on reflectors
– how to discover new feeds & blogs
Ian Rose – Harvard University
NSDI 2007
39
Thank you!
Questions?
[email protected]
Ian Rose – Harvard University
NSDI 2007
40
extra slides
Ian Rose – Harvard University
NSDI 2007
41
The Naïve method…
• “Back of the envelope” approximations:
– 1 user polling 50M feeds every 60 minutes
would use ~560 Mbps of bw
– 1 server serving 500M users Feeds every
60 minutes would use ~5.5 Gbps of bw
Ian Rose – Harvard University
NSDI 2007
42
Comparison to Other Search
Engines
• Created blogs on 2 popular blogging
sites (LiveJournal and Blogger.com)
• Polled for our posts on Feedster,
Blogdigger, Google Blog Search
• After 4 months:
– Feedster & Blogdigger had no results
(perhaps posts were spam filtered?)
– Google latency varied from 83s to 6.6
hours (perhaps use of ping service?)
Ian Rose – Harvard University
NSDI 2007
43
FeedTree
• Requires special client software.
• Relies on “good will” (donating BW) of
participants.
Ian Rose – Harvard University
NSDI 2007
44
Reflector Memory Usage
Ian Rose – Harvard University
NSDI 2007
45
Match-Time Performance
Ian Rose – Harvard University
NSDI 2007
46
Source: http://www.sifry.com/alerts/archives/000443.html
Ian Rose – Harvard University
NSDI 2007
47

Cobra: Content-based Filtering and Aggregation of Blogs

Transcript Cobra: Content-based Filtering and Aggregation of Blogs

Directory