Text Retrieval in Peer to Peer Systems
Download
Report
Transcript Text Retrieval in Peer to Peer Systems
Text Retrieval
in Peer to Peer Systems
David Karger
MIT
Information Retrieval
before P2P
The traditional approach
Information Retrieval
Most of our information base is text
academic journals
books and encyclopedias
news feeds
world wide web pages
email
neat
messy
How do we find what we need?
The Classic IR Model
User has information need
User formulates textual query
System processes corpus of documents
System extracts relevant documents
User refines query
Metrics:
recall: % of relevant documents retrieved
precision: % of retrieved docs with relevance
Precision-Recall Tradeoff
Web Search
100%
Library
Fetch
Nothing
CIA
Precision
Fetch
Everything
Recall
100%
Specific Retrieval Algorithms
Define relevance
Build a model of documents, meanings
Ignore computational cost
Implement efficiently
Preprocessing
Tb corpora call for big-iron machines (or simulations)
Interaction:
after 1/2 second, user notices delay
after 10 seconds, user gives up
(historical perspective; changed by web)
Boolean Keyword Search
Q: “Do harsh winters affect steel production?”
Output:
Query: steel AND winter
“Last WINTER, overproduction of STEEL led to...”
“STEEL automobiles resist WINTER weather poorly.”
“Boston must STEEL itself for another bad WINTER”
“the Pittsburgh STEELers started WINTER training...”
Not Output:
“Cold weather caused increased metal prices as orders for
radiators and automobiles picked up...”
Implementing Boolean Search
Typical: OR of ANDS,
For ANDs, inverted index:
handle each OR separately, aggregate
Per term, list of documents containing that term
intersect lists for query terms
Basically a database join
Intersection Algorithms (as in DB)
Method 1: direct list merge
Linear work in summed size of lists
Method 2: examine candidates
Start with shortest term list
For each list entry, check for other search terms
Linear in smallest list
Good if at least one rare term, but
requires forward index (list of terms in each document)
no gain if all search terms common (“flying fish”)
Problems with Boolean Approach
Synonymy
Polysemy
one word can mean many things (“bank”)
query matches wrong meanings
Harsh cutoffs (1 wrong keyword kills)
several words for same thing
if author used different one, query won’t match
user can’t type descriptive paragraph...
Terms have uniform influence
repeated occurrence same as single occurrence
common terms treated same as rare ones
Fixing Problems
Synonymy
Polysemy
use more query terms to disambiguate
user might not know more terms
increases precision, but lowers recall
Harsh cutoffs
thesaurus can add equivalent terms to query
increases recall, but lowers precision
expensive to construct (semantics---manual)
quorum system (maximize # matching terms)
Uniformity?
Vector Space Model
Document is a vector with a coordinate/term
0-1 for presence/absence of term (quorum)
real valued to represent term “importance”
term frequency in document increases value
term frequency in corpus decreases value
Dot product with query measures similarity
Best known implementation: inverted index
for each query term, list documents containing it
accumulate dot products
Vector Space Advantages
Smoother than Boolean search
Provides ranking rather than sharp cut-off
Tends to allow/encourage queries with many
nonzero terms
Easy to “expand query” with synonyms
Hopefully polysemes will “interfere constructively”
May even add relevant documents to query
100s or 1000s of terms
P2PIR
Simulating big iron
Web Search Info From Google
Web queries
Almost all queries 2 terms only
“Boolean vector space” model (tiny recall OK)
Zipf distribution, so caching queries helps some
Corpus
3B pages, 10K average size, 30TB total
Inverted index: roughly the same size
Fits in a “moderate” P2P system of 30K nodes
But must be partitioned. How?
Obvious: Partition Documents
Node builds full inverted index for its subset
Query quite tractable per node
Merge results sent back from each node
Used by Google (in data center) and Gnutella
Drawback: query broadcast to all nodes
OK for Google data center; bad for P2P
Alternative: Partition Terms
One node owns a few terms of inverted index
Term pair is “key” for distributed hash table
Talk only to nodes that own query terms
They return desired inverted-index lists
Results intersected at query issuer
Drawback: transfer huge inverted index lists
Alternative: send first term-list to second
Ships 1 (perhaps small) list instead of 2
Avoiding Communication
(Om Gnawali et al. @MIT)
Build inverted index on term pairs
Pre-answering all queries
Partition pairs among nodes
Search contacts one node
Problem: pre-computation cost
Size-n document generates n2 pairs
Each pair must be communicated
Each pair must be stored
Good Cases
Music search
Document windows
“document” is song title + author
n small, so n2 factor unimportant
Usually, good docs have query terms “nearby”
Scan window of length 5, take pairs in window
10 pairs/window, so 10n per document
So linear in corpus size as before
Bundle pairs to ship over sparse overlay
What About Vector Space?
Weighting terms is easy
But cannot limit search to pair list
However, need only highest-scored documents on
individual terms
So, pre-compute and store small “winner list”
Vector space encourages many-term queries
Find pairs with small intersection
Index triples, quadruples, etc
Apply branch and bound techniques
Google Pushback
No need for P2P
More precisely: “keep peers in our data center”
Exploit high local communication bandwidth
Economics support large server farm
More load? Buy more servers
Main bottleneck: content provider bandwidth
Limits rate of crawl
Google index often weeks out of date
Distributed crawler won’t help
Google Pushback Pushback
P2P might help
Potential applications
Let each node build own index
Ship changes to Google
real-time index
new-relevant-content notification
Problem: SPAM
Content providers will lie about index changes
Use P2P system to spot-check?
Person-to-Person IR
New modalities
P2P:
Systems Perspective
Distributed system has more resources
Can exploit, if successfully hide
Computation/Storage
Reliability
Latency
Bandwidth
Goal: simulate reliable big iron
Solve traditional problems that need resources
File storage, factoring, database queries, IR
P2P:
Social Perspective
Applications based on person-to-person
interactions
Messaging
Linking/community bulding (the web)
Reputation management (Mojo Nation)
File-sharing collaborations (just now)
Need not run on top of P2P network
The “Pathetic Fallacy” of P2P
Assumption that network layer should mirror
social layer
Many work fine on one (big, reliable) machine
E.g. “peers should be node with similar interests”
Placement on P2P system is “coincidental”
On other side of “one big machine” abstraction
Breaching abstraction has bad consequences
Peering to “friends” unlikely to optimize efficiency,
reliability
P2P Opportunity:
Leverage Involvement of People
Each individual manipulates information
In much more powerful, semantic ways than
machines can achieve
Record that manipulation
Exploit to help others do better retrieval
Link-based Retrieval
Simultaneous work:
People find “good” web pages, link to them
Kleinberg at IBM
Brin/Page at Stanford/Google
So, a page with large in-degree is good
Refine: target of many good nodes is good
Mathematically, random walk model
Page rank=stationary probability of random walk
Applications
Search
Raise relevance of high page-rank pages
If lazy, limit corpus to high page-rank
Anchor text better description than page contents
Crawl
Page rank computed before see page
Prioritize high page-rank pages for crawl
People add usable info no system could find
P2P:
Systems/Social Interactions
Distributed system has novel properties
Exploit them to enable novel capabilities
E.g., anonymity
Relies on partition of control/knowledge
E.g., privacy
Allow limited access to my private information
Gain (false, but important) sense of safety by
keeping it on my machine
Expertise Networks
Haystack (Karger et al), Shock (Adar et al)
Route questions to appropriate expert
Use text to describe knowledge
Based on human entry, or indexing of human’s
personal files
Might be unwilling to admit knowledge
P2P framework can protect anonymity
Shock achieves by Gnutella-style query broadcast
More efficient approach?
Other New Aspects
Personal information sharing
Unwilling to “publish” mail, documents to world
But might allow search, access in some cases
Keeping data, index on own machine gives (false)
sense of security, privacy
Anonymity
P2P provides strong anonymity primitives
Can be exploited, e.g., for “recommending”
embarrassing content
Sample Application
Social: “Secret Web”
Maintain links for use by page-rank algorithm
But, links are secret from most others
Need random walk through link path
Implement via recursive lookup
Censorproof?, spamproof?
Semantics vs. Syntax
Clearly, using word meanings would help
Some systems try to implement semantics
But this is a core AI problem, unsolved
Current attempts don’t scale to large corpora
All current large systems are syntactic only
Idea: use computational power of P2P
Idea: use humans to attach semantics
Conclusion: Two Approaches to P2P
Hide P2P (Partition to Partition)
Goal: illusion of single server
Know how to do task on single server
Devise tools to achieve same in distributed sys.
Focus on surmounting drawbacks: systems
Exploit P2P (Person to Person)
Determine new opportunities afforded by P2P
Perhaps impossible on single server
Focus on new applications: AI? HCI?