Text Retrieval in Peer to Peer Systems

Download Report

Transcript Text Retrieval in Peer to Peer Systems

Text Retrieval
in Peer to Peer Systems
David Karger
MIT
Information Retrieval
before P2P
The traditional approach
Information Retrieval

Most of our information base is text






academic journals
books and encyclopedias
news feeds
world wide web pages
email
neat
messy
How do we find what we need?
The Classic IR Model






User has information need
User formulates textual query
System processes corpus of documents
System extracts relevant documents
User refines query
Metrics:


recall: % of relevant documents retrieved
precision: % of retrieved docs with relevance
Precision-Recall Tradeoff
Web Search
100%
Library
Fetch
Nothing
CIA
Precision
Fetch
Everything
Recall
100%
Specific Retrieval Algorithms

Define relevance



Build a model of documents, meanings
Ignore computational cost
Implement efficiently

Preprocessing


Tb corpora call for big-iron machines (or simulations)
Interaction:



after 1/2 second, user notices delay
after 10 seconds, user gives up
(historical perspective; changed by web)
Boolean Keyword Search

Q: “Do harsh winters affect steel production?”


Output:





Query: steel AND winter
“Last WINTER, overproduction of STEEL led to...”
“STEEL automobiles resist WINTER weather poorly.”
“Boston must STEEL itself for another bad WINTER”
“the Pittsburgh STEELers started WINTER training...”
Not Output:

“Cold weather caused increased metal prices as orders for
radiators and automobiles picked up...”
Implementing Boolean Search

Typical: OR of ANDS,


For ANDs, inverted index:



handle each OR separately, aggregate
Per term, list of documents containing that term
intersect lists for query terms
Basically a database join
Intersection Algorithms (as in DB)

Method 1: direct list merge


Linear work in summed size of lists
Method 2: examine candidates




Start with shortest term list
For each list entry, check for other search terms
Linear in smallest list
Good if at least one rare term, but


requires forward index (list of terms in each document)
no gain if all search terms common (“flying fish”)
Problems with Boolean Approach

Synonymy



Polysemy



one word can mean many things (“bank”)
query matches wrong meanings
Harsh cutoffs (1 wrong keyword kills)


several words for same thing
if author used different one, query won’t match
user can’t type descriptive paragraph...
Terms have uniform influence


repeated occurrence same as single occurrence
common terms treated same as rare ones
Fixing Problems

Synonymy




Polysemy




use more query terms to disambiguate
user might not know more terms
increases precision, but lowers recall
Harsh cutoffs


thesaurus can add equivalent terms to query
increases recall, but lowers precision
expensive to construct (semantics---manual)
quorum system (maximize # matching terms)
Uniformity?
Vector Space Model

Document is a vector with a coordinate/term


0-1 for presence/absence of term (quorum)
real valued to represent term “importance”




term frequency in document increases value
term frequency in corpus decreases value
Dot product with query measures similarity
Best known implementation: inverted index


for each query term, list documents containing it
accumulate dot products
Vector Space Advantages

Smoother than Boolean search


Provides ranking rather than sharp cut-off
Tends to allow/encourage queries with many
nonzero terms



Easy to “expand query” with synonyms
Hopefully polysemes will “interfere constructively”
May even add relevant documents to query

100s or 1000s of terms
P2PIR
Simulating big iron
Web Search Info From Google

Web queries




Almost all queries 2 terms only
“Boolean vector space” model (tiny recall OK)
Zipf distribution, so caching queries helps some
Corpus




3B pages, 10K average size, 30TB total
Inverted index: roughly the same size
Fits in a “moderate” P2P system of 30K nodes
But must be partitioned. How?
Obvious: Partition Documents





Node builds full inverted index for its subset
Query quite tractable per node
Merge results sent back from each node
Used by Google (in data center) and Gnutella
Drawback: query broadcast to all nodes

OK for Google data center; bad for P2P
Alternative: Partition Terms

One node owns a few terms of inverted index






Term pair is “key” for distributed hash table
Talk only to nodes that own query terms
They return desired inverted-index lists
Results intersected at query issuer
Drawback: transfer huge inverted index lists
Alternative: send first term-list to second

Ships 1 (perhaps small) list instead of 2
Avoiding Communication
(Om Gnawali et al. @MIT)

Build inverted index on term pairs




Pre-answering all queries
Partition pairs among nodes
Search contacts one node
Problem: pre-computation cost



Size-n document generates n2 pairs
Each pair must be communicated
Each pair must be stored
Good Cases

Music search



Document windows





“document” is song title + author
n small, so n2 factor unimportant
Usually, good docs have query terms “nearby”
Scan window of length 5, take pairs in window
10 pairs/window, so 10n per document
So linear in corpus size as before
Bundle pairs to ship over sparse overlay
What About Vector Space?


Weighting terms is easy
But cannot limit search to pair list



However, need only highest-scored documents on
individual terms
So, pre-compute and store small “winner list”
Vector space encourages many-term queries



Find pairs with small intersection
Index triples, quadruples, etc
Apply branch and bound techniques
Google Pushback

No need for P2P



More precisely: “keep peers in our data center”
Exploit high local communication bandwidth
Economics support large server farm


More load? Buy more servers
Main bottleneck: content provider bandwidth



Limits rate of crawl
Google index often weeks out of date
Distributed crawler won’t help
Google Pushback Pushback

P2P might help



Potential applications



Let each node build own index
Ship changes to Google
real-time index
new-relevant-content notification
Problem: SPAM


Content providers will lie about index changes
Use P2P system to spot-check?
Person-to-Person IR
New modalities
P2P:
Systems Perspective

Distributed system has more resources



Can exploit, if successfully hide



Computation/Storage
Reliability
Latency
Bandwidth
Goal: simulate reliable big iron


Solve traditional problems that need resources
File storage, factoring, database queries, IR
P2P:
Social Perspective

Applications based on person-to-person
interactions





Messaging
Linking/community bulding (the web)
Reputation management (Mojo Nation)
File-sharing collaborations (just now)
Need not run on top of P2P network
The “Pathetic Fallacy” of P2P

Assumption that network layer should mirror
social layer


Many work fine on one (big, reliable) machine



E.g. “peers should be node with similar interests”
Placement on P2P system is “coincidental”
On other side of “one big machine” abstraction
Breaching abstraction has bad consequences

Peering to “friends” unlikely to optimize efficiency,
reliability
P2P Opportunity:
Leverage Involvement of People

Each individual manipulates information



In much more powerful, semantic ways than
machines can achieve
Record that manipulation
Exploit to help others do better retrieval
Link-based Retrieval

Simultaneous work:



People find “good” web pages, link to them



Kleinberg at IBM
Brin/Page at Stanford/Google
So, a page with large in-degree is good
Refine: target of many good nodes is good
Mathematically, random walk model

Page rank=stationary probability of random walk
Applications

Search




Raise relevance of high page-rank pages
If lazy, limit corpus to high page-rank
Anchor text better description than page contents
Crawl


Page rank computed before see page
Prioritize high page-rank pages for crawl
People add usable info no system could find
P2P:
Systems/Social Interactions



Distributed system has novel properties
Exploit them to enable novel capabilities
E.g., anonymity


Relies on partition of control/knowledge
E.g., privacy


Allow limited access to my private information
Gain (false, but important) sense of safety by
keeping it on my machine
Expertise Networks


Haystack (Karger et al), Shock (Adar et al)
Route questions to appropriate expert



Use text to describe knowledge
Based on human entry, or indexing of human’s
personal files
Might be unwilling to admit knowledge



P2P framework can protect anonymity
Shock achieves by Gnutella-style query broadcast
More efficient approach?
Other New Aspects

Personal information sharing




Unwilling to “publish” mail, documents to world
But might allow search, access in some cases
Keeping data, index on own machine gives (false)
sense of security, privacy
Anonymity


P2P provides strong anonymity primitives
Can be exploited, e.g., for “recommending”
embarrassing content
Sample Application

Social: “Secret Web”



Maintain links for use by page-rank algorithm
But, links are secret from most others
Need random walk through link path


Implement via recursive lookup
Censorproof?, spamproof?
Semantics vs. Syntax







Clearly, using word meanings would help
Some systems try to implement semantics
But this is a core AI problem, unsolved
Current attempts don’t scale to large corpora
All current large systems are syntactic only
Idea: use computational power of P2P
Idea: use humans to attach semantics
Conclusion: Two Approaches to P2P

Hide P2P (Partition to Partition)





Goal: illusion of single server
Know how to do task on single server
Devise tools to achieve same in distributed sys.
Focus on surmounting drawbacks: systems
Exploit P2P (Person to Person)



Determine new opportunities afforded by P2P
Perhaps impossible on single server
Focus on new applications: AI? HCI?