Lecture 9: Unstructured Data • Information Retrieval – Types of Systems, Documents, Tasks – Evaluation: Precision, Recall • Search Engines (Google) – Architecture – Web Crawling –

Transcript Lecture 9: Unstructured Data • Information Retrieval – Types of Systems, Documents, Tasks – Evaluation: Precision, Recall • Search Engines (Google) – Architecture – Web Crawling –

Lecture 9: Unstructured Data
• Information Retrieval
– Types of Systems, Documents, Tasks
– Evaluation: Precision, Recall
• Search Engines (Google)
– Architecture
– Web Crawling
– Query Processing
• Inverted Indexes
• PageRank (!)
Most of the IR portion of this material is take from the course "Information retrieval
on the Internet" by Maier and Price, taught at PSU in alternate years.
Slide 1
Leaarning Objectives
• LO9.1 Given a Transition matrix draw a transition
graph, and vice versa.
• LO9.2 Given a Transition matrix, and a residence
vector, decide if it is the PageRank for that matrix.
Slide 2
Information Retrieval (IR)
• The study of
Unstructured
Data is called
Information
Retrieval (IR)
• A Database
refers to
Structured
Data
DBMS
IR
Structured
Data: rows in
tables
Unstructured
Data: documents,
media, etc.
Queries
SQL
Keyword
Matching
precise
approximate
unordered
(unless
specified) list
List ordered by
matching priority
Target
Results
Slide 3
General types of IR systems
•
•
•
•
Web Pages
Full text documents
Bibliographies
Distributed variations
– Metasearch
– Virtual document collections
Slide 4
Types of Documents in IR Systems
• Hyperlinked or not
• Format
–
–
–
–
HTML
PDF
Word Processed
Scanned OCR
• Type
– Text
– Multimedia
– Semistructured, e.g., XML
• Static or Dynamic
Slide 5
Types of tasks in IR systems
• Find
–
–
–
–
–
an overview
a fact/answer a question
comprehensive information
a known item (document, page or site)
a site to execute a transaction (e.g., buy a book,
download a file)
Slide 6
Evaluation
• How can we evaluate performance of an IR system?
– System perspective
– User perspective
• User perspective: Relevance
– (How well) does a document satisfy a user's need?
• Ideally, an IR system will retrieve exactly those items
that satisfy the user's needs, no more, no less.
• More: wastes user's time
• Less: user misses valuable information
Slide 7
Notation
In response to a user’s query:
The IR system
• reTrieves a set of documents T
The user
• knows the set of reLevant documents L
|X| denotes the number of documents in X
Ideally, T = L, no more (no junk), no less(no
missing)
Slide 8
The big picture
Retrieved, Not
Relevant = Junk
T
|TL|  |T|
= 1 if No Junk
Precision
= fraction of retrieved
items that were relevant
• =1 if all retrieved items
were relevant
•
•
•
•
Relevant, Not
Retrieved = Missing
L
TL
|TL|  |L|
= 1 if No Missing
Recall
= fraction of relevant
items that were retrieved
• =1 if all the relevant
items were retrieved
•
•
•
•
Slide 9
Context
• Precision, Recall were created for IR systems that
retrieved from a small set of items.
• In that case one could calculate T and L.
• Web search engines do not fit this model well; T and
L are huge.
• Recall does not make sense in this model, but we
can apply the definition of “precision@10”, measuring
the fraction of relevant items that were retrieved
among the first 10 displayed.
Slide 10
Experiment
• Compute Precision@10,20 for Google, Bing and
Yahoo for this query:
– Paris Hilton Hotel
• Precision = fraction of retrieved items that are relevant
Precision@10
Google
Bing
Yahoo
Slide 11
Search Engine Architecture
• How often do you google?
• What happens when you google?
– http://www.google.com/corporate/tech.html
• Average time: half a second
• We need a crawler to create the indexes and docs.
– Notice that the web crawler creates the docs.
– From the docs, the indexes are created and the docs are
given ranks… cf. later slides.
• Let's study the Web Crawler Algorithm (WCA)
– Page 1143 of the handout
Slide 12
Web Crawler Algorithm
•
•
•
1.
2.
3.
4.
Input: Set of popular URLs S
Output: Repository of visited web pages R
Method:
If S is empty, end
Select page p from S to crawl, delete p from S
Get p* (page that p points to).
If p* is in R, return to (1),
•
Else add p* to R, and add to S all outlinks from p* unless
they are already in R or S
5. Return to step (1)
Slide 13
WCA: Terminating Search
• Limit the number of pages crawled
– Total number of pages, or
– Pages per site
• Limit the depth of the crawl
Slide 14
WCA: Managing the Repository
• Don't add duplicates to S
– Need an index on S, probably hash
• Don't add duplicates to R
– Cannot happen since we search each URL only once?
• A page can come from >1 URL; mirror sites
– So use hash table of pages in R
Slide 15
WCA: Select Next Page in S?
• Can use Random Search
• Better: Most Important First
– Can consider first set of pages to be most important
• As pages are added, make them less important
• Breadth first search
– Can do a simplified PageRank (cf. later) calculation
Slide 16
WCA: Faster, Faster
• Multiprogramming, Multiprocessing
– Must manage locks on S
• With billions of URLs, this becomes a bottlneck
• So assign each process to a host/site, not a URL
– This can become a denial-of-service attack, so throttle down and
take on several sites, organized by hash buckets
– R also has bottleneck problems, and can be handled with
locks
Slide 17
On to Query Processing
• Very different from structured data: no SQL, parser,
optimizer
• Input is boolean combination of keywords
– data [and] base
– data OR base
• Google's goal is an engine that "understands exactly
what you mean and gives you back exactly what you
want "
Slide 18
Inverted Indexes
• When the crawl is complete, the search engine
builds, for each and every word, an inverted index.
• An inverted index is a list of all documents containing
that word
– The index may be a bit vector
– It may also contain the location(s) of the word in the
document
• Word: any word in any language, plus misspelling,
plus any sequence of characters surrounded by
punctuation!
Hundreds of millions of words
Farms of PCs, e.g. near Bonneville Dam, to hold all this data
Slide 19
Mechanics of Query Processing
1. Relevant inverted indexes are found
1. Typically the indexes are in memory, otherwise this could
take a full half second
2. If they are bit vectors, they are ANDed or ORed,
then materialized, then lists are handled
• Result is many URLs.
• Next step is to determine their rank so the highest
ranked URLs can be delivered to the user.
Slide 20
Ranking Pages
• Indexes have returned pages. Which ones are most
relevant to you?
• There are many criteria for ranking pages; here are
some no-brainers (except !)
–
–
–
–
Presence of all words
All words close together
Words in important locations and formats on the page
! Words near anchor text of links in reference pages
• But the killer criteria is PageRank
Slide 21
PageRank Intuition
• You need to find a plumber. How do you do it?
1. Call plumbers and talk to them
2. ! Call friends and ask for plumber references
•
Then choose plumbers who have the most references
3. !! Call friends who know a lot about plumbers (important
friends) and ask them for plumber references
•
•
•
•
•
•
Then choose plumbers who have the most references from
important people.
Technique 1 was used before Google.
Google introduced technique 2 to search engines
Google also introduced technique 3
Techniques 2, and especially 3, wiped out the competition.
The big challenge: determine which pages are important
Slide 22
What does this mean for pages?
1. Most search engines look for pages containing the
word "plumber"
2. Google searches for pages that are linked to by
pages containing "plumber".
3. Google searches for pages that are linked to by
important pages containing "plumber".
• A web page is important if many important pages
link to it.
– This is a recursive equation.
– Google solves it by imagining a web walker.
Slide 23
The Web Walker
• From page p, the walker follows a random link in p
– Note that all links in p have equal weight
• The walker walks for a very, very, long time.
• A residence vector [ y a m ] describes the percentage
of time that the walker spends on each page
– What does the vector [1/3 1/3 1/3 ] mean?
• In steady state, the residence vector will be (1st draft
of) the PageRank
• Observe: pages with many in-links are visited often
• Observe: important pages are visited most often
Slide 24
Stochastic Transition Matrix
• To describe the page walker's moves, we use a
stochastic transition matrix.
– Stochastic = each column sums to 1
• There are 3 web pages: Yahoo, Amazon and Microsoft
• This matrix means that the Yahoo page has 2 outlinks, to
Yahoo (a self-link) and to Amazon, etc.
Matrix =
Y
A
M
½
½
0
½
0
1
0
½
0
Slide 25
Transition Graph
• Each Transition Matrix corresponds to a Transition
Graph, e.g.
1/2
1/2
1/2
A
Y
1
M
1/2
Slide 26
LO9.1:Transition Graph*
• What is the Transition Graph for this Matrix?
Y
A
M
0
½
⅔
⅓
0
⅓
⅔
½
0
Slide 27
Solving for Page Rank
• For small dimension matrices it is simple to calculate
the PageRank using Gaussian Elimination.
• Remember [y,a,m] is the time the walker spends at
each site. Since it is a probability distribution,
y+a+m=1. Since the walker has reached steady
state,
½
½
0
y
½
0
1
a
0
½
0
m
y
=
a
m
Slide 28
Solving, ctd
• Solving such small equations is easy, but in reality
the matrix dimension is the number of pages in the
web, so it is in the billions.
• There is a simpler way, called relaxation.
• Start with a distribution, typically equal values, and
transform it by the matrix.
2/6
3/6
1/6
=
½
½
0
1/3
½
0
1
1/3
0
½
0
1/3
Slide 29
Solving, ctd
• If we repeat this only 5-10* times the vectors
converge to values very close to [2/5,2/5,1/5]. Check
that this is a solution:
½
½
0
2/5
2/5
½
0
1
2/5 =
2/5
1/5
0
½
0
1/5
• This solution gives the PageRank of each page on
the Web.
• It is also called the eigenvector of the matrix with
eigenvalue one.
• Does this agree with our intuition about Page Rank?
*For real web values, at most 100 iterations suffice
Slide 30
LO9.2: Identify Solution
• Is [ 3/8, 1/4, 3/8 ] a solution for this transition matrix ?
0
½
⅔
⅓
0
⅓
⅔
½
0
Slide 31
A Spider Trap
• Let's look at a more realistic example called a spider
trap.
½
½
0
M=
½
0
0
0
½
1
• The Transition Graph is:
1/2
1/2
• M represents any set of
web pages that does not
have a link outside the
set.
1/2
A1/2
Y
1
M
Slide 32
A Spider Trap
• The Page Rank is:
0
0
1
=
½
½
0
0
½
0
0
0
0
½
1
1
• Relaxation arrives at this vector because a random
walker arrives at M and stays there in a loop.
• This Page Rank vector violates the Page Rank
principle that inlinks should determine importance.
Slide 33
A Dead End
• A similar example, called a dead end, is
M=
½
½
0
½
0
0
0
½
0
• The Transition Graph is:
1/2
1/2
• M represents any set of
web pages that does not
have out-links.
Y
1/2
A1/2
M
Slide 34
A Dead End, ctd
• A dead end matrix is not stochastic, because M does
not obey the stochastic rule.
• The only eigenvector for a dead end matrix is the
zero vector.
• Relaxation arrives at the zero vector because a
random walker arrives at M and then has nowhere to
go.
Slide 35
What to do?
• In these cases, which happen all the time on the web,
the web walker algorithm does not identify which
pages are truly important.
• But we can tweak the algorithm to do so: Every 5th
walk, or so, the walker steps to a random page on the
web.
• Then the walk (spider trap example) becomes
Pnew = 0.8 *
½
½
0
½
0
0 Pold + 0.2 * 1/3
0
½
1
1/3
1/3
Slide 36
Teleporter
• Now our tweaked random walker is a teleporter.
• With probability 80%* s/he follows a random link from
the current page, as before.
• But with probability 20% s/he teleports to a random
page with uniform probability.
– It could be anywhere on the web, even the current page
• If s/he is at a dead end, with 100% probability s/he
teleports to a random page with uniform probability.
*80-20% are tunable paramaters
Slide 37
Solving the Teleporter Equation
• The equation on slide 36 describes the teleporter's
walk. It can be solved using relaxation or Gaussian
elimination.
• The solution is (7/33, 5/33, 21/33) .
• It gives unreasonably high importance to M, but does
recognize that Y is more important than A.
Slide 38