Transcript Title

INF 2914

Information Retrieval and Web Search

Lecture 10: Query Processing

These slides are adapted from Stanford’s class CS276 / LING 286 Information Retrieval and Web Mining

1

Algorithms for Large Data Sets

Ziv Bar-Yossef http://www.ee.technion.ac.il/courses/049011 2

Abstract Formulation

 Ingredients:     D: document collection Q: query space f: D x Q  R: relevance scoring function For every q in Q, f induces a ranking (partial order)  q on D  Functions of an IR system:   Preprocess D and create an index I Given q in Q, use I to produce a permutation  on D 3

Document Representation

 T = { t 1 ,…, t k  }: a “token space” (a.k.a. “feature space” or “term space”)   Ex: all words in English Ex: phrases, URLs, …  A document: a real vector d in R k   d i : “weight” of token t i Ex: d i in d = normalized # of occurrences of t i in d 4

Classic IR (Relevance) Models

  The Boolean model The Vector Space Model (VSM) 5

The Boolean Model

 A document: a boolean  d i = 1 iff t i belongs to d vector d in {0,1} k  A query: a boolean formula q over tokens    q: {0,1} k  {0,1} Ex: “Michael Jordan” AND (NOT basketball) Ex: +“Michael Jordan” –basketball  Relevance scoring function: f(d,q) = q(d) 6

The Boolean Model: Pros & Cons

 Advantages:  Simplicity for users  Disadvantages:  Relevance scoring is too coarse 7

The Vector Space Model (VSM)

 A document: a real  d i = weight of t i vector d in R k in d (usually TF-IDF score)  A query: a real vector q in R k  q i = weight of t i in q  Relevance scoring function: f(d,q) = sim(d,q)  “similarity” between d and q 8

Popular Similarity Measures

 L 1  or L 2 distance d,q are first normalized to have unit norm d q d –q  Cosine similarity  d q 9

TF-IDF Score: Motivation

  Motivating principle:  A term t i   is relevant to a document d if: t i t i occurs many times in d relative to other terms that occur in d occurs many times in d relative to its number of occurrences in other documents Examples  10 out of 100 terms in d are “java”   10 out of 10,000 terms in d are “java” 10 out of 100 terms in d are “the” 10

TF-IDF Score: Definition

    n(d,t i ) = # of occurrences of t N =  i i n(d,t i ) (# of tokens in d) in d D i = # of documents containing t i D = # of documents in the collection  TF(d,t i ): “Term Frequency”   Ex: TF(d,t i ) = n(d,t i ) / N Ex: TF(d,t i ) = n(d,t i ) / (max j { n(d,t j ) })  IDF(t i ): “Inverse Document Frequency”  Ex: IDF(t i ) = log (D/D i )  TFIDF(d,t i ) = TF(d,t i ) x IDF(t i ) 11

VSM: Pros & Cons

 Advantages:  Better granularity in relevance scoring   Good performance in practice Efficient implementations  Disadvantages:  Assumes term independence 12

Retrieval Evaluation

 Notations:  D: document collection   D q : documents in D that are “relevant” to query q  Ex: f(d,q) is above some threshold L q : list of results on query q

Recall:

L q D q D

Precision:

13

Precision & Recall: Example

List A

1. d 123 2. d 84 3. d 56 4. d 6 5. d 8 6. d 9 7. d 511 8. d 129 9. d 187 10.d

25

List B

1. d 81 2. d 74 3. d 56 4. d 123 5. d 511 6. d 25 7. d 9 8. d 129 9. d 3 10.d

5 Relevant docs: d 123 , d 56 , d 9 , d 25 , d 3   Recall(A) = 80% Precision(A) = 40%   Recall(B) = 100% Precision(B) = 50% 14

Precision@k and Recall@k

 Notations:   D q : documents in D that are “relevant” to q L q,k : top k results on the list

Recall@k: Precision@k:

15

Precision@k: Example

List A

1. d 123 2. d 84 3. d 56 4. d 6 5. d 8 6. d 9 7. d 511 8. d 129 9. d 187 10.d

25

List B

1. d 81 2. d 74 3. d 56 4. d 123 5. d 511 6. d 25 7. d 9 8. d 129 9. d 3 10.d

5

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% List A List B 1 2 3 4 5 6 7 8 9 1 k

16

Recall@k: Example

List A

1. d 123 2. d 84 3. d 56 4. d 6 5. d 8 6. d 9 7. d 511 8. d 129 9. d 187 10.d

25

List B

1. d 81 2. d 74 3. d 56 4. d 123 5. d 511 6. d 25 7. d 9 8. d 129 9. d 3 10.d

5

100% 80% 60% 40% 20% 0% 1 List A List B 3 5 k 7 9

17

“Interpolated” Precision

 Notations:    D q : documents in D that are “relevant” to q r: a recall level (e.g., 20%) k(r): first k so that recall@k >= r

Interpolated precision@ recall level r = max { precision@k : k >= k(r) }

18

Precision vs. Recall: Example

List A

1. d 123 2. d 84 3. d 56 4. d 6 5. d 8 6. d 9 7. d 511 8. d 129 9. d 187 10.d

25

List B

1. d 81 2. d 74 3. d 56 4. d 123 5. d 511 6. d 25 7. d 9 8. d 129 9. d 3 10.d

5

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% List A List B 0 20 40 60 Recall 80 100

19

Top-k Query Processing

Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor Based on the presentation of Wesley Sebrechts, Joost Voordouw. Modified by Vagelis Hristidis 20

Why top-k query processing • Multimedia brings fuzzy data • attribute values are graded typically [0,1] • No clear boundary between “answer” / “no answer” Example: • A query in a multimedia database means combining graded attributes • Combine attributes by aggregation function • Aggregation function gives overall grade of object • Return k objects with highest overall grade 21

Top-k query processing Top-k query processing = Finding k objects that have the highest overall grades • How ?  Which algorithms?

• Fagin’s Algorithm (FA) • Threshold Algorithm (TA) • Which is the best algorithm?

Keep in mind: Database system serves as middleware • Multimedia (objects) may be kept in different subsystems • e.g. photoDB, videoDB, search engine • Take into account the limitations of these subsystems 22

Example • Simple database model • Simple query • Explaining Fagin’s Algorithm (FA) • Finding top-k with FA • Explaining Threshold Algortihm (TA) • Finding top-k with TA 23

Example – Simple Database model N M Object ID Attribute 1 Attribute 2 a 0.9

0.85

b 0.8

0.7

c 0.72

0.2

d

.

.

.

.

0.6

.

.

.

.

0.9

.

.

.

.

Sorted L 1 (a, 0.9) (b, 0.8) (c, 0.72)

.

.

.

.

(d, 0.6) Sorted L 2 (d, 0.9) (a, 0.85) (b, 0.7)

.

.

.

.

(c, 0.2) 24

Example – Simple Query Find the top 2 (k = 2) objects on the following ‘query’ executed on the middleware:

A1 & A2

(eg: color=red & shape=round) A1 & A2 as a ‘query’ to the middleware results in the middleware combining the grades of A1 en A2 by min(A1, A2)

Aggregation function:

• function that gives objects an overall grade based on attribute grades • examples : min, max functions • Monotonicity!

25

Example – Fagin’s Algorithm STEP 1 • Read attributes from every sorted list • Stop when

k

objects have been seen in common from all lists L 1 (a, 0.9) L 2 (d, 0.9) ID A 1 A 2 Min(A 1 ,A 2 ) (b, 0.8) (a, 0.85) a (c, 0.72) (b, 0.7) 0.9

0.85

d 0.9

.

.

.

.

.

.

.

.

b c 0.8

0.72

0.7

(d, 0.6) (c, 0.2) 26

Example – Fagin’s Algorithm STEP 2 • Random access to find missing grades L 1 (a, 0.9) (b, 0.8) (c, 0.72)

.

.

.

.

(d, 0.6) L 2 (d, 0.9) (a, 0.85) (b, 0.7)

.

.

.

.

(c, 0.2) ID a d b c A 1 A 2 Min(A 1 ,A 2 ) 0.9

0.6

0.8

0.72

0.85

0.9

0.7

0.2

27

Example – Fagin’s Algorithm STEP 3 • Compute the grades of the seen objects.

• Return the k highest graded objects.

L 1 (a, 0.9) L 2 (d, 0.9) ID (b, 0.8) (a, 0.85) a (c, 0.72) (b, 0.7) d

.

.

.

.

.

.

.

.

b c (d, 0.6) (c, 0.2) A 1 0.9

0.6

0.8

0.72

A 2 Min(A 1 ,A 2 ) 0.85

0.85

0.6

0.9

0.7

0.2

0.7

0.2

28

New Idea !!! Threshold Algorithm (TA)

Read all grades of an object once seen from a sorted access • No need to wait until the lists give

k

common objects Do sorted access (and corresponding random accesses) until you have seen the top k answers.

• How do we know that grades of seen objects are higher than the grades of unseen objects ?

• Predict maximum possible grade unseen objects: L 1 L 2 a: 0.9

d: 0.9

Seen b: 0.8

a: 0.85

Possibly unseen c: 0.72

.

.

.

d: 0.6

b: 0.7

.

.

f: 0.6

.

.

c: 0.2

T = min(0.72, 0.7) = 0.7

Example – Threshold Algorithm L 1 (a, 0.9) (b, 0.8) (c, 0.72) Step 1: - parallel sorted access to each list For each object seen: - get all grades by random access - determine Min(A1,A2) L 2 - amongst 2 highest seen ? keep in buffer (d, 0.9) ID A 1 A 2 Min(A 1 ,A 2 ) (a, 0.85) (b, 0.7) a 0.9

0.85

0.85

d 0.6

0.9

0.6

.

.

.

.

.

.

.

.

(d, 0.6) (c, 0.2) 30

Example – Threshold Algorithm Step 2: - Determine threshold value based on objects currently L 1 seen under sorted access. T = min(L1, L2) - 2 objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1 L 2 a: 0.9

d: 0.9

ID A 1 A 2 Min(A 1 ,A 2 ) b: 0.8

a: 0.85

0.85

c: 0.72

b: 0.7

a 0.9

0.85

d 0.6

0.9

0.6

.

.

.

.

.

.

.

.

d: 0.6

c: 0.2

31 T = min(0.9, 0.9) = 0.9

Example – Threshold Algorithm L 1 (a, 0.9) Step 1 (Again): - parallel sorted access to each list For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer L 2 (d, 0.9) ID A 1 A 2 Min(A 1 ,A 2 ) (b, 0.8) (a, 0.85) a 0.9

0.85

0.85

(c, 0.72) (b, 0.7) d 0.6

0.9

0.6

.

.

.

.

.

.

.

.

b 0.8

0.7

0.7

(d, 0.6) (c, 0.2) 32

Example – Threshold Algorithm Step 2 (Again): - Determine threshold value based on objects currently seen. T = min(L1, L2) L 1 - 2 objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1 L 2 a: 0.9

d: 0.9

ID A 1 A 2 Min(A 1 ,A 2 ) b: 0.8

a: 0.85

0.85

c: 0.72

b: 0.7

a 0.9

0.85

b 0.8

0.7

0.7

.

.

.

.

.

.

.

.

d: 0.6

c: 0.2

T = min(0.8, 0.85) = 0.8

33

Example – Threshold Algorithm L 1 a: 0.9

b: 0.8

c: 0.72

.

.

.

.

d: 0.6

Situation at stopping condition L 2 d: 0.9

ID A 1 A 2 a: 0.85

b: 0.7

a 0.9

0.85

b 0.8

0.7

.

.

.

.

Min(A 1 ,A 2 ) 0.85

0.7

c: 0.2

T = min(0.72, 0.7) = 0.7

34

Comparison of Fagin ’s and Threshold Algorithm

• TA sees less objects than FA • TA stops at least as early as FA • When we have seen

k

objects in common in FA, their grades are higher or equal than the threshold in TA.

• TA may perform more random accesses than FA • In TA, (

m

-1) random accesses for each object • In FA, Random accesses are done at the end, only for missing grades • TA requires only bounded buffer space ( • FA makes use of unbounded buffers

k

) • At the expense of more random seeks 35

The best algorithm

Which algorithm is the best?

• Define “best” • middleware cost • concept of instance optimality • Consider: • wild guesses • aggregation functions characteristics • Monotone, strictly monotone, strict • database restrictions • distinctness property 36

The best algorithm: concept of optimality

A

= class of algorithms, A E A represents an algorithm

D

= legal inputs to algorithms (databases), D E D represents a database middleware cost = cost for processing data subsystems = sc + rc Cost(

A,D

) = middleware cost when running algorithm

A

over database

D

Algorithm

B

is instance optimal over

A

B E A and Cost(

B,D

) = O(Cost(

A,D

A and

D

)) A E A, A if : D E D Which means that: Cost(

B,D

) ≤ c · Cost(

A,D

) + c’, A E A, A D E D optimality ratio 37

The best algorithm: instance optimality & wild guesses Intuitively: B instance optimal =

always

the best algorithm in

A

=

always

optimal In reality: always is “always”  we will exclude wild guesses algorithms Wild guess = random access on object not previously encountered by sorted access • In practice not possible • Database need to know ID to do random access • If wild guesses allowed in

A

then no algorithm can be instance optimal • Wild guesses can find top-k objects by k·m random accesses (k = #objects , m = #lists) 38

The best algorithm: aggregation functions

• Aggregation function

t

combines object grades into object ’s overall grade: x 1 , …,x m

t

(x 1 , …,x m ) • Monotone :

t

(x 1 , …,x m ) ≤

t

(x ’ 1 , …,x’ m ) if x i ≤ x’ i for every i • Strictly monotone:

t

(x 1 , …,x m ) <

t

(x ’ 1 , …,x’ m ) if x i < x i ’ for every i • Strict:

t

(x 1 , …,x m ) = 1 precisely when x i = 1 for every i 39

The best algorithm: database restrictions

Distinctness property: A database has no (sorted) attribute list in which two objects have the same grade 40

Fagin’s Algorithm

- Database with N objects, each with m attributes.

- Orderings of lists are independent • FA finds top-k with middleware cost O(N (m-1)/m k 1/m ) • FA = optimal with high probability in the worst case for strict monotone aggregation functions 41

Threshold Algorithm

TA = instance optimal (always optimal) for every monotone aggregation function, over every database (excluding wild guesses) = optimal in much stronger sense than Fagin ’s Algorithm If strict monotone aggregation function: Optimality ratio = m + m (m-1)c R /c s = best possible (m = # attributes) • If random acces not possible (c r • If sorted access not possible (c s = 0 )  = 0)  optimality ratio = m optimality ratio = infinite  TA not instance optimal TA = instance optimal (always optimal) for every strictly monotone aggregation function, over every database (including wild guesses) that satisfies the distinctness property • Optimality ratio = cm 2 with c = max {c R /c S , c S /c R } 42

Optimized Query Execution in Large Search Engines with Global Page Ordering Xiaohui Long Torsten Suel

CIS Department Polytechnic University Brooklyn, NY 11201

The Problem:

“how to optimize query throughput in large search engines, when the ranking function is a combination of term-based ranking and a global ordering such as Pagerank”

Talk Outline:

• • • • •

intro:

query processing in search engines

related work:

query execution and pruning techniques

algorithmic techniques experimental evaluation:

single and multiple nodes

concluding remarks

Query Processing in Parallel Search Engines

low-cost cluster architecture

(usually with additional replication) Cluster with global index organization

LAN

query integrator broadcasts each query and combines the results

index pages index pages index pages index pages index pages

local index: every node stores and indexes subset of pages

every query broadcast to all nodes by query integrator (QI)

every node supplies top-10, and QI computes global top-10

note: we don’t really need top-10 from all, maybe only top-2

Related Work on top-k Queries

IR: optimized evaluation of cosine measures

(since 1980s)

• •

DB: top-k queries for multimedia databases

(Fagin 1996)

does not consider combinations of term-based and global scores

Brin/Page 1998: fancy lists in Google

Related Work (IR)

basic idea: “

presort entries in each inverted list by contribution to cosine”

also process inverted lists from shortest to longest list

various schemes, either reliable or probabilistic

most closely related: - Persin/Zobel/Sacks-Davis 1993/96 - Anh/Moffat 1998, Anh/deKretzer/Moffat 2001

typical assumptions: many keywords/query, OR semantics

Related Work (DB)

(Fagin 1996 and others)

motivation: searching multimedia objects by several criteria

typical assumptions: few attributes, OR semantics, random access

• • •

FA (Fagin’s algorithm), TA (Threshold algorithm), others

N m

1

k

1 term-based ranking: presort each list by contribution to cosine

Related Work (Google)

(Brin/Page 1998)

“fancy lists” optimization in Google

create extra shorter inverted list for “fancy matches” (matches that occur in URL, anchor text, title, bold face, etc.) chair table fancy list fancy list rest of list with other matches rest of list with other matches

note: fancy matches can be modeled by higher weights in the term-based vector space model

no details given or numbers published

Results of our Paper

pruning techniques for query execution in large search engines

focus on a combination of a term-based and a global score (such as Pagerank)

techniques combine previous approaches such as fancy lists and presorting of lists by term scores

experimental evaluation on 120 million pages

very significant savings with almost no impact on results

it’s good to have a global ordering!

Algorithms:

exhaustive algorithm:

“no pruning, traverse entire list”

first-m:

“a naïve algorithm with lists sorted by Pagerank; stop after m elements in intersection found”

fancy first-m:

“use fancy and non-fancy lists, each sorted by Pagerank, and stop after m elements found”

reliable pruning:

“stop when top-k results found”

fancy last-m:

“stop when at most m elements unresolved”

single-node and parallel case with optimization

Experimental setup:

120 million pages on 16 machines (1.8TB uncompressed)

P-4 1.7Ghz with 2x80GB Seagate Barracuda IDE

compressed index based on Berkeley DB (using the mg compression macros)

queries from Excite query trace from December 1999

queries with 2 terms in the following

local index organization with query integrator

first results for one node (7.5 million pages), then 16

note: do not need top-10 from every node

motivates top-1, top-4 schemes and precision at 1, 4

ranking by cosine + log(PR) with normalization

A naïve approach: first-m

sort inverted lists by Pagerank (docID = rank due to Pagerank)

• •

exhaustive: top-10 first-m: return 10 highest scoring among first 10/100/1000 pages in intersection

first-m

(ctd.)

average cost per query in terms of disk blocks loose/strict precision, relative to “correct” cosine + log(PR)

for first-10, about 45% of top-10 results belong in top-10

for first-1000, about 85% of top-10 results belong in top-10

• •

for first-100, about 80% of queries return correct top-1 result for first-1000, about 70% of queries return all correct top-10 results

How can we do better?

• • •

(1) Use better stopping criteria? reliable pruning: stop when we are sure probabilistic pruning: stop when almost sure do not work well for Pagerank-sorted index

• • •

(2) Reorganize index structure?

sort lists by term score (cosine) instead of Pagerank - does not do any better than sorting by Pagerank only sort lists by term + 0.5 log(PR) (or some combination of these) - some problems in normalization and dependence on # of keywords

generalized fancy lists

- for each list, put entries with highest term value in fancy list - sort both lists by pagerank docID - note: anything that does well in 2 out of 3 scores is found soon - deterministic or probabilistic pruning, or first-k chair table fancy list fancy list rest of list, cosine < x rest of list, cosine < y

Results for generalized fancy lists

loose vs. strict precision for various sizes of the fancy lists

MUCH better precision than without fancy lists!

for first-1000, we always get correct top-1 in these runs

Costs of Fancy Lists

• • • •

cost similar to first-m without fancy lists plus the additional cost of reading fancy lists cost increases slightly with size of fancy list slight inefficiency: fancy list items not removed from other list note: we do not consider savings due to caching

Reliable Pruning

• • •

always gives “correct” result top-4 can be computed reliably with ~20% of original cost with 16 nodes, top-4 from each node suffice with 99% prob.

to get top-10

Results for 16 Nodes

first-30 returns correct top-10 for almost 98% of all queries

Throughput and Latency for 16 Nodes

top-10 queries on 16 machines with 120 million pages

up to 10 queries/sec with reliable pruning

up to 20 queries per second with first-30 scheme Note: reliable pruning not implemented in purely incremental manner

Current and Future Work

results for 3+ terms and incremental query integrator

need to do precision/recall study

need to engineer ranking function and reevaluate

how to include term distance in documents

impact of caching at lower level

working on publicly available engine prototype

tons of loose ends and open questions