Transcript Slide 1

og/le

optimal guesswork/luck-based engine

Circa 1300 BC:

Ten Commandments (1956)

Circa 1971:

Courtesy of “An Atlas of Cyberspaces” (http://www.cybergeography.org/atlas/historical.html

)

Circa 1999:

So, why do we need search engines?

o The web is too big.

o There is too much irrelevant information.

o Search engines bring order to this chaos filled land.

o Know the Data o Store the Data o Retrieve the Results What does a search engine require?

o Order the Results

To Do Quarantine Link Cache Word Cache

Anatomy of Streaker

Page Retriever URL Stream List Parser Main The Database

To Do http://www.carleton.edu

Page Retriever URL Stream List Quarantine Link Cache Word Cache Main Parser The Database

To Do Page Retriever URL Stream List Quarantine Link Cache Word Cache Main Parser The Database

Playing nicely with the network

o Is the server responding?

o Is the server overloaded?

o How much info are we requesting?

o How fast are we sending our requests?

Formulas: Pause Time: DELAY * 2

Throttling Streaker

Ave. Delay: DELAY + (lastFetchTime – DELAY) * .5

Streaker 1 second DELAY = 1 2 seconds Streaker WEBSERVER 3 seconds 4 seconds DELAY = 2 4 seconds Streaker DELAY = 3

To Do Page Retriever URL Stream List Quarantine Link Cache Word Cache Main Parser The Database

To Do Page Retriever URL Stream List Quarantine Link Cache Word Cache Main Parser The Database

Before

og/le

     og/le
optimal guesswork/luck-based engine
: carleton search


About     Instructions for Testers     Statistics



ogling 25,49 pages


Powered by Streaker

© 2004 Josh Allen, Andrew Drummer, Brendan Foote, Aaron Miller, Mike Ottum

After

Page object

Page Text Page Header Page URL Etc…

Word Object(s)

The Word Word Position Other info

Link Object(s)

Link URL Link Position Page URL Link text

Brief HTML Introduction

Which elements of a page are important?

o Text o Individual Words o Position o Tag Information o Links o Link target o Link text

Parsing Challenges

o Identical pages with different URLs o Especially common with dynamically-generated pages o Solution: Compute a checksum as we parse and then compare it to previously seen pages o CRC-32 Checksum Algorithm o HTML is not a strict language o The Parser must be flexible enough to allow for many different types of coding, especially in tags.

To Do Page Retriever URL Stream List Quarantine Link Cache Word Cache Main Parser The Database

To Do Page Retriever URL Stream List Quarantine Link Cache Word Cache Main Parser The Database

Pages Indexed: 54,752 Fetch Errors: 43,862

To Do Page Retriever URL Stream List Quarantine Link Cache Word Cache Main Parser The Database

To Do Page Retriever URL Stream List Quarantine Link Cache Word Cache Main Parser The Database

Unique is Good

Word

philanderer

Word ID

251 philanthropist 252 philanderer 253

mySQL queries take a long time!

mySQL queries take a long time!

We have MANY queries to make.

Our current database contains 206,493 unique words 54,752 unique urls

Google stores the complete text of

6 Billion

web pages in

memory

THEN: 459 pages/hour NOW: 3422 pages/hour

MySQL

Brief Databases Introduction

o Why use databases?

o Data to store is too big for main memory o Optimize disk accesses through intelligent organization of data o Relational Database Model o Data is stored in tables according to

relationships

o Data is retrieved using Structured Query Language (SQL)

Relational Example

o Relate Words to Pages o Information that we care about: o word (string) o url (string) o position (integer) o HTML tag attributes (set)

The Non-Relational Way

word

college a college

url

http://www.carleton.edu

http://www.mathcs.carleton.edu

http://www.carleton.edu

3 4

pos

1

tags

<>

Why is this method bad?

o Wasted space o The word “college” and the URL “http://www.carleton.edu” appear twice in this example o In our actual crawl, the word “carleton” appears 85,496 times o String comparisons are slow

2 3

wid

1 4 5 6 7

Our Database Tables - Word

word

carleton college is a great place fhqwhgads

Our Database Tables - URL

2 3

urlid

1 4

url

http://www.carleton.edu

http://www.mathcs.carleton.edu

http://www.carleton.edu/student/ http://violet.mathcs.carleton.edu/ogle/search.php

wid

2 4 2 WordToUrl Table Captures a Relation o Relates Word entries to URL entries

urlid

1 2 1

pos

1 3 4

tags

<>

Executing a join Operation o Combine the information from multiple tables to produce something meaningful Word Table URL Table WordToURL Table pos tags Desired Output

Word Table 6 7 4 5 2 3

wid

1

word

carleton college is a great place fhqwhgads

Word Table 6 7 4 5 2 3

wid

1

word

carleton college is a great place fhqwhgads

Word Table WordToURL Table pos tags Desired Output URL Table

6 7 4 5 2 3

wid

1 Word Table

word

carleton college is a great place fhqwhgads

wid

2 4 2 WordToURL Table

urlid pos

1 2 1 1 3 4

tags

<>

6 7 4 5 2 3

wid

1 Word Table

word

carleton college is a great place fhqwhgads

wid

2 4 2 WordToURL Table

urlid pos

1 2 1 1 3 4

tags

<>

Word Table WordToURL Table pos tags Desired Output URL Table

WordToURL Table

wid

2 4 2

urlid pos

1 2 1 1 3 4

tags

<> URL Table

urlid url

1 www.carleton.edu

2 3 www.mathcs… www.carleton… 4 violet.mathcs…

WordToURL Table

wid

2 4 2

urlid pos

1 2 1 1 3 4

tags

<> URL Table

urlid url

1 www.carleton.edu

2 3 www.mathcs… www.carleton… 4 violet.mathcs…

Word Table WordToURL Table pos tags Desired Output URL Table

word

college college

url

www.carleton.edu

www.carleton.edu

Join Result

pos

1 4

tags

<>

Heuristics

o Tools by which we return search results o Must be accurate o Must be fast Problems: In general, the more complex a heuristic is, the slower it performs.

How heuristics work

o Obtain search query from user o Use query to “pull out” relevant data o Use data to retrieve all relevant pages o Use specific heuristic to order pages o Output ordered pages to user

Basic Heuristics

o Word Occurrence o Pages order by the number of times the words in the query appear on the page o Frequency o Pages order by the number of times words in the query appear over the total number of words on the page o Proximity o Pages ordered by the number of times words in the query appear in the same order on the page

Meta Heuristics

o Tags o Words on a page are weighted depending on their html tags o Pages are ordered by the sum of the weighted words that appear on the page

Ultimate Heuristic

A combination of data and context o frequency o proximity o tag heuristics o Rank of pages factored into heuristic

Vector Space Models

A table with relationships between terms and documents: doc1 doc2 doc3 Term1 Term2 Term3 1 0 7 17 0 0 20 5 2

Now consider the table to be a matrix. Then o The columns can be seen as document vectors o The terms serve as a basis for the vector space o We can compare documents using vector functions

Comparing Vectors

Recall: cos  

a a

b

b

If we set a threshold on cos around a.

 , we find the set of vectors that are within a cone

Normalizing the Data

Since the length of the document vectors and the values both affect this calculation, we can do some pre-processing to help the heuristic.

Local Term Weighting Schemes: Binary Term frequency Logarithmic Augmented Normalized Global Term Weighting Schemes: Normal Document Normalization Schemes: Cosine  ( f ij ) f ij log( 1 + f ij ) (  ( f ij ) + (f ij /max k f kj )) / 2 1 (  ( 

f ij

2 ) 2 1

g i l ij

2 )  1 2

Latent Semantic Indexing

Matrix Decomposition If the matrix A has rank k, we can represent the matrix using k column vectors.

This has the effect of smooshing together like documents, creating relationships between terms that do not appear on the same page.

Example: if a user searches for “Samuel Clemens”, the terms appear on the same page as “Mark Twain” often enough that documents only containing “Mark Twain” will match.

o Heuristics concerning text o Heuristics concerning the context of the text o Heuristics concerning the context of the pages

Page Rank

What makes Page Rank different?

o Link-based o Independent of search terms o Fewer database queries during search o Copyrighted

B D A

Example Network

C E

Ranking a page

What do you need to rank a page?

o Pages that link to your page o The ranks of those pages o The links on those pages o Rank = 0.15 + 0.85 * Σ(R i /L i ) o Fifty iterations

1.0

B Total = 5.00

1.0

D 1.0

A

Before Ranking

C 1.0

E 1.0

0.54

B

After First Iteration

1.85

A 0.97

C Total = 5.41

0.77

D E 1.28

0.50

B 1.66

A

After Second Iteration

1.05

C Total = 5.19

0.72

D E 1.26

0.49

B 1.62

After Fifth Iteration

A 0.49

C 0.70

Total = 5.06

D E 1.23

And in Conclusion . . .

How did we do?

og/le makes your laundry whiter than any other leading brand!

Competitors: Google (the big boys) ht://Dig (Carleton’s current search engine)

Searching for Dave Musicant Google Search time: .31 seconds

Searching for Dave Musicant ht://Dig Search time: .5 seconds

Searching for Dave Musicant og/le Search time: 1.54 seconds

Searching for Aaron Miller Google Search time: .40 seconds

Searching for Aaron Miller ht://Dig Search time: .5 seconds

Searching for Aaron Miller og/le Search time: .06 seconds

o Support Stemming o Link Referrals o Better Hardware o T-shirts, coffee mugs

Future Goals

How to og/le

Visit us at: http://violet.mathcs.carleton.edu/ogle/

Bibliography

Berry, Michael, and Murray Browne. Understanding Search Engines. Philadephia: SIAM, 1999.

Craven, Phil. "Google's PageRank Explained and how to make the most of it ." Web Workshop.net.