Transcript Slide 1
og/le
optimal guesswork/luck-based engine
Circa 1300 BC:
Ten Commandments (1956)
Circa 1971:
Courtesy of “An Atlas of Cyberspaces” (http://www.cybergeography.org/atlas/historical.html
)
Circa 1999:
So, why do we need search engines?
o The web is too big.
o There is too much irrelevant information.
o Search engines bring order to this chaos filled land.
o Know the Data o Store the Data o Retrieve the Results What does a search engine require?
o Order the Results
To Do Quarantine Link Cache Word Cache
Anatomy of Streaker
Page Retriever URL Stream List Parser Main The Database
To Do http://www.carleton.edu
Page Retriever URL Stream List Quarantine Link Cache Word Cache Main Parser The Database
To Do Page Retriever URL Stream List Quarantine Link Cache Word Cache Main Parser The Database
Playing nicely with the network
o Is the server responding?
o Is the server overloaded?
o How much info are we requesting?
o How fast are we sending our requests?
Formulas: Pause Time: DELAY * 2
Throttling Streaker
Ave. Delay: DELAY + (lastFetchTime – DELAY) * .5
Streaker 1 second DELAY = 1 2 seconds Streaker WEBSERVER 3 seconds 4 seconds DELAY = 2 4 seconds Streaker DELAY = 3
To Do Page Retriever URL Stream List Quarantine Link Cache Word Cache Main Parser The Database
To Do Page Retriever URL Stream List Quarantine Link Cache Word Cache Main Parser The Database
Before
og/le optimal guesswork/luck-based engine | : carleton search |
ogling 25,49 pages
Powered by Streaker
© 2004 Josh Allen, Andrew Drummer, Brendan Foote, Aaron Miller, Mike Ottum
After
Page object
Page Text Page Header Page URL Etc…
Word Object(s)
The Word Word Position Other info
Link Object(s)
Link URL Link Position Page URL Link text
Brief HTML Introduction
Which elements of a page are important?
o Text o Individual Words o Position o Tag Information o Links o Link target o Link text
Parsing Challenges
o Identical pages with different URLs o Especially common with dynamically-generated pages o Solution: Compute a checksum as we parse and then compare it to previously seen pages o CRC-32 Checksum Algorithm o HTML is not a strict language o The Parser must be flexible enough to allow for many different types of coding, especially in tags.
To Do Page Retriever URL Stream List Quarantine Link Cache Word Cache Main Parser The Database
To Do Page Retriever URL Stream List Quarantine Link Cache Word Cache Main Parser The Database
Pages Indexed: 54,752 Fetch Errors: 43,862
To Do Page Retriever URL Stream List Quarantine Link Cache Word Cache Main Parser The Database
To Do Page Retriever URL Stream List Quarantine Link Cache Word Cache Main Parser The Database
Unique is Good
Word
philanderer
Word ID
251 philanthropist 252 philanderer 253
mySQL queries take a long time!
mySQL queries take a long time!
We have MANY queries to make.
Our current database contains 206,493 unique words 54,752 unique urls
Google stores the complete text of
6 Billion
web pages in
memory
THEN: 459 pages/hour NOW: 3422 pages/hour
MySQL
Brief Databases Introduction
o Why use databases?
o Data to store is too big for main memory o Optimize disk accesses through intelligent organization of data o Relational Database Model o Data is stored in tables according to
relationships
o Data is retrieved using Structured Query Language (SQL)
Relational Example
o Relate Words to Pages o Information that we care about: o word (string) o url (string) o position (integer) o HTML tag attributes (set)
The Non-Relational Way
word
college a college
url
http://www.carleton.edu
http://www.mathcs.carleton.edu
http://www.carleton.edu
3 4
pos
1
tags
<>
Why is this method bad?
o Wasted space o The word “college” and the URL “http://www.carleton.edu” appear twice in this example o In our actual crawl, the word “carleton” appears 85,496 times o String comparisons are slow
2 3
wid
1 4 5 6 7
Our Database Tables - Word
word
carleton college is a great place fhqwhgads
Our Database Tables - URL
2 3
urlid
1 4
url
http://www.carleton.edu
http://www.mathcs.carleton.edu
http://www.carleton.edu/student/ http://violet.mathcs.carleton.edu/ogle/search.php
wid
2 4 2 WordToUrl Table Captures a Relation o Relates Word entries to URL entries
urlid
1 2 1
pos
1 3 4
tags
<>
Executing a join Operation o Combine the information from multiple tables to produce something meaningful Word Table URL Table WordToURL Table pos tags Desired Output
Word Table 6 7 4 5 2 3
wid
1
word
carleton college is a great place fhqwhgads
Word Table 6 7 4 5 2 3
wid
1
word
carleton college is a great place fhqwhgads
Word Table WordToURL Table pos tags Desired Output URL Table
6 7 4 5 2 3
wid
1 Word Table
word
carleton college is a great place fhqwhgads
wid
2 4 2 WordToURL Table
urlid pos
1 2 1 1 3 4
tags
<>
6 7 4 5 2 3
wid
1 Word Table
word
carleton college is a great place fhqwhgads
wid
2 4 2 WordToURL Table
urlid pos
1 2 1 1 3 4
tags
<>
Word Table WordToURL Table pos tags Desired Output URL Table
WordToURL Table
wid
2 4 2
urlid pos
1 2 1 1 3 4
tags
<> URL Table
urlid url
1 www.carleton.edu
2 3 www.mathcs… www.carleton… 4 violet.mathcs…
WordToURL Table
wid
2 4 2
urlid pos
1 2 1 1 3 4
tags
<> URL Table
urlid url
1 www.carleton.edu
2 3 www.mathcs… www.carleton… 4 violet.mathcs…
Word Table WordToURL Table pos tags Desired Output URL Table
word
college college
url
www.carleton.edu
www.carleton.edu
Join Result
pos
1 4
tags
<>
Heuristics
o Tools by which we return search results o Must be accurate o Must be fast Problems: In general, the more complex a heuristic is, the slower it performs.
How heuristics work
o Obtain search query from user o Use query to “pull out” relevant data o Use data to retrieve all relevant pages o Use specific heuristic to order pages o Output ordered pages to user
Basic Heuristics
o Word Occurrence o Pages order by the number of times the words in the query appear on the page o Frequency o Pages order by the number of times words in the query appear over the total number of words on the page o Proximity o Pages ordered by the number of times words in the query appear in the same order on the page
Meta Heuristics
o Tags o Words on a page are weighted depending on their html tags o Pages are ordered by the sum of the weighted words that appear on the page
Ultimate Heuristic
A combination of data and context o frequency o proximity o tag heuristics o Rank of pages factored into heuristic
Vector Space Models
A table with relationships between terms and documents: doc1 doc2 doc3 Term1 Term2 Term3 1 0 7 17 0 0 20 5 2
Now consider the table to be a matrix. Then o The columns can be seen as document vectors o The terms serve as a basis for the vector space o We can compare documents using vector functions
Comparing Vectors
Recall: cos
a a
b
b
If we set a threshold on cos around a.
, we find the set of vectors that are within a cone
Normalizing the Data
Since the length of the document vectors and the values both affect this calculation, we can do some pre-processing to help the heuristic.
Local Term Weighting Schemes: Binary Term frequency Logarithmic Augmented Normalized Global Term Weighting Schemes: Normal Document Normalization Schemes: Cosine ( f ij ) f ij log( 1 + f ij ) ( ( f ij ) + (f ij /max k f kj )) / 2 1 ( (
f ij
2 ) 2 1
g i l ij
2 ) 1 2
Latent Semantic Indexing
Matrix Decomposition If the matrix A has rank k, we can represent the matrix using k column vectors.
This has the effect of smooshing together like documents, creating relationships between terms that do not appear on the same page.
Example: if a user searches for “Samuel Clemens”, the terms appear on the same page as “Mark Twain” often enough that documents only containing “Mark Twain” will match.
o Heuristics concerning text o Heuristics concerning the context of the text o Heuristics concerning the context of the pages
Page Rank
What makes Page Rank different?
o Link-based o Independent of search terms o Fewer database queries during search o Copyrighted
B D A
Example Network
C E
Ranking a page
What do you need to rank a page?
o Pages that link to your page o The ranks of those pages o The links on those pages o Rank = 0.15 + 0.85 * Σ(R i /L i ) o Fifty iterations
1.0
B Total = 5.00
1.0
D 1.0
A
Before Ranking
C 1.0
E 1.0
0.54
B
After First Iteration
1.85
A 0.97
C Total = 5.41
0.77
D E 1.28
0.50
B 1.66
A
After Second Iteration
1.05
C Total = 5.19
0.72
D E 1.26
0.49
B 1.62
After Fifth Iteration
A 0.49
C 0.70
Total = 5.06
D E 1.23
And in Conclusion . . .
How did we do?
og/le makes your laundry whiter than any other leading brand!
Competitors: Google (the big boys) ht://Dig (Carleton’s current search engine)
Searching for Dave Musicant Google Search time: .31 seconds
Searching for Dave Musicant ht://Dig Search time: .5 seconds
Searching for Dave Musicant og/le Search time: 1.54 seconds
Searching for Aaron Miller Google Search time: .40 seconds
Searching for Aaron Miller ht://Dig Search time: .5 seconds
Searching for Aaron Miller og/le Search time: .06 seconds
o Support Stemming o Link Referrals o Better Hardware o T-shirts, coffee mugs
Future Goals
How to og/le
Visit us at: http://violet.mathcs.carleton.edu/ogle/
Bibliography
Berry, Michael, and Murray Browne. Understanding Search Engines. Philadephia: SIAM, 1999.
Craven, Phil. "Google's PageRank Explained and how to make the most of it ." Web Workshop.net.