Transcript slides

Web Programming
Week 13
Old Dominion University
Department of Computer Science
CS 418/518 Fall 2010
Martin Klein<[email protected]>
11/23/10
Relational Data Model is a
Special Case…
SELECT pi.fname, m.aces, m.unforced_errors, m.winners
FROM player_info pi, matches m
WHERE pi.fname = “Andre”
AND m.opponent_name = “Sampras”
AND m.year = “2002”;
Unstructured Data
is More Common…
Precision and Recall
how much extra stuff did you get?
source: http://www.hsl.creighton.edu/hsl/Searching/Recall-Precision.html
how much did you miss?
Precision and Recall
10 documents in the index are relevant
search returns 20 documents
5 of which are relevant
1 out of 4 retrieved documents are relevant
source: http://www.hsl.creighton.edu/hsl/Searching/Recall-Precision.html
half of the relevant documents were retrieved
Precision and Recall
1
Precision
0
figure 1.2
in FBY
Recall
1
Why Isn’t Recall Always 100%?
Virginia Agricultural and Mechanical College?
Virginia Agricultural and Mechanical College
and Polytechnic Institute?
Virginia Polytechnic Institute?
Virginia Polytechnic Institute
and State University?
Virginia Tech?
Precision and Recall - Literature
CREATE Table
mysql> CREATE TABLE ODUtennis(
->
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
->
title VARCHAR(200),
->
body TEXT,
-> );
Query OK, 0 rows affected (0.00 sec)
INSERT
mysql> INSERT INTO ODUtennis (title, body) VALUES
-> ('Monarchs Eliminated from 2010 CAA Tournament in Qu
'The fourth-seeded Monarchs ...'),
-> ('Monarchs Close Out Season With 4-3 Win Over South
'ODU closes out the 2010 tennis schedule...'),
-> ('ODU Edged By DePaul in Mens Tennis Action, 4-3',
'Junior Tobias Fanselow was the other Monarch…'),
-> ('ODU Mens Tennis Drops Delaware, 5-2, in CAA Action
'The Old Dominion mens tennis team…doubles…Monarchs...'),
-> ('Mens Tennis Nipped by #64 UNC Wilmington, 4-3',
'The two teams split the first two doubles matches…Monarchs
Query OK, 5 rows affected (0.00 sec)
Records: 5 Duplicates: 0 Warnings: 0
LIKE & REGEXP
• We can search rows with the “LIKE” (or
“REGEXP”) operator
– http://dev.mysql.com/doc/refman/5.0/en/pattern-matching.html
– for tables of any size, this will be s-l-o-w
Example 1
mysql> SELECT id, title FROM ODUtennis WHERE title
LIKE 'Monarchs Eliminated from 2010 CAA Tournament in Quart
mysql> SELECT id, title FROM ODUtennis WHERE title
REGEXP 'Monarchs Eliminated from 2010 CAA Tournament in Qua
+----+-----------------------------------------------------------| id | title
+----+-----------------------------------------------------------| 1 | Monarchs Eliminated from 2010 CAA Tournament in Quarterfina
+----+-----------------------------------------------------------1 row in set (0.00 sec)
Example 2
mysql> SELECT id, title FROM ODUtennis WHERE title REGEXP
'Monarchs';
+----+--------------------------------------------------------------------------------------+
| id | title
|
+----+--------------------------------------------------------------------------------------+
| 1 | Monarchs Eliminated from 2010 CAA Tournament in
Quarterfinals |
| 2 | Monarchs Close Out Season With 4-3 Win Over South
Alabama
|
mysql> SELECT
id, title FROM ODUtennis WHERE title
+----+---------------------------------------------------LIKE 'Monarchs';
------------------------------------+
Empty set (0.00 sec)
2 rows in set (0.01 sec)
BUT
Full-Text Search – The Better Way
•
•
•
•
•
•
MATCH()…AGAINST()
Performs a natural language search over index
Index = set of one ore more columns of the same table
Index as argument to MATCH()
Search string as argument to AGAINST()
If used in WHERE clause result returned in order of
relevance score
– Relevance: similarity between search string and index row
CREATE Table
mysql> CREATE TABLE ODUtennis (
->
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
->
title VARCHAR(200),
->
body TEXT,
->
FULLTEXT (title,body)
-> );
Query OK, 0 rows affected (0.00 sec)
• can only create FULLTEXT on CHAR, VARCHAR or TEXT columns
• “title” and “body” still available as regular columns
• if you want to search only on “title”, you need to create a separate index
INSERT
mysql> INSERT INTO ODUtennis (title, body) VALUES
-> ('Monarchs Eliminated from 2010 CAA Tournament in Qu
'The fourth-seeded Monarchs ...'),
-> ('Monarchs Close Out Season With 4-3 Win Over South
'ODU closes out the 2010 tennis schedule...'),
-> ('ODU Edged By DePaul in Mens Tennis Action, 4-3',
'Junior Tobias Fanselow was the other Monarch…'),
-> ('ODU Mens Tennis Drops Delaware, 5-2, in CAA Action
'The Old Dominion mens tennis team…doubles…Monarchs...'),
-> ('Mens Tennis Nipped by #64 UNC Wilmington, 4-3',
'The two teams split the first two doubles matches…Monarchs
Query OK, 5 rows affected (0.00 sec)
Records: 5 Duplicates: 0 Warnings: 0
MATCH .. AGAINST
mysql>
SELECT * FROM ODUtennis
WHERE MATCH(title,body) AGAINST('fanselow');
+----+------------------------------------------------+-------------------------| id | title
+----+------------------------------------------------+-------------------------| 3 | ODU Edged By DePaul in Mens Tennis Action, 4-3 | Junior Tobias Fanselow wa
+----+------------------------------------------------+-------------------------1 row in set (0.00 sec)
SELECT * FROM ODUtennis
WHERE MATCH(title,body) AGAINST('Monarchs');
mysql>
Empty set (0.00 sec)
why?!
Ranking
• If the word appears in > 50% of the rows
then the word is considered a “stop word”
and is not matched (unless you are in
Boolean mode)
– this makes sense for large collections (the word
is not a good discriminator of records), but can
lead to unexpected results for small collections
Stopwords
• Stopwords exist in stoplists or negative dictionaries
• Idea: remove low semantic content
– index should only have “important stuff”
• What not to index is domain dependent, but often includes:
– “small” words: a, and, the, but, of, an, very, etc.
– NASA ADS example:
• http://adsabs.harvard.edu/abs_doc/stopwords.html
– MySQL full-text index:
• http://dev.mysql.com/doc/refman/5.0/en/fulltext-stopwords.html
Stopwords
• Punctuation, numbers often stripped or treated as
stopwords
– precision suffers on searches for:
•
•
•
•
•
NASA TM-3389
F-15
X.500
.NET
Tree::Suffix
• MySQL also treats words < 4 characters as stopwords
– too bad for: “Liu”, “ORF”, “DEA”, etc.
Getting the Rank
mysql>
SELECT id, MATCH(title,body) AGAINST('doubles') from
+----+---------------------------------------------------+
| id | MATCH(title,body) AGAINST('doubles') |
+----+---------------------------------------------------+
| 1 |
| 2 |
| 3 |
| 4 |
0.28169098496437 |
| 5 |
0.61670464277267 |
+----+---------------------------------------------------+
5 rows in set (0.00 sec)
0 |
0 |
0 |
Getting the Rank in Order
SELECT id, title, MATCH(title,body) AGAINST('doubles
FROM ODUtennis WHERE MATCH(title,body) AGAINST('doubles')
mysql>
+----+-----------------------------------------------------------------| id
| title
+----+-----------------------------------------------------------------| 5
| Mens Tennis Nipped by #64 UNC Wilmington, 4-3
| 0.616
| 4
| ODU Mens Tennis Drops Delaware, 5-2, in CAA Action
| 0.28169
+----+-----------------------------------------------------------------2 rows in set (0.00 sec)
Boolean Mode
SELECT id, title FROM ODUtennis
WHERE MATCH(title,body) AGAINST('+Monarchs' IN BOOLEAN M
mysql>
+----+-------------------------------------------------------------------------| id | title
+----+-------------------------------------------------------------------------| 1 | Monarchs Eliminated from 2010 CAA Tournament in Quarterfinals |
| 2 | Monarchs Close Out Season With 4-3 Win Over South Alabama
|
| 4 | ODU Mens Tennis Drops Delaware, 5-2, in CAA Action
|
| 5 | Mens Tennis Nipped by #64 UNC Wilmington, 4-3
+----+-------------------------------------------------------------------------4 rows in set (0.00 sec)
• Does not use the 50% threshold
• Does use stopwords, length limitation
• Operator list:
– http://dev.mysql.com/doc/refman/5.0/en/fulltext-boolean.html
Blind Query Expansion
(AKA Automatic Relevance Feedback)
• General assumption: user query is insufficient
– Too short
– Too generic
– Too many results
• How does one keep up with Virginia Tech’s multiple names /
nicknames?
– Hokies, Fighting Gobblers, VPI, VPI&SU, Va Tech, VT
•
1.
2.
3.
Idea:
run the query with the requested terms
then take the results and
re-run the query with the most relevant terms from the initial
results
Blind Query Expansion
(AKA Automatic Relevance Feedback)
SELECT * FROM ODUtennis
WHERE MATCH(title,body)
AGAINST('fanselow' WITH QUERY EXPANSION);
mysql>
+----+-------------------------------------------------------------------------+---| id | title
+----+-------------------------------------------------------------------------+---| 3 | ODU Edged By DePaul in Mens Tennis Action, 4-3
| Junior Tobias Fanselo
|
| to win, earning the win at No. 1 si
|
| Alasdair Graetz, 6-2, 7-6
-----------------------------------------------------------------------------------| 4 | ODU Mens Tennis Drops Delaware, 5-2, in CAA Action | The Old Dominion mens t
|
| to 14-8 overall and 2-3 in the CAA
|
| victory over the Delaware Fighting
|
| on Wednesday. After dropping all th
|
| matches, the Monarchs went on to wi
|
| the six singles matches for the vic
+----+----------------------------------------------------+------------------------2 rows in set (0.00 sec)
For More Information…
• MySQL documentation:
– http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
• Chapter 12/13 “Building a Content Management
System”
• CS 751/851 “Introduction to Digital Libraries”
– http://www.cs.odu.edu/~mln/teaching/
– esp. “Information Retrieval Concepts” lecture
• Is MySQL the right tool for your job?
– http://lucene.apache.org/
MySQL examples in this lecture based on those found at dev.mysql.com
content snippets taken from www.odusports.com