Document 7836430

Download Report

Transcript Document 7836430

March 24, 2000
Database Application Design
Handout #11
(C) 2000, The
University of Michigan
1
Course information
•
•
•
•
•
•
Instructor: Dragomir R. Radev ([email protected])
Office: 305A, West Hall
Phone: (734) 615-5225
Office hours: Thursdays 3-4 and Fridays 1-2
Course page: http://www.si.umich.edu/~radev/654w00
Class meets on Fridays, 2:30 - 5:30 PM, 311 WH
(C) 2000, The
University of Michigan
2
Web-based databases
(C) 2000, The
University of Michigan
3
Types of databases
• Textual databases
• Semi-structured databases
(C) 2000, The
University of Michigan
4
Indexing textual data
• Inverted files
• Boolean queries
• Signature files
• Signature S1 matches signature S2 if S2&S1=S2
(C) 2000, The
University of Michigan
5
XML-QL
(C) 2000, The
University of Michigan
6
XML-QL
Two slides from Johannes Gehrke, Cornell University
<IMG SRC=“xysq.gif” ALT=“(x+y)^2”>
<apply> <power/>
<apply> <plus/> <ci>x</ci> <ci>y</ci> </apply>
<cn>2</cn>
</apply>
WHERE
<BOOK>
<NAME><LAST>$1</LAST></NAME>
</BOOK> in “www.booklist.com/books.xml
CONSTRUCT <RESULT> $1 </RESULT>
(C) 2000, The
University of Michigan
7
XML-QL (continued)
WHERE <BOOK> $b <BOOK> IN “www.booklist.com/books.xml”,
<AUTHOR> $n </AUTHOR>
<PUBLISHED> $p </PUBLISHED> in $e
CONSTRUCT
<RESULT>
<PUBLISHED> $p </PUBLISHED>
WHERE <LAST> $l </LAST> IN $n
CONSTRUCT <LAST> $l </LAST>
</RESULT>
(C) 2000, The
University of Michigan
8
XML-QL (continued)
<!ELEMENT book (author+, title, publisher)>
<!ATTLIST book year CDATA>
<!ELEMENT article (author+, title, year?,
(shortversion|longversion))>
<!ATTLIST article type CDATA>
<!ELEMENT publisher (name, address)>
<!ELEMENT author (firstname?, lastname)>
(C) 2000, The
University of Michigan
9
XML-QL (continued)
WHERE <book>
<publisher><name>Addison-Wesley</name></publisher>
<title> $t</title>
<author> $a</author>
</book> IN "www.a.b.c/bib.xml"
CONSTRUCT $a
(C) 2000, The
University of Michigan
10
XML-QL (continued)
WHERE <book>
<publisher><name>Addison-Wesley</></>
<title> $t</>
<author> $a</>
</> IN "www.a.b.c/bib.xml"
CONSTRUCT $a
(C) 2000, The
University of Michigan
11
XML-QL (continued)
WHERE <book>
<publisher><name>Addison-Wesley</></>
<title> $t</>
<author> $a</>
</> IN "www.a.b.c/bib.xml"
CONSTRUCT <result>
<author> $a</>
<title> $t</>
</>
(C) 2000, The
University of Michigan
12
XML-QL (continued)
<bib>
<book year="1995">
<!-- A good introductory text -->
<title> An Introduction to Database Systems </title>
<author> <lastname> Date </lastname> </author>
<publisher> <name> Addison-Wesley </name > </publisher>
</book>
<book year="1998">
<title> Foundation for Object/Relational Databases: The Third Manifesto </title>
<author> <lastname> Date </lastname> </author>
<author> <lastname> Darwen </lastname> </author>
<publisher> <name> Addison-Wesley </name > </publisher>
</book>
</bib>
(C) 2000, The
University of Michigan
13
XML-QL (continued)
<result>
<author> <lastname> Date </lastname> </author>
<title> An Introduction to Database Systems </title>
</result>
<result>
<author> <lastname> Date </lastname> </author>
<title> Foundation for Object/Relational Databases: The Third Manifesto </title>
</result>
<result>
<author> <lastname> Darwen </lastname> </author>
<title> Foundation for Object/Relational Databases: The Third Manifesto </title>
</result>
(C) 2000, The
University of Michigan
14
XML-QL (continued)
WHERE <book > $p</> IN "www.a.b.c/bib.xml",
<title > $t</>,
<publisher><name>Addison-Wesley</>> IN $p
CONSTRUCT <result>
<title> $t </>
WHERE <author> $a </> IN $p
CONSTRUCT <author> $a</>
</>
(C) 2000, The
University of Michigan
15
XML-QL (continued)
<result>
<title> An Introduction to Database Systems </title>
<author> <lastname> Date </lastname> </author>
</result>
<result>
<title> Foundation for Object/Relational Databases: The Third
Manifesto </title>
<author> <lastname> Date </lastname> </author>
<author> <lastname> Darwen </lastname> </author>
</result>
(C) 2000, The
University of Michigan
16
XML-QL (continued)
WHERE <article>
<author>
<firstname> $f </> // firstname $f
<lastname> $l </> // lastname $l
</>
</> CONTENT_AS $a IN "www.a.b.c/bib.xml"
<book year=$y>
<author>
<firstname> $f </> // join on same firstname $f
<lastname> $l </> // join on same lastname $l
</>
</> IN "www.a.b.c/bib.xml",
y > 1995
CONSTRUCT <article> $a </>
(C) 2000, The
University of Michigan
17
XML-QL (continued)
(C) 2000, The
University of Michigan
18
XML-QL (continued)
<!ATTLIST person ID ID #REQUIRED>
<!ATTLIST article author IDREFS #IMPLIED>
(C) 2000, The
University of Michigan
19
XML-QL (continued)
<person ID="o123">
<firstname>John</firstname>
<lastname>Smith<lastname>
</person>
<person ID="o234">
...
</person>
<article author="o123 o234">
<title> ... </title>
<year> 1995 </year>
</article>
(C) 2000, The
University of Michigan
20
XML-QL (continued)
(C) 2000, The
University of Michigan
21
XML-QL (continued)
WHERE <article><author><lastname> $n</></></> IN "abc.xml”
WHERE <article author=$i>
<title> </> ELEMENT_AS $t
</>,
<person ID=$i>
<lastname> </> ELEMENT_AS $l
</>
CONSTRUCT <result> $t $l</>
(C) 2000, The
University of Michigan
22
Scalar values
<title>A Trip to <titlepart> the Moon </titlepart></title>
NOT!
<title><CDATA> A Trip to </CDATA><titlepart><CDATA> the
Moon</CDATA></titlepart></title>
(C) 2000, The
University of Michigan
YES
23
Tag variables
WHERE <$p>
<title> $t </title>
<year>1995</>
<$e> Smith </>
</> IN "www.a.b.c/bib.xml",
$e IN {author, editor}
CONSTRUCT <$p>
<title> $t </title>
<$e> Smith </>
</>
(C) 2000, The
University of Michigan
24
Transforming data
<!ELEMENT book (author+, title, publisher)>
<!ATTLIST book year CDATA>
<!ELEMENT article (author+, title, year?, (shortversion|longversion))>
<!ATTLIST article type CDATA>
<!ELEMENT publisher (name, address)>
<!ELEMENT author (firstname?, lastname)>
<!ELEMENT person (lastname, firstname, address?, phone?, publicationtitle*)>
(C) 2000, The
University of Michigan
25
Transforming data (cont’d)
WHERE <$> <author> <firstname> $fn </>
<lastname> $ln </>
</>
<title> $t </>
</> IN "www.a.b.c/bib.xml",
CONSTRUCT <person ID=PersonID($fn, $ln)>
<firstname> $fn </>
<lastname> $ln </>
<publicationtitle> $t </>
</>
(C) 2000, The
University of Michigan
26
Integrating data from different
sources
WHERE <person>
<name></> ELEMENT_AS $n
<ssn> $ssn</>
</> IN "www.a.b.c/data.xml",
<taxpayer>
<ssn> $ssn</>
<income></> ELEMENT_AS $i
</> IN "www.irs.gov/taxpayers.xml"
CONSTRUCT <result> $n $i </>
(C) 2000, The
University of Michigan
27
Query blocks
WHERE <$e> <title> $t </>
<year> 1995 </> </> CONTENT_A $p
IN "www.a.b.c/bib.xml"
CONSTRUCT <result ID=ResultID($p)> <title> $t </> </>
{ WHERE $e = "journal-paper",
<month> $m </> IN $p
CONSTRUCT <result ID=ResultID($p)> <month> $m </> </>
}
{ WHERE $e = "book",
<publisher>$q </> IN $p
CONSTRUCT <result ID=ResultID($p)> <publisher>$q </> </>
}
(C) 2000, The
University of Michigan
28
WSQ
(C) 2000, The
University of Michigan
29
Web-supported queries
SIGMOD2000 (Goldman and Widom)
WebPages
(SearchExp,T1,T2,…,Tn,URL,Rank, Date)
SELECT NAME, COUNT
FROM STATES, WEBCOUNT
WHERE NAME = T1
ORDER BY COUNT DESC
(C) 2000, The
University of Michigan
30
XHTML
(C) 2000, The
University of Michigan
31
Simple example
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>Virtual Library</title>
</head>
<body>
<p>Moved to <a href="http://vlib.org/">vlib.org</a>.</p>
</body>
</html>
(C) 2000, The
University of Michigan
32
SI 760
Language and information
(Fall 2000)
(C) 2000, The
University of Michigan
33
SI 760 (1)
Classes 1-3 Introduction to the course and linguistic background
The study of language. Computational Linguistics and Psycholinguistics.
Classes 4-5 Elementary probability and statistics
Describing data. Measures of central tendency. The z score. Hypothesis testing.
Classes 6-8 Information theory
Entropy, joint entropy, conditional entropy. Relative entropy and mutual information. Chain
rules.
Classes 9-10 Data compression and coding
Entropy rate. Language modeling. Examples of codes. Optimal codes. Huffman codes.
Arithmetic coding. The entropy of English.
(C) 2000, The
University of Michigan
34
SI 760 (2)
Classes 11-12 Clustering
Cluster analysis. Clustering of terms according to semantic similarity. Distributional
clustering.
Classes 13-14 Concordancing and collocations
Concordances. Collocations. Syntactic criteria for collocability.
Classes 15-16 Literary detective work
The statistical analysis of writing style. Decipherment and translation.
Classes 17-18 Information extraction
Message understanding. Trainable methods.
(C) 2000, The
University of Michigan
35
SI 760 (3)
Classes 19-20 Word sense disambiguation and lexical acquisition
Supervised disambiguation. Unsupervised disambiguation. Attachment ambiguity.
Computational lexicography.
Classes 21-22 Part-of-speech tagging
Statistical taggers. Transformation-based learning of tags. Maximum entropy models.
Weighted finite- state transducers.
Classes 23-24 Question answering
Semantic representation. Predictive annotation.
(C) 2000, The
University of Michigan
36
SI 760 (4)
Classes 25-26 Text summarization
Single-document summarization. Multi-document summarization. Language models.
Maximal Marginal Relevance. Cross-document structure theory. Trainable methods.
Text categorization.
Classes 27-28 (30) Other topics
Text alignment. Word alignment. Statistical machine translation. Discourse segmentation.
Text categorization. Maximum entropy modeling.
(C) 2000, The
University of Michigan
37
SI 760 (5)
Manning and Schuetze. Foundations of Statistical Natural Language Processing.
MIT Press. 1999.
Jurafsky and Martin. Speech and Language Processing. Prentice-Hall 2000.
Cover & Thomas. Elements of Information Theory. John Wiley and Sons 1991.
Baeza-Yates and Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley
1999.
Oakes. Statistics for Corpus Linguistics. Edinburgh University Press 1998.
(C) 2000, The
University of Michigan
38
Course URL
http://www.si.umich.edu/~radev/760f00
(C) 2000, The
University of Michigan
39
Readings for next time
• Web-based readings
– Asilomar report:
• http://www.acm.org/sigmod/record/issues/9812/asilomar.html
– White paper on XML:
• http://www-db.stanford.edu/~widom/xml-whitepaper.html
(C) 2000, The
University of Michigan
40