Transcript PowerPoint

CS 430: Information Discovery
Lecture 8
Collection-Level Metadata
Vector Methods
1
Course Administration
•
2
Collection-level metadata
Several of the most difficult fields to extract automatically are
the same across all pages in a web site.
Therefore create a collection record manually and combine it with
automatic extraction of other fields at item level.
For the CS 430 home page, collection-level metadata:
<meta name="DC.Publisher" content="Cornell University">
<meta name="DC.Creator" content="William Y. Arms">
<meta name="DC.Rights" content="William Y. Arms, 2001">
See: Jenkins and Inman
3
Collection-level metadata
Compare:
(a) Metadata extracted automatically by DC-dot
(b) Collection-level record
(c) Combined item-level record (DC-dot plus collection-level)
(d) Manual record
4
5
Metadata extracted automatically by
DC-dot
D.C. Field Qualifier
6
Content
title
Digital Libraries and the Problem of
Purpose
subject
not included in this slide
publisher
Corporation for National Research
Initiatives
date
W3CDTF
2000-05-11
type
DCMIType Text
format
text/html
format
27718 bytes
identifier
http://www.dlib.org/dlib/january00/01levy.html
Collection-level record
D.C. Field Qualifier
7
Content
publisher
Corporation for National Research
Initiatives
type
article
type
resource
work
relation
rel-type
InSerial
relation
serial-name
D-Lib Magazine
relation
issn
1082-9873
language
English
rights
Permission is hereby given for the material
in D-Lib Magazine to be used for ...
Combined item-level record
(DC-dot plus collection-level)
D.C. Field Qualifier
Content
title
publisher
date
type
Digital Libraries and the Problem of Purpose
(*) Corporation for National Research Initiatives
W3CDTF
2000-05-11
(*) article
type
type
format
resource (*) work
DCMIType Text
text/html
format
27718 bytes
(*) indicates collection-level metadata
continued on next slide
8
Combined item-level record
(DC-dot plus collection-level)
D.C. Field Qualifier
Content
relation
rel-type
(*) InSerial
relation
serial-name (*) D-Lib Magazine
relation
issn
(*) 1082-9873
language
(*) English
rights
(*) Permission is hereby given for the material
in D-Lib Magazine to be used for ...
identifier
http://www.dlib.org/dlib/january00/01levy.html
(*) indicates collection-level metadata
9
Manually created record
D.C. Field Qualifier
title
Digital Libraries and the Problem of Purpose
creator
(+) David M. Levy
publisher
date
type
type
Content
Corporation for National Research Initiatives
publication
resource
January 2000
article
work
(+) entry that is not in the automatically generated records
continued on next slide
10
Manually created record
D.C. Field Qualifier
relation
relation
relation
relation
relation
identifier
identifier
language
rights
Content
rel-type
InSerial
serial-name
D-Lib Magazine
issn
1082-9873
volume
(+) 6
issue
(+) 1
DOI
(+) 10.1045/january2000-levy
URL
http://www.dlib.org/dlib/january00/01levy.html
English
(+) Copyright (c) David M. Levy
(+) entry that is not in the automatically generated records
11
SMART System
An experimental system for automatic information retrieval
•
automatic indexing to assign terms to documents and queries
•
collect related documents into common subject classes
•
identify documents to be retrieved by calculating
similarities between documents and queries
•
procedures for producing an improved search query
based on information obtained from earlier searches
Gerald Salton and colleagues
Harvard 1964-1968
Cornell 1968-1988
12
Vector Space Methods
Problem: Given two text documents, how similar are they?
(One document may be a query.)
Vector space methods that measure similarity do not assume
exact matches.
Benefits of similarity measures rather than exact matches
• Encourage long queries, which are rich in information. An
abstract should be very similar to its source document.
• Accept probabilistic aspects of writing and searching.
Different words will be used if an author writes the same
document twice.
13
Vector space revision
x = (x1, x2, x3, ..., xn) is a vector in an n-dimensional vector space
Length of x is given by (extension of Pythagoras's theorem)
|x|2 = x12 + x22 + x32 + ... + xn2
If x1 and x2 are vectors:
Inner product (or dot product) is given by
x1.x2 = x11x21 + x12x22 + x13x23 + ... + x1nx1n
Cosine of the angle between the vectors x1 and x2:
x1.x2
cos () =
|x1| |x2|
14
Vector Space Methods: Concept
n-dimensional space, where n is the total number of different
terms used to index a set of documents.
Each document is represented by a vector, with magnitude in
each dimension equal to the (weighted) number of times that
the corresponding term appears in the document.
Similarity between two documents is the angle between their
vectors.
15
Three terms represented in
3 dimensions
t3
d1
d2

t2
t1
16
Example 1: Incidence array
terms in d1 -> ant ant bee
terms in d2 -> bee hog ant dog
terms in d3 -> cat gnu dog eel fox
terms ant bee cat dog eel fox gnu hog
d1
1
1
d2
1
1
d3
length
2
1
1
1
1
1
1
1
4
5
Weights: tij = 1 if document i contains term j and zero otherwise
17
Example 1 (continued)
Similarity of documents in example:
d1
d2
d3
d1
1
0.71
0
d2
0.71
1
0.22
d3
0
0.22
1
• Similarity measures the occurrences of terms, but no
other characteristics of the documents.
18
Example 2: frequency array
terms in d1 -> ant ant bee
terms in d2 -> bee hog ant dog
terms in d3 -> cat gnu dog eel fox
ant bee cat dog eel fox gnu hog
d1
2
1
d2
1
1
d3
length
5
1
1
1
1
1
1
1
4
5
Weights: tij = frequency that term j occurs in document i
19
Example 2 (continued)
Similarity of documents in example:
d1
d2
d3
d1
1
0.67
0
d2
0.67
1
0.22
d3
0
0.22
1
• Similarity depends upon the weights given to the terms.
20
Vector similarity computation
Documents in a collection are assigned terms from a set of n terms
The term assignment array T is defined as
if term j does not occur in document i, tij = 0
if term j occurs in document i, tij is greater than zero
(the value of tij is called the weight of term j in document i)
Similarity between di and dj is defined as
n
cos(di, dj) =
21
 t t
k=1 ik jk
|di| |dj|
Simple use of vector similarity
Threshold
For query q, retrieve all documents with similarity more
than 0.50
Ranking
For query q, return the n most similar documents ranked
in order of similarity
22
Contrast with Boolean searching
With Boolean retrieval, a document either matches a query
exactly or not at all
• Encourages short queries
• Requires precise choice of index terms
• Requires precise formulation of queries (professional training)
With retrieval using similarity measures, similarities range from 0
to 1 for all documents
• Encourages long queries to have as many dimensions as possible
• Benefits from large numbers of index terms
• Benefits from queries with many terms, not all of which need
match the document
23
Document vectors as points on a
surface
24
•
Normalize all document vectors to be of length 1
•
Then the ends of the vectors all lie on a surface
with unit radius
•
For similar documents, we can represent parts of
this surface as a flat region
•
Similar document are represented as points that are
close together on this surface
Relevance feedback (concept)
x
x

o

x
o
x
hits from
original
search
o
x documents identified as non-relevant
o documents identified as relevant
 original query
 reformulated query
25
Document clustering (concept)
xx
x
x
x
x
x
x x
x
x
x
x
x x
x
x
x
x
Document clusters are a form of
automatic classification.
A document may be in several
clusters.
26