Transcript PowerPoint
CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods 1 Course Administration • 2 Collection-level metadata Several of the most difficult fields to extract automatically are the same across all pages in a web site. Therefore create a collection record manually and combine it with automatic extraction of other fields at item level. For the CS 430 home page, collection-level metadata: <meta name="DC.Publisher" content="Cornell University"> <meta name="DC.Creator" content="William Y. Arms"> <meta name="DC.Rights" content="William Y. Arms, 2001"> See: Jenkins and Inman 3 Collection-level metadata Compare: (a) Metadata extracted automatically by DC-dot (b) Collection-level record (c) Combined item-level record (DC-dot plus collection-level) (d) Manual record 4 5 Metadata extracted automatically by DC-dot D.C. Field Qualifier 6 Content title Digital Libraries and the Problem of Purpose subject not included in this slide publisher Corporation for National Research Initiatives date W3CDTF 2000-05-11 type DCMIType Text format text/html format 27718 bytes identifier http://www.dlib.org/dlib/january00/01levy.html Collection-level record D.C. Field Qualifier 7 Content publisher Corporation for National Research Initiatives type article type resource work relation rel-type InSerial relation serial-name D-Lib Magazine relation issn 1082-9873 language English rights Permission is hereby given for the material in D-Lib Magazine to be used for ... Combined item-level record (DC-dot plus collection-level) D.C. Field Qualifier Content title publisher date type Digital Libraries and the Problem of Purpose (*) Corporation for National Research Initiatives W3CDTF 2000-05-11 (*) article type type format resource (*) work DCMIType Text text/html format 27718 bytes (*) indicates collection-level metadata continued on next slide 8 Combined item-level record (DC-dot plus collection-level) D.C. Field Qualifier Content relation rel-type (*) InSerial relation serial-name (*) D-Lib Magazine relation issn (*) 1082-9873 language (*) English rights (*) Permission is hereby given for the material in D-Lib Magazine to be used for ... identifier http://www.dlib.org/dlib/january00/01levy.html (*) indicates collection-level metadata 9 Manually created record D.C. Field Qualifier title Digital Libraries and the Problem of Purpose creator (+) David M. Levy publisher date type type Content Corporation for National Research Initiatives publication resource January 2000 article work (+) entry that is not in the automatically generated records continued on next slide 10 Manually created record D.C. Field Qualifier relation relation relation relation relation identifier identifier language rights Content rel-type InSerial serial-name D-Lib Magazine issn 1082-9873 volume (+) 6 issue (+) 1 DOI (+) 10.1045/january2000-levy URL http://www.dlib.org/dlib/january00/01levy.html English (+) Copyright (c) David M. Levy (+) entry that is not in the automatically generated records 11 SMART System An experimental system for automatic information retrieval • automatic indexing to assign terms to documents and queries • collect related documents into common subject classes • identify documents to be retrieved by calculating similarities between documents and queries • procedures for producing an improved search query based on information obtained from earlier searches Gerald Salton and colleagues Harvard 1964-1968 Cornell 1968-1988 12 Vector Space Methods Problem: Given two text documents, how similar are they? (One document may be a query.) Vector space methods that measure similarity do not assume exact matches. Benefits of similarity measures rather than exact matches • Encourage long queries, which are rich in information. An abstract should be very similar to its source document. • Accept probabilistic aspects of writing and searching. Different words will be used if an author writes the same document twice. 13 Vector space revision x = (x1, x2, x3, ..., xn) is a vector in an n-dimensional vector space Length of x is given by (extension of Pythagoras's theorem) |x|2 = x12 + x22 + x32 + ... + xn2 If x1 and x2 are vectors: Inner product (or dot product) is given by x1.x2 = x11x21 + x12x22 + x13x23 + ... + x1nx1n Cosine of the angle between the vectors x1 and x2: x1.x2 cos () = |x1| |x2| 14 Vector Space Methods: Concept n-dimensional space, where n is the total number of different terms used to index a set of documents. Each document is represented by a vector, with magnitude in each dimension equal to the (weighted) number of times that the corresponding term appears in the document. Similarity between two documents is the angle between their vectors. 15 Three terms represented in 3 dimensions t3 d1 d2 t2 t1 16 Example 1: Incidence array terms in d1 -> ant ant bee terms in d2 -> bee hog ant dog terms in d3 -> cat gnu dog eel fox terms ant bee cat dog eel fox gnu hog d1 1 1 d2 1 1 d3 length 2 1 1 1 1 1 1 1 4 5 Weights: tij = 1 if document i contains term j and zero otherwise 17 Example 1 (continued) Similarity of documents in example: d1 d2 d3 d1 1 0.71 0 d2 0.71 1 0.22 d3 0 0.22 1 • Similarity measures the occurrences of terms, but no other characteristics of the documents. 18 Example 2: frequency array terms in d1 -> ant ant bee terms in d2 -> bee hog ant dog terms in d3 -> cat gnu dog eel fox ant bee cat dog eel fox gnu hog d1 2 1 d2 1 1 d3 length 5 1 1 1 1 1 1 1 4 5 Weights: tij = frequency that term j occurs in document i 19 Example 2 (continued) Similarity of documents in example: d1 d2 d3 d1 1 0.67 0 d2 0.67 1 0.22 d3 0 0.22 1 • Similarity depends upon the weights given to the terms. 20 Vector similarity computation Documents in a collection are assigned terms from a set of n terms The term assignment array T is defined as if term j does not occur in document i, tij = 0 if term j occurs in document i, tij is greater than zero (the value of tij is called the weight of term j in document i) Similarity between di and dj is defined as n cos(di, dj) = 21 t t k=1 ik jk |di| |dj| Simple use of vector similarity Threshold For query q, retrieve all documents with similarity more than 0.50 Ranking For query q, return the n most similar documents ranked in order of similarity 22 Contrast with Boolean searching With Boolean retrieval, a document either matches a query exactly or not at all • Encourages short queries • Requires precise choice of index terms • Requires precise formulation of queries (professional training) With retrieval using similarity measures, similarities range from 0 to 1 for all documents • Encourages long queries to have as many dimensions as possible • Benefits from large numbers of index terms • Benefits from queries with many terms, not all of which need match the document 23 Document vectors as points on a surface 24 • Normalize all document vectors to be of length 1 • Then the ends of the vectors all lie on a surface with unit radius • For similar documents, we can represent parts of this surface as a flat region • Similar document are represented as points that are close together on this surface Relevance feedback (concept) x x o x o x hits from original search o x documents identified as non-relevant o documents identified as relevant original query reformulated query 25 Document clustering (concept) xx x x x x x x x x x x x x x x x x x Document clusters are a form of automatic classification. A document may be in several clusters. 26