Transcript PowerPoint
CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval 1 Course Administration • Please send all questions about the course to: [email protected] The message will be sent to [email protected] (Bill Arms) [email protected] (Manpreet Singh ) [email protected] (Sid Anand) [email protected] (Martin Guerrero) 2 Course Administration Programming in Perl Assignments 2, 3 and 4 require programs to be written in Perl. An introduction to programming in Perl will be given at 7:30 p.m. on Wednesdays September 19 and October 3. These classes are optional. There will not be regular discussion classes on these dates. Materials about Perl and further information about these classes will be posted on the course web site. 3 Course Administration Discussion class, Wednesday, September 4 Read and be prepared to discuss: Harman, D., Fox, E., Baeza-Yates, R.A., Inverted files. (Frakes and Baeza-Yates, Chapter 3) Phillips Hall 101, 7:30 to 8:30 p.m. 4 Classical Information Retrieval media type text image, video, audio, etc. linking searching CS 502 natural language processing CS 474 5 statistical browsing catalogs, indexes (metadata) user-in-loop Recall and Precision If information retrieval were perfect ... Every hit would be relevant to the original query, and every relevant item in the body of information would be found. Precision: percentage of the hits that are relevant, the extent to which the set of hits retrieved by a query satisfies the requirement that generated the query. Recall: percentage of the relevant items that are found by the query, the extent to which the query found all the items that satisfy the requirement. 6 Recall and Precision: Example • Collection of 10,000 documents, 50 on a specific topic • Ideal search finds these 50 documents and reject others • Actual search identifies 25 documents; 20 are relevant but 5 were on other topics • Precision: 20/ 25 = 0.8 • Recall: 20/50 = 0.4 7 Measuring Precision and Recall Precision is easy to measure: • A knowledgeable person looks at each document that is identified and decides whether it is relevant. • In the example, only the 25 documents that are found need to be examined. Recall is difficult to measure: 8 • To know all relevant items, a knowledgeable person must go through the entire collection, looking at every object to decide if it fits the criteria. • In the example, all 10,000 documents must be examined. Relevance and Ranking Precision and recall assume that a document is either relevant to a query or not relevant. Often a user will consider a document to be partially relevant. Ranking methods: measure the degree of similarity between a query and a document. Similar Requests Documents Similar: How similar is document to a request? 9 Documents A textual document is a digital object consisting of a sequence of words and other symbols, e.g., punctuation. The individual words and other symbols are known as tokens or terms. A textual document can be: • Free text, also known as unstructured text, which is a continuous sequence of tokens. • Fielded text, also known as structured text, in which the text is broken into sections that are distinguished by tags or other markup. [Methods of markup, e.g., XML, are covered in CS 502.] 10 Word Frequency Observation: Some words are more common than others. Statistics: Most large collections of text documents have similar statistical characteristics. These statistics: • influence the effectiveness and efficiency of data structures used to index documents • many retrieval models rely on them The following example is taken from: Jamie Callan, Characteristics of Text, 1997 http://hobart.cs.umass.edu/~allan/cs646-f97/char_of_text.html 11 Rank Frequency Distribution For all the words in a collection of documents, for each word w f(w) is the frequency that w appears r(w) is rank of w in order of frequency, e.g., the most commonly occurring word has rank 1 f w has rank r and frequency f 12 r f 13 the 1130021 of 547311 to 516635 a 464736 in 390819 and 387703 that 204351 for 199340 is 152483 said 148302 it 134323 on 121173 by 118863 as 109135 at 101779 mr 101679 with 101210 f from 96900 he 94585 million 93515 year 90104 its 86774 be 85588 was 83398 company83070 an 76974 has 74405 are 74097 have 73132 but 71887 will 71494 say 66807 new 64456 share 63925 f or about market they this would you which bank stock trade his more who one their 54958 53713 52110 51359 50933 50828 49281 48273 47940 47401 47310 47116 46244 42142 41635 40910 Zipf's Law If the words, w, in a collection are ranked, r(w), by their frequency, f(w), they roughly fit the relation: r(w) * f(w) = c Different collections have different constants c. In English text, c tends to be about n / 10, where n is the number of distinct words in the collection. For a weird but wonderful discussion of this and many other examples of naturally occurring rank frequency distributions, see: Zipf, G. K., Human Behaviour and the Principle of Least Effort. Addison-Wesley, 1949 14 1000*rf/n 15 the of to a in and that for is said it on by as at mr with 59 58 82 98 103 122 75 84 72 78 78 77 81 80 80 86 91 1000*rf/n from he million year its be was company an has are have but will say new share 92 95 98 100 100 104 105 109 105 106 109 112 114 117 113 112 114 1000*rf/n or about market they this would you which bank stock trade his more who one their 101 102 101 103 105 107 106 107 109 110 112 114 114 106 107 108 Luhn's Proposal "It is here proposed that the frequency of word occurrence in an article furnishes a useful measurement of word significance. It is further proposed that the relative position within a sentence of words having given values of significance furnish a useful measurement for determining the significance of sentences. The significance factor of a sentence will therefore be based on a combination of these two measurements." Luhn, H.P., The automatic creation of literature abstracts, IBM Journal of Research and Development, 2, 159-165 (1958) 16 Methods that Build on Zipf's Law Term weighting: Give differing weights to terms based on their frequency, with most frequent words weighed less. Stop lists: Ignore the most frequent words (upper cut-off) Significant words: Ignore the most frequent and least frequent words (upper and lower cut-off) 17 Cut-off Levels for Significance Words f Upper cut-off Lower cut-off Resolving power of significant words Significant words r from: Van Rijsbergen, Ch. 2 18 Approaches to Weighting Boolean information retrieval: Weight of term i in document j: w(i, j) = 1 w(i, j) = 0 if term i occurs in document j otherwise Vector space methods Weight of term i in document j: 0 < w(i, j) <= 1 if term i occurs in document j w(i, j) = 0 otherwise 19 Functional View of Information Retrieval Similar: mechanism for determining the similarity of the request representation to the information item representation. Documents Requests Index database 20 Major Subsystems Indexing subsystem: Receives incoming documents, converts them to the form required for the index and adds them to the index database. Search subsystem: Receives incoming requests, converts them to the form required for searching the index and searches the database for matching documents. The index database is the central hub of the system. 21 Example: Indexing Subsystem Documents text assign document IDs document numbers and *field numbers break into words words *Indicates optional operation. from Frakes, page 7 22 documents stoplist non-stoplist words stemming* stemmed words term weighting* terms with weights Index database Example: Search Subsystem query parse query ranked document set query terms stoplist ranking* non-stoplist words stemming* relevance judgments* 23 *Indicates optional operation. Boolean retrieved operations document set relevant document set stemmed words Index database