Transcript PowerPoint
Discussion Class 3 Inverse Document Frequency 1 Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for others to comment. When answering: Stand up. Give your name. Make sure that the TA hears it. Speak clearly so that all the class can hear. Suggestions: Do not be shy at presenting partial answers. Differing viewpoints are welcome. 2 Question 1: Reading a Research Paper (a) Who are the authors of this paper and the letter? What is their background? Why did they write this paper? (b) This paper was written at the time that the technique now know as tf.idf was under development. i What was the state of the art before this paper? ii What was the contribution of this paper? iii What step(s) were left in developing tf.idf? 3 Question 2. Summary of the first three pages of the paper Explain the statement, "We are concerned with obtaining an effective vocabulary for a collection of documents of some broadly known subject matter and size." The paper begins, "We are familiar with the notions of exhaustivity and specificity." Explain these two terms. Explain the statement, "We should think of specificity as a function of term use. It should be interpreted as a statistical rather than semantic property of index terms." 4 Question 3: Statistical specificity 5 Question 4: Weighting Sparck Jones suggests a weighting: Let f(n) = m such that 2m-1 < n <= 2m. Then where there are N documents in the collection, the weight of a term which occurs n times is: f(N) - f(n) + 1 Robertson suggests: log2(N/n) + 1 What are the differences between these two expressions? 6 Question 5: Zipf and Probability Robertson notes that n/N is the probability p that an item chosen at random will contain a given term. He says: Suppose that an item contains the terms a, b, c in common with the question; let the values of p for these terms be pa, pb, pc respectively. Then the weight (‘level’) assigned to the document is: log(1/pa) + log(1/pb) + log(1/pc) = log(1/papbpc) How does this results relate to similarity measures on the term vector space? 7 What probability assumption does this expression depend on? Is it reasonable? Question 6: Explain this graph 8