Transcript Document
南京大学多媒体研究所 Multimedia Computing Institute of NJU 文本信息检索——文本操作 武港山 Tel : 83594243 Office: 蒙民伟楼608B Email : [email protected] 南京大学多媒体研究所 Multimedia Computing Institute of NJU 信息检索系统的体系结构 (clir,QA,Web) 文档 用户界面 用户 需求 查询语言和 用户 查询处理 反馈 具体 应用 系统 文档处理 文本处理 逻辑视图 提问处理 建索引 数据库 管理 倒排文档 提问 排序后 的文档 2015/7/18 搜索 索引和检索 索引 文本 数据库 排序 检出的文档 Wu Gangshan: Modern Information Retrieval 2 南京大学多媒体研究所 Multimedia Computing Institute of NJU 内容提要 文档预处理 文档分类 文档聚类 文档摘要 文档压缩 2015/7/18 Wu Gangshan: Modern Information Retrieval 3 南京大学多媒体研究所 Multimedia Computing Institute of NJU 文档预处理 Tokenization Stopword removal Lemmatization[词性还原] Stemming [词干] Metadata and markup languages 2015/7/18 Wu Gangshan: Modern Information Retrieval 4 Simple Tokenization 南京大学多媒体研究所 Multimedia Computing Institute of NJU Analyze text into a sequence of discrete tokens (words). Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs. republican) can be a meaningful part of a token. However, frequently they are not. Simplest approach is to ignore all numbers and punctuation and use only case-insensitive unbroken strings of alphabetic characters as tokens. More careful approach: Separate ? ! ; : “ ‘ [ ] ( ) < > Care with . - why? when? Care with … 2015/7/18 Wu Gangshan: Modern Information Retrieval 5 南京大学多媒体研究所 Multimedia Computing Institute of NJU Punctuation(标点符号) Ne’er: use language-specific mappings to normalize State-of-the-art: break up hyphenated sequence. U.S.A. vs. USA a.out 2015/7/18 Wu Gangshan: Modern Information Retrieval 6 南京大学多媒体研究所 Multimedia Computing Institute of NJU Numbers 3/12/91 Mar. 12, 1991 55 B.C. B-52 100.2.86.144 Generally, don’t index as text 2015/7/18 Wu Gangshan: Modern Information Retrieval 7 南京大学多媒体研究所 Multimedia Computing Institute of NJU Case folding Reduce all letters to lower case exception: upper case in mid-sentence 2015/7/18 e.g., General Motors Fed vs. fed SAIL vs. sail Wu Gangshan: Modern Information Retrieval 8 南京大学多媒体研究所 Multimedia Computing Institute of NJU Tokenizing HTML Should text in HTML commands not typically seen by the user be included as tokens? Words appearing in URLs. Words appearing in “meta text” of images. Simplest approach is to exclude all HTML tag information (between “<“ and “>”) from tokenization. 2015/7/18 Wu Gangshan: Modern Information Retrieval 9 南京大学多媒体研究所 Multimedia Computing Institute of NJU Stopwords It is typical to exclude high-frequency words (e.g. function words: “a”, “the”, “in”, “to”; pronouns: “I”, “he”, “she”, “it”). Stopwords are language dependent For efficiency, store strings for stopwords in a hashtable to recognize them in constant time. Simple Perl hashtable for Perl-based implementations How to determine a list of stopwords? For English? – may use existing lists of stopwords E.g. SMART’s commonword list (~ 400) WordNet stopword list For Spanish? Bulgarian? 2015/7/18 Wu Gangshan: Modern Information Retrieval 10 Lemmatization(同义异形) Reduce inflectional/variant forms to base form Direct impact on VOCABULARY size E.g., 南京大学多媒体研究所 Multimedia Computing Institute of NJU am, are, is be car, cars, car's, cars' car the boy's cars are different colors the boy car be different color How to do this? Need a list of grammatical rules + a list of irregular words Children child, spoken speak … Practical implementation: use WordNet’s morphstr function 2015/7/18 Wu Gangshan: Modern Information Retrieval 11 Stemming 南京大学多媒体研究所 Multimedia Computing Institute of NJU Reduce tokens to “root” form of words to recognize morphological variation. “computer”, “computational”, “computation” all reduced to same token “compute” Correct morphological analysis is language specific and can be complex. Stemming “blindly” strips off known affixes (prefixes and suffixes) in an iterative fashion. for example compressed and compression are both accepted as equivalent to compress. 2015/7/18 for exampl compres and compres are both accept as equival to compres. Wu Gangshan: Modern Information Retrieval 12 南京大学多媒体研究所 Multimedia Computing Institute of NJU Porter Stemmer Simple procedure for removing known affixes in English without using a dictionary. Can produce unusual stems that are not English words: “computer”, “computational”, “computation” all reduced to same token “comput” May conflate (reduce to the same token) words that are actually distinct. Not recognize all morphological derivations. 2015/7/18 Wu Gangshan: Modern Information Retrieval 13 南京大学多媒体研究所 Multimedia Computing Institute of NJU Typical rules in Porter sses ss ies i ational ate tional tion See class website for link to “official” Porter stemmer site Provides Perl, C ready to use implementations 2015/7/18 Wu Gangshan: Modern Information Retrieval 14 南京大学多媒体研究所 Multimedia Computing Institute of NJU Porter Stemmer Errors Errors of “comission”: organization, organ organ police, policy polic arm, army arm Errors of “omission”: cylinder, cylindrical create, creation Europe, European 2015/7/18 Wu Gangshan: Modern Information Retrieval 15 南京大学多媒体研究所 Multimedia Computing Institute of NJU Other stemmers Other stemmers exist, e.g., Lovins stemmer http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm Single-pass, longest suffix removal (about 250 rules) Motivated by Linguistics as well as IR Full morphological analysis - modest benefits for retrieval 2015/7/18 Wu Gangshan: Modern Information Retrieval 16 南京大学多媒体研究所 Multimedia Computing Institute of NJU Metadata On Metadata Often included in Web pages Hidden from the browser, but useful for indexing Information about a document that may not be a part of the document itself (data about data). Descriptive metadata is external to the meaning of the document: Author Title Source (book, magazine, newspaper, journal) Date ISBN Publisher Length 2015/7/18 Wu Gangshan: Modern Information Retrieval 17 南京大学多媒体研究所 Multimedia Computing Institute of NJU 文本的特性 Text properties Distribution of words in language Why are they important? Zipf’s Law Heap’s Law 2015/7/18 Wu Gangshan: Modern Information Retrieval 18 南京大学多媒体研究所 Multimedia Computing Institute of NJU Statistical Properties of Text How is the frequency of different words distributed? How fast does vocabulary size grow with the size of a corpus? Such factors affect the performance of information retrieval and can be used to select appropriate term weights and other aspects of an IR system. 2015/7/18 Wu Gangshan: Modern Information Retrieval 19 南京大学多媒体研究所 Multimedia Computing Institute of NJU Word Frequency A few words are very common. Most words are very rare. 2 most frequent words (e.g. “the”, “of”) can account for about 10% of word occurrences. Half the words in a corpus appear only once, called hapax legomena (Greek for “read only once”) Called a “heavy tailed” distribution, since most of the probability mass is in the “tail” 2015/7/18 Wu Gangshan: Modern Information Retrieval 20 Sample Word Frequency Data 南京大学多媒体研究所 Multimedia Computing Institute of NJU (from B. Croft, UMass) 2015/7/18 Wu Gangshan: Modern Information Retrieval 21 南京大学多媒体研究所 Multimedia Computing Institute of NJU Zipf’s Law Rank (r): The numerical position of a word in a list sorted by decreasing frequency (f ). Zipf (1949) “discovered” that: f r k (for constantk ) If probability of word of rank r is pr and N is the total number of word occurrences: f A pr for corpus indp. const. A 0.1 N r 2015/7/18 Wu Gangshan: Modern Information Retrieval 22 Zipf and Term Weighting 南京大学多媒体研究所 Multimedia Computing Institute of NJU Luhn (1958) suggested that both extremely common and extremely uncommon words were not very useful for indexing. 2015/7/18 Wu Gangshan: Modern Information Retrieval 23 南京大学多媒体研究所 Multimedia Computing Institute of NJU Predicting Occurrence Frequencies By Zipf, a word appearing n times has rank rn=AN/n Several words may occur n times, assume rank rn applies to the last of these. Therefore, rn words occur n or more times and rn+1 words occur n+1 or more times. So, the number of words appearing exactly n times is: AN AN AN I n rn rn1 n n 1 n(n 1) 2015/7/18 Wu Gangshan: Modern Information Retrieval 24 南京大学多媒体研究所 Multimedia Computing Institute of NJU Predicting Word Frequencies (cont’d) Assume highest ranking term occurs once and therefore has rank D = AN/1 Fraction of words with frequency n is: In 1 D n(n 1) Fraction of words appearing only once is therefore ½. 2015/7/18 Wu Gangshan: Modern Information Retrieval 25 Occurrence Frequency Data 南京大学多媒体研究所 Multimedia Computing Institute of NJU (from B. Croft, UMass) 2015/7/18 Wu Gangshan: Modern Information Retrieval 26 南京大学多媒体研究所 Multimedia Computing Institute of NJU Does Real Data Fit Zipf’s Law? A law of the form y = kxc is called a power law. Zipf’s law is a power law with c = –1 On a log-log plot, power laws give a straight line with slope c. log( y) log(kxc ) log k c log(x) Zipf is quite accurate except for very high and low rank. 2015/7/18 Wu Gangshan: Modern Information Retrieval 27 南京大学多媒体研究所 Multimedia Computing Institute of NJU Fit to Zipf for Brown Corpus k = 100,000 2015/7/18 Wu Gangshan: Modern Information Retrieval 28 南京大学多媒体研究所 Multimedia Computing Institute of NJU Mandelbrot (1954) Correction The following more general form gives a bit better fit: B f P(r ) For constantsP, B, 2015/7/18 Wu Gangshan: Modern Information Retrieval 29 南京大学多媒体研究所 Multimedia Computing Institute of NJU Mandelbrot Fit P = 105.4, B = 1.15, = 100 2015/7/18 Wu Gangshan: Modern Information Retrieval 30 南京大学多媒体研究所 Multimedia Computing Institute of NJU Explanations for Zipf’s Law Zipf’s explanation was his “principle of least effort.” Balance between speaker’s desire for a small vocabulary and hearer’s desire for a large one. Li (1992) shows that just random typing of letters including a space will generate “words” with a Zipfian distribution. 2015/7/18 http://linkage.rockefeller.edu/wli/zipf/ Wu Gangshan: Modern Information Retrieval 31 南京大学多媒体研究所 Multimedia Computing Institute of NJU Zipf’s Law Impact on IR Good News: Stopwords will account for a large fraction of text so eliminating them greatly reduces inverted-index storage costs. Bad News: For most words, gathering sufficient data for meaningful statistical analysis (e.g. for correlation analysis for query expansion) is difficult since they are extremely rare. 2015/7/18 Wu Gangshan: Modern Information Retrieval 32 南京大学多媒体研究所 Multimedia Computing Institute of NJU Vocabulary Growth How does the size of the overall vocabulary (number of unique words) grow with the size of the corpus? This determines how the size of the inverted index will scale with the size of the corpus. Vocabulary not really upper-bounded due to proper names, typos, etc. (没有上界的原因是名称、打字错等) 2015/7/18 Wu Gangshan: Modern Information Retrieval 33 南京大学多媒体研究所 Multimedia Computing Institute of NJU Heaps’ Law If V is the size of the vocabulary and the n is the length of the corpus in words: V Kn withconstantsK , 0 1 Typical constants: K 10100 0.40.6 (approx. square-root) 2015/7/18 Wu Gangshan: Modern Information Retrieval 34 南京大学多媒体研究所 Multimedia Computing Institute of NJU Heaps’ Law Data 2015/7/18 Wu Gangshan: Modern Information Retrieval 35 南京大学多媒体研究所 Multimedia Computing Institute of NJU Explanation for Heaps’ Law Can be derived from Zipf’s law by assuming documents are generated by randomly sampling words from a Zipfian distribution. Heap’s Law holds on distribution of other data Own experiments on types of questions asked by users show a similar behavior 2015/7/18 Wu Gangshan: Modern Information Retrieval 36