Transcript Document

搜索引擎技术
闫宏飞,[email protected]
北京大学计算机系网络实验室
2004年12月24日@CERNET2004
内容提要
• 搜索引擎工作原理
• 信息检索相关研究和机构
搜索引擎 — Web Search Engines
• 定义:允许用户递交查询,检索出与查询
相关的网页结果列表,并且排序输出。
• 创建索引的方法
– 手工索引
– 自动索引
• 系统结构
– 集中式体系结构
– 分布式体系结构
Two service extremes
Browsing
Services
???
Web
Pages
???
Two semantics extremes
Search
Engine
Services
Bag of
Words
搜索引擎三段式工作流程
搜集
整理
服务
• 搜集
– 批量搜集,增量式搜集;搜集目标,搜集策略
• 预处理
– 关键词提取;重复网页消除;链接分析;索引
• 服务
– 查询方式和匹配;结果排序;文档摘要
搜索引擎系统流程
天网搜索引擎系统流程
分布式Web搜集系统结构
抓取
抓取
抓取
进程
进程
进程
协调
进程
协调
进程
协调
进程
(节点)
(节点)
调度模块
……
(节点)
天网存储格式
version: 1.0
// version number
url: http://www.pku.edu.cn/
// URL
origin: http://www.somewhere.cn/
// original URL
date: Tue, 15 Apr 2003 08:13:06 GMT
// time of harvest
ip: 162.105.129.12
// IP address
unzip-length: 30233
length: 18133
// If included, the data must be compressed
// data length
// a blank line
XXXXXXXX
// the followings are data part
XXXXXXXX
….
XXXXXXXX
// data end
// insert a new line
File Organizations (Indexes)
• Choices for accessing data during query evaluation
• Scan the entire collection
–
–
–
–
Typical in early (batch) retrieval systems
Computational and I/O costs are O(characters in collection)
Practical for only “small” text collections
Large memory systems make scanning feasible
• Use indexes for direct access
– Evaluation time O(query term occurrences in collection)
– Practical for “large” collections
– Many opportunities for optimization
• Hybrids: Use small index, then scan a subset of the
collection
Indexes
• What should the index contain?
• Database systems index primary and secondarykeys
– This is the hybrid approach
– Index provides fast access to a subset of database records
– Scan subset to find solution set
• IR Problem:
• Cannot predict keys that people will use in queries
– Every word in a document is a potential search term
• IR Solution: Index by all keys (words) full text indexes
Index Contents
• The contents depend upon the retrieval model
• Feature presence/absence
– Boolean
– Statistical (tf, df, ctf, doclen, maxtf)
– Often about 10% the size of the raw data, compressed
• Positional
–
–
–
–
Feature location within document
Granularities include word, sentence, paragraph, etc
Coarse granularities are less precise, but take less space
Word-level granularity about 20-30% the size of the raw
data,compressed
Indexes: Implementation
• Common implementations of indexes
– Bitmaps
– Signature files
– Inverted files
No positional data indexed
• Common index components
– Dictionary (lexicon)
– Postings
• document ids
• word positions
Inverted Files
Inverted Files
Word-Level Inverted File
Inverted Search Algorithm
1. Find query elements (terms) in the
lexicon
2. Retrieve postings for each lexicon entry
3. Manipulate postings according to the
retrieval model
Word-Level Inverted File
lexicon
Query:
1.porridge & pot (BOOL)
2.“porridge pot” (BOOL)
3. porridge pot (VSM)
Answer
posting
内容提要
• 搜索引擎工作原理
• 信息检索相关研究和机构
A Brief history of Modern Information Retrieval
• In 1945, Vannevar Bush published "As We May Think" in
the Atlantic monthly.
• In the 1960s, the SMART system by Gerard Salton and
his students
• Cranfield evaluations done by Cyril Cleverdon
• The 1970s and 1980s saw many developments built on
the advances of the 1960s.
• In 1992 with the inception of Text Retrieval Conference.
• The algorithms developed
• The algorithms developed in IR were employed for
searching the Web from 1996.
Clustering of SIGIR papers by topic vs. year
Cluster \ Year
71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces
General !
Models
Question answering
Syntactic phrases & SDR
Conceptual IR, KB IR
Compression
Clustering
Relevance feedback
Inverted files & Implementations
Term weighting
Message understanding & TDT
Filtering
Hypertext IR, Multiple evidence
Image retrieval
Probabilistic & Language models
Boolean & extended Boolean
Japanese & Chinese IR
DBMS & IR
Users & Search
Visualisation
Signature files
Distributed IR
Evaluation
Topic distillation & Linkage retrieval
Latent semantic indexing
Text categorisation
Document summarisation
Cross lingual
8
4
1
6
5
10
1
3
5
2
5
2
4
1
5
2
9
2
9
5
7
10
10
6
10
6
2
5
8
1
2
1
1
4
1
2
1
2
1
2
1
1
1
1
2
1
6
3
3
2
3
2
1
4
3
5
7
5
1
6
3
5
3
1
2
2
1
1
1
3
1
1
3
3
2
1
2
1
1
2
4
1
4
4
1
1
3
3
1
2
1
1
1
1
1
3
1
1
1
2
2
1
1
2
1
1
1
2
5
3
1
1
3
1
1
1
1
1
1
3
1
2
1
1
1
2
1
2
2
2
2
3
1
4
4
1
17
1
2
1
1
3
1
1
1
37
2
3
4
1
3
2
1
1
75
1
2
1
1
1
1
3
2
1
1
2
1
3
2
1
1
2
2
4
2
2
1
1
1
1
2
3
3
2
2
4
1
1
2
1
1
3
1
1
1
2
2
1
2
2
3
1
1
2
1
5
1
126
30
2
1
1
2
1
1
2
1
1
1
1
1
3
4
2
1
1
1
18
3
1
26
25
1
3
18
1
1
31
4
5
5
31
4
1
1
1
1
2
3
18
1
4
3
1
5
2
2
33
2
1
1
3
2
1
3
1
3
3
34
3
2
1
1
3
1
9
1
3
1
1
1
1
2
1
10
14
5
2
1
1
2
3
1
1
3
4
4
2
1
1
38
1
12
1
9
1
1
1
2
2
1
3
1
3
66
4
1
1
1
3
1
2
3
1
3
1
1
1
1
2
4
1
3
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
3
6
4
2
1
1
24
7
2
3
8
34
1
3
3
2
9
1
3
3
3
1
3
1
3
3
2
2
1
1
3
3
2
23
6
2
2
3
3
12
1
1
3
4
16
Cluster \ Year
71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces
General !
Models
Question answering
Syntactic phrases & SDR
Conceptual IR, KB IR
Compression
Clustering
Relevance feedback
Inverted files & Implementations
Term weighting
Message understanding & TDT
Filtering
Hypertext IR, Multiple evidence
Image retrieval
Probabilistic & Language models
Boolean & extended Boolean
Japanese & Chinese IR
DBMS & IR
Users & Search
Visualisation
Signature files
Distributed IR
Evaluation
Topic distillation & Linkage retrieval
Latent semantic indexing
Text categorisation
Document summarisation
Cross lingual
8
4
1
6
5
2
9
5
10
1
3
5
2
5
2
4
1
7
10
10
6
10
6
2
5
8
4
1
2
1
2
1
2
2
1
6
3
3
2
3
2
1
4
3
5
7
5
1
6
3
5
3
1
2
2
1
1
1
3
1
1
1
3
3
2
1
2
1
2
1
1
2
4
1
1
2
9
5
1
2
1
1
1
1
1
1
1
4
4
1
Question answering
1
3
3
1
2
1
1
1
1
1
1
1
3
1
1
2
2
1
2
1
1
1
2
5
3
1
1
1
3
1
1
1
3
1
2
1
1
1
2
6
2
2
4
3
1
2
2
2
2
3
1
4
4
1
17
1
2
1
1
3
1
1
1
37
2
3
4
1
3
2
1
1
75
1
2
1
1
1
3
2
1
1
2
1
3
2
1
1
2
2
4
2
2
1
1
1
1
2
3
1
3
3
2
2
4
1
1
1
2
1
3
1
1
1
1
2
1
3
1
1
1
1
1
1
1
1
1
1
1
1
3
2
1
1
3
1
1
1
2
2
1
2
2
3
1
1
2
1
2
1
126
30
1
2
1
2
1
1
2
1
1
1
1
1
1
1
5
18
3
1
26
25
1
3
18
1
1
1
1
31
3
4
2
4
5
5
31
4
1
1
1
1
2
3
18
1
4
3
1
5
2
2
33
2
1
1
3
2
1
3
1
3
3
34
9
1
3
2
1
1
3
1
3
1
1
1
10
1
14
5
1
2
2
1
1
2
3
1
1
3
4
4
2
1
1
38
1
12
1
9
1
1
1
4
2
1
3
1
3
66
4
2
1
1
24
7
2
3
8
34
1
3
3
2
9
1
3
3
3
1
3
2
1
3
3
2
1
1
3
3
2
23
6
2
2
3
3
12
1
1
3
4
16
Cluster \ Year
71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces
General !
Models
Question answering
Syntactic phrases & SDR
Conceptual IR, KB IR
Compression
Clustering
Relevance feedback
Inverted files & Implementations
Term weighting
Message understanding & TDT
Filtering
Hypertext IR, Multiple evidence
Image retrieval
Probabilistic & Language models
Boolean & extended Boolean
Japanese & Chinese IR
DBMS & IR
Users & Search
Visualisation
Signature files
Distributed IR
Evaluation
Topic distillation & Linkage retrieval
Latent semantic indexing
Text categorisation
Document summarisation
Cross lingual
8
4
1
6
5
2
9
2
2
1
1
Clustering
1
5
10
1
3
5
2
5
2
4
1
9
5
7
10
10
6
10
6
2
5
8
1
1
4
1
2
1
2
1
2
1
1
2
1
6
3
3
2
3
2
1
4
3
5
7
5
1
6
3
5
3
1
2
2
1
1
1
3
1
1
1
3
3
2
1
2
1
2
1
1
2
4
1
1
1
4
4
1
1
3
3
1
2
1
1
1
1
1
1
1
3
1
1
2
2
1
2
1
1
1
2
5
3
1
1
1
3
1
1
1
3
1
2
1
1
1
2
6
2
2
4
3
1
2
2
2
2
3
1
4
4
1
17
1
2
1
1
3
1
1
1
37
2
3
4
1
3
2
1
1
75
1
2
1
1
1
3
2
1
1
2
1
3
2
1
1
2
2
4
2
2
1
1
1
1
2
3
1
3
3
2
2
4
1
1
1
2
1
3
1
1
1
1
2
1
3
1
1
1
1
1
1
1
1
1
1
1
1
3
2
1
1
3
1
1
1
2
2
1
2
2
3
1
1
2
1
2
1
126
30
1
2
1
2
1
1
2
1
1
1
1
1
1
1
5
18
3
1
26
25
1
3
18
1
1
1
1
31
3
4
2
4
5
5
31
4
1
1
1
1
2
3
18
1
4
3
1
5
2
2
33
2
1
1
3
2
1
3
1
3
3
34
9
1
3
2
1
1
3
1
3
1
1
1
10
1
14
5
1
2
2
1
1
2
3
1
1
3
4
4
2
1
1
38
1
12
1
9
1
1
1
4
2
1
3
1
3
66
4
2
1
1
24
7
2
3
8
34
1
3
3
2
9
1
3
3
3
1
3
2
1
3
3
2
1
1
3
3
2
23
6
2
2
3
3
12
1
1
3
4
16
Cluster \ Year
71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces
General !
Models
Question answering
Syntactic phrases & SDR
Conceptual IR, KB IR
Compression
Clustering
Relevance feedback
Inverted files & Implementations
Term weighting
Message understanding & TDT
Filtering
Hypertext IR, Multiple evidence
Image retrieval
Probabilistic & Language models
Boolean & extended Boolean
Japanese & Chinese IR
DBMS & IR
Users & Search
Visualisation
Signature files
Distributed IR
Evaluation
Topic distillation & Linkage retrieval
Latent semantic indexing
Text categorisation
Document summarisation
Cross lingual
8
4
1
6
5
2
9
2
2
Inverted files &
Implementations
1
1
1
5
10
1
3
5
2
5
2
4
1
9
5
7
10
10
6
10
6
2
5
8
1
1
4
1
2
1
2
1
2
1
1
2
1
6
3
3
2
3
2
1
4
3
5
7
5
1
6
3
5
3
1
2
2
1
1
1
3
1
1
1
3
3
2
1
2
1
2
1
1
2
4
1
1
1
1
4
1
1
3
3
1
2
1
1
1
1
1
1
1
3
1
1
2
2
1
2
1
1
1
2
5
3
1
1
1
3
1
1
1
3
1
2
1
1
1
2
6
2
2
4
3
1
2
2
2
2
3
1
4
4
1
17
1
2
1
1
3
1
1
1
37
2
3
4
1
3
2
1
1
75
1
2
1
1
1
3
2
1
1
2
1
3
2
1
1
2
2
4
2
2
1
1
1
1
2
3
1
3
3
2
2
4
1
1
1
2
1
3
1
1
1
1
2
1
3
1
1
1
1
1
1
1
1
1
1
1
4
3
2
1
1
3
1
1
1
2
2
1
2
2
3
1
1
2
1
2
1
126
30
1
2
1
2
1
1
2
1
1
1
1
1
1
1
5
18
3
1
26
25
1
3
18
1
1
1
1
31
3
4
2
4
5
5
31
4
1
1
1
1
2
3
18
1
4
3
1
5
2
2
33
2
1
1
3
2
1
3
1
3
3
34
9
1
3
2
1
1
3
1
3
1
1
1
10
1
14
5
1
2
2
1
1
2
3
1
1
3
4
4
2
1
1
38
1
12
1
9
1
1
1
4
2
1
3
1
3
66
4
2
1
1
24
7
2
3
8
34
1
3
3
2
9
1
3
3
3
1
3
2
1
3
3
2
1
1
3
3
2
23
6
2
2
3
3
12
1
1
3
4
16
Cluster \ Year
71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces
General !
Models
Question answering
Syntactic phrases & SDR
Conceptual IR, KB IR
Compression
Clustering
Relevance feedback
Inverted files & Implementations
Term weighting
Message understanding & TDT
Filtering
Hypertext IR, Multiple evidence
Image retrieval
Probabilistic & Language models
Boolean & extended Boolean
Japanese & Chinese IR
DBMS & IR
Users & Search
Visualisation
Signature files
Distributed IR
Evaluation
Topic distillation & Linkage retrieval
Latent semantic indexing
Text categorisation
Document summarisation
Cross lingual
8
4
1
6
5
2
9
5
10
1
3
5
2
5
2
4
1
7
10
10
6
10
6
2
5
8
4
1
2
1
2
1
2
2
1
6
3
3
2
3
2
1
4
3
5
7
5
1
6
3
5
3
1
2
2
1
1
1
3
1
1
1
3
3
2
1
2
1
2
1
1
2
4
1
1
2
9
5
1
2
1
1
1
1
1
1
1
4
4
1
1
3
3
Message understanding &
TDT
1
2
1
1
1
1
1
1
1
3
1
1
2
2
1
2
1
1
1
2
5
3
1
1
1
3
1
1
1
3
1
2
1
1
1
2
6
2
2
4
3
1
2
2
2
2
3
1
4
4
1
17
1
2
1
1
3
1
1
1
37
2
3
4
1
3
2
1
1
75
1
2
1
1
1
3
2
1
1
2
1
3
2
1
1
2
2
4
2
2
1
1
1
1
2
3
1
3
3
2
2
4
1
1
1
2
1
3
1
1
1
1
2
1
3
1
1
1
1
1
1
1
1
1
1
1
1
3
2
1
1
3
1
1
1
2
2
1
2
2
3
1
1
2
1
2
1
126
30
1
2
1
2
1
1
2
1
1
1
1
1
1
1
5
18
3
1
26
25
1
3
18
1
1
1
1
31
3
4
2
4
5
5
31
4
1
1
1
1
2
3
18
1
4
3
1
5
2
2
33
2
1
1
3
2
1
3
1
3
3
34
9
1
3
2
1
1
3
1
3
1
1
1
10
1
14
5
1
2
2
1
1
2
3
1
1
3
4
4
2
1
1
38
1
12
1
9
1
1
1
4
2
1
3
1
3
66
4
2
1
1
24
7
2
3
8
34
1
3
3
2
9
1
3
3
3
1
3
2
1
3
3
2
1
1
3
3
2
23
6
2
2
3
3
12
1
1
3
4
16
Cluster \ Year
71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces
General !
Models
Question answering
Syntactic phrases & SDR
Conceptual IR, KB IR
Compression
Clustering
Relevance feedback
Inverted files & Implementations
Term weighting
Message understanding & TDT
Filtering
Hypertext IR, Multiple evidence
Image retrieval
Probabilistic & Language models
Boolean & extended Boolean
Japanese & Chinese IR
DBMS & IR
Users & Search
Visualisation
Signature files
Distributed IR
Evaluation
Topic distillation & Linkage retrieval
Latent semantic indexing
Text categorisation
Document summarisation
Cross lingual
8
4
1
6
5
2
9
Filtering
5
10
1
3
5
2
5
2
4
1
7
10
10
6
10
6
2
5
8
4
1
2
1
2
1
2
2
1
6
3
3
2
3
2
1
4
3
5
7
5
1
6
3
5
3
1
2
2
1
1
1
3
1
1
1
3
3
2
1
2
1
2
1
1
2
4
1
1
2
9
5
1
2
1
1
1
1
1
1
1
4
4
1
1
3
3
1
2
1
1
1
1
1
1
1
3
1
1
2
2
1
2
1
1
1
2
5
3
1
1
1
3
1
1
1
3
1
2
1
1
1
2
6
2
2
4
3
1
2
2
2
2
3
1
4
4
1
17
1
2
1
1
3
1
1
1
37
2
3
4
1
3
2
1
1
75
1
2
1
1
1
3
2
1
1
2
1
3
2
1
1
2
2
4
2
2
1
1
1
1
2
3
1
3
3
2
2
4
1
1
1
2
1
3
1
1
1
1
2
1
3
1
1
1
1
1
1
1
1
1
1
1
1
3
2
1
1
3
1
1
1
2
2
1
2
2
3
1
1
2
1
2
1
126
30
1
2
1
2
1
1
2
1
1
1
1
1
1
1
5
18
3
1
26
25
1
3
18
1
1
1
1
31
3
4
2
4
5
5
31
4
1
1
1
1
2
3
18
1
4
3
1
5
2
2
33
2
1
1
3
2
1
3
1
3
3
34
9
1
3
2
1
1
3
1
3
1
1
1
10
1
14
5
1
2
2
1
1
2
3
1
1
3
4
4
2
1
1
38
1
12
1
9
1
1
1
4
2
1
3
1
3
66
4
2
1
1
24
7
2
3
8
34
1
3
3
2
9
1
3
3
3
1
3
2
1
3
3
2
1
1
3
3
2
23
6
2
2
3
3
12
1
1
3
4
16
Cluster \ Year
71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces
General !
Models
Question answering
Syntactic phrases & SDR
Conceptual IR, KB IR
Compression
Clustering
Relevance feedback
Inverted files & Implementations
Term weighting
Message understanding & TDT
Filtering
Hypertext IR, Multiple evidence
Image retrieval
Probabilistic & Language models
Boolean & extended Boolean
Japanese & Chinese IR
DBMS & IR
Users & Search
Visualisation
Signature files
Distributed IR
Evaluation
Topic distillation & Linkage retrieval
Latent semantic indexing
Text categorisation
Document summarisation
Cross lingual
8
4
1
6
5
2
9
5
10
1
3
5
2
5
2
4
1
7
10
10
6
10
6
2
5
8
4
1
2
1
2
1
2
2
1
6
3
3
2
3
2
1
4
3
5
7
5
1
6
3
5
3
1
2
2
1
1
1
3
1
1
1
3
3
2
1
2
1
2
1
1
2
4
1
1
2
9
5
1
2
1
1
1
1
1
1
1
4
4
1
1
3
3
1
2
1
1
1
1
1
1
2
3
1
1
2
1
2
1
2
1
5
3
1
1
3
1
3
1
1
1
1
1
1
3
1
2
1
1
1
2
6
2
2
4
3
1
2
2
2
2
3
1
4
4
1
17
1
2
1
1
3
1
1
1
37
2
3
4
1
3
2
1
1
75
1
2
1
1
1
3
2
1
1
2
1
3
2
1
1
2
2
4
2
2
1
1
1
1
2
3
1
3
3
2
2
4
1
1
1
2
1
3
1
1
1
1
2
1
1
1
1
1
1
1
Hypertext IR, Multiple evidence
1
1
1
1
1
3
2
1
1
3
1
1
1
2
2
1
2
2
3
1
1
2
1
2
1
126
30
1
2
1
2
1
1
2
1
1
1
1
1
1
1
5
18
3
1
26
25
1
3
18
1
1
1
1
31
3
4
2
4
5
5
31
4
1
1
1
1
2
3
18
1
4
3
1
5
2
2
33
2
1
1
3
2
1
3
1
3
3
34
9
1
3
2
1
1
3
1
3
1
1
1
10
1
14
5
1
2
2
1
1
2
3
1
1
3
4
4
2
1
1
38
1
12
1
9
1
1
1
4
2
1
3
1
3
66
4
2
1
1
24
7
2
3
8
34
1
3
3
2
9
1
3
3
3
1
3
2
1
3
3
2
1
1
3
3
2
23
6
2
2
3
3
12
1
1
3
4
16
Cluster \ Year
71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces
General !
Models
Question answering
Syntactic phrases & SDR
Conceptual IR, KB IR
Compression
Clustering
Relevance feedback
Inverted files & Implementations
Term weighting
Message understanding & TDT
Filtering
Hypertext IR, Multiple evidence
Image retrieval
Probabilistic & Language models
Boolean & extended Boolean
Japanese & Chinese IR
DBMS & IR
Users & Search
Visualisation
Signature files
Distributed IR
Evaluation
Topic distillation & Linkage retrieval
Latent semantic indexing
Text categorisation
Document summarisation
Cross lingual
8
4
1
6
5
2
9
5
10
1
3
5
2
5
2
4
1
7
10
10
6
10
6
2
5
8
4
1
2
1
2
1
2
2
1
6
3
3
2
3
2
1
4
3
5
7
5
1
6
3
5
3
1
2
2
1
1
1
3
1
1
1
3
3
2
1
2
1
2
1
1
2
4
1
1
2
9
5
1
2
1
1
1
1
1
1
1
4
4
1
1
3
3
1
2
1
1
1
1
1
1
1
3
1
1
2
2
1
2
1
1
1
2
5
3
1
1
3
1
1
3
1
1
1
1
3
1
3
1
1
1
1
2
1
1
1
1
2
2
6
2
2
4
3
1
2
2
2
2
3
1
4
4
1
17
1
2
1
1
3
1
1
1
37
2
3
4
1
3
2
1
1
75
2
1
1
3
2
1
1
2
1
3
2
1
1
1
1
1
2
2
4
2
2
1
1
1
1
2
3
1
3
3
2
2
4
1
1
1
2
1
1
1
1
1
Probabilistic & Language models
1
1
1
1
1
3
2
1
1
3
1
1
1
2
2
1
2
2
3
1
1
2
1
2
1
126
30
1
2
1
2
1
1
2
1
1
1
1
1
1
1
5
18
3
1
26
25
1
3
18
1
1
1
1
31
3
4
2
4
5
5
31
4
1
1
1
1
2
3
18
1
4
3
1
5
2
2
33
2
1
1
3
2
1
3
1
3
3
34
9
1
3
2
1
1
3
1
3
1
1
1
10
1
14
5
1
2
2
1
1
2
3
1
1
3
4
4
2
1
1
38
1
12
1
9
1
1
1
4
2
1
3
1
3
66
4
2
1
1
24
7
2
3
8
34
1
3
3
2
9
1
3
3
3
1
3
2
1
3
3
2
1
1
3
3
2
23
6
2
2
3
3
12
1
1
3
4
16
Cluster \ Year
71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces
General !
Models
Question answering
Syntactic phrases & SDR
Conceptual IR, KB IR
Compression
Clustering
Relevance feedback
Inverted files & Implementations
Term weighting
Message understanding & TDT
Filtering
Hypertext IR, Multiple evidence
Image retrieval
Probabilistic & Language models
Boolean & extended Boolean
Japanese & Chinese IR
DBMS & IR
Users & Search
Visualisation
Signature files
Distributed IR
Evaluation
Topic distillation & Linkage retrieval
Latent semantic indexing
Text categorisation
Document summarisation
Cross lingual
8
4
1
6
5
2
9
5
10
1
3
5
2
5
2
4
1
7
10
10
6
10
6
2
5
8
4
1
2
1
2
1
2
2
1
6
3
3
2
3
2
1
4
3
5
7
5
1
6
3
5
3
1
2
2
1
1
1
3
1
1
1
3
3
2
1
2
1
2
1
1
2
4
1
1
2
9
5
1
2
1
1
1
1
1
1
1
4
4
1
1
3
3
1
Distributed IR
2
1
1
1
1
1
1
1
3
1
1
2
2
1
2
1
1
1
2
5
3
1
1
1
3
1
1
1
3
1
2
1
1
1
2
6
2
2
4
3
1
2
2
2
2
3
1
4
4
1
17
1
2
1
1
3
1
1
1
37
2
3
4
1
3
2
1
1
75
1
2
1
1
1
3
2
1
1
2
1
3
2
1
1
2
2
4
2
2
1
1
1
1
2
3
1
3
3
2
2
4
1
1
1
2
1
3
1
1
1
1
2
1
3
1
1
1
1
1
1
1
1
1
1
1
1
3
2
1
1
3
1
1
1
2
2
1
2
2
3
1
1
2
1
2
1
126
30
1
2
1
2
1
1
2
1
1
1
1
1
1
1
5
18
3
1
26
25
1
3
18
1
1
1
1
31
3
4
2
4
5
5
31
4
1
1
1
1
2
3
18
1
4
3
1
5
2
2
33
2
1
1
3
2
1
3
1
3
3
34
9
1
3
2
1
1
3
1
3
1
1
1
10
1
14
5
1
2
2
1
1
2
3
1
1
3
4
4
2
1
1
38
1
12
1
9
1
1
1
4
2
1
3
1
3
66
4
2
1
1
24
7
2
3
8
34
1
3
3
2
9
1
3
3
3
1
3
2
1
3
3
2
1
1
3
3
2
23
6
2
2
3
3
12
1
1
3
4
16
Cluster \ Year
71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces
General !
Models
Question answering
Syntactic phrases & SDR
Conceptual IR, KB IR
Compression
Clustering
Relevance feedback
Inverted files & Implementations
Term weighting
Message understanding & TDT
Filtering
Hypertext IR, Multiple evidence
Image retrieval
Probabilistic & Language models
Boolean & extended Boolean
Japanese & Chinese IR
DBMS & IR
Users & Search
Visualisation
Signature files
Distributed IR
Evaluation
Topic distillation & Linkage retrieval
Latent semantic indexing
Text categorisation
Document summarisation
Cross lingual
8
4
1
6
5
2
9
Evaluation
5
10
1
3
5
2
5
2
4
1
7
10
10
6
10
6
2
5
8
4
1
2
1
2
1
2
2
1
6
3
3
2
3
2
1
4
3
5
7
5
1
6
3
5
3
1
2
2
1
1
1
3
1
1
1
3
3
2
1
2
1
2
1
1
2
4
1
1
2
9
5
1
2
1
1
1
1
1
1
1
4
4
1
1
3
3
1
2
1
1
1
1
1
1
1
3
1
1
2
2
1
2
1
1
1
2
5
3
1
1
1
3
1
1
1
3
1
2
1
1
1
2
6
2
2
4
3
1
2
2
2
2
3
1
4
4
1
17
1
2
1
1
3
1
1
1
37
2
3
4
1
3
2
1
1
75
1
2
1
1
1
3
2
1
1
2
1
3
2
1
1
2
2
4
2
2
1
1
1
1
2
3
1
3
3
2
2
4
1
1
1
2
1
3
1
1
1
1
2
1
3
1
1
1
1
1
1
1
1
1
1
1
1
3
2
1
1
3
1
1
1
2
2
1
2
2
3
1
1
2
1
2
1
126
30
1
2
1
2
1
1
2
1
1
1
1
1
1
1
5
18
3
1
26
25
1
3
18
1
1
1
1
31
3
4
2
4
5
5
31
4
1
1
1
1
2
3
18
1
4
3
1
5
2
2
33
2
1
1
3
2
1
3
1
3
3
34
9
1
3
2
1
1
3
1
3
1
1
1
10
1
14
5
1
2
2
1
1
2
3
1
1
3
4
4
2
1
1
38
1
12
1
9
1
1
1
4
2
1
3
1
3
66
4
2
1
1
24
7
2
3
8
34
1
3
3
2
9
1
3
3
3
1
3
2
1
3
3
2
1
1
3
3
2
23
6
2
2
3
3
12
1
1
3
4
16
Cluster \ Year
71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces
General !
Models
Question answering
Syntactic phrases & SDR
Conceptual IR, KB IR
Compression
Clustering
Relevance feedback
Inverted files & Implementations
Term weighting
Message understanding & TDT
Filtering
Hypertext IR, Multiple evidence
Image retrieval
Probabilistic & Language models
Boolean & extended Boolean
Japanese & Chinese IR
DBMS & IR
Users & Search
Visualisation
Signature files
Distributed IR
Evaluation
Topic distillation & Linkage retrieval
Latent semantic indexing
Text categorisation
Document summarisation
Cross lingual
8
4
1
6
5
2
9
5
10
1
3
5
2
5
2
4
1
7
10
10
6
10
6
2
5
8
4
1
2
1
2
1
2
2
1
6
3
3
2
3
2
1
4
3
5
7
5
1
6
3
5
3
1
2
2
1
1
1
3
1
1
1
3
3
2
1
2
1
2
1
1
2
4
1
1
2
9
5
1
2
1
1
1
1
1
1
1
4
4
1
1
3
3
1
2
1
1
1
1
1
1
1
3
1
1
2
2
1
2
1
1
1
2
5
3
1
1
1
1
1
1
2
6
2
2
4
3
1
2
2
2
2
3
1
4
4
1
17
1
2
1
1
3
1
1
1
37
2
3
4
1
3
2
1
1
75
1
3
1
2
1
1
2
1
1
1
3
2
1
1
2
1
3
2
1
1
2
2
4
2
2
1
1
1
1
2
3
1
3
3
2
2
4
1
1
1
2
1
3
1
1
2
1
3
1
1
1
Topic distillation &
Linkage retrieval
3
1
1
1
1
1
1
1
1
1
1
1
1
3
2
1
1
3
1
1
1
2
2
1
2
2
3
1
1
2
1
2
1
126
30
1
2
1
2
1
1
2
1
1
1
1
1
1
1
5
18
3
1
26
25
1
3
18
1
1
1
1
31
3
4
2
4
5
5
31
4
1
1
1
1
2
3
18
1
4
3
1
5
2
2
33
2
1
1
3
2
1
3
1
3
3
34
9
1
3
2
1
1
3
1
3
1
1
1
10
1
14
5
1
2
2
1
1
2
3
1
1
3
4
4
2
1
1
38
1
12
1
9
1
1
1
4
2
1
3
1
3
66
4
2
1
1
24
7
2
3
8
34
1
3
3
2
9
1
3
3
3
1
3
2
1
3
3
2
1
1
3
3
2
23
6
2
2
3
3
12
1
1
3
4
16
Cluster \ Year
71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces
General !
Models
Question answering
Syntactic phrases & SDR
Conceptual IR, KB IR
Compression
Clustering
Relevance feedback
Inverted files & Implementations
Term weighting
Message understanding & TDT
Filtering
Hypertext IR, Multiple evidence
Image retrieval
Probabilistic & Language models
Boolean & extended Boolean
Japanese & Chinese IR
DBMS & IR
Users & Search
Visualisation
Signature files
Distributed IR
Evaluation
Topic distillation & Linkage retrieval
Latent semantic indexing
Text categorisation
Document summarisation
Cross lingual
8
4
1
6
5
2
9
5
10
1
3
5
2
5
2
4
1
7
10
10
6
10
6
2
5
8
4
1
2
1
2
1
2
2
1
6
3
3
2
3
2
1
4
3
5
7
5
1
6
3
5
3
1
2
2
1
1
1
3
1
1
1
3
3
2
1
2
1
2
1
1
2
4
1
1
2
9
5
1
2
1
1
1
1
1
1
1
4
4
1
1
3
3
1
2
1
1
1
1
1
1
1
3
1
1
2
2
1
2
1
1
1
2
5
3
1
1
1
3
1
1
1
3
1
2
1
1
1
2
Text categorisation
6
2
2
4
3
1
2
2
2
2
3
1
4
4
1
17
1
2
1
1
3
1
1
1
37
2
3
4
1
3
2
1
1
75
1
2
1
1
1
3
2
1
1
2
1
3
2
1
1
2
2
4
2
2
1
1
1
1
2
3
1
3
3
2
2
4
1
1
1
2
1
3
1
1
1
1
2
1
3
1
1
1
1
1
1
1
1
1
1
1
1
3
2
1
1
3
1
1
1
2
2
1
2
2
3
1
1
2
1
2
1
126
30
1
2
1
2
1
1
2
1
1
1
1
1
1
1
5
18
3
1
26
25
1
3
18
1
1
1
1
31
3
4
2
4
5
5
31
4
1
1
1
1
2
3
18
1
4
3
1
5
2
2
33
2
1
1
3
2
1
3
1
3
3
34
9
1
3
2
1
1
3
1
3
1
1
1
10
1
14
5
1
2
2
1
1
2
3
1
1
3
4
4
2
1
1
38
1
12
1
9
1
1
1
4
2
1
3
1
3
66
4
2
1
1
24
7
2
3
8
34
1
3
3
2
9
1
3
3
3
1
3
2
1
3
3
2
1
1
3
3
2
23
6
2
2
3
3
12
1
1
3
4
16
Cluster \ Year
71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces
General !
Models
Question answering
Syntactic phrases & SDR
Conceptual IR, KB IR
Compression
Clustering
Relevance feedback
Inverted files & Implementations
Term weighting
Message understanding & TDT
Filtering
Hypertext IR, Multiple evidence
Image retrieval
Probabilistic & Language models
Boolean & extended Boolean
Japanese & Chinese IR
DBMS & IR
Users & Search
Visualisation
Signature files
Distributed IR
Evaluation
Topic distillation & Linkage retrieval
Latent semantic indexing
Text categorisation
Document summarisation
Cross lingual
8
4
1
6
5
2
9
5
10
1
3
5
2
5
2
4
1
7
10
10
6
10
6
2
5
8
4
1
2
1
2
1
2
2
1
6
3
3
2
3
2
1
4
3
5
7
5
1
6
3
5
3
1
2
2
1
1
1
3
1
1
1
3
3
2
1
2
1
2
1
1
2
4
1
1
2
9
5
1
2
1
1
1
1
1
1
1
4
4
1
1
3
3
1
2
1
1
1
1
1
1
1
3
1
1
2
2
1
2
1
1
1
2
5
3
1
1
1
3
1
1
1
3
1
2
1
1
1
2
6
2
2
4
3
1
2
2
2
2
3
1
4
4
1
17
1
2
1
1
3
1
1
1
37
2
3
4
1
3
2
1
1
75
1
2
1
1
1
3
2
1
1
2
1
3
2
1
1
2
2
4
2
2
1
1
1
1
2
3
1
3
3
2
2
4
1
1
1
2
1
3
1
1
1
1
2
1
3
1
1
1
1
1
1
1
1
1
1
1
1
3
2
1
1
3
1
1
1
2
2
1
2
2
3
1
1
2
1
1
2
1
126
30
1
2
1
2
1
1
2
1
1
1
1
1
1
1
5
18
3
1
26
25
1
3
18
1
1
1
1
31
3
4
2
4
5
5
31
4
1
1
1
1
2
3
18
1
4
3
1
5
2
2
33
2
1
1
3
2
1
3
1
3
3
34
9
1
3
2
1
1
3
1
3
1
1
1
10
1
14
5
1
2
2
1
1
2
3
1
1
3
4
4
2
1
1
38
1
12
1
9
1
3
1
4
2
1
Document summarisation
1
3
66
4
2
1
1
24
7
2
3
8
34
1
3
3
2
9
1
3
3
3
1
3
2
1
3
3
2
1
1
3
3
2
23
6
2
2
3
3
12
1
1
3
4
16
Cluster \ Year
71 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 Total
Databases, NL Interfaces
General !
Models
Question answering
Syntactic phrases & SDR
Conceptual IR, KB IR
Compression
Clustering
Relevance feedback
Inverted files & Implementations
Term weighting
Message understanding & TDT
Filtering
Hypertext IR, Multiple evidence
Image retrieval
Probabilistic & Language models
Boolean & extended Boolean
Japanese & Chinese IR
DBMS & IR
Users & Search
Visualisation
Signature files
Distributed IR
Evaluation
Topic distillation & Linkage retrieval
Latent semantic indexing
Text categorisation
Document summarisation
Cross lingual
8
4
1
6
5
2
9
5
10
1
3
5
2
5
2
4
1
7
10
10
6
10
6
2
5
8
4
1
2
1
2
1
2
2
1
6
3
3
2
3
2
1
4
3
5
7
5
1
6
3
5
3
1
2
2
1
1
1
3
1
1
1
3
3
2
1
2
1
2
1
1
2
4
1
1
2
9
5
1
2
1
1
1
1
1
1
1
4
4
1
1
3
3
1
Cross lingual
2
1
1
1
1
1
1
1
3
1
1
2
2
1
2
1
1
1
2
5
3
1
1
1
3
1
1
1
3
1
2
1
1
1
2
6
2
2
4
3
1
2
2
2
2
3
1
4
4
1
17
1
2
1
1
3
1
1
1
37
2
3
4
1
3
2
1
1
75
1
2
1
1
1
3
2
1
1
2
1
3
2
1
1
2
2
4
2
2
1
1
1
1
2
3
1
3
3
2
2
4
1
1
1
2
1
3
1
1
1
1
2
1
3
1
1
1
1
1
1
1
1
1
1
1
1
3
2
1
1
3
1
1
1
2
2
1
2
2
3
1
1
2
1
2
1
126
30
1
2
1
2
1
1
2
1
1
1
1
1
1
1
5
18
3
1
26
25
1
3
18
1
1
1
1
31
3
4
2
4
5
5
31
4
1
1
1
1
2
3
18
1
4
3
1
5
2
2
33
2
1
1
3
2
1
3
1
3
3
34
9
1
3
2
1
1
3
1
3
1
1
1
10
1
14
5
1
2
2
1
1
2
3
1
1
3
4
4
2
1
1
38
1
12
1
9
1
1
1
4
2
1
3
1
3
66
4
2
1
1
24
7
2
3
8
34
1
3
3
2
9
1
3
3
3
1
3
2
1
3
3
2
1
1
3
3
2
23
6
2
2
3
3
12
1
1
3
4
16
信息检索相关研究和机构
•
•
•
•
•
•
CIIR, University of Massachusetts
LTI, Carnegie Mellon University
The Stanford University DB Group
Microsoft Research Asia
TREC
北京大学, 网络实验室, 天网组
Lemur简介
• http://www-2.cs.cmu.edu/~lemur/
Lemur Toolkit
• 目标:为促进LM和IR研究的research system
– ad hoc , distributed retrieval, cross-language IR,
summarization, filtering, and classification
• 功能:
– 支持大规模文档数据库的索引
– 建立Simple Language Model
– 实现基于Language Model和其它多个检索模型的系统
• 实现:
– C and C++
– Unix / Windows
– Current Version 3.1
MRA: Towards Next Generation Web Search
• From Pages to Blocks
– Analyze the Web at finer granularity
• From Surface Web to Deep Web
– Unleash the huge assets of high-value information
• From Unstructure to Structure
– Provide well organized results
• From relevance to intelligence
– Contribute knowledge discovery with search
• From Desktop Search to Mobile Search
– Bridge physical world search to digital world search
The Stanford Univ. DB Group
• WebBase
– Crawling, storage, indexing, and querying of
large collections of Web pages.
• Digital Libraries
– Infrastructure and services for creating,
disseminating, sharing and managing
information
TREC Conference
• Established in 1992 to evaluate large-scale IR
– Retrieving documents from a gigabyte collection
• Has run continuously since then
– TREC 2004(13th) meeting is in November
• Run by NIST’s Information Access Division
• Probably most well known IR evaluation setting
– Started with 25 participating organizations in 1992
evaluation
– In 2003, there were 93 groups from 22 different countries
• Proceedings available on-line (http://trec.nist.gov )
– Overview of TREC 2003 at
http://trec.nist.gov/pubs/trec12/papers/OVERVIEW.12.pd
f
TREC General Format
• TREC consists of IR research tracks
– Ad hoc, routing, confusion ( scanned documents, speech
recognition ), video, filtering, multilingual ( cross-language, Spanish,
Chinese ), question answering, novelty, high precision, interactive,
Web, database merging, NLP, …
• Each track works on roughly the same model
–
–
–
–
November: track approved by TREC community
Winter: track’s members finalize format for track
Spring: researchers train system based on specification
Summer: researchers carry out format evaluation
• Usually a “blind” evaluation: research do not know answer
– Fall: NIST carries out evaluation
– November: Group meeting (TREC) to find out:
• How well your site did
• How others tackled the program
– Many tracks are run by volunteers outside of NIST (e.g. Web)
• “Coopetition” model of evaluation
– Successful approaches generally adopted in next cycle
TREC Tracks
Summary of VLC/Web Track evaluation 1996 - 2003
Tianwang Group @PKU
1996
1999
Cycles: experience requirement
2000
2002
2004
experience
requirement
experience
requirement
Key ideas:
Web pages
Web pages
FTP files
grow
exponentially
MileTianwang 1.0
stones:
Bingle 1.0
Tianwang 2.0
preserve easier
vanishing
preserve
web resources
mass system
CDAL 1.0,
World MEMEX
pages
Web InfoMall 1.0
Web InfoMall 2.0
http://www.infomall.cn/
CWT100g构建时间表
2004.2.1
CWT100g idea
6.16
√
Document
10.8 -20
11.3
√
pooling
query
√
11.10
√
judgment
我是一小步,人类的一大步!
......
截止2004-12-20北大燕穹数据共享情况
2.5/8.8 = 28.4%
提交结果的参加队
TEAM
NAME
TDRUNS
NPHPRUNS
上海交通大学APEX实验室
北京大学计算机科学技术
研究所
TRS公司
华南理工大学木棉一队
APEX
5
5
ANS
3
2
TRS
MUMIAN1
5
3
2
1
华南理工大学木棉二队
华南理工大学计算机学院
数据库应用研究室
福建师大附中
MUMIAN2
2
1
SCUTDB
5
5
WLL
1
注:pooling还包括google,yisou,baidu,sogou,zhongsou五个SE的检索结果。
评测结果
主题
提取
导航
搜索
其中TIANWANG_RUN仅供参考
总结
• 搜索引擎工作原理
• 信息检索相关研究和机构
谢谢!
Vector Space Model
• 文档d和查询q在向量空间中表示为两个m维向
量,每维度的权值用TF∙IDF,其相似度用向量
夹角余弦度量,有: (使用原始的tf,idf公式)
Cos(Q, Dd ) 
1

Wq  Wd
1

Wd

tQ
W
tQ

tQ
d ,t
 Wq ,t
Wq  Wd

f
tQ
d ,t
 idft  f q ,t  idft
Wq  Wd

N
f d ,t  f q ,t  log ( )
dft
2
N
f d ,t  f q ,t  log ( )
dft
2
BACK
Query Answer
• 1.porridge & pot (BOOL)
– d2
• 2.“porridge pot” (BOOL)
– null
• 3. porridge pot
(VSM)
– d2 > d1>d5
– Next page
BACK
CIIR-Center for Intelligent Information Retrieval @UMASS
• One of the leading research groups in IR
– improving the probabilistic models,
– first description of a retrieval system based on statistical language
models.
– introduced and improved a number of techniques for text and query
representation
– automatically representing databases and combining local searches for
DIR
– first high capacity probabilistic filtering architecture
– define and evaluate the first versions of event detection and tracking
software
– earliest research on ranking and representation techniques for Asian
languages
– first approaches to information extraction that emphasized learning
– novel techniques for indexing images and video
CIIR cont.
• Research
– more than 500 journal and refereed conference
papers over the past 12 years (52 submissions in
2003).
• industrial and government collaboration
– INQUERY
– licensed our software to nearly 300 sites
• Education
– 20 Ph.D.s , 29 M.S.
– 123/145, 34/4 graduate/undergraduate
CIIR cont.
• Personnel
– Faculty
– Technical personel
– Graduate student
4
10
34/10
(W. BRUCE CROFT)
• Groups
– IESL:Information Extraction and Synthesis Laboratory
– IR :Information Retrieval Laboratory
– MIR :Multimedia Indexing and Retrieval Laboratory
• The CIIR is currently concentrating on the unsolved longterm research problems that underlie effective
information retrieval
– text representation,
– query acquisition,
– retrieval models
LTI : Language Technologies Institue @CMU
• Machine Translation, Natural Language
Processing, Speech, and Information Retrieval
• IR Projects (Jamie Callan and Yiming Yang )
–
–
–
–
–
–
Adaptive Information Filtering
Distributed Information Retrieval / Federated Search
Email Classification and Prioritization
Minerva: Web Mining for Question Answering
MuchMore: Translingual Information Retrieval
JAVELIN: Open-Domain Question Answering
BACK