Knowledge Management with Documents Qiang Yang HKUST Thanks: Professor Dik Lee, HKUST Keyword Extraction Goal: t given N documents, each consisting of words, i extract the most significant.
Download ReportTranscript Knowledge Management with Documents Qiang Yang HKUST Thanks: Professor Dik Lee, HKUST Keyword Extraction Goal: t given N documents, each consisting of words, i extract the most significant.
Knowledge Management with Documents
Qiang Yang HKUST Thanks: Professor Dik Lee, HKUST 1
Keyword Extraction
Goal: given N documents, each consisting of words, extract the most significant subset of words
i
Example [All the students are taking exams] -- >[student, take, exam] Keyword Extraction Process remove stop words stem remaining terms collapse terms using thesaurus build inverted index extract key words - build key word index extract key phrases - build key phrase index
t
2
Stop Words and Stemming
From a given Stop Word List [a, about, again, are, the, to, of, …] Remove them from the documents Or, determine stop words Given a large enough corpus of common English Sort the list of words in decreasing order of their occurrence frequency in the corpus Zipf ’ s law: Frequency * rank constant most frequent words tend to be short most frequent 20% of words account for 60% of usage 3
Zipf
’
s Law -- An illustration
Rank(R) Term Frequency (F) R*F (10**6) 1 the 69,971 0.070
2 3 4 of and to 36,411 28,852 26,149 0.073
0.086
0.104
5 6 7 8 9 10 a in that is was he 23,237 21,341 10,595 10,009 9,816 9,543 0.116
0.128
0.074
0.081
0.088
0.095
4
Resolving Power of Word
Non-significant high-frequency terms Non-significant low-frequency terms Presumed resolving power of significant words Words in decreasing frequency order 5
Stemming
The next task is stemming: transforming words to root form Computing, Computer, Computation comput Suffix based methods Remove “ability” from “computability” “…”+ness, “…”+ive, remove Suffix list + context rules 6
Thesaurus Rules
A thesaurus aims at classification of words in a language for a word, it gives related terms which are broader than , narrower than , same as (synonyms) and opposed to (antonyms) of the given word (other kinds of relationships may exist, e.g., composed of) Static Thesaurus Tables [anneal, strain], [antenna, receiver], … Roget’s thesaurus WordNet at Preinceton 7
Thesaurus Rules can also be Learned
From a search engine query log After typing queries, browse… If query1 and query2 leads to the same document Then, Similar(query1, query2) If query1 leads to Document with title keyword K, Then, Similar(query1, K) Then, transitivity… Microsoft Research China’s work in WWW10 (Wen, et al.) on Encarta online 8
The Vector-Space Model
T distinct terms are available; call them terms or the vocabulary index The index terms represent important terms for an application a vector to represent the document
The Vector-Space Model
Assumptions: words are uncorrelated Given: 1. N documents and a Query 2. Query considered a document too 2. Each represented by t terms 3. Each term weight
d ij
j in document i 4. We will deal with how to compute the weights later has
D 1 D 2
: :
D n Q T d 1 11 d 21 T d 2 12 d 22 …. T t … d 1t … d
: : :
2t d
: : :
n1 d n2 … d nt q
1
q
2 ...
q t
10
Graphic Representation
Example :
T 3
D 1 = 2T 3T 2 1 + + 5T 3 D 2 = 3T 7T 2 Q = 0T 1 + 2T 3 1 + + T 3 + 0T
D 1
2
= 2T 1 + 3T 2 + 5T 3
5
Q = 0T 1
2 3
+ 0T 2 + 2T 3 D 2 = 3T 1 + 7T 2 + T 3 T 2
7 • Is
D 1
or
D 2
more similar to Q?
• How to measure the degree of similarity? Distance? Angle? Projection?
11
T 1
Similarity Measure - Inner Product
Similarity between documents can be computed as the inner vector product: sim ( D i , Q ) =
k
1 ( D D i i and query Q) Q
j t
1
d ij
*
q j
Binary: weight = 1 if word present, 0 o/w Non-binary: weight represents degree of similary Example: TF/IDF we explain later 12
Inner Product -- Examples
Binary: D = 1, 1, 1, 0, 1, 1, 0 Size of vector = size of vocabulary = 7 Q = 1, 0 , 1, 0, 0, 1, 1 sim(D, Q) = 3 Weighted D 1 = 2T 1 Q = 0T 1 + 3T 2 + 0T 2 sim( D 1 , + 5T 3 + 2T 3 Q ) = 2*0 + 3*0 + 5*2 = 10 13
Properties of Inner Product
The inner product similarity is unbounded Favors long documents long document a large number of unique terms, each of which may occur many times measures how many terms matched but not how many terms not matched 14
Cosine Similarity Measures
t 3
Cosine similarity measures the cosine of the angle between two vectors vector lengths
t 2 D 1
Inner product normalized by the 2 1
D 2
CosSim(
D i
,
Q
) =
k t
1 (
d ik
q k
)
k t
1
d ik
2
k t
1
q k
2
Q t 1
15
Cosine Similarity: an Example
D 1 D 2 = 2T = 3T Q = 0T 1 1 1 + 3T + 7T + 0T 2 2 2 + 5T + T 3 3 + 2T 3
CosSim( CosSim(
D D 1 2
,
Q
) = 5 / ,
Q
) = 1 / 38 = 0.81
59 = 0.13
D 1
is 6 times better than
D 2
using cosine similarity but only 5 times better using inner product 16
Document and Term Weights
Document term weights are calculated using frequencies in documents ( collection ( idf ) tf ) and in tf ij = frequency of term j df j in document = document frequency of term j i = number of documents containing term j idf j = inverse document frequency of term = log 2 collection) ( N/ df j j ) (N: number of documents in Inverse document frequency -- an indication of term values as a document discriminator.
17
Term Weight Calculations
Weight of the jth term in ith document:
d
ij
= tf
ij
idf
j
= tf
ij log 2 ( N/ df j ) TF Term Frequency A term occurs frequently in the document but rarely in the remaining of the collection has a high weight Let max l { tf lj } be the term frequency of the most frequent term in document j Normalization: term frequency = tf ij /max l { tf lj } 18
An example of TF
Document=(A Computer Science Student Uses Computers) Vector Model based on keywords (Computer, Engineering, Student) Tf(Computer) = 2 Tf(Engineering)=0 Tf(Student) = 1 Max(Tf)=2 TF weight for: Computer = 2/2 = 1 Engineering = 0/2 = 0 Student = ½ = 0.5
19
Inverse Document Frequency
Df
j gives a the number of times term appeared among N documents IDF = 1/DF Typically use log 2 ( N/ df j ) for IDF Example: given 1000 documents, computer appeared in 200 of them, IDF= log 2 ( 1000/ 200 ) =log 2 (5)
j
20
TF IDF
d
ij
=
(
tf
ij
/max
l {
tf
lj })
=
(
tf
ij
/max
l {
tf
lj })
idf
j log 2 ( N/ df j ) Can use this to obtain non-binary weights Used in the SMART Information Retrieval System by the late Gerald Salton and MJ McGill, Cornell University to tremendous success, 1983 21
Implementation based on Inverted Files
In practice, document vectors are not stored directly; an inverted organization provides much better access speed. The index file can be implemented as a hash file, a sorted list, or a B-tree.
D j , tf j
Index terms
df
computer 3 D 7 , 4 database 2 D 1 , 3 science system 4 1 D 2 , 4 D 5 , 2 22
A Simple Search Engine
1.
Now we have got enough tools to build a simple Search engine (documents == web pages) Starting from well known web sites, pages (for very large N) crawl to obtain N web 2.
3.
Apply stop-word-removal, stemming and thesaurus to select K keywords Build an inverted index for the K keywords 4.
1.
2.
3.
4.
For any incoming user query Q, For each document D 1.
Compute the Cosine similarity score Select all documents whose score between Q and document D is over a certain threshold T Let this result set of documents be M Return M to the user 23
Remaining Questions
How to crawl?
How to evaluate the result Given 3 search engines, which one is better?
Is there a quantitative measure?
24
Measurement
Let M documents be returned out of a total of N documents; N=N1+N2 N1 total documents are relevant to query N2 are not M=M1+M2 M1 found documents are relevant to query M2 are not Precision = M1/M Recall = M1/N1 25
Retrieval Effectiveness - Precision and Recall
Entire document collection
retrieved & irrelevant Not retrieved & irrelevant
Relevant documents Retrieved documents
retrieved & relevant not retrieved but relevant retrieved not retrieved
recall
Number of relevant Total number of documents retrieved relevant documents precision
Number of relevant total Number of documents retrieved documents retrieved
26
Precision and Recall
Precision evaluates the correlation of the query to the database an indirect measure of the completeness of indexing algorithm Recall the ability of the search to find all of the relevant items in the database Among three numbers, only two are always available total number of items retrieved number of relevant items retrieved total number of relevant items available is usually not 27
Relationship between Recall and Precision Return relevant documents but miss many useful ones too 1 The ideal 0 recall 1 Return mostly relevant documents but include many junks too 28
Computation of Recall and Precision
n
1 2 3 4 5 6 7 8 9 10 11 12 13 14
doc #relevantRecallPrecision
588 589 576 590 986 592 984 988 578 985 103 591 772 990 x x x x x 0.2
0.4
0.4
0.6
0.6
0.8
0.8
0.8
0.8
0.8
0.8
0.8
1.0
1.0
1.00
1.00
0.67
0.76
0.60
0.67
0.57
0.50
0.44
0.40
0.36
0.33
0.38
0.36
Suppose: total no. of relevant docs = 5 R=1/5=0.2; R=2/5=0.4; R=2/5=0.4; R=5/5=1; p=1/1=1 p=2/2=1 p=2/3=0.67
p=5/13=0.38
29
Computation of Recall and Precision
n
1 2 3 4 5 6 7 8 9 10 11 12 13 14
RecallPrecision
0.2
0.4
0.4
0.6
0.6
0.8
0.8
0.8
0.8
0.8
0.8
0.8
1.0
1.0
1.00
1.00
0.67
0.76
0.60
0.67
0.57
0.50
0.44
0.40
0.36
0.33
0.38
0.36
1.0
1 0.8
0.6
0.4
0.2
0.2
2 3 0.4
4 5 0.6
6 7 12 0.8
13 1.0
200 recall 30
Compare Two or More Systems
Computing recall and precision values for two or more systems F1 score: see http://en.wikipedia.org/wiki/F1_score Superimposing the results in the same graph The curve closest to the upper right-hand corner of the graph indicates the best performance
1 0.8
0.6
0.4
0.2
0 Stem Theraurus 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall
31
The TREC Benchmark
TREC: T ext Re trieval C onference Originated from the TIPSTER program sponsored by Defense Advanced Research Projects Agency (DARPA) Became an annual conference in 1992, co-sponsored by the National Institute of Standards and Technology (NIST) and DARPA Participants are given parts of a standard set of documents and queries in different stages for testing and training Participants submit the P/R values on the final document and query set and present their results in the conference http://trec.nist.gov/ 32
Link Based Search Engines
Qiang Yang HKUST 33
Search Engine Topics
Text-based Search Engines Document based Ranking: TF-IDF, Vector Space Model No relationship between pages modeled Cannot tell which page is important without query Link-based search engines: Google, Hubs and Authorities Techniques Can pick out important pages 34
The PageRank Algorithm
Fundamental question to ask What is the importance level of a page P,
I
(
P
) Information Retrieval Cosine + TF IDF hyperlinks does not give related Link based Important pages (nodes) have many other links point to it Important pages also point to other important pages 35
The Google Crawler Algorithm
“Efficient Crawling Through URL Ordering ”, Junghoo Cho, Hector Garcia-Molina, Lawrence Page, Stanford http://www.www8.org
http://www-db.stanford.edu/~cho/crawler-paper/ “Modern Information Retrieval”, BY-RN Lawrence Page, Sergey Brin. The Anatomy of a Search Engine. The Seventh International WWW Conference Brisbane, Australia, April 14-18, 1998. (WWW 98). Pages 380—382 http://www.www7.org
36
Back Link Metric
IB(P)=3 Web Page P IB(P) = total number of backlinks of P IB(P) impossible to know, thus, use IB’(P) which is the number of back links crawler has seen so far 37
Page Rank Metric
T 1 Let 1 d be probability that user randomly jump to page P; T 2 Web Page P C=2 “d” is the damping factor T N Let C i be the number of out links from each T i d=0.9
IR
(
P
) ( 1
d
)
d
*
i N
1
IR
(
T i
) /
C i
38
Matrix Formulation
Consider a random walk on the web (denote IR(P) by r(P)) Let B ij Let r i = probability of going directly from i to j be the limiting probability (page rank) of being at page i
b b
12
b
11 ...
1
n b
21
b
22 ...
b
2
n
...
...
...
...
b n
1
b n
2
b
...
nn
r r r
1 2 ...
n
r r r
1 2 ...
n
B T r
r
Thus, the final page rank r is a principle eigenvector of B T 39
How to compute page rank?
For a given network of web pages, Initialize page rank for all pages (to one) Set parameter (d=0.90) Iterate through the network, L times 40
B A
Example: iteration K=1
IR(P)=1/3 for all nodes, d=0.9
C node IP A 1/3 B C 1/3 1/3 41
A
Example: k=2
IR
(
P
) l 0 .
1 0 .
9 *
l
IR
(
T i i
1 is the in-degree of P ) /
C i
C node IP A 0.4
B B 0.1
Note: A, B, C’s IR values are C Updated in order of A, then B, then C 0.55
Use the new value of A when calculating B, etc.
42
B A
Example: k=2 (normalize)
C node IP A 0.38
B C 0.095
0.52
43
Crawler Control
All crawlers maintain several queues of URL’s to pursue next Google initially maintains 500 queues Each queue corresponds to a web site pursuing Important considerations: Limited buffer space Limited time Avoid overloading target sites Avoid overloading network traffic 44
Crawler Control
Thus, it is important to visit important pages
first
Let G be a lower bound threshold on I(P)
Crawl and Stop
Select only pages with IP>G to crawl, Stop after crawled K pages 45
Test Result: 179,000 pages
Percentage of Stanford Web crawled vs. P ST the percentage of hot pages visited so far – 46
Google Algorithm
(very simplified) First, compute the page rank of each page on WWW Query independent Then, in response to a query q, return pages that contain page ranks
q
and have highest A problem/feature of Google: favors big commercial sites 47
How powerful is Google?
A PageRank for 26 million web pages can be computed in a few hours on a medium size workstation Currently has indexed a total of 1.3 Billion pages 48
Hubs and Authorities 1998
Kleinburg, Cornell University http://www.cs.cornell.edu/home/kleinber/ Main Idea: type “java” in a text-based search engine Get 200 or so pages Which one’s are authoritive?
http://java.sun.com
What about others?
www.yahoo.com/Computer/ProgramLanguages 49
Hubs and Authorities
Others Hubs - An authority is a page pointed to by many strong hubs; Authorities - A hub is a page that points to many strong authorities 50
H&A Search Engine Algorithm
First submit query Q to a text search engine Second, among the results returned select ~200, find their neighbors, compute Hubs and Authorities Third, return Authorities found as final result Important Issue : how to find Hubs Authorities?
and 51
Link Analysis: weights
Let
B
ij =1 if
i
h i =hub weight of page i a i = authority weight of page Weight normalization I But, for simplicity, we will use
N i
1 (
h i
) 2 links to 1 (3)
j
, 0 otherwise
N i
1
h i
1
i N
1 (
a i
) 2 1
N i
1
a i
1 (3’) 52
Link Analysis: update a-weight
h1 a h2
a i
B ji
0
h j
B ji h j
B T h
(1) 53
Link Analysis: update h-weight
a1 h a2
h i
B ij
0
a j
B ij a j
Ba
(2) 54
H&A: algorithm
1.
2.
3.
c.
a.
b.
Set value for K, the number of iterations Initialize all a and For l=1 to K, do h weights to 1 Apply equation (1) to obtain new a i weights Apply equation (2) to obtain all new h new a i i weights obtained in the last step weights, using the Normalize a i and h i weights using equation (3) 55
Does it converge?
Yes, the Kleinberg paper includes a proof Needs to know Linear algebra and eigenvector analysis We will skip the proof but only using the results: The a and h weight values will converge after sufficiently large number of iterations, K.
56
B A
Example: K=1
h=1 and a=1 for all nodes C node a A 1 B C 1 1 h 1 1 1 57
B A
Example: k=1 (update a)
C node a A 1 B C 0 2 h 1 1 1 58
B A
Example: k=1 (update h)
C node a A 1 B C 0 2 h 2 2 1 59
B A
Example: k=1 (normalize)
Use Equation (3’) C node a A 1/3 B C 0 2/3 h 2/5 2/5 1/5 60
A B
Example: k=2 (update a, h,normalize)
Use Equation (1) C node a A 1/5 B C 0 4/5 h 4/9 4/9 1/9 If we choose a threshold of ½, then C is an Authority, and there are no hubs.
61
Search Engine Using H&A
For each query q, Enter q into a text-based search engine Find the top 200 pages Find the neighbors of the 200 pages by one link, let the set be S Find hubs and authorities in S Return authorities as final result 62
Conclusions
Link based analysis is very powerful in find out the important pages Models the web as a graph, and based on in-degree and out-degree Google: crawl only important pages H&A: post analysis of search result 63