幻灯片 1

Transcript 幻灯片 1

Generalized Vector Space Model
• Definition Let ki be a vector associated with
the index term ki . Independence of index terms
in the vector model implies that the set of
vectors {k1 ,k2 ,…,kt} is linearly independent and
forms a basis for the subspace of interest. The
dimension of this space is the number t of
index terms in the collection.
An example for independent
• V1=(1, 0, 0), V2=(0, 1, 0), V3=(0, 0, 1).
• V1  V2=0+0+0=0.
• Vi  Vj=0.
• Each element represents a keywords.
• Different keywords are treated as totally different
items. This is not reasonable since sometimes
they are related.
• Definition Given the set {k1 ,k2 ,…,kt} of index
terms in a collection, as before, let wi,j be the
weight associated with the term-document pair
[ki ,dj]. If the wi,j weights are all binary, then all
possible patterns of term co-occurrence (inside
documents) can be represented by a set of 2t
minterms given by min1 =(0,0,…,0),
min2 =(1,0,…,0),…, min2t =(1,1,…,1).
• Let gi (minj ) return the weight {0,1} of the index
term ki in the minterm mini. (gi(dj) is defined
similarly.)
• Definition Let us define the following set of
(containing 2t elements)
•
m1=(0, 0, …, 1)
m2=(0, 0, …, 1, 0)
…..
m 2t-1=(0, 0, …, 1).
vectors
where each vector mi is associated with the respective
minterm mini .
For mi .•mj=0 for all i  j
The new vector kki is defined as:
kki



ci ,r 
c mr
r , g i ( mr ) 1 i , r
2
c
r , g ( m ) 1 i , r
i
1.1
r
w
i, j
d j | g l ( d j )  g l ( mr ), for.all.l
1.2
kki  kk j 
c
c
i ,r
j ,r
r | g i ( mr ) 1 g j ( mr ) 1
d j  i wi , j kki
q j  i wi ,q kki
An example for Generalized Vector Space
Model
• Suppose that the system has 12 documents and 4
keywords.
• D1=(2, 1, 0, 0), D2=(5, 1, 0, 0), D3=(1, 1, 1, 1),
• D4=(0, 0, 2, 2), D5=(0, 1, 1, 2), D6=(0, 0, 1, 1),
• D7=(0, 0, 1, 0), D8=(1, 1, 0, 0), D9=(2, 1, 1, 1),
• D10=(0, 2, 2, 2). D11=(1, 0, 2, 0), D12=(0,0, 2,1).
• Minterms: 6 minterms are used as independent vectors to form a base.
• min1=(1, 1, 0, 0), min2=(1, 1, 1, 1), min3=(0, 0, 1, 1),
min4=(0, 1, 1, 1), min5=(0, 0,1, 0), min6=(1, 0, 1, 0).
Generalized Vector Space Model
• Independent vectors:
v1= (1, 0, 0, 0, 0, 0), v2=(0, 1, 0, 0, 0, 0),
v3=(0, 0, 1, 0, 0, 0), v4=(0, 0, 0, 1, 0, 0),
v5=(0, 0, 0, 0, 1, 0), v6=(0, 0, 0, 0, 0, 1).
• Vi represents minterm mini.
• Each pair of Vi and Vj is orthogonal. (dot
product=0)
• The four keywords k1, k2, k3, and k4 are
represent by a combination of the independent
vectors.
Generalized Vector Space Model
• The four keywords k1, k2, k3, and k4 are
represent by a combination of the independent
vectors.
k1=(c1,1V1+c1,2V2+c1,3V3+c1,4V4+c1,5V5+c1,6V6)/C
where c1,1=w1,1+w1,2+w1,8 =2+5+1 (D1, D2, and D8
has minterm min1), c1,2=w1,3+w1,9 =1+2=3(D3
and D9 has minterm min2),
c1,3=w1,4+w1,6+w1,12=0+0+0=0 (D4, D6 and D12
has minterm min3.), c1,4=w1,5+w1,10=0+0.
c1,5=w1,7=0. c1,6=w1,11=1.
C=(c1,1 2+c1,2 2+c1,3 2+c1,4 2+c1,5 2+c1,6 2)0.5
Generalized Vector Space Model
k2=(c2,1V1+c2,2V2+c2,3V3+c2,4V4+c2,5V5+c2,6V6)/C
where c2,1=w2,1+w2,2+w2,8 =1+1++1 (D1, D2, and
D8 has minterm m1), c2,2=w2,3+w2,9 =1+1=2(D3
and D9 has minterm m2),
c2,3=w2,4+w2,6+w2,12=0+0+0=0 (D4, D6 and D12
has minterm m3.), c2,4=w2,5+w2,10=1+2=3.
c2,5=w2,7=0. c2,6=w2,11=0.
C=(c2,1 2+c2,2 2+c2,3 2+c2,4 2+c2,5 2+c2,6 2)0.5
Generalized Vector Space Model
k3=(c3,1V1+c3,2V2+c3,3V3+c3,4V4+c3,5V5+c3,6V6)/C
where c3,1=w3,1+w3,2+w3,8 =0 (D1, D2, and D8 has
minterm m1), c3,2=w3,3+w3,9 =1+1=2(D3 and D9
has minterm m2), c3,3=w3,4+w3,6+w2,12=2+1+2=5
(D4, D6 and D12 has minterm m3.),
c3,4=w3,5+w3,10=1+2=3. c3,5=w3,7=1. c3,6=w3,11=2.
C=(c3,1 2+c3,2 2+c3,3 2+c3,4 2+c3,5 2+c3,6 2)0.5
Generalized Vector Space Model
k4=(c4,1V1+c4,2V2+c4,3V3+c4,4V4+c4,5V5+c4,6V6)/C
where c4,1=w4,1+w4,2+w4,8 =0 (D1, D2, and D8 has
minterm m1), c4,2=w4,3+w4,9 =1+1=2(D3 and D9
has minterm m2), c4,3=w4,4+w4,6+w4,12=2+1+1=4
(D4, D6 and D12 has minterm m3.),
c4,4=w4,5+w4,10=2+2=4. c4,5=w4,7=0. c4,6=w4,11=0.
C=(c4,1 2+c4,2 2+c4,3 2+c4,4 2+c4,5 2+c4,6 2)0.5
Ki’s are converted from a vector of length 4 into a
vector of length 6.
Extended Boolean Model:
•
•
•
Disadvantages of “Boolean Model” :
No term weight is used
Counterexample: query q=Kx AND Ky.
Documents containing just one term, e,g, Kx is
considered as irrelevant as another document
containing none of these terms.
•
•
No term weight is used
The size of the output might be too large or
too small
Extended Boolean Model:
• The Extended Boolean model was introduced
in 1983 by Salton, Fox, and Wu[703]
• The idea is to make use of term weight as
vector space model.
• Strategy: Combine Boolean query with vector
space model.
• Why not just use Vector Space Model?
• Advantages: It is easy for user to provide query.
Extended Boolean Model:
• Each document is represented by a vector
(similar to vector space model.)
idf x
wx , j  fx , j *
max iidf i
• Remember the formula.
• Query is in terms of Boolean formula.
• How to rank the documents?
Fig. Extended Boolean logic considering the space
composed of two terms kx and ky only.
• ky
• ky
(0,1)
(1,1)
(1,1)
(0,1)
kx or ky
dj+1
dj+1
dj
dj
kx and ky
(0,0)
• k
(1,0)
(0,0)
• kx
(1,0)
Extended Boolean Model:
•
For query q=Kx or Ky, (0,0) is the point we try
to avoid. Thus, we can use
x y
2
sim (qor, d ) 
to rank the documents
• The bigger the better.
2
2
Extended Boolean Model:
•
•
For query q=Kx and Ky, (1,1) is the most
desirable point.
We use
(1 x)  (1  y)
2
sim (qand, d )  1 
to rank the documents.
• The bigger the better.
2
2
Extend the idea to m terms
• qor=k1 p k2 p … p Km
sim (qor


...

x
x
x
,d )  (
)
m
j
p
p
p
1
2
m
1/ p
• qand=k1 p k2 p … p km
sim (qand, dj )  1 (
(1 x ) (1 x ) ...(1 x )
p
1
p
2
m
p
1/ p
m
)
Properties:
• The p norm as defined above enjoys a couple of
interesting properties as follows. First, when p=1
it can be verified that
x1  ...  xm
sim (qor, dj )  sim (qand, dj ) 
m
• Second, when p= it can be verified that
• Sim(qor,dj)=max(xi)
• Sim(qand,dj)=min(xi)
Example:
• For instance, consider the query q=(k1 k2)  k3.
The similarity sim(q,dj) between a document dj
and this query is then computed as
(1 x1)  (1 x 2)
(
1

(
sim (q, d ) 
2
p
(
2
p
1/ p
1/ p
p

)
x
)
p
3
• Any boolean can be expressed as a numeral
formula.
)
Exercise:
1. Give the numeral formula for extended
Boolean model of the query
q=(k1 or k2 or k3)and (not k4 or k5). (assume that
there are 5 terms in total.)
2. Assume that the document is represented by
the vector (0.8, 0.1, 0.0, 0.0, 1.0).
What is sim(q, d) for extended Boolean model?
Also try to do more exercise for other Boolean
formulas.