Introduction

Download Report

Transcript Introduction

Project4 - will be updated
Project
Use previous results to compute
TFIDF(token_i, token_j, document_j)
= tf(ti, tj; dj) log |Tr|/|Tr(ti,tj)
where ti and tj are distinct and nearby
(they are  10 tokens apart)
High Dimension TFIDF
Definition 2. The notion of LSI can be extended to
q-terms
TFIDF(ti1,. . . tiq; dj)
=tf(ti1,. . . tiq; dj) log |Tr|/|Tr(ti1,. . . tiq)
Where ti1,. . . tiq is a set of keywords that  10
tokens apart); keywords mean token with
high TFIDF value.
.
High Dimension LSI
To a set of documents, we consider
1. Keywords (1-associations)

2. Co occurring of q keyword set
(q-associations)
Project
1. Tr(ti1,. . . tiq)= the # of documents in Tr in which (ti1,. . . tiq)
occurs at least once,
 =1 + log(N(ti1,. . . tiq; dj); dj))
2. tf(ti; dj) 
if N(ti1,. . . tiq; dj) > 0
 =0 otherwise
3. N(ti1,. . . tiq; dj) = the frequency of (ti1,. . . tiq)in dj.