Text Analysis

Transcript Text Analysis

1
Text Analysis

Indexing

Matrix Representations

Term Extraction and Analysis

Term Association

Lexical Measures of Term Significance

Document Similarity

Problems of using a uncontrolled vocabulary
2
1. Indexing

Indexing
the act of assigning index terms to a document
manually or automatically

Indexing language(Vocabulary)
controlled or uncontrolled
 controlled:
limited to a predefined set of index terms
 uncontrolled:
allow use of any term that fits some broad
criteria
3
1. Indexing
purpose
 to
permit easy location of documents by topic
 to
define topic areas, and hence relate one document to
another
 to
predict relevance of a given document to a specified
information need
characteristics
 exhaustivity
 specificity
- the breadth of coverage of the index terms
- the depth of coverage
4
Manual indexing

generally, uncontrolled indexing for manual
indexing

Problem
lack of consistency
 indexer마다
 controlled

다른 exhaustivity와 specificity
vocabulary를 사용하면 다른 문제가 발생
document의 내용을 정확히 나타내기 어려울 수 있다.
indexer-user mismatch
 같은
개념을 다른 용어를 사용해서 표시
 controlled
vocabulary를 사용해도 해결하기 어렵다.
5
Manual indexing(continued)

Characterizing the occurrence of terms
link
 occur

together or have semantic relationship
ex) digital and computer
 using
conjunction
role
 indicating

꽃의 이름은 식물학적 정의에 등장하기도 하고, 정원에
서의 용도를 서술하는 문장에 등장하기도 한다
 using

its function or usage
prepositional phrases
Cross-referencing
 enhance
the usability of an indexing language
 See,See also(RT),Broader term(BT),Narrower term(NT)
6
Automatic indexing

Algorithm 이용, index term을 결정
almost, based on the frequency of occurrence
guiding principles
 words는

두개의 subset으로 나눌 수 있다.
grammatical/relational and content-bearing
 content-bearing
words중에서 더 많이 나타나는 word는
더 중요
a
word가 document collection의 average occurrence와
유의하게 다를 때 document를 구별하는데 사용가능
7
Automatic indexing (Continued)

Does not settle the issue of a controlled
vocabulary vs. an uncontrolled one

Recent trends
linguistic knowledge 이용
 syntactic
structure
 semantics
 ex)
and concepts
DR-LINK(both) : 고유명사, 보통명사 등의 구분
inferencing technique

A major use of the index
inverted file: list the document containing each term
 matching
terms to document:한번만 수행(모든 query가 공유)
8
2. Matrix Representation

many-to-many relationship between terms and
documents
관계를 명확하게 하기 위해 세 가지 matrix 사용
 term-document
 term-term
matrix
matrix
 document-document
matrix
9
2. Matrix Rep.(continued)

term-document matrix, A
rows : vocabulary terms
columns : documents
0 : does not occur, 1 or N : occur

term-term matrix, T
rows, columns : vocabulary terms
nonzero(1 or N)
 ith,
jth term occur together in some document
 or have some other relationship
10
2. Matrix Rep.(continued)

document-document matrix, D
rows,columns : documents
nonzero
 documents
 or
have some other relationship


have some terms in common
ex) author in common
이 matrix들은 sparse: 빈칸의 저장을 피해야
ex) term-document matrix 대신 a list of terms사용
각
term에는 list of document가 attach되어 있다
 빈도수가
중요한 경우에는 ‘frequency-document
identifier’ 쌍을 저장
11
3. Term Extraction and Analysis

Frequency variation
one basis for selection as automatic indexing terms

Zipf’s law
rank  frequency  constant

if the words are ranked in order of decreasing frequency
빈번한 단어들은 빈도수가 급격히 감소함을 암시

자주 나타나는 (빈번한) 단어
grammatical necessity: the, of, and, and a
half of any given text is made up of approximately
250 words
12
3. Term Extraction and Analysis

빈번한 단어가 index term으로 부적합한 이유
거의 모든 문서가 이들 단어를 포함
문서의 주된 아이디어와 무관

드문 단어가 index term으로 부적합한 이유
문서의 아이디어와 유관할 수 있지만, 이런 단어
로 검색하면 결과 문서의 수가 너무 작다
 inability

to retrieve many documents
Two thresholds for defining index terms
upper : high-frequency terms
lower : rare words
13
3. Term Extraction and Analysis

Zipf’s law는 일반적인 guideline일뿐
빈도수가 딱 한번인 100개의 단어가 있다면, 공식
이 성립하지 않는다: 각각은 다른 rank를 가짐

“the most frequent 20% of the text words
account for 70% of term usage.”와 모순됨
f = kr-1
 전체
문서의 수는 이 곡선의 아래 면적이고, 적분에 의
해 구할 수 있다. 그러나 이 전체 면적은 무한대
 따라서,
어떤 finite portion도 전체 면적의 70%가 아니다
f = kr-(1-)(>0), f = kr-(1+)(>0)의 경우도 마찬가지
14
4. Term Association


빈도수가 충분히 높은 단어쌍이나 구절은
indexing vocabulary에 포함되어야 함
word proximity
depend on
a
given number of intervening words,
 on the words appearing in the same sentence, etc.
word order, punctuation

여러 종류의 문서 집합을 고려해야 한다
digital computer는 의학, 음악 분야 문서집합에
서는 중요하지만, 컴퓨터 분야에서는 너무 빈번
해서 중요하지 않고, 철학 분야에서는 너무 드
물어서 중요하지 않다
15
5. Lexical Measure of Term
Significance

development of an indexing language
begins with analysis of the words and phrases
occurring
문서별 빈도 -> 전체 문서에서의 빈도

Term-document matrix보다는 term-list가 더
실용적
sparseness

Word phrase의 빈도
각 구성 단어의 빈도로부터 직접 구할 수는 없
지만 범위는 알 수 있다
f(AB)  min (f(A), f(B))
16
5. Lexical Measure of Term
Significance

absolute term frequency
can be very misleading
documents and document collections vary in size

relative term frequency
sizes and characteristics를 고려하여 수정된 값
 문서내
빈도수/ 문서의 길이(단어수)
전체 문서 집합을 고려한 빈도
 단어의
전체 빈도수/ 문서 집합의 모든 단어의 빈도수 합
 단어를 포함하는 문서의 수/ 전체 문서의 수
17
5. Lexical Measure of Term
Significance(continued)

Inverse document frequency weight

Signal-to-noise ratio

Term discrimination value
18
Inverse Document Frequency
Weight

The frequency of occurrence of a term is
weighted by the number of documents that
contain the term
많은 문서에서 나타나면 low weight

inverse document frequency(idf)
log2(N/dk)+1 = log2N-log2dk+1
 dk
: the number of documents containing the term k
N
: the number of documents in the collection
최소값 = 1
19
Inverse Document Frequency
Weight(continued)

inverse document frequenct weight(tf.idf)
wik =fik[log2N - log2dk + 1]
increases with the frequency of the term in the
document
decreases with the number of documents
containing the term
로그함수: 문서집합 크기의 증가에 둔감
 collection의
크기가 2배가 되면 idf값은 1 증가
20

Text Analysis

Transcript Text Analysis

Directory