한국어에 기반한 인터넷 지식 정보의 지능적 통합 기술

Download Report

Transcript 한국어에 기반한 인터넷 지식 정보의 지능적 통합 기술

Introduction
인공지능 연구실
정성원
The beginning
• Linguistic science의 4부분
– Cognitive side of how human acquire, produce, and
understand language
– Understanding the relationship between linguistic
utterances and the world
– Understanding the linguistic structures
– Rules which are used to structure linguistic expressions
• Edward Sapir “All grammars leak”
– 정확하고 완전하게 언어의 특성을 제공하는 것은 불
가능하다.
2
Rationalist and Empiricist
Approaches to Language(1)
• Rationalist (1960 ~ 1985)
– 지식의 중요한 부분은 유전적으로 미리 머리에 저장
되어 있다고 믿었다.
– Noam Chomsky
– 인간 두뇌가 복제될 수 있다.
– Poverty of the stimulus
• Empiricist (1920 ~ 1960, 1985 ~ 최근)
– 두뇌에는 몇 개의 인지능력이 있다. 절대적인 것이 아
니라 지식 발달의 한 단계로 본다.
– 일반적인 패턴인식, 조합, 일반화 등의 연산은 가지고
태어 난다.
3
Rationalist and Empiricist
Approaches to Language(2)
• Corpus
–
–
–
–
Surrogate for situating language in a real world
Advocacy of Empiricist
Discovering a language’s structure
Finding a good grammatical description
• Rationalist와 Empiricist의 접근 방법의 차이
– Describe the language module of the human mind (the
I-language)
– Describe the texts(E-language) as it actually occurs
4
Scientific Content
Questions that linguistics should answer
• Two basic question
– The first covers all aspects of the structure of language
– The second deals semantics, pragmatics, discourse
• Traditional
–
–
–
–
–
Competence grammar
Grammatically, categorical binary choice
무리하게 붙임, Conventionality 문제 발생
Work in framework
Categorical perception
5
Non-categorical phenomena in
language
• Over time, the words and syntax of a language
change.
• Words will change their meaning and their part of
speech
• Blanding of parts of speech : near
– 같은 단어가 여러 가지 품사로 쓰인다.
• Language change kind of and sort of
– We are kind of hungry
– He sort of understood what was going on
6
Language and cognition as
probabilistic phenomena
• We live in a world filled with uncertainty
and incomplete information
• The cognitive processes are best formalized
as probabilistic processes
• A major part of statistical NLP is deriving
good probability estimates for unseen
events
7
The Ambiguity of Language :
Why NLP Is Difficult(1)
• An NLP system needs to determine something of the
structure of text-normally
• Example : 3 syntactic analyses
– Our company is training workers.
8
The Ambiguity of Language :
Why NLP Is Difficult(2)
• Making disambiguation
– decisions of word sense, word category, syntactic structure, and
semantic scope
• Traditional approach
– Selectional restrictions, Disallow metaphorical extensions
– Disambiguation strategies that rely on manual rule creation and
handtuning produce a knowledge acquisition bottleneck
• Statistical NLP approach
– Solve these problems by automatically learning lexical and
structural preferences from corpora
– We recognize that there is a lot of information in the relationships
between words.
9
Dirty Hands - Word counts(1)
• Text is being represented as a list of words.
• Some questions
– What are the most common words?
– How many words are there in the text?
• How many word tokens there are? (71,370)
• How many word types appear in the text? (8,018)
– Statistical NLP is difficult : it is hard to predict
much about the behavior of words
10
Dirty Hands - Word counts(2)
Table 1.1 Common words
Table1.2 Frequency of frequencies of word type
11
Zipf’s laws (1)
• The Principle of Least effort
– People will act so as to minimize their probalbe average
rate of work
• Zip’s law
– Roughly accurate characterization of certain empirical
facts
– f : frequenc of a word
– r : its position in the list, rank
1
f 
r
f r  k
12
Zipf’s laws (2)
•
Empirical evaluation of Zipf’s law
13
Zipf’s laws (3)
• Mandelbrot
– It is very bad in reflecting the details
– A straight line with slope –1
– More general relationship between rank and
frequency
– A straight line descending with slope –B
– B=1,   0 equal Zipf’s laws
f  P(r   ) B
log f  log P  B log(r   )
14
Zipf’s laws (4)
•
Zipf’ law and Mandelbrot
15
Other laws
• The number of meanings of a word is correlated
with its frequency
m
f
m
1
r
• A word can measure the number of lines or page
between each occurrence of the word in a text
–
–
–
–
F : frequency
I : different interval sizes
P : between about 1 and 1.3
Most of the time content words occur near another
occurrence of the same word
F  I p
16
Collocation(1)
• Collocation is any turn of phrase of accepted usage where
somehow the whole is perceived to have an existence
beyond the sum of the parts
• Include
– Compounds, phrasal verbs, other stock pharsers
• Any expression that people repeat because they have heard
others using it is a candidate for a collocation
• Example
– The first idea : bigrams
– Next step : filter
• Continue it in chapter 5
17
Collocation(2)
Table 1.4 Commonest bigram collocation
Table 1.5 Frequent bigrams after filtering
18
Concordances
• The syntactic frames in which verbs
appear
• Guiding statistical parsers.
19