HanNanum Korean Morphological Analyzer

Download Report

Transcript HanNanum Korean Morphological Analyzer

KKAP: KAIST Korean Analysis Platform
Morphological Analyzer, POS Tagger, Parser
Sangwon Park
January 12, 2011
Research Goal
• The goal of the research is to develop KKAP(KAIST Korean
Analysis Platform), which is a infrastructure for Korean natural
language analysis.
• The KKAP will be flexible and easy to utilize so that it can be
widely used in various areas. The platform will include
morphological analyzer, POS tagger, parser, etc.
Contents
• 1. Introduction of Korean Morphological Analysis
• 2. HanNanum Korean Morphological Analyzer & POS Tagger
• 3. Extension to KKAP(KAIST Korean Analysis Platform)

Features of Korean morphological analysis
Ambiguity of part-of-speech

가시는





가시/noun + 는/josa
(thorn, prickle)
가시/verb + 는/eomi
(leave, disappear)
가/verb + 시/eomi + 는/eomi (go)
갈/verb + 시/eomi + 는/eomi (grind, sharpen)
Example Sentences:




Ambiguity of segmentation of
morpheme
그 선인장의 가시는 참 따가웠다.
물을 마셨더니 갈증이 가시는 기분이다.
할머니께서는 집에 가시는 길이었다.
아저씨의 칼을 가시는 모습은 인상적이다.
HanNanum Korean Morphological Analyzer
•
•
•
•
•
HanNanum has been developed since 1990s.
Written in C programming language
Module-based architecture
Based on KAIST morphological analyzed corpus
HMM-based, Maximum Entropy-based POS Tagger
HanNanum Architecture
Segment
Position
Inverse
Segment Position
Morphological Analyzer
INPUT
Analyzer
Connection
Check
Morpheme
Chart
Chart
(lattice form)
Sentence
Divisor
Tagger
Dictionary
Search
Tag Set
Tag Set Table
Phoneme
Restoration
Tag
Mapper
Code
Conversion
Connection
Info. Table
Frequency
Dictionary
System
Dictionary
(Trie)
User
Dictionary
(Trie)
Number
Dictionary
Computation
Bigram Info.
OUTPUT
HMM-based POS Tagger
•
Shin Jung-ho, Han Young-seok, Park Young-chan, Choi Key-Sun, “An HMM Part-ofSpeech Tagger for Korean Based on Wordphrase”, Proceedings of the Conference
on Hangul and Korean Language Information Processing, 389-394, 1994.
•
Transition probability between word
phrase tag
•
Transition probability between
morpheme tag in a word phrase
•
Probability of occurrence of
morpheme and POS
Analysis Example
-
HMM-based Tagger
Find the most suitable result
among the candidates
- POS-tagged Dictionary
- Check Connection rule
- Phoneme Restoration
Plug-In Component-based System
– Each functionality for the Korean morphological analysis is
implemented as a plug-in.
– It allows a user to set up a workflow with existing plug-ins for his
own goal.
Plug-In Pool
Chart-base
Morph Analyzer
HMM
POS
Tagger
Corpus-base
Morph Analyzer
Unknown
Noun Proc.
Tag
Mapper
Noun
Extractor
Tag
Mapping
Noun
Extracting
…
CRF POS
Tagger
Auto
Spacing
Input Filter
Sentence
Splitter
…
Transliteration
…
Phase1 Supplement Plugin
Phase2 Morphological Analyzer
Phase2 Supplement Plugin
Phase3 POS Tagger
Phase3 Supplement Plugin
Flexible Workflow
$$$$$
$/su+$/su+$/su+$/su+$/su
- Analysis of Announcement on Web
장소
장소/ncn
$$$$$
$/su+$/su+$/su+$/su+$/su
$$$$$장소$$$$$
Informal
Input
Filter
서울코엑스3층
Auto
Spacing
Sentence
Splitter
Plain Text
Processor
Chart-based
Morphological
Analyzer
Unknown
Processor
Morphological
Analyzer
Morpheme
Processor
HMM-based
POS Tagger
서울
서울/nq
코엑스
코엑스/ncn
POS Tagger
3층
3/nnc+층/nbu
- Indexing of News Articles
지난 9월 거제도에서
열린 축제 …
Sentence
Splitter
Chart-based
Morphological
Analyzer
Noun
Extractor
Plain Text
Processor
Morphological
Analyzer
Morpheme
Processor
9월/n
거제도/ncn
축제/ncn
HanNanum Korean Morphological Analyzer
Workflow for Morphological Analysis
Phase 1.
Text Preprocessing
Phase 2.
Morphological Analysis
Supplement Plugin
7일 저녁 발표예정인 노벨문학상의
유력 수상자로 고은 시인이 거론되고
있다. AP통신은 스웨덴의 노벨상 관측
통들 사이에 한국의 고은 시인이 시리
아의 시인 아도니스와 함께 올해 노벨
상 수상 가능성이 큰 후보로 가장 많이
거론됐다고 전했다. …
Major Plugin
Supplement
Plugin
Major Plugin
Plugin Pool
Phase 1. Plugin
Sentence
Segmentation
Phase 2. Plugin
Unknown Term
Processing
Auto
Spacing
Input
Filter
Noun
Extraction
Noun
Extraction
Korean
Document
Analysis
Phase 3.
POS Tagging
CRF-based
POS Tagging
Phase 3. Plugin
HMM-based
POS Tagging
Tag
Mapper
Chart-base
Morph Analyzer
Tag
Mapper
Supplement
Plugin
7/nnc+일/nbu
저녁/ncn
발표예정/ncpa+이/jp+ㄴ/etm
노벨문학상/nq+의/jcm
유력/ncps
수상자/ncn+로/jca
고은/nq
시인/ncn+이/jcc
거론/ncpa+되/xsv+고/ecc
있/paa+다/ef
./sf
통신은 통/ncn+신/ncn+은/jxc
스웨덴/nq+의/jcm
노벨상/ncn
관측통/ncn+들/xsn
사이/ncn+에/jca
….
Extract the Part Of
Speech Information
from Korean Text
Open Source Project
• http://kldp.net/projects/hannanum/
• 2011.01.10 jhannanum 0.8.2 was released
GUI Demo
Workflow
Information of
a plug-in
Plug-in Pool
Workflow
control
Input & Output
KKAP: KAIST Korean Analysis Platform
Workflow for Korean Analysis
Phase 1.
Text Preprocessing
Supplement Plugin
7일 저녁 발표예정인 노벨문학상의
유력 수상자로 고은 시인이 거론되고
있다. AP통신은 스웨덴의 노벨상 관측
통들 사이에 한국의 고은 시인이 시리
아의 시인 아도니스와 함께 올해 노벨
상 수상 가능성이 큰 후보로 가장 많이
거론됐다고 전했다. …
Phase 2.
Morphological Analysis
Major Plugin
Supplement
Plugin
Major Plugin
Supplement
Plugin
Plugin Pool
Phase 1. Plugin
Sentence
Segmentation
Auto
Spacing
Noun
Extraction
HMM-based
POS Tagging
Phase 2. Plugin
Unknown Term
Processing
Noun
Extraction
Input
Filter
Korean
Document
Analysis
Phase 3.
POS Tagging
Tag
Mapper
Phase 3. Plugin
Noun Phrase
Extractor
Chart Parser
Chart-base
Morph Analyzer
Tag
Mapper
Verb Phrase
Extractor
Phase 4. Plugin
Phase 4. Parsing
Major Plugin
Supplement
Plugin
7/nnc+일/nbu
저녁/ncn
발표예정/ncpa+이/jp+ㄴ/etm
노벨문학상/nq+의/jcm
유력/ncps
수상자/ncn+로/jca
고은/nq
시인/ncn+이/jcc
거론/ncpa+되/xsv+고/ecc
있/paa+다/ef
./sf
통신은 통/ncn+신/ncn+은/jxc
스웨덴/nq+의/jcm
노벨상/ncn
관측통/ncn+들/xsn
사이/ncn+에/jca
….
Analyzed Korean
Document
Korean Syntactic Tree Tagged Corpus
• Registered at BoRA (Bank of Resource for Language and Annotation)
–
–
–
–
http://bora.or.kr
Corpus 5. Manual sentence analysis corpus
31,091 Sentences from 97 different sources.
Length: 1 ~ 33 Eojeols
Average 11.35 Eojeols
• Related document
– Kong joo Lee, Byung Gyu Chang, Gil Chang Kim, “Bracketing Guidelines for
Korean Syntactic Tree Tagged Corpus Version 1”, KAIST CS Department
Technical Report, CS/TR-97-112, 1997 (In Korean)
– Byung Gyu Chang, Kong joo Lee, Gil Chang Kim, “Design and Implementation
of Tree Tagging Workbench To Build a Large Tree Tagged Corpus of Korean”,
Proceedings of the Conference on Hangul and Korean Language Information
Processing, pp.421~429, 1997 (In Korean)
Question & Comments