Korean script searching in Korean Library OPACs Junglim Chae Yonsei University Indexing Method  N-Gram  Morphological Analysis.

Download Report

Transcript Korean script searching in Korean Library OPACs Junglim Chae Yonsei University Indexing Method  N-Gram  Morphological Analysis.

Korean script searching in
Korean Library OPACs
Junglim Chae
Yonsei University
Indexing Method

N-Gram

Morphological Analysis
N-Gram Indexing




N-Gram : Unigram, Bigram, Trigram, N-Gram
E.g.) 아버지가 방에 들어가신다
 12 Index by Bigram Segmentation
 아버, 버지, 지가, 가0 , 0방, 방에, 에0 , 0들,
들어, 어가, 가신, 신다
Many index terms-many results but lots of noise
High recall ratio but low precision ratio
Morphological Analysis






Requires a morphological analysis dictionary
E.g.) 아버지가 방에 들어가신다
 Three Index by morphological analysis
 아버지, 방, 들어가다
Ability to match linguistically similar terms
Faster performance with a smaller index
Accurate matches that meet user expectations
High precision ratio but low recall ratio
N-Gram Vs. Morphological Analysis
N-Gram Morphological Analysis
Recall Ratio
High
Low
Precision Ratio
Low
High
Size of Index
Big
Small
Indexing Speed
Fast
Slow
Search Speed
Slow
Fast
Application
Libraries
Web Search Engines
A Case Study
Yonsei University Library




Library System: Maestro-Y
Search Engine: K2 by Verity
Indexing Method
 N-Gram (bigram) + Morphological Analysis
Indexing Rules
 Rule1: Divide Strings by space
 Rule2: Extract index using bigram indexing method
 Rule3: Add the whole string excluding spaces
between strings
 Rule4: Add words from Korean morphological
analysis dictionary
A Case Study
Yonsei University Library

E.g.) ‘국어문법의 이해’





국어문법의/ 이해(rule1)
국어, 어문, 문법, 법의, 이해(rule2)
국어문법의이해(rule3)
국어문법(rule4)
Index: 국어, 어문, 문법, 법의, 이해,
국어문법, 국어문법의이해
Search Tips
Search Tips(1)
• Keyword Search
– 키워드검색, 임의검색
– Default Search Option
– Use at most 3 keywords
• Use Boolean operators
• Omit Stop-words
Search Tips(2)
• Keyword Search
– Follow the Korean Word Division Rules
• E.g.) 동해물과 백두산이(O)
동해물과백두산이(X)
Search Tips(3)
• Keyword Search
– Compound Nouns
• do not use spaces between nouns
• E.g.) 서울대학교(O), 서울 대학교(X )
Search Tips(4)
• Browse Search
– Begin with or Truncation
– 전방일치검색, 우측절단검색
– When you already know the
first word of the title, author,
or publisher
• E.g.) 한글과
Search Tips(5)
• Browse Search
– Korean Classics
• E.g.) 열여춘향슈절가라
Search Tips(6)
• Exact Match
– Precise Search
– 완전일치검색
– Known items
• E.g.) 난중일기
Search Tips(7)
• Exact Match
– Single character words
• E.g.) ‘산’, ‘흙’, ‘C’
Search Tips(8)
• Support Hangul/Hancha Searching
• E.g.) 中國歷史文選/중국역사문선
Search Tips(9)
• Japanese Kana
• Archaic Korean
• Russian
• Special characters
: Choose scripts from
Multi-language Input Table
E.g.) Multi-Script Input Table
Search Tips(10)
• Japanese Kana
– 日本の歷史/일본の역사/일본노역사
– 日本デザイン論
일본デザイン론
일본데자인론
Search Tips(11)
• Personal names
–
–
–
–
윤동주
이광수 ; 춘원
Shakespeare ; 셰익스피어
Murakami, Haruki ; 村上春樹 ;
촌상춘수, 무라카미 하루키
Search Tips(12)
• Space
– Considered as AND
• E.g.) 한국 역사=한국 AND 역사
– In some OPACs, spaces in the
character fields do make a
difference in retrieval
Comparative search
with and without space
Input Keywords
국어 문법
국어문법
National Assembly Library
102
102
National Digital Library
2,047
2,047
KERIS (monographs)
3,246
3,246
Seoul National University Library
171
171
Korea University Library
224
224
The National Library of Korea
614
400
Yonsei University Library
332
179
Sungkyunkwan University Library
141
70
Ewha Womans University Library
221
133
Hanyang University Library
344
181
Libraries
Spaces
do not
matter
Spaces
matter
謝謝
Thank You
감사합니다
ありがとうございます
[email protected]