Korean script searching in Korean Library OPACs Junglim Chae Yonsei University Indexing Method N-Gram Morphological Analysis.
Download ReportTranscript Korean script searching in Korean Library OPACs Junglim Chae Yonsei University Indexing Method N-Gram Morphological Analysis.
Korean script searching in Korean Library OPACs Junglim Chae Yonsei University Indexing Method N-Gram Morphological Analysis N-Gram Indexing N-Gram : Unigram, Bigram, Trigram, N-Gram E.g.) 아버지가 방에 들어가신다 12 Index by Bigram Segmentation 아버, 버지, 지가, 가0 , 0방, 방에, 에0 , 0들, 들어, 어가, 가신, 신다 Many index terms-many results but lots of noise High recall ratio but low precision ratio Morphological Analysis Requires a morphological analysis dictionary E.g.) 아버지가 방에 들어가신다 Three Index by morphological analysis 아버지, 방, 들어가다 Ability to match linguistically similar terms Faster performance with a smaller index Accurate matches that meet user expectations High precision ratio but low recall ratio N-Gram Vs. Morphological Analysis N-Gram Morphological Analysis Recall Ratio High Low Precision Ratio Low High Size of Index Big Small Indexing Speed Fast Slow Search Speed Slow Fast Application Libraries Web Search Engines A Case Study Yonsei University Library Library System: Maestro-Y Search Engine: K2 by Verity Indexing Method N-Gram (bigram) + Morphological Analysis Indexing Rules Rule1: Divide Strings by space Rule2: Extract index using bigram indexing method Rule3: Add the whole string excluding spaces between strings Rule4: Add words from Korean morphological analysis dictionary A Case Study Yonsei University Library E.g.) ‘국어문법의 이해’ 국어문법의/ 이해(rule1) 국어, 어문, 문법, 법의, 이해(rule2) 국어문법의이해(rule3) 국어문법(rule4) Index: 국어, 어문, 문법, 법의, 이해, 국어문법, 국어문법의이해 Search Tips Search Tips(1) • Keyword Search – 키워드검색, 임의검색 – Default Search Option – Use at most 3 keywords • Use Boolean operators • Omit Stop-words Search Tips(2) • Keyword Search – Follow the Korean Word Division Rules • E.g.) 동해물과 백두산이(O) 동해물과백두산이(X) Search Tips(3) • Keyword Search – Compound Nouns • do not use spaces between nouns • E.g.) 서울대학교(O), 서울 대학교(X ) Search Tips(4) • Browse Search – Begin with or Truncation – 전방일치검색, 우측절단검색 – When you already know the first word of the title, author, or publisher • E.g.) 한글과 Search Tips(5) • Browse Search – Korean Classics • E.g.) 열여춘향슈절가라 Search Tips(6) • Exact Match – Precise Search – 완전일치검색 – Known items • E.g.) 난중일기 Search Tips(7) • Exact Match – Single character words • E.g.) ‘산’, ‘흙’, ‘C’ Search Tips(8) • Support Hangul/Hancha Searching • E.g.) 中國歷史文選/중국역사문선 Search Tips(9) • Japanese Kana • Archaic Korean • Russian • Special characters : Choose scripts from Multi-language Input Table E.g.) Multi-Script Input Table Search Tips(10) • Japanese Kana – 日本の歷史/일본の역사/일본노역사 – 日本デザイン論 일본デザイン론 일본데자인론 Search Tips(11) • Personal names – – – – 윤동주 이광수 ; 춘원 Shakespeare ; 셰익스피어 Murakami, Haruki ; 村上春樹 ; 촌상춘수, 무라카미 하루키 Search Tips(12) • Space – Considered as AND • E.g.) 한국 역사=한국 AND 역사 – In some OPACs, spaces in the character fields do make a difference in retrieval Comparative search with and without space Input Keywords 국어 문법 국어문법 National Assembly Library 102 102 National Digital Library 2,047 2,047 KERIS (monographs) 3,246 3,246 Seoul National University Library 171 171 Korea University Library 224 224 The National Library of Korea 614 400 Yonsei University Library 332 179 Sungkyunkwan University Library 141 70 Ewha Womans University Library 221 133 Hanyang University Library 344 181 Libraries Spaces do not matter Spaces matter 謝謝 Thank You 감사합니다 ありがとうございます [email protected]