Indexing UMLS concepts with Apache Lucene

Transcript Indexing UMLS concepts with Apache Lucene

Indexing UMLS concepts with
Apache Lucene
Julien Thibault
[email protected]
University of Utah
Department of Biomedical Informatics
Outline
•
•
•
•
Goals
Unified Medical Language System (UMLS)
Apache Lucene
Get to work!
Goals
• Build a dictionary lookup module for NLP pipelines
– Input: string (e.g. “diabetes”, “breast cancer”, “warfarin”)
– Output: list of concepts (e.g. “C083562”)
• Application examples:
– Unstructured clinical document coding
– (Semi)automated literature indexing
• Pre-processing necessary for free text (not covered today):
– Tokenization
– Sentence detection
– Part-of-speech tagging (e.g. to lookup only noun phrases)
UMLS
• Unified Medical Language System (NLM)
–
–
–
–
Millions of organized biomedical concepts
Over 150 sources (e.g. SNOMED-CT, LOINC, NCI, MESH)
Good source to index biomedical concept!
UMLS Terminology Services: https://uts.nlm.nih.gov/home.html
• Content
– Concepts, synonymous names, relationships
– Semantic network (high-level classification)
• Organism, anatomical structure, biologic function, chemical, …
• Distribution
– Files with concept and relationship description data
– Loadable into a database for querying
– Files/columns: http://www.ncbi.nlm.nih.gov/books/NBK9685/
UMLS schema
• 19 files to describe:
– Concepts
– Relationships
– The files (columns and content)
• MRCONSO
– Concepts names and sources
• MRSTY
– Concept semantic types
• Terminology (source) codes
– http://www.nlm.nih.gov/resear
ch/umls/knowledge_sources/m
etathesaurus/release/source_v
ocabularies.html
Concept table (MRCONSO)
CUI
LAT
LUI
SAB
STR
…
C0001175 ENG
L0001175
MSH
Acquired Immunodeficiency
Syndromes
…
C0001175 ENG
L0001842
SNOMEDCT
AIDS
…
C0001175 FRE
L0162173
SNOMEDCT
SIDA
…
CUI: concept unique ID; LAT: language of term; LUI: term unique ID; SAB: Source; STR: string
• MySQL database
– mysql -u [user] -h [host] -D [database] –p
– Replace with provided info (thanks Kristina!!)
• Query example:
select * from MRCONSO where STR like ‘my favorite disease’;
Apache Lucene
• Relational databases are not optimized for string
search (e.g. partial matches, phrases)
• Apache Lucene
– http://lucene.apache.org/
– High-performance text search engine library
• Ranked searching (score)
• Phrase queries, wildcard queries, proximity queries…
– Java API to:
• build indexes
• perform lookups
– Integrate nicely into UIMA
Apache Lucene index
• Indexes stored on disk and loaded at runtime
• Documents
– Index entries with indexable fields
– The set of fields does not need to be the same for each document
– Searches target one field at a time and return the whole matching document
Document
CUI
LAT
SAB
STR
EXTRA
C0001175
-
MSH
Acquired Immunodeficiency
Syndromes
-
C0001175
ENG
SNOMEDCT
AIDS
genial
C0001175
FRE
SNOMEDCT
SIDA
-
• Default match scoring
Field
– Higher ranks = good overlap, non-frequent words, short fields
Apache Lucene Analyzer
• Defines the pre-processing step applied to
– Strings indexed by Lucene
– Strings that are looked up in the index
• Components
– Tokenizer : creates token stream (e.g. based on white spaces)
– Filter: applied to token stream (e.g. lower case, stop words)
• This is a good place to customize the matching algorithm, but see
also:
– Language-specific analyzers (e.g. Arabic, Chinese, Catalan)
– CustomScoreQuery (to customize scoring function)
– WildcardQuery, FuzzyQuery, RegexpQuery
– KeywordQuery (no tokenization)
Building an index
//create reference to Lucene index to be stored on disk
Directory dir = FSDirectory.open(new File(indexPath));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);//tokenizer,filter
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter writer = new IndexWriter(dir, iwc); //get index writer
…
Document doc = new Document(); //create new entry (i.e. document)
Field myfield = new TextField(“term", term, Field.Store.YES); //create field
doc.add(pathField); //add field to document
…
writer.addDocument(doc); //add document to index
…
writer.close(); //save updated index
StandardAnalyzer = StandardTokenizer with StandardFilter, LowerCaseFilter and
StopFilter, using a list of English stop words.
Other analyzer examples: WhitespaceAnalyzer, KeywordAnalyzer.
Field.Store.YES = this field will be indexed
http://lucene.apache.org/core/4_6_0/demo/src-html/org/apache/lucene/demo/IndexFiles.html
Creating index queries
//create reference to existing Lucene index stored on disk
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index)));
//prepare search
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
//create query on the “term” field
QueryParser parser = new QueryParser(Version.LUCENE_40, “term”, analyzer);
Query query = parser.parse(“hello*”);//search for terms that start with ‘hello’
//search
TopDocs results = searcher.search(query, 5); //search for top 5 matches
//collect results
ScoreDoc[] hits = results.scoreDocs; //collect matches
int numTotalHits = results.totalHits; //count number of results
…
Document doc = searcher.doc(hits[0].doc); //retrieve first matching entry
int score = hits[0].score; //retrieve score of first matching entry
String term = doc.get(“term"); //retrieve value of field “term”
http://lucene.apache.org/core/4_6_0/demo/src-html/org/apache/lucene/demo/SearchFiles.html
Lets get to work!
• Download necessary files
– Apache Lucene Core API
• http://lucene.apache.org/core/mirrors-core-latest-redir.html?
– MySQL Java connector
• http://dev.mysql.com/downloads/connector/j/
– Files for this tutorial
• Create Eclipse project
– Add necessary JAR files to build path
– Copy source files to project src folder
• Complete code to:
– Build index from MySQL query (don’t use all concepts!!)
– Create search function that returns the CUIs of matching terms
Merci!
[C2986674] Thank you (NCI)
Julien Thibault
[email protected]
University of Utah
Department of Biomedical Informatics