Quality Assurance of the Content of a Large DL-based Terminology using Mixed Lexical and Semantic Criteria: Experience with SNOMED CT Alan Rector, Luigi Iannone,
Download ReportTranscript Quality Assurance of the Content of a Large DL-based Terminology using Mixed Lexical and Semantic Criteria: Experience with SNOMED CT Alan Rector, Luigi Iannone,
Quality Assurance of the Content of a Large DL-based Terminology using Mixed Lexical and Semantic Criteria: Experience with SNOMED CT
Alan Rector, Luigi Iannone, Robert Stevens [email protected]
“A report from the trenches”
►
SNOMED-CT - mandated terminology for electronic patient records in UK, US, & worldwide aspirations
►
The result of a merger of two other systems
• •
SNOMED and Clinical Terms v3 Long history with much opportunity for error
►
Expressed in a Description Logic and now available in OWL
•
subset of EL++ without disjoint axioms
►
Has been resistant to independent analysis although many known problems
•
Despite several global QA attempts based on lexical criteria that have identified errors without explaining them
2
It’s very big - and classification matters
►
~400,000 Concepts/Classes; >1,000,000 axioms
►
Much of richness only evident in classified for m
►
Most errors only present in classified form
stated Classified 3
…and some classification horrendously complicated (Skin of Ankle)
4
An experiment of opportunity
►
The opportunities
►
Tried to use SNOMED for Commercial Collaboration on Clinical Systems
►
Tried to use SNOMED as contribution to WHO’s revsion of International Classification of Diseases (ICD-11)
►
Problems with both
►
Therefore, experiment if QA & repair were possible
•
Conventional wisdom said that it was not
►
However, we had new resources
►
Core Problem List Subset from NLM (8500 most used classes)
►
Software to extract “modules”
►
SNOROCKET Classifier for EL++
►
4-8GB machines
5
Step 1: Cut it down & find a classifier
►
Find a subset
►
UMLS Core Problem List subset -
►
8500 most used disease concepts
•
Collected by US National Library of Medicine by combining sets from 6 major institutions.
►
Extract a “Module” (built into OWL API v3)
►
Use core subset as “signature”
►
Guaranteed that all inferences amongst the classes in “signature” in whole will hold in module
►
35,000 concepts - including most of anatomy
►
Find a classifier that can cope - at least two for checking
►
SNOROCKET (EL++) polynomial time subset of OWL (30 sec)
►
Pellet 2.1 (200 sec)
►
FaCT++ (250 sec)
• 6
Step 2: Pick some areas of interest to clinicians: some with anomalies already spotted
►
Myocardial Infarction (Heart attack)
►
Should be a kind of Ischemic Heart Disease, but wasn’t
►
Hypertension (High blood pressure)
►
Odd to find it a kind of Soft Tissue disorder
►
Diabetes
►
Odd to find it as a Disorder of the Abdomen
►
Allergies
►
Odd to find some but not all autoimmune disorders classified as Allergies.
►
…
7
Look at classification: Most initial errors spotted looking upwards
►
Look up hierarchy (with OWLViz)
►
Let clinicians find important concepts and check them
•
Face validity and then look up the hierarchy
►
Check any anomalies against the complete SNOMED in standard browser
•
Guard against artifacts in various transformations
►
Trace anomalies to their root
►
Decide which links to add or break
►
Decide how to break them
►
Edit, classify and check
•
Hierarchies
•
Usages
8
OwlViz Upwards for Hypertension
9
And check for the desired result
10
Check in standard browser in full SNOMED (snob.eggbird.eu/)
11
Examine definition & formulate solution
Disorder of blood vessel
that (
Finding site
some
Systemic arterial structure
)
and (
Has definitional manifestation
some
Increased blood pressure)
)
Disorder of blood vessel
that (
Finding site
some
Cardiovascular system structure
) and (
Has definitional manifestation
some
Increased blood pressure)
12
Then check usages for unwanted results anything that should relate to arteries instead of Cardiovascular system?
13
Also look down hierarchy: Combine lexical & semantic search
►
Hard to spot what is missing
►
Hypertensive disorders included some complications as well as kinds of hypertension. Did it contain them all?
►
Use OPPL combining lexical, owl semantics & queries
►
?C
:CLASS=MATCH( “.*[Hh]ypertensive.*” ) SELECT ?C
SubClassOf Thing WHERE FAIL ?C
SubClassOf “Hypertensive disorder” BEGIN ADD ?C
SubClassOf Candidate_hypertensive END ;
action
lexical open world OWL semantics
closed world query
►
Classify and look at odd cases …
14
Classify and look at odd cases
15
Look for regularities
►
Of hypertensive complications
►
1 linked to Hypertensive disorder by property due to
►
1 linked to Hypertensive disorder by property associated with
►
2 are subclasses of Hypertensive disorder
►
2 not linked at all
►
No class for Hypertensive complication
►
Although there is a class for Diabetic complication
►
Regularise
►
Create classes for
• • •
Hypertension, Hypertensive complication and
Hypertension AND/OR Hypertensive complication
►
Edit all complications to schema:
Disorder due to some Hypertension
16
Which concept should carry the old ID?
►
Look at usages of Hypertensive disorder
►
All fit Hypertension; none fit Hypertensive complication
►
Therefore, label original ID for Hypertensive disorder as
Hypertension
•
New Hierarchy:
‣
Hypertension AND/OR Hypertensive complication
Hypertension
new ID/concept old ID/concept …kinds of hypertension Hypertensive complication… … kinds of hypertensive complication
new ID/concept
17
Looking down hierarchy: Analysis by categorisation
►
Even short alphabetic lists are difficult to check
►
Break it up logically
?
?
18
Always trace errors to root to fix mish mash modelling
►
Simple error
►
The axiom that Skin is a kind of Soft tissue was omitted
►
Therefore Injuries to skin are not listed as kinds of
Soft tissue injuries
►
Authors have noticed some cases and tried to compensate
►
Cut of skin of foot is a kind of soft tissue injury, but Cut of the skin of lower limb was NOT a soft tissue injury
►
One axiom to fix it all: Skin subClassOf SoftTissue:
•
And then a script to find the redundant axioms
19
Trace errors to their roots: Incomplete modelling: Example
►
Why is Myocardial Infarction not a kind of Ischemic Heart Disease?
• •
Ischemia = “lack of blood supply” Myocardium = “Heart muscle”
►
Infarction
not fully defined in SNOMED. References say…
•
“Tissue death due to ischemia”
►
Ischemic heart disease
not fully defined SNOMED, Refs say…
•
Heart disease due to ischemia
►
Ischemic disorder
•
does not exist in SNOMED, Natural closure… Disorder due to some Ischemia - NB always involves Cardiovascular system
►
Add definitions and Myocardial infarction classified correctly
►
Also discover a long list of Ischemic disease that have not been classified as cardiovascular
►
Check lexically for other uses of “ischemic”
►
None found in this subset
20
Error in schema for anatomy: Conflates branches with parts
►
Example
►
Injury to artery of the ankle is located in the pelvis and in the abdomen (as well as the ankle)!
►
Extends to all nerves & blood vessels
►
Requires a generic change
►
Simplest involves about 20 axioms for arteries
21
Overgeneralisation – explains many arguments
►
The dictionary says “Neuropathy” is a disease of nerves
►
But in practice it is a “dysfunction” of nerves
•
Doctors don’t consider tumors or injuries to nerves to be neuropathies
►
SNOMED often does not distinguish structural and functional disorders
•
Needs a consistent pattern:
23
Naming issues
►
All SNOMED terms have at least two names
►
“Fully qualified name” & “Preferred name”
►
“Fully qualified names” should be consistent but…
►
Example - conflicting names
►
“Immune hypersensitivity disorder (disorder) = “Allergic disorder”
►
Structure nodes in SEP triples
•
“Structure of X”, “X Structure”, X
‣
Leads to “Swelling of gums” is kind of “Swelling of face”
24
Doing everything in a separate module (insofar as possible)
Perform queries as “probes” Perform queries as “probes” Keep changes in Modules Compromise: System of diffs and merges 25
Summary: QA of a large DL-based ontology is possible!
►
Find a useful subset and use it as signature to extract a manageable module
►
Start with things that are important to your experts
►
Look upwards rather than downwards in the first instance
►
Follow up analogies and patterns
►
When looking downwards enrich categorization to reduce noise
•
Combine lexical and semantic techniques
►
Analysis by synthesis -
►
test alternative potential changes with classifier
►
as far as possible in a separate module; scripting where possible
►
Tooling gaps / weaknesses
►
Scripting tools need work
►
Combining filtering with imports
►
Diffs & change management – needed but don’t enough
►
Log everything!
26