Quality Assurance of the Content of a Large DL-based Terminology using Mixed Lexical and Semantic Criteria: Experience with SNOMED CT Alan Rector, Luigi Iannone,

Download Report

Transcript Quality Assurance of the Content of a Large DL-based Terminology using Mixed Lexical and Semantic Criteria: Experience with SNOMED CT Alan Rector, Luigi Iannone,

Quality Assurance of the Content of a Large DL-based Terminology using Mixed Lexical and Semantic Criteria: Experience with SNOMED CT

Alan Rector, Luigi Iannone, Robert Stevens [email protected]

“A report from the trenches”

SNOMED-CT - mandated terminology for electronic patient records in UK, US, & worldwide aspirations

The result of a merger of two other systems

• •

SNOMED and Clinical Terms v3 Long history with much opportunity for error

Expressed in a Description Logic and now available in OWL

subset of EL++ without disjoint axioms

Has been resistant to independent analysis although many known problems

Despite several global QA attempts based on lexical criteria that have identified errors without explaining them

2

It’s very big - and classification matters

~400,000 Concepts/Classes; >1,000,000 axioms

Much of richness only evident in classified for m

Most errors only present in classified form

stated Classified 3

…and some classification horrendously complicated (Skin of Ankle)

4

An experiment of opportunity

The opportunities

Tried to use SNOMED for Commercial Collaboration on Clinical Systems

Tried to use SNOMED as contribution to WHO’s revsion of International Classification of Diseases (ICD-11)

Problems with both

Therefore, experiment if QA & repair were possible

Conventional wisdom said that it was not

However, we had new resources

Core Problem List Subset from NLM (8500 most used classes)

Software to extract “modules”

SNOROCKET Classifier for EL++

4-8GB machines

5

Step 1: Cut it down & find a classifier

Find a subset

UMLS Core Problem List subset -

8500 most used disease concepts

Collected by US National Library of Medicine by combining sets from 6 major institutions.

Extract a “Module” (built into OWL API v3)

Use core subset as “signature”

Guaranteed that all inferences amongst the classes in “signature” in whole will hold in module

35,000 concepts - including most of anatomy

Find a classifier that can cope - at least two for checking

SNOROCKET (EL++) polynomial time subset of OWL (30 sec)

Pellet 2.1 (200 sec)

FaCT++ (250 sec)

• 6

Step 2: Pick some areas of interest to clinicians: some with anomalies already spotted

Myocardial Infarction (Heart attack)

Should be a kind of Ischemic Heart Disease, but wasn’t

Hypertension (High blood pressure)

Odd to find it a kind of Soft Tissue disorder

Diabetes

Odd to find it as a Disorder of the Abdomen

Allergies

Odd to find some but not all autoimmune disorders classified as Allergies.

7

Look at classification: Most initial errors spotted looking upwards

Look up hierarchy (with OWLViz)

Let clinicians find important concepts and check them

Face validity and then look up the hierarchy

Check any anomalies against the complete SNOMED in standard browser

Guard against artifacts in various transformations

Trace anomalies to their root

Decide which links to add or break

Decide how to break them

Edit, classify and check

Hierarchies

Usages

8

OwlViz Upwards for Hypertension

9

And check for the desired result

10

Check in standard browser in full SNOMED (snob.eggbird.eu/)

11

Examine definition & formulate solution

Disorder of blood vessel

that (

Finding site

some

Systemic arterial structure

)

and (

Has definitional manifestation

some

Increased blood pressure)

)

Disorder of blood vessel

that (

Finding site

some

Cardiovascular system structure

) and (

Has definitional manifestation

some

Increased blood pressure)

12

Then check usages for unwanted results anything that should relate to arteries instead of Cardiovascular system?

13

Also look down hierarchy: Combine lexical & semantic search

Hard to spot what is missing

Hypertensive disorders included some complications as well as kinds of hypertension. Did it contain them all?

Use OPPL combining lexical, owl semantics & queries

?C

:CLASS=MATCH( “.*[Hh]ypertensive.*” ) SELECT ?C

SubClassOf Thing WHERE FAIL ?C

SubClassOf “Hypertensive disorder” BEGIN ADD ?C

SubClassOf Candidate_hypertensive END ;

 

action

lexical open world OWL semantics

closed world query

Classify and look at odd cases …

14

Classify and look at odd cases

15

Look for regularities

Of hypertensive complications

1 linked to Hypertensive disorder by property due to

1 linked to Hypertensive disorder by property associated with

2 are subclasses of Hypertensive disorder

2 not linked at all

No class for Hypertensive complication

Although there is a class for Diabetic complication

Regularise

Create classes for

• • •

Hypertension, Hypertensive complication and

Hypertension AND/OR Hypertensive complication

Edit all complications to schema:

Disorder due to some Hypertension

16

Which concept should carry the old ID?

Look at usages of Hypertensive disorder

All fit Hypertension; none fit Hypertensive complication

Therefore, label original ID for Hypertensive disorder as

Hypertension

New Hierarchy:

Hypertension AND/OR Hypertensive complication

Hypertension

new ID/concept old ID/concept …kinds of hypertension Hypertensive complication… … kinds of hypertensive complication

new ID/concept

17

Looking down hierarchy: Analysis by categorisation

Even short alphabetic lists are difficult to check

Break it up logically

?

?

18

Always trace errors to root to fix mish mash modelling

Simple error

The axiom that Skin is a kind of Soft tissue was omitted

Therefore Injuries to skin are not listed as kinds of

Soft tissue injuries

Authors have noticed some cases and tried to compensate

Cut of skin of foot is a kind of soft tissue injury, but Cut of the skin of lower limb was NOT a soft tissue injury

One axiom to fix it all: Skin subClassOf SoftTissue:

And then a script to find the redundant axioms

19

Trace errors to their roots: Incomplete modelling: Example

Why is Myocardial Infarction not a kind of Ischemic Heart Disease?

• •

Ischemia = “lack of blood supply” Myocardium = “Heart muscle”

Infarction

not fully defined in SNOMED. References say…

“Tissue death due to ischemia”

Ischemic heart disease

not fully defined SNOMED, Refs say…

Heart disease due to ischemia

Ischemic disorder

does not exist in SNOMED, Natural closure… Disorder due to some Ischemia - NB always involves Cardiovascular system

Add definitions and Myocardial infarction classified correctly

Also discover a long list of Ischemic disease that have not been classified as cardiovascular

Check lexically for other uses of “ischemic”

None found in this subset

20

Error in schema for anatomy: Conflates branches with parts

Example

Injury to artery of the ankle is located in the pelvis and in the abdomen (as well as the ankle)!

Extends to all nerves & blood vessels

Requires a generic change

Simplest involves about 20 axioms for arteries

21

Overgeneralisation – explains many arguments

The dictionary says “Neuropathy” is a disease of nerves

But in practice it is a “dysfunction” of nerves

Doctors don’t consider tumors or injuries to nerves to be neuropathies

SNOMED often does not distinguish structural and functional disorders

Needs a consistent pattern:

23

Naming issues

All SNOMED terms have at least two names

“Fully qualified name” & “Preferred name”

“Fully qualified names” should be consistent but…

Example - conflicting names

“Immune hypersensitivity disorder (disorder) = “Allergic disorder”

Structure nodes in SEP triples

“Structure of X”, “X Structure”, X

Leads to “Swelling of gums” is kind of “Swelling of face”

24

Doing everything in a separate module (insofar as possible)

Perform queries as “probes” Perform queries as “probes” Keep changes in Modules Compromise: System of diffs and merges 25

Summary: QA of a large DL-based ontology is possible!

Find a useful subset and use it as signature to extract a manageable module

Start with things that are important to your experts

Look upwards rather than downwards in the first instance

Follow up analogies and patterns

When looking downwards enrich categorization to reduce noise

Combine lexical and semantic techniques

Analysis by synthesis -

test alternative potential changes with classifier

as far as possible in a separate module; scripting where possible

Tooling gaps / weaknesses

Scripting tools need work

Combining filtering with imports

Diffs & change management – needed but don’t enough

Log everything!

26