Automatic String Matching for Reduction of Drug Name Confusion

Download Report

Transcript Automatic String Matching for Reduction of Drug Name Confusion

Automatic String Matching for Reduction of Drug Name Confusion

Bonnie J. Dorr, University of Maryland Greg Kondrak, University of Alberta December 4, 2003

Study Method for Testing Drug Name Similarity

• Overview: Phonological string matching for ranking similarity between drug names • Validation of Study Method: Precision and Recall against gold standard • Optimal Design of Study: Interface for assessing appropriateness of newly proposed drug name • Strengths and weaknesses: Each algorithm retrieves and misses correct answers that the others do not.

Overview: Drugname Matching

• String matching to rank similarity between drug names • Two classes of string matching –

orthographic

: Compare strings in terms of spelling without reference to sound –

phonological

: Compare strings on the basis of a phonetic representation • Two methods of matching – –

distance

: How far apart are two strings?

similarity

: How close are two strings?

Orthographic and Phonological Distance/Similarity

• Orthographic Distance: Levenshtein/string-edit Similarity: LCSR, DICE • Phonological Distance: Soundex Similarity: ALINE • Distance compared to Similarity: [dist(w 1 ,w 2 )] comparable to [1 − sim(w 1 ,w 2 )]

Orthographic Distance: Levenshtein (string-edit)

Levenshtein/string-edit

: Count up the number of steps it takes to transform one string into another • Examples: –Distance between

za

ntac and

co

ntac is 2.

–Distance between

z

an

t

a

c

and

x

ana

x

is 3.

• For “global distance”, we can divide by length of longest string: – 2/max(6,6) = 2/6 = .33

– 3/max(5,6) = 3/6 = .5

• •

Orthographic Similarity: LCSR, DICE

LCSR:

Double the length of the longest common sub-sequence and divide by total number of chars in each string Examples: – Similarity between za

ntac

– Similarity between z

an

t

a

c & co

ntac

: (2 ∙ 4)/12 = 8/12 = .67

& x

ana

x : (2 ∙ 3)/11 = 6/11 = .55

DICE:

Double the number of shared bigrams and divide by total number of bigrams in each string Examples: – Similarity between { za,an,

nt

,

ta

,

ac

} & {co,on,

nt

,

ta

,

ac

} is (2 ∙ 3)/(5+5) = 6/10 = .6

– Similarity between { za,

an

,nt,ta,ac} & {xa,

an

,na,ax} is (2 ∙ 1)/(5+4) = 1/9 = .22

Phonological Distance: Soundex

Soundex:

Transform all but first consonant to numeric codes, delete 0’s, and truncate resulting string to 4 characters • Character Conversion: 0 =(a,e,h,i,o,u,w,y); 1=(b,f,p,v); 2=(c,g,j,k,q,s,x,z); 3=(d,t); 4=(l); 5=(m,n); 6=(r) • Examples: Match: king and khyngge (k52,k52) Mismatch: knight and night (k523,n23) Match: pulpit and phlebotomy (p413,p413) Mismatch: zantac and contac (z532, c532) Mismatch: zantac and xanax (z532, x52) • Alternative: compare syllable count, initial/final sounds, stress locations. Misses sefotan(3)/seftin(2); Gelpad/ hypergel

Phonological Similarity: ALINE

• Use phonological features to compare two words by their sounds. (Kondrak, 2000) – x#→k(s): +consonantal, +velar, +stop, -voice – #x→z: +consonantal, +alveolar, +fricative, +voice • Use entire string, vowels, decomposable features.

• Developed originally for identifying cognates in vocabularies of related languages (colour vs. couleur) • Feature weights can be tuned for specific application.

• Phonological similarity of two words: Optimal match between their phonological features.

– Zantac – Xanax

ALINE Example:

Osmitrol

and

Esmolol

o

e 6

s s

10

m m

10 ə ə 10

t -

-5

r l

7 ə ə 10

l l

10 S

=58

- Identifies identical pronunciation of different letters.

- Identifies non-identical but similar sounds.

The vocal tract

Places of articulation: Numerical Values

bilabial 1.0

labiodental 0.95

dental 0.9

alveolar 0.85

retroflex 0.8

palato-alveolar 0.75

palatal 0.7

velar 0.6

uvular 0.5

ALINE Features: Weights and Values

Name Place of articulation Manner of articulation Voicing Aspiration Length Height Weight 40 50 Values dental, velar, palatal … plosive, fricative, … 10 voiced, voiceless 5 aspirated, unaspirated 5 long, short 5 high, mid, low

Validation: Comparison of Outputs

• EDIT: 0.667 zantac contac 0.500 zantac xanax 0.333 xanax contac • DICE: 0.600 zantac contac 0.222 zantac xanax 0.000 xanax contac • LCSR: 0.667 zantac contac 0.545 zantac xanax 0.364 xanax contac • ALINE: 0.792 zantac xanax 0.639 zantac contac 0.486 xanax contac

Validation: Precision and Recall

• Precision and recall against online gold standard:

USP Quality Review

, Mar, 2001.

• 582 unique drug names, 399 true confusion pairs, 169,071 possible pairs (combinatorically induced) • Example (using DICE): + 0.889 atgam + 0.875 herceptin - 0.870 zolmitriptan + 0.857 quinidine ratgam perceptin zolomitriptan quinine - 0.857 cytosar + 0.842 amantadine cytosar-u rimantadine : : : : - 0.800 erythrocin erythromycin

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 0 Validation: Comparison of Precision at Different Recall Values 0.1

0.2

0.3

0.4

0.5

Recall

0.6

0.7

Algorithm (avg prec) ALINE (0.36) ALINE_O (0.35) LCSR (0.32) EDIT (0.29) DICE (0.27) 0.8

0.9

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 0 Validation: Precision of Techniques with Phonetic Transcription 0.1

0.2

0.3

0.4

0.5

Recall

0.6

0.7

Algorithm (avg prec) ALINE (0.36) LCSR (0.32) LCSR_P (0.32) DICE (0.27) DICE_P (0.27) 0.8

0.9

1

Optimal Design of Study

• Develop and use a web-based interface that allows applicants to enter newly proposed names • Interface displays a set of scores produced by each approach individually, as well as combined scores based on the union of all the approaches.

• Applicant compares score to pre-determined threshold to assess appropriateness • In advance, run experiments with different algorithms and their combinations against gold standard for: – Determining the “appropriateness” threshold – Fine-tuning: calculate weights for drugname matching

Optimal Design of Study (continued)

• Parameters have default settings for cognate matching task, but not appropriate for drugname matching • Parameter tuning: – calculate weights for drugname matching – “Hill Climbing” search against gold standard • Tuned parameters for drugname task – maximum score – insertion/deletion penalty – vowel penalty – phonological feature values

Strengths and Weaknesses

• • •

ALINE

matches: – ultram voltaren – nasarel nizoral – lortab luride

DICE

matches: – lanoksin lasiks – gelpad hypergel – levodopa methyldopa

LCSR

matches: – edekrin euleksin – verelan virilon – nefroks nifereks

• • •

Strengths and Weaknesses

ALINE

– Highest interpolated precision; easily tuned to the task; matches similar sounding words with difference in initial characters (ultram/voltaren) – Misses some words with high bigram count (lanoksin/lasiks) and weight-tuning process may induce overfitting to data (bupivacaine/ropivacaine vs. brevital/revia).

DICE

– Matches parts of words (bigrams) to detect confusable names that would otherwise be dissimilar (gelpad/hypergel).

– Misses similar sounding names (ultram/voltaren) that have no shared bigrams.

LCSR

– Matches words where number of shared bigrams/sounds is small (edekrin/euleksin) – Misses similar sounding names (lortab/luride) that have a low subsequence overlap.

Conclusion

• Experimentation with different algorithms and their combinations against gold standard. • Fine-tuning based on comparisons with gold standard (e.g., re-weighting of phonological features).

• Strong foundation for search modules in automating the minimization of medication errors • Solution: combined approach that benefits from strengths of all algorithms (increased recall), without severe degradation in precision (false positives).