Qing Zeng-Treitler PhD Regular Expression Regular expression (regex) is a sequence of characters that forms a search pattern, mainly for use in pattern matching.

Download Report

Transcript Qing Zeng-Treitler PhD Regular Expression Regular expression (regex) is a sequence of characters that forms a search pattern, mainly for use in pattern matching.

Qing Zeng-Treitler PhD
Regular Expression
Regular expression (regex) is a
sequence of characters that forms a
search pattern, mainly for use in
pattern matching with strings.
 In 1950s, American mathematician
Stephen Kleene first formalized the
description of a regular language.

Regex in NLP
Long history from 1960s
 More complex regular expression arose
in 1980s
 Most programming languages support
regular expressions

Problems
Regex is typically hand crafted.
 They can be difficult to maintain and
extend.

 \b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b

Numerous clinical variables and limited
amount of programming resources
Regex for clinical NLP

Two main branches of NLP
 Symbolic
 Statistical

Regex is commonly used in symbolic
NLP systems
Statistical NLP
Modern NLP algorithms are based on
machine learning, especially statistical
machine learning.
 Instead of hand-written rules, machine
learning algorithms creates rules or
models using text corpora.

 Supervised
 Unsupervised
Features
Essentially all statistical NLP algorithms
expect the input of features.
 Features may be words, ngrams,
semantic types, frequency of words,
ngrams, semantic types, etc.

Problems

All are be created equal hold men selfevident these that truths to we.
 We hold these truths to be self-evident that all
men are created equal.

After ED he headache in last left morning
night ourtown VA this the transferred to
sided seen waking was weakness with.
 Transferred to the Ourtown VA after waking this
morning with left sided weakness. He was seen
in the ED last night with headache.
Problems

Height weight 67 70 kg in
 Height 67 in Weight 70 kg
 Height 70 in Weight 67 kg
General Goal

Improve text classification and
information extraction
Specific Aims

Develop machine learning algorithms to
generate regular expression patterns.
Regular Expression Discovery
(RED)
RED (REDC)L for text classification
 REDEX for information extraction

REDCL
Snippets: “He has been suffering from
lower back pain for 3 years.”
 Tokens: “He”, “has”, “been”, “suffering”,
“from”, “lower”, “back”, “pain”, “for”, “3”,
“years”.
 Phrases: “He”, “has been”, “suffering”,
“from”, “lower back pain”, “for”, “3 years”.
 Key: [“has been”, “suffering”, “from”]
[“has been”, “lower back pain”, “for”]

REDCL
The input of the RED algorithm is an
annotated set of text snippets.
 Annotated datasets may have n classes
or labels. RED generates regular
expressions for each class, treating the
instances in one class as positive and
the rest of the instances as negative.

Preprocessing

The dataset was first transformed into
lowercase and then tokenized using
Penn Treebank (PTB) Tokenizer
implemented by the Stanford NLP
group.
Pairwise Alignment

We adopted a pairwise alignment
algorithm, the Smith-Waterman
algorithm.
Key Extraction
Given a set of aligned phrases, a
number of keys may be generated.
 Following the alignment example, the
aligned phrases are “He has,” “for,” and
“years.” One of the keys is [“he has”
“for” “years”].

Regular Expression Generation

To generate regular expressions,
(\s+\S+)* or (\s+\S+){n,m} is placed
between phrases in the keys.
Filtering Using Training Sample
Each new regular expression generated
using a key is tested against the rest of
the training set.
 The regular expressions are filtered
based on a precision threshold.

Regular Expression Classifiers

Since many regular expressions can
match a snippet of text, we rank the
expressions using a formula we refer to
as, the predicting score.
Regular Expression Classifiers
RED+ALIGN
 RED+SVM

Evaluation
Two annotated clinical datasets were
used.
 SMOKE, was generated for the
automated classification of smoking
status.
 PAIN, was generated to identify the pain
status of patients in order to assess the
effectiveness of complementary and
alternative (CAM) therapy.

SMOKE
“current smoker”, “past smoker”, “nonsmoker”, or “unknown”
 1091 text snippets
 A kappa of 92%

PAIN
“have pain”, “no pain”, or “other”
 702 text snippets
 A kappa of 94%

Classification Performance
SMOKE: the RED+SVM classifier
achieved the best accuracy of 83% (v.
RED+ALIGN 81.5% and SVM 80.5%).
 PAIN: the RED+ALIGN classifier
provided the best accuracy of 81.2% (v.
RED+SVM 80.9% and SVM 79.2%).

Learning Curves
Comparison with SVM
REDEX
BLS: Before labeled segment
 LS: Labeled segment
 ALS: After labeled segment

Generalization

Replace punctuation, digits, white
spaces with regular expression
Trimming
Remove tokens and expressions from
the beginning of BLS and end of ALS
iteratively
 Test for false positive matches
 Stop when false positive occurs

Expression generation
Combine BLS, LS and ALS
 Treat LS as capture group

Evaluation
Bodyweight related measures (weight,
height, BMI, abdominal circumference)
are extremely important for clinical care ,
research and quality improvement.
 These and other vitals signs data are
frequently missing from structured
tables.

Datasets

For NLP training and testing, 968
snippets annotated by two reviewers
 Kappa of 99.54%

A separate test set of 3560 notes was
used to estimate the percentage of
bodyweight related measures that were
stored exclusively as text.
Results

In ten-fold cross validation REDEx’s
performance was : Accuracy= 98.3%,
Precision = 98.8%, Recall = 98.3%, F =
98.5%.
Results



7.7% of notes contained bodyweight related
measures that were not available as structured
data.
Extrapolating to the VA’s national database,
we estimated that using this method would
identify 1,309,000 individuals who would
otherwise not have a bodyweight related
measure in the record.
In addition this method would add 2 additional
bodyweight related measures per individual
per year thus allowing for better determination
of change in weight over time.
Discussion
The new RED algorithms complements
existing NLP approaches.
 REDCl performs slightly better than
SVM with BOW features.
 Recent research to improve text
classification performance often focuses
on the number/type of features.
 REDCl retains the sequential relations of
words.

Discussion
REDEx achieved excellent accuracy.
 Comparing to manually generated
regex, the REDEx results are
reproducible.
 REDEx can be incrementally trained,
thus easier to maintain and extend.

Other Related Work
NLP Ecosystem
 Library of NLP modules for specific
variables
 Sublanguage analysis
 Topic Modeling

Other Related Work

Self-service NLP
 Integration of information retrieval,
annotation, and machine learning
Questions?