Using Electronic Medical Records for Research

Download Report

Transcript Using Electronic Medical Records for Research

Using Electronic Medical Records
for Research: Practical Issues and
Implementation Hurdles
Prakash M. Nadkarni MD
1
Benefits of EMRs

Most of the data that you want is often in
the EMR
 Sample
Size Analyses
 Cohort identification /recruitment
 Detail Data

You can implement many research related
workflows
 Appointment
scheduling enables interventions
at the patient's convenience.
2
EMRs don't do everything
Even Epic warns you about the need to
interoperate with software designed
specifically for clinical research (CRIS=Clinical
Research Information System).
 Even CRISs are sub-specialized: Project
management/finance, grant management
workflows, federal paperwork (FDA
Investigational New Drug applications),
general or specialized data capture (e.g.,
patient diaries, adaptive questionnaires).

3
Challenge: No Study Calendar
All patients are not enrolled at the same time.
 Specific evaluations or interventions are done
at specific time points ('events") relative to
start of participation in the study (or some
arbitrary point- e.g., working backwards from
a scheduled MRI scan).
 Each time point may have a permissible
range or window (e.g., “6-mth follow up”
may occur between 5-7 months).
 Given a protocol/study calendar, a CRIS will
*generate* a provisional patient calendar.

4
Study Calendar (2)




The protocol is worked out based on information
yield of the evaluation and expected rate of
change in the parameters evaluated, evaluation
cost and patient risk. An Event-CRF Cross-Table
enforces consistency.
CRISs use "Unscheduled" events to deal with
emergency conditions.
An entire set of reports are calendar-driven – e.g.,
scheduled events, missing forms, out-of-range
visits.
In Epic, the closest to Calendar functionality is the
Chemotherapy module (Beacon)
5
Non-adherence to Standards

If vendor ignores national/international
controlled terminology standards, data
pooling in cross-institutional collaborations
is difficult
 For
procedures, Epic does not use Clinical &
Procedural Terminology (CPT). Instead,
procedures are identified by idiosyncratic
abbreviations created by hurried users, that
are hard to interpret except by those users,
and vary across institutions.
6
Standards Challenges (2)


Of the 15,000 laboratory tests in our instance of
Epic, only about 8% have been mapped
currently to the Logical Observations, Identifiers,
Nomenclature and Codes (LOINC) vocabulary.
Sometimes the same procedure or lab test is
defined more than once in a master table

the definitions are unhelpful, and one must look at
the actual data to determine which are used, e.g.,
histogram showing number of tests performed over a
period of time, the max and minimum values.
7
Redundancy and heterogeneity

The data may have been stored more
than once, and in different ways, in
different parts of the medical record
 BMI

is recorded in two different places.
"Uncontrolled" local terminologies
 Flowsheets
where Blood pressure is recorded
redundantly as text "124/82". (Not in UIHC,
fortunately.)
 Procedures and Lab definitions list are also
semi-controlled.
8
Duplicate Elements

Pseudo-redundancy: Subtly different data
elements that are given the same label in
the user interface
 Baby's
birth weight is recorded both at the
time of delivery and at the time of admission
to a NICU. The two are not semantically the
same: with interventions, the former may be
significantly more (or less) than the latter.
9
“Wrong” structure
Much data (discharge summaries, etc.) is
stored as text, requiring human abstraction or
Natural language processing (NLP).
 NLP is not 100% accurate, requiring sensitivity
and specificity to be traded off. It is especially
hard with progress notes that are replete with
abbreviations and that may have little
grammatical structure.
 Much of the published NLP work relies on
idiosyncrasies of a particular dataset (e.g., the
use of Epic templates) to achieve higher
accuracy, and is not always generalizable.

10
The Needle in the Haystack
Epic schema contains several thousand tables;
many unused, or with empty fields.
 Incomplete or out-of-date documentation.
 The first time, one may spend more time locating
a particular data element than actually pulling it
out.
 Persons doing data extraction need to add value
by providing signposts and tips, to help others
who have to do the same task later.
 Even with a data warehouse, this problem will
reoccur as long as data definitions are suboptimal

11
Real-time cohort identification
must be done judiciously
"Best Practice Alerts" can be a resource
drain on responsiveness of systems.
 Do you really need real-time subject
identification? Would a 24-hour delay be
acceptable? ICU-related clinical studies;
transfusion in preemies.

12
Transforming the Data
The form in which data is recorded in the
EMR is not necessarily the form in which it
is most conveniently analyzed or reported.
 Registries often require creating derived
variables

 Converting
numerical data into categories – e.g.,
Binning children by birth weight
 Converting numeric values or existence/absence
of data into Yes/No: Is the bilirubin > 5 mg/dl?
Did the neonate receive nitric oxide inhalation
for pulmonary hypertension?
13
Interfacing with statistical
software
Before: sample size, randomization
 After: Analysis, fitting to models

 Some
CRISs (e.g., REDCap, TrialDB) will output
SAS/SPSS-formatted data files, with definitions
for all variables (including enumerations for all
categorical variables; SAS has a command
called PROC FORMAT for categorical data).
EMRs still lag.
14
Data Warehouse
A database that is optimized for fast query,
preferably by end-users, without interactive
updates
 Solves some problems, but not others

 More
homogeneous structure – i.e., a handful of
tables rather than thousands.
 However, the problem of locating variables of
interest doesn't go away. With indifferent
documentation of the variables, the problem of
hunting for variables of interest is transferred
from the concierge/analyst to the end-user, which
may worsen the problem.
15
Special Challenges in EMR Data
Interpretation /Reliability




Data entry errors in source data, often a
consequence of “copy and paste”.
Coding of categorical variables does not
accommodate nuances in the medical history or
diagnostic findings.
Depending on the source, billing data may have
been up-coded (Humana).
Outcome data may be lacking – absence of return
visit data may simply mean that patient failed to
improve and went elsewhere.
16
Special Challenges (2)




Data fragmentation – especially where healthcare
is provided by separate institutions.
Data is observational – treatments and exposures
are not assigned randomly.
Confounding Bias – socioeconomic factors might
lead patients to use suboptimal treatments
Selection/sampling Bias – atypical demographical
attributes for the cohort whose data you are
seeing, may limit inferences that you can make
about the general population.
17
Frontiers: Genetic Data
There are no technical barriers to the
incorporation of limited genetic data for an
individual– e.g., SNPs or specific
mutations – in structured (i.e., readily
analyzable) form.
 Major current issue is the limited
understanding of genetic data and
definitions by EMR vendors.
 Whole-genome is still a long-way off. A
single record would be larger than the
bulk of existing non-image EMR data.

18
Conclusions
None of the challenges are insurmountable,
but they take a lot of effort and resources
to address
 Most of the fixes are long-term, involving:

 Manual
mapping to controlled vocabulary
terms
 Change in processes
 Maintaining descriptive documentation that
must continually be checked for usability and
currency.
19
Further Reading


Masys DR, et al . Technical desiderata for the
integration of genomic data into Electronic
Health Records.J Biomed Inform. 2012
Jun;45(3):419-22
Nadkarni, Ohno-Machado and Chapman. Natural
Language Processing: A Tutorial. Journal of the
American Medical Informatics Association, 2011.
PMC3168328

Hoffman & Podgurski, “Big, bad data” Journal of
Law, Medicine and Ethics, (2013) 41:1,pp 56-60.
http://www.ncvhs.hhs.gov/130430b6.pdf.
20
Questions?
21