Data Management Techniques: How to Scrub Your Data

Download Report

Transcript Data Management Techniques: How to Scrub Your Data

Data Management Techniques:
How to Scrub Your Data
Andrea D. Hart
Catherine Callow-Heusser
Quality Assurance:
The Overlooked
and Underused Part of Data Management
Plan carefully prior to starting data
collection.
Careful planning will reduce later data
problems and cost increases.
Plan with data analysis in mind!
Quality Assurance
Field Edits
Directly after an interview, the interviewer should
scan for missing data points, illogical skip
patterns, quantitative items that need more
complete answers, observational items, or any
item that needs clarification from a supervisor.
Field edits should be made in a different color
(green pencil) to denote the field edit as
separate from the data collection time point.
Quality Assurance
Quality Control (QC)
Whether one uses paper instruments or direct
computer entry, there should be a second pair
of eyes checking the data before data entry.
This quality control person should be fully
trained on the interviewing procedures and the
data entry rules. Any edits made by the QC
person should be done in a consistent color
(red pen). Erasures should not be made
beyond the interview, so a history of
decisions is documented.
Quality Control
• Communication should be clear between
researcher, supervisor, interviewer, and quality
control personnel to maintain consistent data
collection rules and interpretations.
• When the quality control person gives feedback
to the interviewer regarding problems with the
interview or needed clarification, this corrects
similar errors in the future and prevents
interviewer drift.
Double Data Entry
• Entering data twice and then comparing the two
data files for differences has been a standard
practice for researchers for many years.
• For example, datafile 1 should have the exact
information as datafile 2. So var1 from datafile1
subtracted from var1 from datafile2 should equal
zero. Simple code can check when they are
anything but zero. Then either a scanned or hard
copy of the original interview is checked to
resolve the difference. Often, at this point, data
entry rules are established that deal with
ambiguous data points.
Double Data Entry
PROS:
• Leaves a “paper trail” for checking data errors
• Low tech, no specialized software needed
• Allows another quality control check on data
points (reading clarifying comments written in
margins of questionnaire)
CONS:
• Time consuming
• Labor intensive
Scanning Data
Software programs are available that scan interview
forms and read data (generally filled in dots)
based on a programmed template. These
programs can assign variable and value labels
and provide data files for use in SAS and SPSS.
Pros:
• Faster (depending on the design of the interview form
and the ease of making a template), thereby reducing
data entry budgets
• Facilitates archiving hard copies as digital media
(searchable by id’s) reducing storage space and
tedious filing
Scanning Data
Cons:
• With few exceptions, generally requires expensive
equipment and software. Additionally, may require
higher levels of expertise from data entry personnel.
• Must redo forms not designed with a scanning
template in mind to allow for “dots” to be filled.
• Forms that have “dots” to fill in are not user friendly to
a variety of populations.
• Does not allow for double data entry and the reliability
of accurate entry varies across scanning programs and
across the user’s ability to fill in the “dots”.
Direct Data Entry
Direct data entry allows participants or
interviewers to directly enter data into a
computer via interactive screens.
Pros:
• Circumvents error introduced by multiple data
handling steps
• Requires little or no additional data entry personnel
• Eliminates out-of-range values and skip pattern errors
Cons:
• Generally, no “paper trail” to check data errors or recreate
corrupt data
• Requires fairly sophisticated programming to create a
foolproof data entry interface
• Computer equipment is needed
Range Checks
Queries or frequencies for values that are not possible or
highly unlikely should be run. For example, a yearly
income of $50,000 is not impossible, but highly unlikely
for a teen mother who qualifies for Early Head Start.
Skip Patterns
Queries should be run for values that are illogical. For
example, if there is an item that says, “no, I’ve never drunk
alcohol” on a screen for alcohol use, then there should be
no data for the section on alcohol use.
Naming Protocols
• In a longitudinal study, large amounts of data will
be collected. Some of this data will be collected
across several time points. It is crucial to have
consistent, easy to decipher file names, variable
names, and variable labels.
• It’s an art to devise variable names for large data
sets that can be “decoded” and fit into only 8
characters (if using SPSS for data analysis).
Variable Naming Protocols
One strategy is to divide the variable name
into 3 parts: a prefix, a stem, and a suffix.
Prefix: 2 or 3 characters denoting the
timepoint the data was gathered and from
whom it was asked or how it was gathered.
•
•
•
•
M1 = Time 1, asked of the mother
F2 = Time 2, asked of the father
C3 = Time 3, asked of the childcare provider
V1 = Time 1, video-coded data
Variable Naming Protocols
Stem: After the prefix, the next set of
characters should denote the construct
being measured.
• V1intr = time 1, video-coded, parent
intrusiveness
• M3dep = time 3, mother data, depression
Variable Naming Protocols
Suffix: The suffix can be used for scales
with item numbers or summary scores.
• M3dep01 = time 3, mother data, construct
depression, item number 1
• M3deptot = time 3, mother data, construct
depression, total summary score
• M3depsu = time 3, mother data, construct
depression, suicide subscale score
Calculate Summary Scores or
Reliability Checks
• The computer should be used to calculate
summary scores, reliability scores, or
check hand-calculated summary scores.
This allows missing data to be treated
consistently with rules.
• For example, a researcher may want to
assign a missing value to any summary
score that is missing more than 25% of
the individual items.
Reports
• Databases like MS Access can be used to
produce any useful collection of data that is
supported by the database.
• These reports can be used for management of
data collection, particularly when data collection
time points during a longitudinal study overlap.
E.g., weekly reports of subjects who are within the
window of data collection for times 1 and 2.
• They can also be used to produce data
codebooks, lists of variable name prefixes,
stems, suffixes, database rules, etc.
Have all the data been entered
into appropriate data tables?
• Matching tracking data (completion codes of
various stages of data collection) with data entry
tables ensures all data that was collected is
entered. This allows cleanup of inconsistent data
OR tracking information. All data should tell the
same story.
• If the tracking information states that Alice
completed all protocols for timepoint 1 but is
missing the video portion of timepoint 2, then the
data tables should also reflect this pattern.
“Business Rules”
• Business rules are database management
rules used to reduce the proliferation of
database “junk.”
• When multiple people use a database and
are designing tables and queries, it
becomes necessary to have rules to live
by so that the database is transparent to
those who use it.
“Business Rules”
These rules may include:
• Naming protocols for permanent queries, tables,
reports, or syntax.
• Articulating cleanup procedures like automatic deletion
of non-permanent queries, data tables, or syntax after
a designated amount of time.
• Designating personal databases or folders for
individual use.
• Keeping a computer file for logging data errors or
changes.
• Maintaining strict control over your computer directory
structure—helps eliminate clutter and confusion.
• Automating a backup schedule.
Data Codebooks
• A database like MS Access can store ANY type
of data. Thus you can integrate making a
codebook as part of the data collection/entry
process.
• These types of data can include your variable
labels, value labels, questionnaires, or
appendices.
• In preparing a longitudinal dataset, it is important
to maintain consistent coding schemes across
time. MS Access can be used to maintain
definitions of coding schemes.
yes/no in EVERY DATASET coded no = 0, yes = 1.
Two Examples of Codebooks
Produced by MS Access
• Differing amounts of information on
each screen
• Ease of readability
Summary
• Your research credibility depends on sound data
management techniques.
• Proper data cleaning ALWAYS takes longer and
requires more investment than you think it will.
• Clear advance planning of data tracking, variable
naming, business rules, and writing data
codebooks will be invaluable for a dataset that is
logical, consistent, and easy to use.
• Always respect your data subjects and
remember to build in data procedures that
maintain their right to privacy.
Contact Information
Andrea D. Hart
[email protected]
Catherine Callow-Heusser
[email protected]
University of Arkansas for Medical
Sciences
Partners for Inclusive
Communities
2001 Pershing Circle, Suite 300
North Little Rock, AR 72114
NSF MSP/RETA Evaluation
Capacity Building Project
2810 Old Main Hill
Utah State University
Logan, UT 84322-2810
501-682-9918
FAX 501-682-9991
435-797-1111
FAX 435-797-1448