Content Analysis: CATA Kimberly A. Neuendorf, Ph.D. Cleveland State University Fall 2011 CATA: Computer Aided Text Analysis Why might you want to use CATA rather than.

Download Report

Transcript Content Analysis: CATA Kimberly A. Neuendorf, Ph.D. Cleveland State University Fall 2011 CATA: Computer Aided Text Analysis Why might you want to use CATA rather than.

Content Analysis:
CATA
Kimberly A. Neuendorf, Ph.D.
Cleveland State University
Fall 2011
CATA: Computer Aided Text Analysis
Why might you want to use CATA rather
than traditional human-coding
techniques?
 CATA programs typically have been
written by researchers with a specific
need; thus, their utility is often limited.
 Online search and acquisition
opportunities have made CATA easier,
more attractive (e.g., Nexis)

Purposes of CATA

1. Descriptive—e.g., word counts

Modell project, using:


VBPro, M. Mark Miller, 1980s software
2. Coding of Open-ended Survey Responses

WordStat, SimStat adjunct program

(Provalis Research; Normand Peladeau)
Purposes of CATA
[Standard Dictionaries: Most of the following
applications use internal or “standard” dictionaries:]

3. Linguistic and Sociolinguistic Measures



General Inquirer, Philip Stone, 1966
 Harvard IV Dictionary
MCCALite, Don McTavish & Ellen Pirro
 116 “idea categories” are applied to multiple characters in
a script
CATPAC, Joseph Woelfel
 Semantic “neural” networks—no actual dictionary
Purposes of CATA

4. Psychometric Measures (or “Thematic
Content Analysis”—Smith)

General Inquirer


e.g., Lasswell Values Dictionary
5. Clinical Psychological/Psychiatric
Diagnoses

PCAD, Louis Gottschalk & Robert Bechtel

Computer version of Gottschalk’s earlier humancoded schemes devised to provide alternative
diagnostic techniques
Purposes of CATA

6. Verbal Style or Communicator Style

LIWC, Pennebaker, Booth, & Francis



e.g., positive emotions, cognitive processes
Also includes many linguistic measures and some
that might be used as psychometrics
Diction, Rod Hart

Computer application of Hart’s earlier human-coded
schemes aimed at measuring characteristics of
political speech—e.g., aggression, cooperation,
ambivalence
Purposes of CATA

7. Authorship Attribution


Most use simple counts of letters or words to
attribute authorship (e.g., the Federalist
papers; Raymond Chandler; Shakespeare)
Basic computer/word processing programming
is sufficient
Measurement in CATA
For Measurement, Three main choices:
 Custom Dictionaries
 Complicated, time-consuming
 Standard Dictionaries
 A task of matching one’s conceptualization to
someone else’s operationalization—sometimes a
scavenger hunt
 Similar to the challenge of finding an appropriate
scale for a survey
 “Emergent” Coding—outcome based on language
patterns that emerge (e.g., CATPAC)
Quantitative CATA Programs
Program
Author
Original Purpose
VBPro
M. Mark Miller
Newspaper articles
Yoshikoder
Will Lowe
Political documents
WordStat
Normand Peladeau
Part of SimStat, a statistical analysis
package
General Inquirer
Philip Stone
General mainframe computer
application (1960s)
Profiler Plus
Michael Young
Communications of world leaders
LIWC 2007
Pennebaker, Booth, & Francis
Linguistic characteristics &
psychometrics
Diction 5.0
Rod Hart
Political speech
PCAD 2000
Gottschalk & Bechtel
Psychiatric diagnoses
WORDLINK
James Danowski
Network analysis/communication
CATPAC
Joseph Woelfel
Consumer behavior/marketing
Quantitative CATA Programs
Program
Type
VBPro
Word count/researcher-created dictionaries only
Yoshikoder
Word count/researcher-created dictionaries only
WordStat
Word count/researcher-created dictionaries only
General
Inquirer
Word count with pre-set dictionaries
Profiler Plus
Word count with pre-set dictionaries
LIWC 2007
Word count with pre-set dictionaries (researchercreated dictionaries may be added)
Diction 5.0
Word count with pre-set dictionaries
PCAD 2000
Word count with pre-set dictionaries (researchercreated dictionaries may be added)
WORDLINK
Word co-occurrence
CATPAC
Word co-occurrence
Validity and CATA



Validation part of development of CATA system (e.g., Lin et
al., 2009—genres of online discussion threads)
Validation of thematic CA (psychometrics) against selfreport—rare and uncertain (e.g., McClelland et al., 1992)
A comprehensive model for assessing content, external, and
predictive validity when using CATA—Short, Broberg, Cogliser,
Brigham (2010) as applied to “entrepreneurial orientation”:
 Content validity—an inductive/deductive combo
 External validity—use multiple sampling frames
 Predictive validity—measure non-CATA variables that
should relate
Validity of Standard Dictionaries

Trusting the Standard Dictionary—an issue of face
validity




Few CATA programs reveal the full dictionary lists (e.g.,
Diction, General Inquirer)
None reveal the full algorithm (including disambiguation
(e.g., well, pot, leaves))
None account for negation
Construct and Criterion Validity


Rod Hart’s Diction—”normed” rather than validated
Gottschalk and Bechtel’s PCAD—validated against
standard psychiatric diagnoses
Quantitative CATA Programs
Program
Type
Validation
VBPro
Word count/researcher-created dictionaries only
N/A—all custom dictionaries
Yoshikoder
Word count/researcher-created dictionaries only
N/A—all custom dictionaries
WordStat
Word count/researcher-created dictionaries only
N/A—all custom dictionaries
General
Inquirer
Word count with pre-set dictionaries
No--Dictionaries adapted from Harvard IV, Lasswell
values, other standard linguistic and sociopsychological scales
Profiler Plus
Word count with pre-set dictionaries
Proprietary
LIWC 2007
Word count with pre-set dictionaries (researchercreated dictionaries may be added)
Some dimensions have been validated against
assessments by human judges
Diction 5.0
Word count with pre-set dictionaries
No—Based on R. Hart’s substantive work
PCAD 2000
Word count with pre-set dictionaries (researchercreated dictionaries may be added)
Long history of development of a human-coded
scheme; both human & CATA heavily validated
against clinical diagnoses
WORDLINK
Word co-occurrence
N/A—emergent dimensions
CATPAC
Word co-occurrence
N/A—emergent dimensions
Yoshikoder
About Yoshikoder
Created by Will Lowe at Harvard’s Department of
Government
Can be downloaded free at www.yoshikoder.org
A cross-platform, multi-lingual CATA program
Must run one case at a time
Assumes the researcher will create dictionaries
Can import external dictionaries
Exports results into Excel
Yoshikoder: KWIC and Concordance
Yoshikoder: Dictionary Report
WordStat
About WordStat
• Created by Normand Peladeau, as part of the SimStat suite for
quantitative data analysis (a counterpart to SPSS)
•Must be run as part of SimStat
•Particularly suited to analyzing open-ended responses, in that data
set typically includes both numeric and textual variables—which
can immediately be crosstabulated
•The “standard” dictionaries that are included are incomplete and
should be avoided
•Also includes KWIC
The WordStat Interface (within SimStat)
Selection of Independent & Dependent Variables—
Including Textual Variable
Standard WordStat “Dictionaries”
Breakdown of very limited WordStat “Dictionary”
WordStat Output: Word counts
WordStat Output: Dendogram
WordStat Output: Crosstab with bar graph
WordStat Output: Crosstab and 3D representation
WordStat Output: KWIC
General Inquirer
(PC/MAC version)
About General Inquirer


Created by Philip Stone in the Department of
Social Relations at Harvard in the 1960s—on
mainframe for many years
The current version combines the "Harvard IV-4"
dictionary content-analysis categories, the
"Lasswell" dictionary content-analysis categories,
and five categories based on the social cognition
work of Semin and Fiedler, making for 182
categories (dictionaries) in all
The General Inquirer (PC) Interface


Input and output files must be named
Two choices: Tags (application of dictionaries) &
Words
General Inquirer Output: Tags (data file that may
easily be exported to Excel & SPSS)
First row of each set is the ‘r’ (raw count) form of the output. This corresponds
to frequencies.
Second row of each set is the ‘s’ (scaled count) form of the output. This
corresponds to percentages (of total).
General Inquirer Output: Words
PCAD
About PCAD




Developed by Gottschalk & Bechtel, using scales developed by
Gottschalk & Gleser for human-coding in 1960s
Diagnostic—assesses one text at a time
Intended for naturally-occurring speech or writing, minimum 80
words
Measures states of neuropsychiatric interest such as:







Anxiety
Hostility
Cognitive impairment
Depression
Schizophrenia
Achievement Strivings
Hope
The PCAD Interface
PCAD Interface-2
PCAD Output: 4 Types
(Clauses, Summaries, Analyses, Diagnoses)
PCAD Output: Analyses
PCAD Output: Diagnoses
LIWC
About LIWC
•Created by Pennebaker, Booth, & Francis
•“Looks at how people write & their state of mind”
•Intended to measure both affective and cognitive constructs
•84 Output Variables (standard dictionaries):
•17 Standard linguistic dimensions (e.g., number of pronouns)
•25 Word categories (e.g., “psychological constructs – affect,
cognition”)
•10 Time categories (e.g.“space, motion”)
•19 Personal concerns (e.g., “home”)
James W. Pennebaker & Martha E. Francis
LIWC Dictionaries (dimensions) with sample words
http://www.liwc.net/descriptiontable1.php
The LIWC Interface
LIWC Output: Data Matrix (Each row is a case/text, each
column a dictionary)
Diction
•
•


About Diction
Created by Roderick P. Hart, University of Texas, originally for the
purpose of analyzing political discourse
To measure “semantic features”, uses a series of 31 standard dictionaries
and five “Master Variables” (scales constituted of combinations of the
31):
•
Activity
•
Optimism
•
Certainty
•
Realism
• Commonality
Users can create custom dictionaries in addition to standard dictionaries.
The program can accept individual or multiple passages.
The Diction Interface
Diction Output: Calculated &
Master Variables
Diction Output: Dictionary Totals with
Normative Values
Diction Output: Interactively Changing
Normative Values
Diction: Custom Dictionaries as Simple .txt Files
Diction Output: Data file may be
exported to SPSS
SPSS
Syntax
Editor
CATPAC
About CATPAC





Created by Joseph Woelfel, Communication scientist at
University of Buffalo
Part of the GALILEO suite of softwares that analyze and
display various types of networks
CATPAC uses a neural network approach, identifying the most
frequent words and determining patterns of connection based
on co-occurrence
A scanning window is used to measure the association/cooccurrence
Uses cluster analysis to present results of this co-occurrence
procedure
The CATPAC Interface

Text input will appear in CATPAC main screen
CATPAC Output:
Descending Frequency List, Alphabetically Sorted List
CATPAC Output: Dendogram
CATPAC Output: 3D Plot (using ThoughView,
another part of Galileo Suite)
end
Measurement in CATA
Three choices:


Custom Dictionaries—do it yourself
Standard Dictionaries—a scavenger hunt


A validity issue for each of the above
“Emergent” Coding—outcome based on word
patterns that emerge (e.g., CATPAC, usually)
Measurement in CATA
Custom Dictionaries—it’s more
complicated than you might think
 Standard Dictionaries—an issue of
matching your conceptualization to
someone else’s operationalization


Similar to the challenge of finding an
appropriate scale for a survey
Validity of Standard Dictionaries

Trusting the Standard Dictionary—a issue of face
validity




Few CATA programs reveal the full dictionary list (ones
that do—Diction, General Inquirer)
None reveal the full algorithm (including disambiguation
(e.g., well, pot, driver) and treatment of homonyms
(e.g., leaves))
And what about negation?
Construct and Criterion Validity


Rod Hart’s Diction—”normed” rather than validated
Louis Gottschalk’s PCAD—validated against standard
psychiatric diagnoses
Creation of Custom Dictionaries

Most CATA programs (but not all) allow for
the application of user-created dictionaries



There are various thesaurus-type programs to
help
But it’s still a huge undertaking
Disambiguation? Negation?
Practical Considerations


Collecting and preparing text files takes up the
majority of the time involved—running CATA is
quick
Cost




Free softwares: VBPro, General Inquirer (PC/MAC),
MCCALite
Relatively inexpensive softwares ($100-200): LIWC,
PCAD, Diction
Expensive softwares ($1000+): CATPAC, WordStat
(part of SimStat)
Most CATA programs are Windows-compatible
CATA Software Operating System Compatibility
CATA Software
PC-DOS
PC-Windows
MAC-OS
CATPAC

Diction 5.0

General Inquirer


LIWC


MCCALite

PCAD

VBPro
WordStat


VBPro
About VBPro





Created by M. Mark Miller at the University of
Tennessee
For use with MS-DOS (!!)
Entirely do-it-yourself. . . no standard
dictionaries
Quantitative: frequencies & coding texts in
numeric format for analysis in statistical
software
Qualitative: can provide KWIC (key word in
context)
VBPro: Preparing the text

Multiple cases within one file are prefixed with an
identification tag and saved as a .txt file (NOT .asc, the
old standard)
VBPro: Preparing Dictionaries

Each search dictionary is headed with >>#<<
The VBPro Interface
VBPro Output: Data matrix (each row is a
case/text, each column a dictionary)
VBPro Output: Alphabetization
VBPro Output: Word Frequency
MCCALite
About MCCALITE






Created by Donald G. McTavish & Ellen B. Pirro, sociologists
at the University of Minnesota, 1990
Full name: Minnesota Contextual Content Analysis
Measures the frequency of words in 116 “idea categories”
(dictionaries) and compare these frequencies to the norms of
general usage statistics for the English Language
There are standard dictionaries (categories) and KWIC,
DIMAP
Two types of dictionary scores are reported: E-Scores
(emphasis) and C-Scores (context)
Ideal content for MCCALITE are multiple-person transcripts
(plays, hearings, interviews, TV)
The MCCALite Interface & Output
MCCALite: One more example (of many possible)

pause