Catpac & WordStat Dongwoo Kim & Fran Stewart COM633, Fall 2010 Catpac: The Basics • Originally created by Joseph Woelfel to examine consumer behavior and.

Download Report

Transcript Catpac & WordStat Dongwoo Kim & Fran Stewart COM633, Fall 2010 Catpac: The Basics • Originally created by Joseph Woelfel to examine consumer behavior and.

Catpac & WordStat
Dongwoo Kim
& Fran Stewart
COM633, Fall 2010
Catpac: The Basics
• Originally created by Joseph Woelfel to examine
consumer behavior and marketing.
• Presented as part of the Galileo package of
software analysis tools.
• Billed as a “self-organizing artificial neural
network” optimized for examining text.
Catpac: What It Does
• Recognizes frequency of words used in text.
• Focuses on co-occurrence of words – words that
appear near each other in context.
• Uses cluster analysis to display word cooccurrence.
• Incorporates ThoughtView’s perceptual mapping
and Oresme’s interactive clustering.
Catpac: How Does It Work?
• Catpac moves a window of n words through the
text. For example, if the window size selected is 7
words, then Catpac will systematically scan
words 1-7, then words 2-8, 3-9 and so on until it
completes the document.
• Words appearing in the window then activate
the neurons representing them. Connections
among activated neurons allow Catpac to
associate words that appear close together
within the text.
Getting Started
• Catpac can only be used on ASCII text files so Word
documents will need to be converted to .txt files.
• The most simple analysis is a “dendogram,”
according to the Galileo manual.
• A dendrogram is “a branching diagram representing
a hierarchy of categories based on degree of
similarity or number of shared characteristics
especially in biological taxonomy.” – Merriam-Webster
• Dendron is Greek for tree.
Step 1: Convert to .txt file
Step 2: Input .txt file in Catpac.
Step 3: Select file to be analyzed.
Step 4: Make a Dendrogram.
(Note the spelling error.)
This is what you will see ...
25 most frequent words
50 most frequent words
… after you exclude common words.
(This seems a bit clunky given what the program purports to do.)
Results
• Of the 738 total words in Klaus Krippendorff’s
article on “Testing the Reliability of Content
Analysis Data: What Is Involved and Why?”, the
most frequently used word in the text was data.
It appears 84 times, accounting for 11.4 percent
of all words used.
• Reliability was the next most often-used word,
accounting for more than 8 percent of the total
words used.
Compare that to …
His 1,368-word discussion
of “Computing
Krippendorff’s AlphaReliability,” where data was
the most frequently used
word (excluding common
articles and prepositions).
The word appears 65 times,
followed closely by
reliability and observers.
Dendrograms
Ward’s Method
Centroid
Examining Word Clusters
• The Oresme interactive clustering function allows
for examining concepts that are associated with each
other.
• “Cycle Input” tells which concepts are activated by a
selected concept.
• “Cycle Output” “cycles the network output window
back into itself.”
• Huh?
• “Instead of ‘thinking’ about the concepts you
originally gave it, it is thinking about the concepts
generated by the concepts you originally gave it.”
This is what it looks like …
Cycle Input
Cycle Output
The manual makes note of what some analysts call the “Buddhist monk
syndrome,” where “after sufficient contemplation, it appears that all things
are one.”
To map these cluster concepts …
• First save as a
crud file (.crd).
• “Select Open from
the ThoughtView
File menu.”
• Wait, where the
heck is
ThoughtView?
(CRD files extract coordinate
information from the dendrograms.)
2D mapping of concept clusters
Note the tight grouping of words like reliability,
data and coders on the right.
3D mapping of concept clusters
3D mapping allows for rotation …
Now for a demonstration …
WordStat
WordStat is…
• Content analysis module of SimStat.
• Designed to analyze textual information
(open-ended responses, interview transcripts,
journal articles, news stories, websites, etc.)
• Used both for automatic categorization of text
using a dictionary and for manual coding.
WordStat has…
• Integrated text-mining analysis and
visualization tools.
• Hierarchical categorization dictionary or usergenerated dictionary.
• Keyword-in-context (KWIC) and keyword
retrieval tools.
• Capability of statistical analyses (factor analysis,
word frequencies, etc.).
Getting Started
• First open SimStat because WordStat must be
run as part of the SimStat program.
• Build your own dictionary because WordStat’s
standard dictionaries are lacking.
• Run spell-check on the text to be analyzed
because misspelled words may be left uncoded.
• Select text-type file (Text, MS Word, HTML,
Excel, SPSS files)
Example Study
• Sense of humor study data
(N=288, 52 missing data included)
• Open-ended responses
(Q: instances of sense of superiority in humor)
• Demographical information (gender, ethnic
background and political philosophy) and sense
of humor
How to get WordStat
• Free trial
version on
web site;
http://www
.provalisres
earch.com/
wordstat/W
ordStatDow
nload.html
• Dictionary;
http://www
.provalisres
earch.com/
wordstat/RI
D.html
How to use WordStat
• Create or import an existing dataset
How to use WordStat
• Create or import an existing dataset
How to create dictionary
• Add categories and words
• Dictionary for example study
Results
• Frequencies
Results
• Frequencies - chart
Results
• Frequencies – dendrogram, concept map
Results
• Crosstab word count - gender
Results
• Crosstab word count – political tendency
Results
• Crosstab word count – ethnicity
Results
• Crosstab word count – combination
Results
• KWIC (Keyword-in-Context)
Reports
• Overall
Humor>Race>Family>Politics>Religion
• Gender (M:105, F:131)
Women used more Family (p<.05), less Politics (n.s.)
<COUNT>
<COLUMN PERCENT>
Reports
• Ethnic background
(W: 159, NW: 67)
▫ White people used more
Humor (p<.01), less
Religion (n.s.)
<COLUMN PERCENT>
• Political philosophy
(N=S Consv:13, Consv:30,
Mid:64, Libr:63, S Libr:38,
No Comment: 28)
<COLUMN PERCENT>
Limitations
• Incomplete dictionary
▫ Overestimation: ambiguous words, overlapping
▫ Underestimation: misspellings, odd expressions
▫ Categorization: obscurations, incongruities
More?
Q&A