Catpac & WordStat Dongwoo Kim & Fran Stewart COM633, Fall 2010 Catpac: The Basics • Originally created by Joseph Woelfel to examine consumer behavior and.
Download ReportTranscript Catpac & WordStat Dongwoo Kim & Fran Stewart COM633, Fall 2010 Catpac: The Basics • Originally created by Joseph Woelfel to examine consumer behavior and.
Catpac & WordStat Dongwoo Kim & Fran Stewart COM633, Fall 2010 Catpac: The Basics • Originally created by Joseph Woelfel to examine consumer behavior and marketing. • Presented as part of the Galileo package of software analysis tools. • Billed as a “self-organizing artificial neural network” optimized for examining text. Catpac: What It Does • Recognizes frequency of words used in text. • Focuses on co-occurrence of words – words that appear near each other in context. • Uses cluster analysis to display word cooccurrence. • Incorporates ThoughtView’s perceptual mapping and Oresme’s interactive clustering. Catpac: How Does It Work? • Catpac moves a window of n words through the text. For example, if the window size selected is 7 words, then Catpac will systematically scan words 1-7, then words 2-8, 3-9 and so on until it completes the document. • Words appearing in the window then activate the neurons representing them. Connections among activated neurons allow Catpac to associate words that appear close together within the text. Getting Started • Catpac can only be used on ASCII text files so Word documents will need to be converted to .txt files. • The most simple analysis is a “dendogram,” according to the Galileo manual. • A dendrogram is “a branching diagram representing a hierarchy of categories based on degree of similarity or number of shared characteristics especially in biological taxonomy.” – Merriam-Webster • Dendron is Greek for tree. Step 1: Convert to .txt file Step 2: Input .txt file in Catpac. Step 3: Select file to be analyzed. Step 4: Make a Dendrogram. (Note the spelling error.) This is what you will see ... 25 most frequent words 50 most frequent words … after you exclude common words. (This seems a bit clunky given what the program purports to do.) Results • Of the 738 total words in Klaus Krippendorff’s article on “Testing the Reliability of Content Analysis Data: What Is Involved and Why?”, the most frequently used word in the text was data. It appears 84 times, accounting for 11.4 percent of all words used. • Reliability was the next most often-used word, accounting for more than 8 percent of the total words used. Compare that to … His 1,368-word discussion of “Computing Krippendorff’s AlphaReliability,” where data was the most frequently used word (excluding common articles and prepositions). The word appears 65 times, followed closely by reliability and observers. Dendrograms Ward’s Method Centroid Examining Word Clusters • The Oresme interactive clustering function allows for examining concepts that are associated with each other. • “Cycle Input” tells which concepts are activated by a selected concept. • “Cycle Output” “cycles the network output window back into itself.” • Huh? • “Instead of ‘thinking’ about the concepts you originally gave it, it is thinking about the concepts generated by the concepts you originally gave it.” This is what it looks like … Cycle Input Cycle Output The manual makes note of what some analysts call the “Buddhist monk syndrome,” where “after sufficient contemplation, it appears that all things are one.” To map these cluster concepts … • First save as a crud file (.crd). • “Select Open from the ThoughtView File menu.” • Wait, where the heck is ThoughtView? (CRD files extract coordinate information from the dendrograms.) 2D mapping of concept clusters Note the tight grouping of words like reliability, data and coders on the right. 3D mapping of concept clusters 3D mapping allows for rotation … Now for a demonstration … WordStat WordStat is… • Content analysis module of SimStat. • Designed to analyze textual information (open-ended responses, interview transcripts, journal articles, news stories, websites, etc.) • Used both for automatic categorization of text using a dictionary and for manual coding. WordStat has… • Integrated text-mining analysis and visualization tools. • Hierarchical categorization dictionary or usergenerated dictionary. • Keyword-in-context (KWIC) and keyword retrieval tools. • Capability of statistical analyses (factor analysis, word frequencies, etc.). Getting Started • First open SimStat because WordStat must be run as part of the SimStat program. • Build your own dictionary because WordStat’s standard dictionaries are lacking. • Run spell-check on the text to be analyzed because misspelled words may be left uncoded. • Select text-type file (Text, MS Word, HTML, Excel, SPSS files) Example Study • Sense of humor study data (N=288, 52 missing data included) • Open-ended responses (Q: instances of sense of superiority in humor) • Demographical information (gender, ethnic background and political philosophy) and sense of humor How to get WordStat • Free trial version on web site; http://www .provalisres earch.com/ wordstat/W ordStatDow nload.html • Dictionary; http://www .provalisres earch.com/ wordstat/RI D.html How to use WordStat • Create or import an existing dataset How to use WordStat • Create or import an existing dataset How to create dictionary • Add categories and words • Dictionary for example study Results • Frequencies Results • Frequencies - chart Results • Frequencies – dendrogram, concept map Results • Crosstab word count - gender Results • Crosstab word count – political tendency Results • Crosstab word count – ethnicity Results • Crosstab word count – combination Results • KWIC (Keyword-in-Context) Reports • Overall Humor>Race>Family>Politics>Religion • Gender (M:105, F:131) Women used more Family (p<.05), less Politics (n.s.) <COUNT> <COLUMN PERCENT> Reports • Ethnic background (W: 159, NW: 67) ▫ White people used more Humor (p<.01), less Religion (n.s.) <COLUMN PERCENT> • Political philosophy (N=S Consv:13, Consv:30, Mid:64, Libr:63, S Libr:38, No Comment: 28) <COLUMN PERCENT> Limitations • Incomplete dictionary ▫ Overestimation: ambiguous words, overlapping ▫ Underestimation: misspellings, odd expressions ▫ Categorization: obscurations, incongruities More? Q&A