Intro to SPSS

Download Report

Transcript Intro to SPSS

Introduction to SPSS

Data types and SPSS data entry and analysis 1

In this session

 What does SPSS look like?  Types of data (revision)  Data Entry in SPSS  Simple charts in SPSS  Summary statistics  Contingency tables and crosstabulations  Scatterplots and correlations  Tests of differences of means 2

SPSS/PASW

3

Aspects of SPSS

 Menus - Analyse and Charts esp.

 Spreadsheet view of data   Rows are cases (people, respondents etc.) Columns are Variables  Variable view of data  Shows detail of each variable type 4

Questionnaire Data Coding

5

In SPSS

 We change ticks etc. on a questionnaire into numbers  One number for each variable for each case  How we do this depends on the type of variable/data 6

Types of data

 Nominal  Ranked  Scales/measures  Mixed types  Text answers (open ended questions) 7

Nominal (categorical)

 order is arbitrary  e.g. sex, country of birth, personality type, yes or no.

Use numeric in SPSS

and give value labels.

(e.g. 1=Female, 2=Male, 99=Missing) (e.g. 1=Yes, 2=No, 99=Missing) (e.g. 1=UK, 2=Ireland, 3=Pakistan, 4=India, 5=other, 99=Missing) 8

Ranks or Ordinal

 in order, 1st, 2nd, 3rd etc.

 e.g. status, social class 

Use numeric in SPSS with value labels

 E.g. 1=Working class, 2=Middle class, 3=Upper class  E.g. Class of degree, 1=First, 2=Upper second, 3=Lower second, 4=Third, 5=Ordinary, 99=Missing 9

Measures, scales

1.

 Interval - equal units e.g. IQ 2.

 Ratio - equal units, zero on scale e.g. height, income, family size, age  Makes sense to say one value is twice another   

Use numeric (or comma, dot or scientific) in SPSS

E.g. family size, 1, 2, 3, 4 etc.

E.g. income per year, 25000, 14500, 18650 etc.

10

Mixed type

 Categorised data  Actually ranked, but used to identify categories or groups   e.g. age groups = ratio data put into groups 

Use numeric in SPSS and use value labels

.

 E.g. Age group, 1= ‘Under 18’, 2=‘18-24’, 3=‘25 34 ’, 4=‘35-44’, 5=‘45-54’, 6=‘55 or greater’ 11

Text answers

 E.g. answers to open-ended questions  Either enter text as given (

Use String in SPSS

) 

Or

 Code or classify answers into one of a small number types. (

Use numeric/nominal in SPSS

) 12

Data Entry in SPSS

 Video by Andy Field 13

Frequency counts

 Used with categorical and ranked variables  e.g. gender of students taking Health and Illness option

Sex of student

Valid Female Male Total Frequency 25 9 34 Percent 73.5 Valid Percent 73.5 26.5 26.5 100.0 100.0 Cumulative Percent 73.5 100.0 14

e.g. Number of GCSEs passed by students taking Health and Illness option

Number of GCSEs

Valid 0 1 2 3 4 5 6 7 8 9 13 14 Total Frequency 1 1 4 2 3 1 1 34 6 4 2 6 3 Percent 2.9 Valid Percent 2.9 2.9 11.8 2.9 11.8 17.6 11.8 5.9 17.6 8.8 5.9 8.8 2.9 2.9 100.0 17.6 11.8 5.9 17.6 8.8 5.9 8.8 2.9 2.9 100.0 Cumulative Percent 2.9 5.9 17.6 35.3 47.1 52.9 70.6 79.4 85.3 94.1 97.1 100.0 15

Central Tendency

  

Mean

 = average value  sum of all the values divided by the number of values

Mode

 = the most frequent value in a distribution  (N.B. it is possible to have 2 or more modes, e.g. bimodal distribution)

Median

   = the half-way value, or the value that divides the ordered distribution in the middle The middle score when scores are ordered N.B. need to put values into order first 16

Dispersion and variability

 Quartiles       The three values that split the sorted data into four equal parts.

Second Quartile = median.

Lower quartile = median of lower half of the data Upper quartile = median of upper half of the data Need to order the individuals first One quarter of the individuals are in each inter quartile range 17

Used on Box Plot

Age of Health and Illness students

Upper quartile Lower quartile Statist i c s

Age N Mean Median Valid Missing 34 0 24.03 21.00

Median

18

Variance

 Average deviation from the mean, squared

Score Mean Deviation Squared Deviation

1 2 2.6

2.6

-1.6

-0.6

2.56

0.36

3 3 4 2.6

2.6

2.6

0.4

0.4

1.4

Total 0.16

0.16

1.96

5.20

   5.20 is the Sum of Squares This depends on number of individuals so we divide by n (5) Gives

1.04

which is the

variance

19

Standard Deviation

 The variance has one problem: it is measured in units squared.

 This isn ’t a very meaningful metric so we take the square root value.

 This is the Standard Deviation 20

Using SPSS

 ‘Analyse>Descriptive>Explore’ menu.

 Gives mean, median, SD, variance, min, max, range, skew and kurtosis.

 Can also produce stem and leaf, and histogram.

21

Charts in SPSS

 Use ‘Chart Builder’ from ‘Graph’ menu or the Legacy menu  And/or double click chart to edit it.

 E.g. double click to edit bars (e.g. to change from colour to fill pattern).

 Do this in SPSS first before cut and paste to Word  Label the chart (in SPSS or in Word) 22

Stem and leaf plots

 e.g. age of students taking Health and Illness option  good at showing    distribution of data outliers range 23

Stem and leaf plots e.g.

Age St em- and- Leaf Pl ot Fr equency St em & Leaf 6. 00 1 . 999999 17. 00 2 . 00000000001111134 5. 00 2 . 55678 3. 00 3 . 123 1. 00 3 . 5 2. 00 Ext r emes ( >=36) St em wi dt h: 10 Each l eaf : 1 case( s)

24

Box Plot

Statist i c s

Age N Mean Median Valid Missing 34 0 24.03 21.00 25

Box Plot

Fill colour changed.

N.B. numbers refer to case numbers.

26

Histograms and bar charts

 Length/height of bar indicates frequency 27

Histogram

Fill pattern suitable for black and white printing 28

Changing the bin size

Bin size made smaller to show more bars 29

Pie chart

 angle of segment indicates proportion of the whole 30

Pie Chart

Shadow and one slice moved out for emphasis

Analysing relationships

 Contingency tables or crosstabulations      Compares

nominal/categorical

variables  But can include ordinal variables N.B. table contains counts (= frequency data) One variable on horizontal axis One variable on vertical axis Row and column total counts known as marginals

Example

 In the Health and Illness class, are women more likely to be under 21 than men?

Crosstabulations

 e.g.

 Use column and row percentages to look for relationships

SPSS output

Chi-square

²

Cross tabulations and Chi-square are tests that can be used to look for a relationship between two variables:  When the variables are

categorical

so the

data are nominal

(or frequency).

 For example, if we wanted to look at the relationship between gender and age.

 There are several different types of Chi-square (  ²), we will be using the

2 x 2 Chi-square

2x2 Chi-square results in SPSS

Another example

 The Bank employees data

Bank Employees Chi-Square tests

Chi-Square analysis on SPSS

 http://www.youtube.com/watch?v=Ahs8jS5m JKk 4m15s  http://www.youtube.com/watch?v=IRCzOD27 NQU  From 6m:30s to 9m:50s  http://www.youtube.com/watch?v=532QXt1P M Q&feature=plcp&context=C3ba91a4UDOEgs ToPDskJ-ABupdp-Yfvuf4j4fJGzV 12m30s

Low values in cells

 Get SPSS to output expected values  Look where these are <5  Consider recoding to combine cols or rows

Tabulating questionnaire responses

Categorical survey data often “collapsed” for purposes of data analysis

Original category

White British White Irish Other White Indian Pakistani Bangladeshi Chinese Black British Afro-Caribbean African 16 30 12 2

Frequency

284 7 13 40 32 33

Collapsed category

White South Asian Chinese Black

Frequency

304 105 16 44

An analysis on a sample of 2 (e.g. Black African) would not have been very meaningful!

Recoding variables

 http://www.youtube.com/watch?v=uzQ_522F 2SM&feature=related  Ignore t-test for now 6m11s  http://www.youtube.com/watch?v=FUoYZ_f6 Lxc  Uses old version of SPSS, no submenu now. 6m

Scatterplots and correlations

 Looks for association between variables, e.g.

 Population size and GDP  crime and unemployment rates  height and weight  Both variables must be rank, interval or ratio (scale or ordinal in SPSS).

 Thus cannot use variables like, gender, ethnicity, town of birth, occupation.

44

Scatterplots

 e.g. age (in years) versus Number of GCSEs 45

Interpretation

 As Y increases X increases  Called correlation  Regression line model in red 46

Correlation measures association not causation

  The older the child the better s/he is at reading The less your income the greater the risk of schizophrenia  Height correlates with weight  But weight does not

cause

height  Height is

one

of the causes of weight (also body shape, diet, fitness level etc.)  Numbers of ice creams sold is correlated with the rate of drowning   Ice creams do not

cause

drowning (nor

vice versa

) Third variable involved – people swim more and buy more ice creams when it ’s warm 47

Scatterplot in SPSS

 Use Graph menu  http://www.youtube.com/watch?v=74BjgPQvI Eg 8m34s  http://www.youtube.com/watch?v=blfflA 34pQ&feature=related 4m04s  http://www.youtube.com/watch?v=UVylQoG4 hZM 1m50s, ignore polynomial regression 48

Modifying the Scatterplot

 http://www.youtube.com/watch?v=803YCYA2 AoQ&feature=related 4m04s  http://www.youtube.com/watch?v=vPzvuMuV Xk8&feature=related 3m40s 49

If mixed data sets

 Change point icon and/or colour to see different subsets.

 Overall data may have no relationship but subsets might.

 E.g. show male and female respondents.

 Use Chart builder 50

Correlation

 Correlation coefficient = measure of strength of relationship, e.g. Pearson ’s

r

 varies from 0 to 1 with a plus or minus sign

Correlati ons

Number of GCSEs Number of GCSEs Age Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N *. Correlation is significant at the 0.05 level (2-tailed). 1 34 -.415

* .015 34 Age -.415

* .015 34 1 34 51

Positive correlation

 as x increases, y increases

r = 0.7

52

Negative correlation

 as x increases, y decreases

r = -0.7

53

Strong correlation (i.e. close to 1)

r = 0.9

54

Weak correlation (i.e. close to 0)

r = 0.2

55

Interpretation cont.

r 2

is a measure of

degree of variation in one variable accounted for by variation in the other

.  E.g. If

r

=0.7 then

r 2

=.49 i.e. just under half the variation is accounted for (rest accounted for by other factors).

 If

r

=0.3 then

r 2

=0.09 so 91% of the variation is explained by other things.

56

Significance of r

 SPSS reports if

r

is significant at α=0.05

 N.B. this is dependent on sample size to a large extent.

 Other things being equal, larger samples more likely to be significant.

Usually, size of r is more important than its significance

57

Pearson ’s r in SPSS

 http://www.youtube.com/watch?v=loFLqZmvf zU 6m57s 58

Parametric and non-parametric

 Some statistics rely on the variables being investigated following a normal distribution. – Called

Parametric statistics

   Others can be used if variables are not distributed normally – called

Non-parametric statistics.

Pearson ’s

r

is a parametric statistic Kendal ’s tau and Spearman’s rho (rank correlation) are non-parametric.

59

Assessing normality

 Produce histogram and normal plot 60

Use statistical test

 SPSS provides two formal tests for normality : Kolmogorov-Smirnov (K-S) and Shapiro Wilks (S-W)    But, there is debate about KS Extremely sensitive to departure from normality May erroneously imply parametric test not suitable – especially in small sample  So, always use a histogram as well.

61

Often can use parametric tests

 Parametric tests (e.g. Pearson ’s

r

) are robust to departures from normality  Small, non-normal samples OK  But use non-parametric if  Data are skewed (questionnaire data often is)  Data are bimodal 62

Spearmans ’s rho

 http://www.youtube.com/watch?v=r_WQe2c ISU From 4.14 to 4.56

 http://www.youtube.com/watch?v=POkFi5vKv I8&feature=fvwrel 6m16s 63

So far…

 Looked at relationships between nominal variables  Gender vs age group  Looked at relationships between scale variables  Height vs. Weight  Now combine the two 

Groups

vs a scale variable  E.g. Gender vs income 64

Reminder – IV vs DV

  IV =

independent variable

What makes a difference, causes effects, is responsible for differences.

  DV =

dependent variable

What is affected by things, what is changed by the IV.

  Gender vs income. Gender = IV, income = DV So we investigate the effect of gender on income 65

Example 1 Age group vs. no. of GCSEs

 Using the Health and Illness class data  Age group defines 2 groups   Under 21 21 and over  Just two groups  Can use

independent samples t-test

 Independent because the two groups consist of different people.

 t-test compares the means of the 2 groups.

66

Difference of means

 Do under 21s have more or fewer GCSEs than 21 and overs?

Number of GCSEs Age group Under 21 21 and over

Group Statisti c s

N 16 Mean 6.44 Std. Deviation Std. Error Mean 3.140 .785 18 4.28 2.906 .685  Means are different (6.44 & 4.28) but is that significant?

67

Number of GCSEs Equal variances assumed Equal variances not assumed No significant difference therefore assume equal variances Number of GCSEs

Independent Samples T e s t

Levene's Test for Equality of Variances F Sig. t df Equal variances assumed .164 .689 2.082 32 Equal variances not assumed 2.073 30.789

Independent Samples T e s t

Sig. (2-tailed) .045 .047 t-test for Equality of Means Mean Difference 2.160 2.160 Std. Error Difference 1.037 1.042 95% Confidence Interval of the Difference Lower .047 Upper 4.272 .034 4.285 Levene's Test for Equality of Variances t-test for Equality of Means F .164 Sig. .689 t 2.082 2.073 df 32 30.789 Sig. (2-tailed) .045 .047 Mean Means are Difference statistically Std. Error Difference 1.037 1.042 68 95% Confidence Interval of the Difference Lower Upper .047 .034 4.272 4.285

Parametric vs non-parametric

 Just as in the case of correlations, there are both kinds of tests.  Need to check if DV is normally distributed.

 Do this visually  Also use statistical tests 69

Tests for normality

 Kolmogorov-Smirnov and Shapiro-Wilk    If n>50 use KS If n≤50 use SW Null hypothesis is ‘data are normally distributed’.

  So if

p<0.05

then data are significantly different from a normal distribution –

use non parametric tests

If

p≥0.05

then no significant difference –

parametric tests use

70

Checking normality

 Produce histogram of DV  Tick box to undertake statistical test  Interpret results.

71

t-test

 Identify your two groups.

 Determine what values in the data indicate those two groups (e.g. 1=female, 2=male)  Select Analyze:Compare Means:Independent samples t-test  http://www.youtube.com/watch?v=_KHI3ScO 8sc 9m40s 72

Mann-Whitney U test

 Use this when comparing two groups and the DV is not normally distributed  http://www.youtube.com/watch?v=7iTvv3m9d _g 3m45s 73

Comparing 3 or more groups

 ANOVA = Analysis of Variance  Analyze: Compare Means: One-way ANOVA  http://www.youtube.com/watch?v=wFq1b3QjI 1U 4m04s Useful to get table of means (descriptives) and means plots from ANOVA options.

74

ANOVA Means and F value

75

ANOVA Means Plot

76