Choosing the correct Statistical Test
An
important point to consider at the outset, particularly those amongst
you that don’t like sums. You will not be expected to calculate a level
of statistical significance. However, you will need to know when to use
a particular test and also having been given an observed value, be able
to decide its level of significance. This isn’t as complex as it
sounds. It’s simply a matter of looking up the information in a table,
though you will need to understand what the table tells us!
When choosing a test there are three things to consider. Two of these
have already been covered in this booklet, the third was covered at AS
so a quick reminder.
1.
NOIR: What is the level of your data?
Nominal Data:
is the simplest thing a number can do. It can tell us how many things
there are! Basically nominal data is a headcount or a tally. It
doesn’t tell us if something is bigger, brighter or bolder, just how
many. For example, get a show of hands; how many people in the class
study English. Your head count provides nominal data. If you were
replicating Piaget’s research at a primary school you might count the
number of five year olds who can successfully complete the three
mountains task and compare this to the number of seven year olds.
Nominal data.
Ordinal data:
allows us to put things in order. For example A might be more
attractive than B but uglier than C. We have the order C A B in terms
of attractiveness. Crucially however, we can’t be sure that the
difference between C and A is the same as the difference between A and
B. C and A might both be very attractive whereas B might be a complete
minger. We can’t tell that the intervals are the same.
Usain Bolt won the men’s 200m at Beijing, Shawn Crawford was second and
Walter Dix third. From this we can’t tell is the difference between
first and second was the same as the difference between second and
third. First, second, third provides ordinal data.
Interval and Ratio:
allows us to put things in order (ascending or descending) just as
ordinal, however this time we can be sure that the intervals are the
same. We know that the difference between 10cm and 11cm is the same as
the difference between 15cm and 16cm. The same applies to weight or
mass, temperature and time.
An
odd one to consider is IQ. The jury is out on this one. Some
psychologists believe it yields interval/ratio data, others that it is
merely ordinal.
Generally speaking if you need a piece of equipment to measure it, then
its interval or ratio.
For
the purposes of statistics interval and ratio are taken as the same.
There is however, a subtle difference. Ratio has a true zero. So no
minus values, e.g. time, weight, height. Interval data can be minus
e.g. temperature in degrees Celcius. As a result you can say that 20cm
is twice as long as 10cm. You cannot say 20C is twice as hot as 10C.
2.
Correlation or difference?
Provided you’ve given careful consideration to your procedure and are
confident tin what you’re looking for this should be easy. Some groups
have appeared confused in the past, particularly with issues such as the
relationship between attractiveness and punishment. This could be done
either way:
You
could produce an ascending scale of attractiveness and compare this to
the level of punishment given to each person. You would predict a
negative correlation; as attractiveness increases, level of punishment
given decreases.
Alternatively you could split your photographs into two groups, with the
beautiful people in one group and the mingers in the other. Then count
the level of punishment offered for each. You are now looking for a
difference between the two groups. The danger is, having formulated a
hypothesis that you don’t stick to it.
Generally however, it should be obvious from your hypothesis what you’re
looking for!
3.
Repeated or independent measures design
Again obvious since we’ve covered it many times. If you’re using the
same group of participants to assess both variables its repeated
measures. If the participants in one condition differ from the other
its independent. There are times when the decision is made for you.
Sex differences, age differences, cultural differences… they have to be
different participants in each condition.
Decision time
Having decided on the above three dimensions, use the chart below to
decide which test to use. You will be expected to know about the four
in bold: Chi squared, Wilcoxon’s sign test, Mann-Whitney ‘U’ and
Spearman’s ‘rho.’
|
|
Test of
|
difference |
Test of
correlation (relationship) |
|
Type of
Data |
Repeated
Measures / matched Pairs |
Independent measures / single participant |
|
|
Nominal
|
Sign Test |
Chi
Squared |
Chi
Squared |
|
Ordinal |
Wilcoxon sign test |
Mann
Whitney ‘U’
|
Spearman ‘rho’ |
|
Interval/
ratio |
Related
‘t’ test |
Independent (unrelated) ‘t’ test |
Pearson
product moment (‘r’) |
e.g. if you have ordinal data with independent measures design and
you’re looking for a difference, you will use Mann-Whitney ‘U.’
Now
a little bit of play acting or imagination. Let’s pretend you’ve done
your experiment, collected your raw data, chosen the correct test to use
and made your calculation. All your numbers will have been put into
tables or grids, you’ll have calculated means and added things up,
squared and square-rooted, subtracted one group from another and perhaps
done some dividing too. At the end of this you’ll have calculated ONE
number. This number will magically tell you whether your results are
meaningful and statistically significant, or whether they’ve more than
likely occurred by chance and are little more than a fluke.
Critical and observed values
The
number you calculate is your observed value. This needs to be compared
with the critical value in the appropriate table. Each test has its own
table with various critical values depending on the level of
significance 5% (0.05), 1% (0.01), 0.5% (0.005) and so on. The critical
value also varies depending on the number of participants or degrees of
freedom.
With Spearman’s rho and chi squared tests the number you calculate needs
to be equal to or greater than the critical value for your findings to
be significant.
Aide memoire
‘Spearman’s rho’ and ‘chi squared’ both contain ‘Rs’ as does the word
gReater
‘Mann Whitney U’ and ‘Wilcoxon’s sign’ do not contain R. With these two
tests the critical value needs to be equal to or smaller than the
critical value.
Type one and type two errors
Type 1
This is believing you have found a significant result when you haven’t.
You reject the null hypothesis when it should be retained. For example
you might set too lenient a level of significance.
Type 2
You’ve guessed it… this is believing you have found nothing of
significance when you have. This one is particularly annoying for an
undergraduate piece of research. You have accepted the null hypothesis
when it should have been rejected. This could happen if you set
yourself too high a level of significance.
|
Chi squared test
Use when you have nominal data with independent measures
design. Unlike the other tests, chi-squared can be used to test
for a correlation or a difference.
For example: Piaget’s three mountains test:
|
|
5 year olds |
7 year olds |
Totals |
|
Successful |
a.
4 |
b.
18
|
22 |
|
Not successful |
c.
16
|
d.
2 |
18 |
|
Total |
20 |
20 |
40 |
You would put your raw data into a grid and then calculate the
expected frequencies for each cell (a,b,c,d)
You then compare the scores you obtained with what would be
expected by chance. With some appropriate and very repetitive
number crunching (especially if you have 20 cells) you calculate
your critical value.
The chi squared test uses degrees of freedom calculated:
Number of columns -1 x Number of rows -1
In this case 2-1 x 2-1 = 1 x 1 = 1
You look up your observed value in the appropriate table for 1
degree of freedom at the 5% level.
Your number needs to be equal to or greater than the critical
value.
|
If
asked to justify a choice of test do so in terms of whether you’re
looking for a correlation or a difference, using an independent or
repeated measures design and level of data obtained.
For
example: I chose to use Mann Whitney ‘U’ because I was looking for a
difference with an independent measures design and would be obtaining
data at the ordinal level.
Note: if using matched pairs design treat as repeated measures.
|
Spearman’s Rho
Use when you are looking for an association (for example a
correlation) with ordinal level of data.
For example, testing the matching hypothesis which predicts that
men and women with similar levels of attractiveness are more
likely to get married.
This time you put your raw data in a table that looks like this:
|
Couple |
Groom |
Bride |
Rank
(groom) |
Rank
(bride) |
Difference between ranks |
Difference squared |
|
A |
4 |
5 |
|
|
|
|
|
B |
4 |
4 |
|
|
|
|
|
C |
9 |
8 |
|
|
|
|
|
D |
2 |
10 |
|
|
|
|
|
E |
7 |
7 |
|
|
|
|
|
F |
8 |
8 |
|
|
|
|
|
G |
3 |
4 |
|
|
|
|
|
H |
8 |
9 |
|
|
|
|
|
I |
6 |
6 |
|
|
|
|
|
J |
4 |
5 |
|
|
|
|
|
|
|
|
|
|
|
|
You can complete the rest when we look at ranking a set of data.
Essentially you give each groom a rank dependent on their
attractiveness compared to the other grooms and then repeat the
process for the brides. The higher the correlation the more
similar the two sets of ranks (i.e. the more similar their
levels of attractiveness. When you calculate the difference in
ranks the more similar the attractiveness the smaller the
differences. You square the values to get rid of any negative
values (remember -2 squared is 4 not -4!).
After a little more jiggery pokery you end up with a critical
value… this time always between -1 and +1.
You look it up in the appropriate table. This time the number
of pairs is important. There is a critical value at 5% that
varies depending upon the number of pairs of participants. Your
observed value needs to be gReater than or equal to the critical
value.
|
|
Mann Whitney ‘U’ Test
Use when you are looking for a difference with ordinal data and
an independent measures design.
For example you might want to test the hypothesis that boys and
girls take different subjects at A-level, boys preferring
spatial and mathematical, girls preferring subjects that are
more verbal.
To do this you allocate a score for each A-level subject…for
example allocating spatial and mathematical subjects a low
score: physics and maths (1), chemistry (2) etc and verbal
subjects a high score English, French, German (10), politics and
history (9) and so on…
You put your raw data in a table that looks like this:
|
Boys scores |
Girls scores |
Rank (boys) |
Rank (girls) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Σ |
|
Unlike correlational (Spearman’s) the boys and girls can go in
any order… this is independent measures so there are no pairs as
such. Also, unlike Spearman’s the number of boys and girls
scores can be different. You could have 10 boys and 12 gorls
for example.
This time you rank all the scores together… place all the boys
AND girls scores in ascending order and calculate a rank. For
the calculation you only need to add up one set of ranks, in
this case the boys. Then following some other number crunching
you end up with TWO values. The smaller value is called ‘U’ and
the larger value ‘U’’ (pronounced U prime).
You check U (smaller number) against the critical value for the
number of participants in each column. This time the observed
value needs to be equal to or smaller than the critical value.
|
|
Wilcoxon’s sign test
Use when you are looking for a difference, with a repeated
measures design and ordinal data.
For example investigating the Mozart effect. This is the idea
that listening to the music of Wolfgang Amadeus Mozart (his real
name was
Johannes Chrysostomus Wolfgangus Theophilus Mozart) but I digress, will improve all
manner of cognitive functions. This could be tested using a
repeated measures design. Day 1 you get your participants to
complete a memory task whilst listening to a popular
contemporary instrumental track. Day 2 they return and complete
a similar task listening to Mozart.
Obviously a better design option here is then to deploy
counter-balancing measures or ABBA if you prefer.
Raw data would go on a table like this:
|
Participant |
Mozart |
Non-Mozart |
Difference |
Rank |
|
A |
|
|
|
|
|
B |
|
|
|
|
|
C |
|
|
|
|
|
D |
|
|
|
|
|
E |
|
|
|
|
|
F |
|
|
|
|
|
G |
|
|
|
|
|
H |
|
|
|
|
|
I |
|
|
|
|
|
J |
|
|
|
|
|
|
|
|
|
|
Any ‘0’ ranks are ignored. The sum of positive ranks is added
and then the sum of negative ranks. The smaller of the two
values is taken and then it’s a very quick job to look up the
value in an appropriate table for the appropriate number of
participants (in the above case 10). The simplest of all
inferential tests to calculate.
Wilcoxon’s sign test contains no letter ‘R’ so this time the
observed value needs to be equal to or smaller than the critical
value found in the table.
|
|