Descriptive statistics

2 openng thoughts

Research skills are double sided. Doing your own research (including suitable assignments) helps you to evaluate the research of others. Evaluating the research of others helps you to do your own research: A “virtuous circle”.
You need a basic grasp of social science numbers not just to be a properly trained researcher but to be a responsible and safe citizen (one who cannot be “led by donkeys”.)

Properties of the normal distribution

“Most” values are near the mean (average).
The number of values increasingly far from the mean drops rapidly: 50% of population have I. Q. 90-110 but only 2% have > 130 or < 70. (“Standard deviations”.)
There are as many values above the mean as below it (a symmetrical distribution).
The Galton Board gives us some idea about where the distribution comes from: Lots of (small?) but equally important “50/50” effects.
Advance warning: A lot of statistical techniques depend on normal data so make sure you check the data distribution.

Descriptive statistics

Measures of central tendency. What are the values that in some sense “summarise” a distribution?
Mean aka “average” (works well for normal distributions): Sum of
values divided by number of values: Mean of 1, 2, 3, 3, 3, 4, 15 is (1+2+3+3+3+4+15) / 7 = 4.
Median (as many values above as below): 1, 2, 3 below and 3, 4, 15 above so median = 3. (For an even number of values the median is the mean of the “middle two”. So removing 3, 4, 15 makes the median (2+3) / 2 = 2.5.)
Mode (the value that occurs most often): 3.

Types of data

Affect the type of statistical tests you can do (more later).
Nominal: No meaningful ordering just “names”. Example: Nationality. Brazilian isn’t better or worse than Argentinian, just distinct.
Ordinal: Meaningful ordering but distances between categories need not be the same. Example: BMI. The weight difference between normal and overweight does not have to be the same as the difference between overweight and obese.
Interval: Meaningful ordering and constant category differences. Example: Temperature. The difference between 10 degrees and 15 degrees is the same as between 35 degrees and 40 degrees.
Ratio: Like interval but has a meaningful zero. Example: Blood Alcohol in mg/100 ml.

Representing different data types

Nominal: All you can really give is a list.
Ordinal: Mode (the commonest category in the UK is “overweight”) or frequency (in a sample of 1000 men, 58 are obese, 205 are overweight …)
Interval and Ratio: Can use mean, median or mode depending on whichever gives the most honest representation
SPSS will allow you to do things that make no sense so you have to know enough not to do that. 19

Measures of depression

A complement to a measure of central tendency. These two numbers can give a pretty good feel for a whole distribution.
What are the “extremes” of the data?
The simplest is range: The largest value minus the smallest value.
The commonest is standard deviation

Standard deviation

This indicates how much individual members of a group differ from the mean value for that group
Then square the results to get rid of negative differences and for bigger wighting for relative outliers in data
eg.
- Calculate the mean of the squared results: 75.76.
- Take the square root: 8.70. Why? Because you have squared the differences.
- On average, participant ages differ from the mean by 8.70 years.

Note

Generally we don’t show you the calculations because you can “press buttons” in SPSS to get these but it is somewhat useful to know what goes on “behind” the software and sometimes the calculations can give you insight into the meaning of the statistics (see shortly)

Presentation of data: good and bad

We have already done this without comment.

Data presentations should be clear and fair. Beware of apparently harmless attempts to mislead you.

Reputable graph

Clear what axes are (bottom – horizontal) is dates (side – vertical) is number of homicides (per million population i. e. rate). Bars represent annual data reported at just one time in the year.

More honest x axis

It doesn’t matter much here: they just change the annual census data once, but it can matter a lot
X axis goe across

“Gee Whiz” Graph

Economical or dishonest? It depends on the argument and how visible the axis values are.
Check where the axis starts when reading graphs

Line graph 1

The bar chart gives a better picture of data collection because there isn’t actually data “between” bars: See “spike”. Very hard to work out actual data values

Line graph 2

Better: You can “eye ball” values against grey background lines.

Descriptive statistics and hypotheses

We have delivered a violence reduction treatment programme in prison and want to evaluate whether there is a difference in how successful it is for male compared to female offenders.
Our hypothesis is that the programme will be more successful for males than females.
We monitor the number of aggressive incidents perpetrated by each male and female in the 3 months before starting treatment and in the 3 months after having completed the treatment programme.
For each person, we calculate the difference in aggressive incidents perpetrated pre- and post- treatment.
We then calculate the mean for males and the mean for females, which we can compare.