This is a simple, but flawed, way to compute the average distance.
Standard Deviation Recipe
Express scores as distances (deviations) from the mean.
Square distances to eliminate negative values.
Sum the squared distances.
Average the squared distances.
Take the square root to get the average distance from the scores to the mean.
Distance (Deviation) Scores
The distance between a score and the sample mean is called a deviation score.
Distance scores can be positive (score > mean), negative (score < mean), or zero (score = mean).
Square Deviation Scores
Squaring distance scores eliminates negative values, so distances no longer cancel out.
Achieves the same goal as the absolute value.
Sum the Squared Distances
Summing the squared distance scores gives a foundational measure called sum of squares (SS).
SS = \sum(X_i - \bar{X})^2
SS expresses the total amount of variability in the data as a lump sum.
Average the Squared Distances
Averaging the squared distances between the scores and the mean gives a measure of variability called the variance.
The variance is just the arithmetic average (a sum divided by the number of observations) applied to squared scores.
This version of the formula is biased.
The Variance
The average squared distance from the smoking scores to the mean is 45.6.
The variance will appear in equations later in the course, but we will not use it to describe data.
Square Root to Get Average Distance
Taking the square root undoes the squaring operation, giving the average distance from the scores to the mean.
Standard deviation = average (typical) distance.
This version of the formula is biased.
The Standard Deviation
The average (typical) distance from the smoking scores to the mean is 6.75 cigarettes.
The std. dev. is the square root of the variance.
Rule of Thumb for Normal Data
In a normal curve, 68% of the scores are within ± 1 standard deviation of the mean, and 95% are within ≈ ± 2 standard deviations.
Populations vs. Samples Revisited
A population is the entire group of individuals that are of interest in a study.
Researchers almost exclusively work with a smaller subset called a sample.
The usual goal is use a sample statistic as a best guess about the population statistic.
Small-Sample Bias
The previous formulas for the standard deviation and variance are biased in small samples.
The formulas underestimate true variability in the population.
An adjustment to the sample size in the denominator fixes this bias, making the sample standard deviation a more accurate estimate of the true population standard deviation.
Standard Deviation Formulas
Population standard deviation: \sigma = \sqrt{\frac{\sum(X_i - \mu)^2}{N}}
Biased sample standard deviation: s = \sqrt{\frac{\sum(X_i - \bar{X})^2}{N}}
Unbiased sample standard deviation: s = \sqrt{\frac{\sum(X_i - \bar{X})^2}{N-1}}
Variance Formulas
Population variance: \sigma^2 = \frac{\sum(X_i - \mu)^2}{N}
We do not know the true mean, \mu, so we must substitute the sample mean, \bar{X}, into the standard deviation equation.
This makes the numerator too small because scores are always closer to the sample mean than the true mean.
Averaging over N – 1 (the degrees of freedom) rather than N counteracts this bias, making the equation more accurate.
Sample Size Adjustment
Dividing by N – 1 compensates for the fact that the distances (deviations) in the numerator are underestimated.
The adjustment increases the standard deviation, making the formula more accurate.
The adjusted sample size is called the degrees of freedom.
Smoking and Drinking Cessation Trial
Pharmacological treatments that can concomitantly address cigarette smoking and heavy drinking stand to improve health care delivery for these highly prevalent co-occurring conditions.
This superiority trial compared the combination of varenicline and naltrexone against varenicline alone for smoking cessation and drinking reduction among heavy-drinking smokers.
Key Variables
Breath (alveolar) carbon monoxide
A measure of carbon monoxide in the lungs.
Breath carbon monoxide is a biomarker of smoking behavior common in clinical trials.
Higher scores reflect more frequent smoking.
Medication arm
Participants were randomly assigned to receive one of two meds: varenicline plus naltrexone or varenicline plus placebo pills.
Descriptive Statistics
Mean (\bar{X}) = 5.53
Std. Dev. (s) = 5.96
N = 165
Research Question
Comparative research questions ask whether two groups differ from one another.
Does the combination of two medications result in different smoking levels compared to using a single medication?
We answer this question by comparing the means of the two treatment arms, but the variability is important too.
Statistics by Condition
Medication Arm
Mean
SD
n
Varenicline only
5.02
4.88
82
Varenicline + Naltrexone
6.02
6.86
83
Variability Comparison
Variability is larger in the dual medication arm because there are more individuals with very large smoking scores.
Study Questions
Make up a sample of five age values that exhibit high variability.
Make up a sample of five age values that exhibit very little variability.
Make up a sample of five age values with a standard deviation equal to zero.
What is a distance or deviation score?
Suppose that the mean of a sample of depression scores is 19. Illustrate a deviation score for an individual with a high depression score. Do the same for a low depression score.
Why do we need to square deviation scores when computing variability?
Compute the sample standard deviation for the scores (use an Excel spreadsheet to set up the columns and perform the calculations if you want) for the following sample of data: 25, 0, 1, 0, 2, 14, 0, 2, 1, 5, 3.
The Beck Depression Inventory scoring manual reports that the mean and standard deviation of a mildly depressed normative sample are 19 and 6, respectively. Provide an interpretation of the standard deviation from the previous question.
The variance of depression scores is 6^2 = 36. Provide an interpretation of the variance.
Still referring to the Beck Depression Inventory example, what would the researchers have to do to obtain the standard deviation parameter?
The standard deviation of a sample of depression scores is 6. Describe the concept of sampling error in this context (Hint: Is this estimate identical to the variability in the full population?)