Descriptive Stats

Descriptive Statistics: Summary Information

Descriptive statistics provide summary information for each variable in a dataset.
Key components include:
- Number of cases
- Central tendency
- Dispersion
Used by researchers to describe variables.
Important caution: it generally doesn’t make sense to report the raw counts for Likert-scale responses (e.g., how many people answered 1, 2, 3, 4, or 5) in descriptive summaries; frequencies/percentages are more appropriate for categorical variables.
Descriptive statistics lay the groundwork before conducting statistical tests that analyze differences and relationships between variables.

Data Layout and Purpose of Descriptive Statistics

Data layout in many studies:
- Each person’s data is on one row (case)
- Variables are in columns
Some variables are binary; others are interval or continuous.
Raw data can be hard to discern patterns from; descriptive statistics help reveal patterns and summaries.

Application of Descriptive Statistics

Reported in methods section of research reports.
For each variable, report:
- Mean, standard deviation (sd), range, and sample size (n or N).
- Frequencies: how many times a particular value occurs (usually for categorical variables).
- Percentages: describe characteristics or attributes of participants (usually for categorical variables).

Number of Cases and Sample Size

Number of cases describes how many data points are reported.
Denoted by n or N (depends on design).
Typically, N is the total sample size (e.g., N = 231).
Small n is used when there are multiple sample sizes (e.g., in experiments): n1 = 50, n2 = 60.
Cases can be
- People
- Speaking turns
- Episodes
- Any phenomenon studied

Measures of Central Tendency

Measures include:
- Mean: arithmetic average; most sensitive to extreme scores.
- Median: middle value; not sensitive to extreme scores; useful when distribution is non-normal.
- Mode: value(s) that appear most often; useful for continuous and categorical data; some distributions have more than one mode.
When distributions are non-normal, the median can be more informative than the mean.
Relationship to normal curve:
- In a perfectly normal distribution, the mean, median, and mode are equal.
- In positively skewed distributions, mean > median.
- In negatively skewed distributions, mean < median.
Example connection: median home prices in Tucson might be more representative than the mean when the distribution is skewed.

Measures of Dispersion

Describes the variability or spread of scores.
Should be reported with the mean.
Common dispersion measures:
- Range: difference between highest and lowest score.
- Standard deviation (sd): average distance of raw scores from the mean.
Interpretations:
- If sd = 0, all scores are the same.
- Larger sd indicates scores differ more from the mean on average.

Crunching Numbers: Tools and Process

Tools: calculator with square root, spreadsheet software, or statistics programs (R, SAS, SPSS).
There are online tools offering fairly sophisticated analyses.
Researchers must: select the appropriate descriptive statistic and the appropriate test.
The accuracy of results depends on correct input; wrong input leads to errors.

Preliminaries: Population vs. Sample and Basic Notation

Greek symbols typically refer to population values (e.g., µ, σ).
Sample estimates (often denoted by non-Greek or accented symbols; see below) are used to estimate population values.
In most studies, we cannot measure everyone in the population; we estimate population values from a sample.
The symbol Σ (capital sigma) means “sum everything that follows.”
Normal mathematics (PEMDAS) applies to statistical formulas:
- Parentheses, Exponents, Multiplication/Division (left to right), Addition/Subtraction (left to right).

Population vs. Sample Symbols

Common symbolic mapping (as presented):
- Population: mean = µ, variance = σ^2, standard deviation = σ
- Sample: mean = ar{x}, variance = s^2, standard deviation = s
(The slide row lists a term-by-symbol mapping; focus on the standard convention above.)

General Variance Formula

Variance measures the average squared deviation from the mean.
Two main formulas depending on whether you describe the population or estimate it from a sample:
- Population variance: $ext{Var} = rac{1}{N} \, ext{or} \, rac{1}{N} \, ext{sum}{i=1}^N (xi - \mu)^2$
- Sample variance: $s^2 = rac{1}{n-1} \, ext{sum}{i=1}^n (xi - \bar{x})^2$
Notation and order of operations matter; (capital sigma) sums deviations, then divide by the appropriate denominator (N for population, N-1 for sample in most cases).

Computing Variance: An Example

The dataset (livesup) provides
- sum of squared deviations = 11.6
- sum of deviations from mean before squaring = not shown, but implied by the data
- mean = 3.6 (from the slide): the average score.
Denominators under discussion: 14 or 15, depending on whether you use the sample or population formula.
Possible variance values:
- 11.6 / 14 = 0.83
- 11.6 / 15 = 0.77
The slide notes that 0.77 appears as the average of the squared deviations (dividing by N) and is the population variance formula.
The slide also notes the difference between the two formulas is small for large samples.

Standard Deviation and Its Interpretation in the Example

Standard deviation is the square root of the variance:
- If variance = 0.83 (sample formula): $s = \sqrt{0.83} \approx 0.91$
- If variance = 0.77 (alternative, population-based denominator): $s = \sqrt{0.77} \approx 0.88$
The slide suggests reporting the standard deviation with the chosen variance denominator and notes that differences are negligible for large samples.

Implications of Variance and Standard Deviation

Variance tells us about the relationship of a score to the rest of the scores.
Standard deviation ranges from zero to the maximum variability possible for a given scale.
A score’s percentile relative to the mean can be inferred from the distribution (e.g., SAT percentile logic).
Knowing variance helps predict where most scores lie and aids in hypothesis testing by generalizing from sample to population (lead-in to next lecture).

The Normal Curve: Theoretical vs Empirical

Theoretical Normal Distribution (bell curve):
- Y-axis: Relative Frequency
- Area under the curve equals 1 (or 100%)
- Most data cluster around the center; symmetrical; tails extend to infinity in both directions
- Mean = Median = Mode
Empirical Normal Distribution: an approximation to the theoretical curve; becomes closer to normal with larger samples; based on sampling simulations.

The Normal Curve: Percentages Within Standard Deviation

Known percentages around the mean on a normal curve:
- 34% of scores lie between the mean and +1 SD (or between mean and -1 SD).
- An additional 14% lie between +1 SD and +2 SD (or between -1 SD and -2 SD).
- Therefore, about 96% of all scores are within ±2 SDs of the mean (34% + 34% + 14% + 14%).
These 34% and 14% values are often used for quick mental checks (e.g., Mean Height = 65, SD = 4 as a mnemonic example).

The Standard Normal Curve and Z-Scores

A standard score expresses a score’s position relative to the mean using the standard deviation as the unit; it standardizes different scales.
A z-score represents the number of standard deviations a value is above or below the mean:
- Population-based standardization: $z = rac{x - \mu}{\sigma}$
- Standard normal distribution has mean 0 and standard deviation 1: $\muZ = 0, \\sigmaZ = 1$
Converting to z-scores allows comparison across differently scaled variables.
Note: The z-table is applicable only if the data are (approximately) normally distributed.

The Normal Curve Table and Percentiles

The normal curve table provides the precise percentage of scores between the mean (z = 0) and any z-score.
Uses:
- Proportion above or below a given z-score
- Proportion between the mean and a given z-score
- Proportion between two z-scores
When using z-tables, remember they assume a normal distribution; non-normal distributions (rectangular, skewed, leptokurtic, bimodal) do not align with the standard normal table.

Quick Visuals and External Tools

A common demonstration site shows areas under the standard normal curve (e.g., areas between 0 and 1, or 0 and -1) and the corresponding decimal areas.
The Z-table can also be used to determine z-scores for a given proportion of scores and vice versa.

Finding Area When the Score is Known

Step-by-step approach:
1) Convert the raw score to a z-score.
2) Draw the normal curve and locate the z-score.
3) Shade the area corresponding to the desired proportion.
4) Make a rough estimate of the shaded area’s percentage.
5) Use the normal curve table to find the exact percentage.
6) Check that the exact percentage is close to the estimate.
Example (mean = 10, sd = 2):
- Find the percentage above 12: 16%
- Find the percentage below 12: 84%
- Find the percentage above 8: 84%
- Find the percentage below 8: 16%
- Find the percentage above 9: 69%
- Find the percentage below 7: 7%

Finding Area Between Two Scores

Steps identical to the above, but shade the area between two z-scores.
Example (mean = 10, sd = 2):
- Between 10 and 12: 34%
- Between 8 and 10: 34%
- Between 6 and 10: 48%
- Between 6 and 8: 14%
- Between 10.5 and 11: 10%
- Between 8.5 and 11: 46%

Finding Scores When the Area (z-score) is Known

Steps:
1) Draw the normal curve; shade the approximate area corresponding to the desired percentage.
2) Make a rough estimate of the starting z-score where the shaded area begins.
3) Use the normal curve table to find the exact z-score.
4) Convert the z-score to a raw score if desired: $x = z \, \cdot \, s + \bar{x}$ or more generally: $x = z\cdot s + \bar{x}$
Example scaffolding (mean = 10, sd = 2):
- What raw score corresponds to certain percentile positions (e.g., 50% above, 84% below, 98% above, 62% below, 30% above).
Practical tip: for the last two examples, solving for z may require working backward; online z-tables can be helpful.

Remember: Normality and the Z-Table

Key reminder: The standard normal distribution table (z-table) should be used only when the score distribution is normal.
If the distribution differs markedly from normality (non-normal), transforming to z-scores and using the z-table is inappropriate for exact probabilities.
Non-normal distributions include rectangular, skewed, leptokurtic, and bimodal shapes.

Standard Error Revisited and Confidence Intervals for the Mean

The standard error estimates how similar a sample mean is to the population mean.
Confidence interval for the mean (commonly 95%):
- Example: sample of 100 students, mean attitude score toward tuition increases = 4 (on a 1–10 scale), sample sd = 1.
- 95% CI for the population mean: $\bar{x} \pm z_{\alpha/2} \cdot \frac{s}{\sqrt{n}} = 4 \pm 1.96 \cdot \frac{1}{\sqrt{100}} = 4 \pm 0.196$
- Resulting interval: $[3.804, 4.196]$
- Interpretation: 5% of the time the population mean would fall outside this interval.

Our Friend, 1.96: The 95% Confidence Multiplier

On a standard normal distribution, approximately 5% of the curve lies outside ±1.96 standard errors from the mean.
Therefore, ±1.96 captures the central 95% of the distribution around the mean in confident interval estimation.

Degrees of Freedom (DF)

Intuition: Degrees of freedom are the number of independent pieces of information available to estimate a parameter.
Concrete example: with two observations, you have two independent observations for the mean; however, when calculating the variance, only one independent piece of information remains because the two observations are constrained by their distance from the mean.
Conceptual view: DF reflect how many values are free to vary when estimating a parameter given fixed other parameters.
Practical note: In many cases, using DF properly prevents underestimating the population parameter when using sample estimates.
Summary: A dataset’s sample is an estimate of the population; neglecting degrees of freedom can bias parameter estimates.