Descriptive Stats

Descriptive Statistics: Summary Information

  • Descriptive statistics provide summary information for each variable in a dataset.

  • Key components include:

    • Number of cases

    • Central tendency

    • Dispersion

  • Used by researchers to describe variables.

  • Important caution: it generally doesn’t make sense to report the raw counts for Likert-scale responses (e.g., how many people answered 1, 2, 3, 4, or 5) in descriptive summaries; frequencies/percentages are more appropriate for categorical variables.

  • Descriptive statistics lay the groundwork before conducting statistical tests that analyze differences and relationships between variables.

Data Layout and Purpose of Descriptive Statistics

  • Data layout in many studies:

    • Each person’s data is on one row (case)

    • Variables are in columns

  • Some variables are binary; others are interval or continuous.

  • Raw data can be hard to discern patterns from; descriptive statistics help reveal patterns and summaries.

Application of Descriptive Statistics

  • Reported in methods section of research reports.

  • For each variable, report:

    • Mean, standard deviation (sd), range, and sample size (n or N).

    • Frequencies: how many times a particular value occurs (usually for categorical variables).

    • Percentages: describe characteristics or attributes of participants (usually for categorical variables).

Number of Cases and Sample Size

  • Number of cases describes how many data points are reported.

  • Denoted by n or N (depends on design).

  • Typically, N is the total sample size (e.g., N = 231).

  • Small n is used when there are multiple sample sizes (e.g., in experiments): n1 = 50, n2 = 60.

  • Cases can be

    • People

    • Speaking turns

    • Episodes

    • Any phenomenon studied

Measures of Central Tendency

  • Measures include:

    • Mean: arithmetic average; most sensitive to extreme scores.

    • Median: middle value; not sensitive to extreme scores; useful when distribution is non-normal.

    • Mode: value(s) that appear most often; useful for continuous and categorical data; some distributions have more than one mode.

  • When distributions are non-normal, the median can be more informative than the mean.

  • Relationship to normal curve:

    • In a perfectly normal distribution, the mean, median, and mode are equal.

    • In positively skewed distributions, mean > median.

    • In negatively skewed distributions, mean < median.

  • Example connection: median home prices in Tucson might be more representative than the mean when the distribution is skewed.

Measures of Dispersion

  • Describes the variability or spread of scores.

  • Should be reported with the mean.

  • Common dispersion measures:

    • Range: difference between highest and lowest score.

    • Standard deviation (sd): average distance of raw scores from the mean.

  • Interpretations:

    • If sd = 0, all scores are the same.

    • Larger sd indicates scores differ more from the mean on average.

Crunching Numbers: Tools and Process

  • Tools: calculator with square root, spreadsheet software, or statistics programs (R, SAS, SPSS).

  • There are online tools offering fairly sophisticated analyses.

  • Researchers must: select the appropriate descriptive statistic and the appropriate test.

  • The accuracy of results depends on correct input; wrong input leads to errors.

Preliminaries: Population vs. Sample and Basic Notation

  • Greek symbols typically refer to population values (e.g., µ, σ).

  • Sample estimates (often denoted by non-Greek or accented symbols; see below) are used to estimate population values.

  • In most studies, we cannot measure everyone in the population; we estimate population values from a sample.

  • The symbol Σ (capital sigma) means “sum everything that follows.”

  • Normal mathematics (PEMDAS) applies to statistical formulas:

    • Parentheses, Exponents, Multiplication/Division (left to right), Addition/Subtraction (left to right).

Population vs. Sample Symbols

  • Common symbolic mapping (as presented):

    • Population: mean = µ, variance = σ^2, standard deviation = σ

    • Sample: mean = ar{x}, variance = s^2, standard deviation = s

  • (The slide row lists a term-by-symbol mapping; focus on the standard convention above.)

General Variance Formula

  • Variance measures the average squared deviation from the mean.

  • Two main formulas depending on whether you describe the population or estimate it from a sample:

    • Population variance: extVar=rac1Nextorrac1Nextsum<em>i=1N(x</em>iμ)2ext{Var} = rac{1}{N} \, ext{or} \, rac{1}{N} \, ext{sum}<em>{i=1}^N (x</em>i - \mu)^2

    • Sample variance: s2=rac1n1extsum<em>i=1n(x</em>ixˉ)2s^2 = rac{1}{n-1} \, ext{sum}<em>{i=1}^n (x</em>i - \bar{x})^2

  • Notation and order of operations matter; (capital sigma) sums deviations, then divide by the appropriate denominator (N for population, N-1 for sample in most cases).

Computing Variance: An Example

  • The dataset (livesup) provides

    • sum of squared deviations = 11.6

    • sum of deviations from mean before squaring = not shown, but implied by the data

    • mean = 3.6 (from the slide): the average score.

  • Denominators under discussion: 14 or 15, depending on whether you use the sample or population formula.

  • Possible variance values:

    • 11.6 / 14 = 0.83

    • 11.6 / 15 = 0.77

  • The slide notes that 0.77 appears as the average of the squared deviations (dividing by N) and is the population variance formula.

  • The slide also notes the difference between the two formulas is small for large samples.

Standard Deviation and Its Interpretation in the Example

  • Standard deviation is the square root of the variance:

    • If variance = 0.83 (sample formula): s=0.830.91s = \sqrt{0.83} \approx 0.91

    • If variance = 0.77 (alternative, population-based denominator): s=0.770.88s = \sqrt{0.77} \approx 0.88

  • The slide suggests reporting the standard deviation with the chosen variance denominator and notes that differences are negligible for large samples.

Implications of Variance and Standard Deviation

  • Variance tells us about the relationship of a score to the rest of the scores.

  • Standard deviation ranges from zero to the maximum variability possible for a given scale.

  • A score’s percentile relative to the mean can be inferred from the distribution (e.g., SAT percentile logic).

  • Knowing variance helps predict where most scores lie and aids in hypothesis testing by generalizing from sample to population (lead-in to next lecture).

The Normal Curve: Theoretical vs Empirical

  • Theoretical Normal Distribution (bell curve):

    • Y-axis: Relative Frequency

    • Area under the curve equals 1 (or 100%)

    • Most data cluster around the center; symmetrical; tails extend to infinity in both directions

    • Mean = Median = Mode

  • Empirical Normal Distribution: an approximation to the theoretical curve; becomes closer to normal with larger samples; based on sampling simulations.

The Normal Curve: Percentages Within Standard Deviation

  • Known percentages around the mean on a normal curve:

    • 34% of scores lie between the mean and +1 SD (or between mean and -1 SD).

    • An additional 14% lie between +1 SD and +2 SD (or between -1 SD and -2 SD).

    • Therefore, about 96% of all scores are within ±2 SDs of the mean (34% + 34% + 14% + 14%).

  • These 34% and 14% values are often used for quick mental checks (e.g., Mean Height = 65, SD = 4 as a mnemonic example).

The Standard Normal Curve and Z-Scores

  • A standard score expresses a score’s position relative to the mean using the standard deviation as the unit; it standardizes different scales.

  • A z-score represents the number of standard deviations a value is above or below the mean:

    • Population-based standardization: z=racxμσz = rac{x - \mu}{\sigma}

    • Standard normal distribution has mean 0 and standard deviation 1: μ<em>Z=0,sigma</em>Z=1\mu<em>Z = 0, \\sigma</em>Z = 1

  • Converting to z-scores allows comparison across differently scaled variables.

  • Note: The z-table is applicable only if the data are (approximately) normally distributed.

The Normal Curve Table and Percentiles

  • The normal curve table provides the precise percentage of scores between the mean (z = 0) and any z-score.

  • Uses:

    • Proportion above or below a given z-score

    • Proportion between the mean and a given z-score

    • Proportion between two z-scores

  • When using z-tables, remember they assume a normal distribution; non-normal distributions (rectangular, skewed, leptokurtic, bimodal) do not align with the standard normal table.

Quick Visuals and External Tools

  • A common demonstration site shows areas under the standard normal curve (e.g., areas between 0 and 1, or 0 and -1) and the corresponding decimal areas.

  • The Z-table can also be used to determine z-scores for a given proportion of scores and vice versa.

Finding Area When the Score is Known

  • Step-by-step approach:
    1) Convert the raw score to a z-score.
    2) Draw the normal curve and locate the z-score.
    3) Shade the area corresponding to the desired proportion.
    4) Make a rough estimate of the shaded area’s percentage.
    5) Use the normal curve table to find the exact percentage.
    6) Check that the exact percentage is close to the estimate.

  • Example (mean = 10, sd = 2):

    • Find the percentage above 12: 16%

    • Find the percentage below 12: 84%

    • Find the percentage above 8: 84%

    • Find the percentage below 8: 16%

    • Find the percentage above 9: 69%

    • Find the percentage below 7: 7%

Finding Area Between Two Scores

  • Steps identical to the above, but shade the area between two z-scores.

  • Example (mean = 10, sd = 2):

    • Between 10 and 12: 34%

    • Between 8 and 10: 34%

    • Between 6 and 10: 48%

    • Between 6 and 8: 14%

    • Between 10.5 and 11: 10%

    • Between 8.5 and 11: 46%

Finding Scores When the Area (z-score) is Known

  • Steps:
    1) Draw the normal curve; shade the approximate area corresponding to the desired percentage.
    2) Make a rough estimate of the starting z-score where the shaded area begins.
    3) Use the normal curve table to find the exact z-score.
    4) Convert the z-score to a raw score if desired: x=zs+xˉx = z \, \cdot \, s + \bar{x} or more generally: x=zs+xˉx = z\cdot s + \bar{x}

  • Example scaffolding (mean = 10, sd = 2):

    • What raw score corresponds to certain percentile positions (e.g., 50% above, 84% below, 98% above, 62% below, 30% above).

  • Practical tip: for the last two examples, solving for z may require working backward; online z-tables can be helpful.

Remember: Normality and the Z-Table

  • Key reminder: The standard normal distribution table (z-table) should be used only when the score distribution is normal.

  • If the distribution differs markedly from normality (non-normal), transforming to z-scores and using the z-table is inappropriate for exact probabilities.

  • Non-normal distributions include rectangular, skewed, leptokurtic, and bimodal shapes.

Standard Error Revisited and Confidence Intervals for the Mean

  • The standard error estimates how similar a sample mean is to the population mean.

  • Confidence interval for the mean (commonly 95%):

    • Example: sample of 100 students, mean attitude score toward tuition increases = 4 (on a 1–10 scale), sample sd = 1.

    • 95% CI for the population mean: xˉ±zα/2sn=4±1.961100=4±0.196\bar{x} \pm z_{\alpha/2} \cdot \frac{s}{\sqrt{n}} = 4 \pm 1.96 \cdot \frac{1}{\sqrt{100}} = 4 \pm 0.196

    • Resulting interval: [3.804,4.196][3.804, 4.196]

    • Interpretation: 5% of the time the population mean would fall outside this interval.

Our Friend, 1.96: The 95% Confidence Multiplier

  • On a standard normal distribution, approximately 5% of the curve lies outside ±1.96 standard errors from the mean.

  • Therefore, ±1.96 captures the central 95% of the distribution around the mean in confident interval estimation.

Degrees of Freedom (DF)

  • Intuition: Degrees of freedom are the number of independent pieces of information available to estimate a parameter.

  • Concrete example: with two observations, you have two independent observations for the mean; however, when calculating the variance, only one independent piece of information remains because the two observations are constrained by their distance from the mean.

  • Conceptual view: DF reflect how many values are free to vary when estimating a parameter given fixed other parameters.

  • Practical note: In many cases, using DF properly prevents underestimating the population parameter when using sample estimates.

  • Summary: A dataset’s sample is an estimate of the population; neglecting degrees of freedom can bias parameter estimates.