GB

Stats 171-05

Data and variables

  • A dataset has rows called cases (observations) and columns for different types of variables.
  • In many datasets, one column is an ID or label (e.g., a case number). The remaining columns are the variables you study.
  • The value 100 in a discussion could represent a sample size (n) or a case label depending on context; here, focus is on how to interpret cases (rows) and variables (columns).
  • Types of variables:
    • Categorical (qualitative) variables: group data into categories.
    • Quantitative (numerical) variables: numerical measurements.
    • In practice, we distinguish between quantitative and categorical variables, rather than using "numerical" for everything.
  • Key terms from the transcript:
    • Case = label/observation (one row)
    • Variables = columns within the dataset; two broad types: quantitative (numerical) and categorical.
    • The two general summaries apply to any type of variable: graphical and numerical.

General summaries for datasets

  • Graphical summaries (visual):
    • Categorical variables: bar graph (bar chart).
    • Quantitative variables (e.g., GPA): histogram, dot plot, and box plot.
  • Numerical summaries (single-number summaries when appropriate):
    • For a categorical variable:
    • Frequency (f), proportion (p = f/n), and percentage (100p).
    • For a quantitative variable (e.g., GPA):
    • Range
    • Quartiles (Q1, Q2, Q3) and related concepts (min, max, median)
    • Five-number summary (min, Q1, median, Q3, max)
    • IQR (interquartile range) = Q3 − Q1
    • Possible use of percentiles (generalization of quartiles, but note that percentile refers to a specific percentile, while quartiles are the 25th, 50th, and 75th percentiles)

Numerical summaries by variable type (detailed)

  • Categorical variable summaries:
    • Frequency: how many observations in each category.
    • Proportion: fraction of observations in each category, p = f / n.
    • Percentage: 100 × p.
  • Quantitative variable summaries:
    • Central tendency measures (one-number summaries per column): mean (arithmetic average), median (Q2).
    • Spread/variability measures: range, IQR, variance, standard deviation.
    • Quartiles: Q1 (25th percentile), Q2 (median, 50th percentile), Q3 (75th percentile).
    • Five-number summary: min, Q1, Q2, Q3, max.
    • Percentiles: general concept of the value below which a certain percentage of data fall.
    • Box plots (as a graphical summary) illustrate min, Q1, median, Q3, max and potential outliers.

Five-number summary and quartiles

  • Five-number summary components:
    • Min, Q1, Q2 (median), Q3, Max.
  • Interpretations:
    • Q1 is the 25th percentile: 25% of data are at or below Q1.
    • Q2 is the 50th percentile: the median, with 50% of data at or below it.
    • Q3 is the 75th percentile: 75% of data are at or below Q3.
    • The interval [Q1, Q3] contains the middle 50% of the data; this is the interquartile range (IQR).
    • The IQR is a measure of spread for the middle 50% of observations: IQR = Q3 − Q1.
  • Example interpretations (contextual):
    • If cholesterol data have Q1 = 236 and Q3 = 288, then 25% of patients have cholesterol ≤ 236 and 75% have cholesterol ≤ 288; the middle 50% lie between 236 and 288.
    • The median (Q2) represents the 50th percentile; for instance, if the median is 280, then 50% of observations are at or below 280.
  • Important note: percentile values are specific percentiles; quartiles are specific percentiles (25th, 50th, 75th) but all fall under the percentile concept.

Measures of spread (variability)

  • Range:

    • Definition: max − min
    • Interprets overall spread between the extreme values.
  • Interquartile Range (IQR):

    • Definition: IQR = Q3 − Q1
    • Measures the spread of the middle 50% of the data.
  • Why range and IQR are not enough:

    • They describe spread at extremes or the middle 50% but ignore distribution of all values.
  • Variance and standard deviation (measure spread around the mean):

    • Variance (sample):

    • s^2 = rac{1}{n-1}

      \, \sum{i=1}^n (xi - \bar{x})^2

    • Standard deviation:

    • s = \sqrt{s^2} = \sqrt{\frac{1}{n-1} \sum{i=1}^n (xi - \bar{x})^2}

    • Notes:

    • The deviations are di = xi − \bar{x} (differences from the mean).

    • The sum of deviations equals zero: ∑(x_i − \bar{x}) = 0.

    • Squaring the deviations avoids cancellation and yields a positive measure of spread.

    • Dividing by n−1 (instead of n) provides an unbiased estimator of the population variance (degrees of freedom concept).

    • Standard deviation has the same units as the data, unlike variance which is in squared units.

  • Intuition about standard deviation:

    • It represents the average distance of the data points from the mean.
    • Points far from the mean contribute more to the variance because of the squaring.

Worked example (data and calculations)

  • Data (example set of 10 observations):
    • x = [65, 65, 70, 75, 78, 80, 83, 87, 91, 94]
  • Sample size: n = 10
  • Compute the mean:
    • \bar{x} = \frac{65+65+70+75+78+80+83+87+91+94}{10} = \frac{788}{10} = 78.8
  • Compute deviations from the mean: di = xi − \bar{x}
    • 65: d = -13.8
    • 65: d = -13.8
    • 70: d = -8.8
    • 75: d = -3.8
    • 78: d = -0.8
    • 80: d = 1.2
    • 83: d = 4.2
    • 87: d = 8.2
    • 91: d = 12.2
    • 94: d = 15.2
  • Sum of deviations should be zero (check):
    • (-13.8) + (-13.8) + (-8.8) + (-3.8) + (-0.8) + 1.2 + 4.2 + 8.2 + 12.2 + 15.2 = 0
  • Squares of deviations:
    • (-13.8)^2 = 190.44
    • (-13.8)^2 = 190.44
    • (-8.8)^2 = 77.44
    • (-3.8)^2 = 14.44
    • (-0.8)^2 = 0.64
    • (1.2)^2 = 1.44
    • (4.2)^2 = 17.64
    • (8.2)^2 = 67.24
    • (12.2)^2 = 148.84
    • (15.2)^2 = 231.04
  • Sum of squared deviations:
    • \sum (x_i - \bar{x})^2 = 939.60
  • Compute the sample variance and standard deviation:
    • s^2 = \frac{939.60}{n-1} = \frac{939.60}{9} = 104.40
    • s = \sqrt{104.40} \approx 10.22
  • Interpretation from this example:
    • The mean is 78.8; the standard deviation is about 10.22, indicating that typical observations lie about 10 units away from the mean.
    • The sum of squared deviations amplifies the impact of values far from the mean (outliers have large effects on s^2).

Practical notes and study tips

  • Before calculating, write the big picture or “story” of the data, then add details to connect concepts (as emphasized in the lecture).
  • When interpreting quartiles and percentiles, always tie back to the context (e.g., what portion of the data lies at or below a threshold).
  • Distinguish between measures of center (mean, median) and measures of spread (range, IQR, variance, standard deviation).
  • When comparing distributions with the same mean, use spread and shape (e.g., some distributions can have the same center but different variability or skewness).
  • In practice, software like SPSS is used to calculate variance and standard deviation; the conceptual steps above apply regardless of tool.