Stats 171-05
Data and variables
- A dataset has rows called cases (observations) and columns for different types of variables.
- In many datasets, one column is an ID or label (e.g., a case number). The remaining columns are the variables you study.
- The value 100 in a discussion could represent a sample size (n) or a case label depending on context; here, focus is on how to interpret cases (rows) and variables (columns).
- Types of variables:
- Categorical (qualitative) variables: group data into categories.
- Quantitative (numerical) variables: numerical measurements.
- In practice, we distinguish between quantitative and categorical variables, rather than using "numerical" for everything.
- Key terms from the transcript:
- Case = label/observation (one row)
- Variables = columns within the dataset; two broad types: quantitative (numerical) and categorical.
- The two general summaries apply to any type of variable: graphical and numerical.
General summaries for datasets
- Graphical summaries (visual):
- Categorical variables: bar graph (bar chart).
- Quantitative variables (e.g., GPA): histogram, dot plot, and box plot.
- Numerical summaries (single-number summaries when appropriate):
- For a categorical variable:
- Frequency (f), proportion (p = f/n), and percentage (100p).
- For a quantitative variable (e.g., GPA):
- Range
- Quartiles (Q1, Q2, Q3) and related concepts (min, max, median)
- Five-number summary (min, Q1, median, Q3, max)
- IQR (interquartile range) = Q3 − Q1
- Possible use of percentiles (generalization of quartiles, but note that percentile refers to a specific percentile, while quartiles are the 25th, 50th, and 75th percentiles)
Numerical summaries by variable type (detailed)
- Categorical variable summaries:
- Frequency: how many observations in each category.
- Proportion: fraction of observations in each category, p = f / n.
- Percentage: 100 × p.
- Quantitative variable summaries:
- Central tendency measures (one-number summaries per column): mean (arithmetic average), median (Q2).
- Spread/variability measures: range, IQR, variance, standard deviation.
- Quartiles: Q1 (25th percentile), Q2 (median, 50th percentile), Q3 (75th percentile).
- Five-number summary: min, Q1, Q2, Q3, max.
- Percentiles: general concept of the value below which a certain percentage of data fall.
- Box plots (as a graphical summary) illustrate min, Q1, median, Q3, max and potential outliers.
Five-number summary and quartiles
- Five-number summary components:
- Min, Q1, Q2 (median), Q3, Max.
- Interpretations:
- Q1 is the 25th percentile: 25% of data are at or below Q1.
- Q2 is the 50th percentile: the median, with 50% of data at or below it.
- Q3 is the 75th percentile: 75% of data are at or below Q3.
- The interval [Q1, Q3] contains the middle 50% of the data; this is the interquartile range (IQR).
- The IQR is a measure of spread for the middle 50% of observations: IQR = Q3 − Q1.
- Example interpretations (contextual):
- If cholesterol data have Q1 = 236 and Q3 = 288, then 25% of patients have cholesterol ≤ 236 and 75% have cholesterol ≤ 288; the middle 50% lie between 236 and 288.
- The median (Q2) represents the 50th percentile; for instance, if the median is 280, then 50% of observations are at or below 280.
- Important note: percentile values are specific percentiles; quartiles are specific percentiles (25th, 50th, 75th) but all fall under the percentile concept.
Measures of spread (variability)
Range:
- Definition: max − min
- Interprets overall spread between the extreme values.
Interquartile Range (IQR):
- Definition: IQR = Q3 − Q1
- Measures the spread of the middle 50% of the data.
Why range and IQR are not enough:
- They describe spread at extremes or the middle 50% but ignore distribution of all values.
Variance and standard deviation (measure spread around the mean):
Variance (sample):
s^2 = rac{1}{n-1}
\, \sum{i=1}^n (xi - \bar{x})^2
Standard deviation:
s = \sqrt{s^2} = \sqrt{\frac{1}{n-1} \sum{i=1}^n (xi - \bar{x})^2}
Notes:
The deviations are di = xi − \bar{x} (differences from the mean).
The sum of deviations equals zero: ∑(x_i − \bar{x}) = 0.
Squaring the deviations avoids cancellation and yields a positive measure of spread.
Dividing by n−1 (instead of n) provides an unbiased estimator of the population variance (degrees of freedom concept).
Standard deviation has the same units as the data, unlike variance which is in squared units.
Intuition about standard deviation:
- It represents the average distance of the data points from the mean.
- Points far from the mean contribute more to the variance because of the squaring.
Worked example (data and calculations)
- Data (example set of 10 observations):
- x = [65, 65, 70, 75, 78, 80, 83, 87, 91, 94]
- Sample size: n = 10
- Compute the mean:
- \bar{x} = \frac{65+65+70+75+78+80+83+87+91+94}{10} = \frac{788}{10} = 78.8
- Compute deviations from the mean: di = xi − \bar{x}
- 65: d = -13.8
- 65: d = -13.8
- 70: d = -8.8
- 75: d = -3.8
- 78: d = -0.8
- 80: d = 1.2
- 83: d = 4.2
- 87: d = 8.2
- 91: d = 12.2
- 94: d = 15.2
- Sum of deviations should be zero (check):
- (-13.8) + (-13.8) + (-8.8) + (-3.8) + (-0.8) + 1.2 + 4.2 + 8.2 + 12.2 + 15.2 = 0
- Squares of deviations:
- (-13.8)^2 = 190.44
- (-13.8)^2 = 190.44
- (-8.8)^2 = 77.44
- (-3.8)^2 = 14.44
- (-0.8)^2 = 0.64
- (1.2)^2 = 1.44
- (4.2)^2 = 17.64
- (8.2)^2 = 67.24
- (12.2)^2 = 148.84
- (15.2)^2 = 231.04
- Sum of squared deviations:
- \sum (x_i - \bar{x})^2 = 939.60
- Compute the sample variance and standard deviation:
- s^2 = \frac{939.60}{n-1} = \frac{939.60}{9} = 104.40
- s = \sqrt{104.40} \approx 10.22
- Interpretation from this example:
- The mean is 78.8; the standard deviation is about 10.22, indicating that typical observations lie about 10 units away from the mean.
- The sum of squared deviations amplifies the impact of values far from the mean (outliers have large effects on s^2).
Practical notes and study tips
- Before calculating, write the big picture or “story” of the data, then add details to connect concepts (as emphasized in the lecture).
- When interpreting quartiles and percentiles, always tie back to the context (e.g., what portion of the data lies at or below a threshold).
- Distinguish between measures of center (mean, median) and measures of spread (range, IQR, variance, standard deviation).
- When comparing distributions with the same mean, use spread and shape (e.g., some distributions can have the same center but different variability or skewness).
- In practice, software like SPSS is used to calculate variance and standard deviation; the conceptual steps above apply regardless of tool.