Session Content: Descriptive Statistics and Survey Data Analysis (NHANES example)
- Context: Descriptive statistics from surveys are often the primary interest for government agencies (e.g., unemployment rate, population size, median income). Even when the goal is inference about associations, understanding descriptive statistics and data structure is foundational.
- Focus: Descriptive statistics, data manipulation, two-way associations, renaming variables, and working with complex survey samples (e.g., NHANES).
- Software context: Demonstration uses Insight with a dataset resembling NHANES. Insight runs on R under the hood; you can view some of the R code and you can reproduce analyses by coding in R locally.
- Example data: NHANES (National Health and Nutrition Examination Surveys, U.S.). Multistage, stratified sampling with clustering. Each 4-year cycle samples ~28,000 people. Data include interview, clinical exam, and blood samples; hundreds of variables.
Key Concepts in NHANES-like Survey Design
- Stratification and clustering:
- Strata are formed from combinations of variables (e.g., regions, urban/rural, etc.). In NHANES, regions and county groupings are common strata.
- PSUs (primary sampling units) are the units from which clusters are drawn (e.g., counties). Clusters can be multiple stages (counties → household segments → households → individuals).
- The top-level strata may be composed from several variables; the combination can yield more strata than the number of regions alone.
- Clustering: observations within the same PSU are correlated; this affects standard errors and inference.
- Weights:
- Two datasets within a single file may have different weights:
- int: health interview weight
- mech: medical exam weight (slightly smaller group)
- Weights are used to estimate population totals/means rather than simple unweighted counts.
- Top-coding:
- Age and income are often top-coded (e.g., 80+ recoded as 80). This affects the unweighted distribution and summary statistics until weights are applied.
- Design effect (DEFF):
- A measure of the efficiency (or inefficiency) of a complex survey design relative to simple random sampling.
- A high DEFF indicates you need a larger sample to achieve the same precision as a simple random sample.
- Example: a DEFF of 9 implies you’d need about 9 times the sample size of a SRS to achieve the same precision for estimating a mean.
- The DEFF is influenced by clustering size and the number of clusters.
- Population vs sample statistics:
- Unweighted sample counts/means can differ substantially from population estimates when the sample is not representative.
- Weighted estimates align more closely with population values, but standard errors must reflect the complex design.
- Data structure notes:
- The public-use NHANES data may not reveal all clustering levels (only top-level strata and PSUs are often shown).
- Renamed variables help readability (readable names instead of impenetrable codes like STMVPSU).
Initial Data Exploration (Age and Coding)
- Import and inspect dataset.
- Unweighted age distribution:
- Observed density: higher density at younger ages, flattening out, and an unexpected spike at age 80 due to top-coding.
- Top-coding example: ages 80+ recoded to 80 in the data file.
- Summary statistics (unweighted):
- Median age ≈
- Mean age ≈
- Design declaration (Insight procedure):
- Strata variable:
- PSU variable:
- Weight choice: two options because there are two related datasets (health interview and clinical exam).
- Estimated population size (with proper formatting): ≈ (commas placed appropriately).
- Potential design-specified issues:
- Nested PSUs: National Center for Health Statistics sometimes uses the same PSU name across different strata (e.g., PSU 1 may appear in multiple strata).
- Remedy: specify nesting (nest = TRUE) or rename PSUs to reflect unique combinations across strata.
- Output interpretation (design-adjusted):
- After applying weights and design, age estimates change: weighted population mean/median around ≈ .
- The design is described as: stratified one-level cluster sampling design with replacement (first-stage information). In practice, clusters are large with a relatively small number of clusters, affecting efficiency.
- Sample size: ≈ 9,756 observations after cleaning; standard errors and population estimates are provided in the output.
Describing the Population vs Sample, and Design Effects
- Unweighted vs weighted summaries:
- Unweighted age distribution showed a younger-skewed sample; weighting corrected for over- and under-sampling across age groups.
- Population estimates for age (weighted):
- Mean age ≈
- Median age ≈
- Interquartile range (IQR) ≈
- Design effects (DEFF):
- DEFF for age mean ≈ 9.0 (high efficiency loss due to large clusters, e.g., few big PSUs).
- Interpretation: You need about 9 times the sample size under the complex design to achieve the same precision as a simple random sample for the mean.
- Trade-offs in NHANES-like designs:
- Large clusters reduce travel and data collection costs but inflate variance (inefficiency).
- The design is chosen to optimize cost and feasibility across the country, with constraints like ~16 clusters over two years.
Handling Categorical Variables and Inference
- Converting numeric codes to categorical:
- Problem: Gender coded as 1/2 is numeric; treating as numeric leads to meaningless means (e.g., 1.5).
- Solution: Convert to categorical, e.g., create a new variable gender_cat, then rename levels to meaningful labels (Male, Female).
- After converting to categorical:
- Frequency counts and estimated population proportions are available.
- Inference tab provides confidence intervals for those proportions.
- Descriptive statistics by category (example: alcohol per day by gender):
- Means, medians, and quantiles are available by gender.
- Box plots and histograms can be generated for each group.
- Two-sample inference by category:
- t-test for comparing means between two groups (e.g., men vs women for drinks per day).
- ANOVA for more than two groups (equivalently, an F-test; for two groups, F = t^2).
- Reported example: t-statistic ≈ with a very small p-value (design-based test) indicating a strong difference in mean drinks per day between men and women.
- Practical notes:
- P-values in survey contexts reflect the design; standard simple random-sample intuition may not apply directly.
- If you want to test categorical differences by age bands, chi-squared tests or global tests of independence are used; both provide population-level proportions and confidence intervals.
Creating Age Bands (Form Class Intervals)
- Turn off the survey design temporarily to form age bands using numeric to categorical transformation.
- Steps:
- Manipulate variables → Numeric variables → Form class intervals.
- Choose variable: age.
- Number of intervals: e.g., 7.
- Interval type: equal width vs equal count; commonly use specified intervals with closed left, open right for clarity.
- Example: closed left, open right intervals (e.g., 0–20, 21–40, etc.).
- After creating age bands:
- Re-enable the survey design to view population estimates by age band.
- Readable labels are preferred (e.g., 0–20, 21–40) rather than mathematical interval notation with brackets.
- Cross-tabulations with age bands and gender:
- Age-by-gender table: population counts and row proportions.
- Population-level percentages require combining variables to form a joint cell variable (e.g., seven age bands × two genders = 14 cells).
- The resulting long table shows the distribution of population across age-by-gender cells with corresponding standard errors.
- Renaming variables for reporting:
- Rename gendercat to genderlabel or age bands to readable labels for presentation.
Two Continuous Variables and Associations
- Example: Systolic BP (BP_syst) vs age (Age):
- Scatter plots with weights show the population pattern.
- Correlation ρ ≈ indicates a moderate positive association (as age increases, systolic BP tends to rise).
- Linear trend (simple regression):
- Fit linear model: BP_syst ≈ a + b × Age.
- Reported example: intercept ≈ 100, slope ≈ 0.44 per year.
- Interpretation: Each additional year of age is associated with an average increase of 0.44 mmHg in systolic BP (under the linear model).
- Confidence intervals around parameters are provided; the trend is often strong, especially in aggregated population data.
- Quadratic trend (nonlinear):
- Include term Age^2: BP_syst ≈ a + b × Age + c × Age^2.
- If c > 0, the curve bends upward; if c < 0, downward.
- In the example, the quadratic coefficient is small and the p-value is not strongly significant, suggesting weak evidence of curvature.
- Visual: quadratic fit may better describe the middle ages but may overfit or misrepresent tails if curvature is minimal.
- Locally weighted smoothing (nonparametric trend):
- Locally linear or locally polynomial smoothing (e.g., LOESS/LOWESS style) is used to estimate the mean curve without imposing a rigid parametric form.
- Smoothing parameter controls the width of the local window: wider windows produce smoother curves; too narrow windows produce jagged curves and may reflect noise.
- Quantile smoothing can also be used to plot medians and quartiles as a function of x (e.g., median BP at each level of diastolic BP).
- Weighting in plots:
- For survey data, visualizations should reflect sampling weights so that plots depict population patterns, not just the sample.
- Hexagonal binning and bubble plots are recommended for large samples to visualize density and weight while avoiding overplotting.
- Data quality notes:
- Some diastolic BP measurements near zero may reflect measurement artifacts (e.g., auscultatory method limitations, cuff fit, vascular stiffness). Such zeros may need filtering or careful interpretation.
Graphical Techniques for Survey Data
- Bubble plots: size proportional to sampling weight; useful for small datasets.
- Hexagonal binning: default for large samples; reduces overplotting and reveals structure that is not visible in scatter plots.
- Smoothers:
- Linear trend: simple straight line fit.
- Quadratic trend: a second-degree polynomial fit; can capture curvature but may misrepresent if the true relationship is not well approximated by a parabola.
- Cubic trend: more flexible shapes; can capture inflection points but risk overfitting with limited data.
- Quantile smoothing: curves for different quantiles (e.g., 25th, 50th, 75th percentiles) to summarize dispersion around the central tendency.
- Example: NHANES-derived scatter plots of systolic vs diastolic BP with smoothing reveal that older individuals tend to have higher values, and the dispersion increases with BP values.
- Practical note: For large surveillance datasets, plotting all points is infeasible; smoothing and hex-bin plots provide useful summaries without revealing every data point.
Automated Code Narratives and Reproducibility
- Insight can display an R code history window showing code snippets used to perform analyses.
- Limitations:
- The code history may not capture every step (e.g., it may show design declarations but not all manipulations).
- For full reproducibility, writing and running your own R scripts is recommended.
- Benefits:
- Viewing the underlying code helps in learning and enables running analyses outside the IDE (e.g., on your own machine).
- Reproducibility: you can run code locally, adjust parameters, and verify results without relying on the web interface.
Summary of Core Techniques Covered
- Descriptive statistics with survey design: mean, median, quantiles, totals, proportions, and counts, all adjusted for weights and design.
- Data manipulation:
- Converting numeric codes to categorical variables.
- Renaming levels for readability.
- Creating and labeling age bands (class intervals).
- Combining variables to form joint cells for population percentages.
- Two-way associations:
- Numeric outcome by categorical predictor (e.g., alcohol per day by gender) with plots and group-specific summaries.
- Categorical outcomes by categorical predictors (e.g., cross-tabulations; chi-squared tests; global tests).
- Associations between two continuous variables:
- Correlation, scatter plots with weighted smoothers, and linear/quadratic/cubic trend fits.
- Graphical literacy for survey data:
- Bubble plots, hexagonal binning, smoothing, quantile smoothing, and interpretation with weights.
- Interpretation cautions:
- Design effects affect required sample sizes and standard errors.
- Weights are essential for population-relevant inferences; failing to account for design yields biased results.
- Large clusters reduce efficiency; there is a cost-benefit trade-off between representativeness, cost, and precision.
- Practical workflow recommendations:
- Start with descriptive statistics to understand data structure and weights.
- Declare survey design early (strata, PSUs, weights) to obtain correct standard errors and population estimates.
- If encountering PSUs sharing IDs across strata, consider nest = TRUE or rename PSUs to reflect unique combinations.
- When presenting results, ensure variable names are readable and labels are informative.
- Use code history as a learning tool, but rely on your own R scripts for full reproducibility and advanced analyses.
What’s Next (Course Context)
- Next steps in the course typically involve:
- Inference and hypothesis testing for survey data (t-tests, ANOVA, chi-squared tests) with proper design-based adjustments.
- Graphical inference and more advanced plotting for survey data.
- Two-way and higher-order associations (e.g., three-way tables) and nuanced interpretation.
- Assignment focus: Many practical techniques demonstrated (descriptives, transforming variables, two-way associations, and some basic inferential tests) are relevant for the upcoming assignment.
Quick Reference Formulas (LaTeX)
Weighted mean (population estimate):
ar{x}{w} = rac{ ablai wi xi}{
ablai wi}, ext{ where }
ablai wi = ext{sum of weights}Population mean with alternative notation:
ar{X}{w} = rac{ otali wi xi}{ otali wi}Design effect (DEFF):
DEFF = rac{Var{ ext{design}}(ar{X}{w})}{Var_{ ext{SRS}}(ar{X})}Relation between t and F for two-sample tests:
t = rac{ar{X}1 - ar{X}2}{SE}, \ F = t^2, ext{ with } F ext{ having } k-1, n-k ext{ degrees of freedom for } k ext{ groups.}Linear trend (example form):
Quadratic trend:
Smoothing concepts (local):
- Locally linear smoothing estimates a line in a moving window around each x; the width of the window controls smoothness.
- Quantile smoothing estimates conditional quantiles (e.g., median, quartiles) as functions of x.
Notes for Exam Preparation
Be able to explain why survey design (strata, PSUs, weights) matters for descriptive statistics and inference.
Describe how to convert numeric codes into meaningful categorical variables and why this is necessary.
Discuss why top-coding occurs and its impact on unweighted distributions and weight-based estimates.
Explain design effects and their implications for sampling efficiency and required sample size.
Demonstrate understanding of how to form age bands (class intervals) and interpret results by age groups.
Compare and contrast different plotting approaches for survey data (scatter with weights, hexagonal binning, bubble plots, smoothing, quantile smoothing).
Interpret t-tests/ANOVA and chi-squared tests in the context of survey data, including the importance of design-based standard errors.
Be able to describe the trade-offs between linear, quadratic, and cubic trend fits for continuous outcomes and the circumstances under which each is appropriate.
Recognize common data quality issues in survey data (e.g., zero diastolic BP due to measurement limitations) and how to address them in analysis.
Understand how to use code history or write your own R scripts to ensure reproducibility and better control over analyses.
Questions? Break acknowledged; next topics: inference and plotting for survey data, and two-way associations (assignment-oriented).