Lecture Notes: Summarizing Data in Health Data (Vocabulary Flashcards)

Purpose of summarizing data in health science

The chapter focuses on how to summarize data and how those summaries help address health questions (e.g., patient recovery times, patterns of a spreading virus, treatment outcomes).
Goal: turn raw health data into clear, actionable insights and numerical visualizations that reveal centers of tendency and spread, describe relationships, and support decision-making in health care.
Emphasizes integrating data summarization with health research questions: Is there an effective drug? What is the pattern of recovery? How do we detect biases or gaps in medical data?
Connections to real-world health work: epidemiology, clinical trials, pandemic response, vaccine allocation, and monitoring of treatment outcomes.

Case study from last week: Chronic Fatigue Syndrome (CFS) trial (1997)

Study aim: evaluate cognitive behavioral therapy (CBT) for CFS and compare with a control condition.
Recruitment and eligibility:
- About 142 patients recruited from a hospital/clinic.
- Only about 60 were eligible/ready for the trial; others did not meet criteria, had health issues, or refused participation.
Randomization and groups:
- Participants randomly assigned to CBT vs. relaxation (control) conditions.
- Approximately 30 participants were placed in the relaxation (control) arm; the remainder were allocated to CBT.
Follow-up and outcomes (6 months):
- Follow-up data show about 19–27 participants in the CBT/treatment arm remained engaged.
- Outcome rates: about 70% in the treatment group achieved the desired outcome vs about 19% in the control group.
Takeaway question: Why is random assignment valuable?
- Random assignment helps uncover information and insights by reducing bias, and it improves fairness across groups.
- It supports more convincing, interpretable conclusions about treatment effectiveness.

Why summarize data in health science?

Raw health data (clinical trials, health records, global health databases) are chaotic; summarization makes patterns, trends, and meaningful summaries clearer.
Enables actionable accountability and decision-making (e.g., vaccine supply planning, resource allocation).
Provides a basis for visualizations that communicate findings to clinicians, policymakers, and the public.

Numerical data in health science: variables and data types

Numerical data fall into two main forms:
- Continuous data (e.g., blood pressure, recovery time, life expectancy)
- Discrete data (e.g., number of hospital visits, event counts)
In health data, numerical variables help reveal trends (e.g., rising obesity rates, recovery times).
Analogies to daily life: variables like steps counted by a wearable.
Historical context: even early health data (e.g., 1918–1910 influenza data) used numerical summaries of death rates to identify vulnerable groups.
Types of plots emphasize relationships between numerical variables (e.g., scatter plots, dot plots, histograms).

Visualizing numerical data: scatter plots and more

Scatter plots display the relationship between two numerical variables and help assess association, linearity, and potential trends.
Example discussions discussed in class:
- Life expectancy vs total fertility rate (from Gapminder data): initial association appears negative (as fertility rises, life expectancy tends to be lower historically), with changes over time as health systems improve.
- Across countries and regions (Africa, Europe, the Americas, Asia): scatter plots can illustrate regional patterns and inform policy (e.g., family planning as a policy lever to improve health outcomes).
- Practical health examples: HIV viral load over time, COVID-19 cases and vaccination status.
The scatter plot discussion emphasizes: are variables associated or independent? is the relationship linear or non-linear? How does the association evolve over time?
The class planned a practical SPSS session to generate plots and analyze data from health datasets.

Other numerical visualizations

Dot plots:
- Show distribution of a single numerical variable (e.g., GPA distribution from 2.5 to 4.0).
- The mean is often indicated with a marker (e.g., a red triangle) to show the center.
- Interpretation considerations: skewness can mislead if only the mean is viewed without the distribution shape.
Stack dot plots:
- A stacked arrangement helps reveal density and clustering without overlap, giving a clearer sense of the distribution around the center.
Histogram:
- Bars show the density of values across bins (e.g., hours spent on activities, weekly performance).
- Choice of bin width affects how spikes and tails are represented; too wide or too narrow bins can obscure important patterns.
- Discussed how histogram shape and bin choices can mask or reveal spikes, e.g., in epidemiology (age-specific case counts) or activity measurements.
Shape and modality of distributions:
- Unimodal: one peak
- Bimodal: two peaks
- Multimodal: many peaks
- Uniform: flat distribution
- Symmetric: bell-shaped, mean ≈ median
Skewness and outliers:
- Right-skewed: long right tail; mean tends to be greater than the median
- Left-skewed: long left tail; mean tends to be less than the median
- Symmetric distributions have mean ≈ median
- Outliers: isolated bars or gaps; can be data-entry errors or rare events; outliers impact the mean and standard deviation more than the median and IQR; need to scrutinize for data quality and potential insights.
Practical examples discussed:
- Ebola outbreak: a spike in Liberia illustrating how outliers can indicate areas with unique transmission dynamics.
- COVID-19 and vaccination, and historical HIV data: distributions can reveal spikes and tails relevant to public health decisions.
Transformations to handle skewness:
- Log transformation (Y = log(X)) can normalize highly skewed data (e.g., viral loads, attendance counts, skewed health metrics).
- Pros: reduces skewness, improves modeling and interpretation for statistical analyses.
- Cons: interpretation after transformation can be harder; back-transformation and reporting original scale may be nontrivial.
- Example discussion: log-transforming basketball game attendance to illustrate normalization; viral loads often exhibit right-skewness that becomes more symmetric after log-transform.
Density and spatial patterns:
- Intensity maps/heat maps illustrate how density or change varies across geography (e.g., opioid crisis density by county).
- Colors convey density or rate of change, guiding resource allocation and policy decisions.

Describing distributions with summary statistics

Mean (average):
- Symbol:
- Population mean: $\\bar{x}\$ is the sample mean;$ \mu $is the population mean.</li><li>Sample mean formula:$ \bar{x} = \frac{1}{n} \sum{i=1}^{n} xi $</li><li>Population mean:$ \mu = \frac{1}{N} \sum{i=1}^{N} xi $</li><li>In health trials, means are common but can be distorted by outliers; sometimes medians or transformed scales are preferred.</li></ul></li><li>Variance and standard deviation:<ul><li>Population variance: if we have the population, the definition mirrors the sample version; for a sample:</li><li>$ s^2 = \frac{1}{n-1} \sum{i=1}^{n} (xi - \bar{x})^2 $</li><li>Standard deviation:$ s = \sqrt{s^2} $</li><li>Variance measures the average squared deviation from the mean; standard deviation preserves the same units as the data, aiding interpretation.</li><li>In health contexts, variance quantifies variability in responses (e.g., blood sugar responses, hormone levels) and helps assess consistency of treatment effects.</li></ul></li><li>Median and robustness to outliers:<ul><li>Median: middle value after ordering the data; for odd n, the middle value; for even n, the average of the two middle values.</li><li>Median is robust to outliers and extreme values (unlike the mean), making it a preferred center measure for skewed data.</li></ul></li><li>Quartiles and IQR:<ul><li>Q1 = 25th percentile; Q3 = 75th percentile; IQR = Q3 − Q1.</li><li>IQR describes the middle 50% of the data and is robust to outliers.</li></ul></li><li>Box plots (five-number summary):<ul><li>Components: minimum, Q1, median, Q3, maximum; whiskers typically extend to 1.5 × IQR beyond the quartiles; observations beyond whiskers are treated as outliers.</li><li>Visualizes center, spread, and potential outliers; good for comparing groups (side-by-side boxes).</li></ul></li><li>Relationship between distribution shape and statistics:<ul><li>For skewed distributions, medians and IQRs are more informative than means and SDs.</li><li>For symmetric distributions, means and SDs are informative.</li></ul></li><li>Scenario: how replacing extreme values changes statistics:<ul><li>Replacing the largest value with a very large number tends to heavily affect the mean but not the median.</li><li>Medians and IQRs are relatively resistant to extreme values; means and standard deviations are more sensitive to outliers.</li></ul></li><li>Practical example: skewed survival times or income data favor using median and IQR for describing central tendency and spread.</li></ul><h3 collapsed="false" seolevelmigrated="true">Hypothesis testing, bias, and randomization in health research</h3><ul><li>Hypothesis testing framework:<ul><li>Null hypothesis (H0): there is no effect or no bias (independence).</li><li>Alternative hypothesis ( Ha ): there is an effect or bias (dependence).</li><li>The burden of proof lies with the alternative; evidence is weighed against the null using a test statistic and p-value.</li><li>p-value threshold commonly set at$ p \le 0.05$$ to declare statistical significance.
Example: gender bias in promotions (classroom study):
- Sample: 48 supervisors; 24 male, 24 female; promotions observed.
- Outcome example: 35 promotions overall; breakdown by gender showed higher promotion rate among males (e.g., 21/24 = 87.5%) vs females (e.g., 14/24 = 58.3%), a difference of about 29.2%.
- Interpretation requires a statistical test to assess whether the observed gap could occur by chance under H0 (no bias).
Randomization and bias reduction:
- Random assignments (e.g., in clinical trials or experiments) help ensure comparability between groups and minimize biases from confounding factors.
- In health experiments, randomization supports fair comparisons of treatments and reduces systematic differences between groups.
Simulation and hypothesis testing in practice:
- Card-shuffling and random allocation simulations illustrate how outcomes could arise under the null hypothesis.
- The data inform whether to reject H0 or fail to reject, guiding conclusions about treatment effects or biases.
Practical advice regarding hypothesis testing:
- Always consider whether the observed effect size is practically meaningful, not just statistically significant.
- Ensure study design minimizes confounding and bias; use randomization where possible.
- Be mindful of data quality and whether assumptions (e.g., normality, independence) hold for the chosen test.

Practical takeaways and connections to real-world health research

Data summarization is essential for transforming chaotic health data into actionable insights that can guide clinical and public health decisions.
Visualizations (scatter plots, dot plots, histograms, box plots, mosaic plots, bar charts, pie charts) communicate complex data patterns to diverse audiences.
Descriptive statistics (mean, median, IQR, variance, standard deviation) describe central tendency and variability, with careful choice depending on distribution shape and presence of outliers.
Transformations (e.g., log) can normalize skewed data and improve modeling, but require careful interpretation and potential back-transformation for reporting.
Categorical data and contingency tables enable exploration of associations between variables (e.g., treatment type and outcome; Titanic-era survivorship by age category) and underpin hypothesis testing.
Hypothesis testing and randomization are core to evaluating treatment effects and identifying biases; the burden of proof lies with the alternative hypothesis, and p-values guide decisions to reject or not reject the null.
Real-world relevance: these methods support epidemiological surveillance (e.g., post-COVID trends), clinical trials, vaccine and treatment assessments, and resource allocation in health systems.

Quick practice prompts (conceptual prompts you can try)

Sketch or imagine the expected distribution for the following variables and state whether you anticipate unimodal, bimodal, skewness, or symmetry:
- Number of glasses of coffee consumed daily
- Hours spent on social media per day by students during exam week
- ICU length of stay for a viral pneumonia cohort
Given a small dataset with an obvious outlier, discuss how the mean, standard deviation, median, and IQR would likely differ with and without the outlier.
Design a simple contingency table for a hypothetical vaccine trial, showing recovery status (recovered/not recovered) by vaccine type (A/B/Placebo). Explain how you would use a chi-square test to assess association.
Describe a scenario in which a log transformation would help you analyze a health variable, and explain potential interpretation challenges after transformation.
Outline the null and alternative hypotheses for a hypothetical study on whether a new drug improves recovery time, and describe what a p-value would tell you in this context.

Note on tools and upcoming work

In the next sessions, we will practice with SPSS (or equivalent software) to generate scatter plots, histograms, box plots, contingency tables, mosaic plots, and bar plots from real health datasets.
Projects will involve analyzing data from health records and clinical trials, with emphasis on summarization, visualization, and hypothesis testing.

Final recap

Summarizing health data turns raw information into meaningful, decision-relevant insights.
A toolkit of visualizations and statistics helps describe patterns, assess relationships, and determine whether observed effects are likely due to chance or to real factors.
Always consider distribution shape, outliers, and robustness of summary statistics when interpreting health data.
Use randomization and formal hypothesis testing to guard against bias and to support evidence-based conclusions in health research.