Lecture Notes: Summarizing Data in Health Data (Vocabulary Flashcards)
Purpose of summarizing data in health science
The chapter focuses on how to summarize data and how those summaries help address health questions (e.g., patient recovery times, patterns of a spreading virus, treatment outcomes).
Goal: turn raw health data into clear, actionable insights and numerical visualizations that reveal centers of tendency and spread, describe relationships, and support decision-making in health care.
Emphasizes integrating data summarization with health research questions: Is there an effective drug? What is the pattern of recovery? How do we detect biases or gaps in medical data?
Connections to real-world health work: epidemiology, clinical trials, pandemic response, vaccine allocation, and monitoring of treatment outcomes.
Case study from last week: Chronic Fatigue Syndrome (CFS) trial (1997)
Study aim: evaluate cognitive behavioral therapy (CBT) for CFS and compare with a control condition.
Recruitment and eligibility:
About 142 patients recruited from a hospital/clinic.
Only about 60 were eligible/ready for the trial; others did not meet criteria, had health issues, or refused participation.
Randomization and groups:
Participants randomly assigned to CBT vs. relaxation (control) conditions.
Approximately 30 participants were placed in the relaxation (control) arm; the remainder were allocated to CBT.
Follow-up and outcomes (6 months):
Follow-up data show about 19โ27 participants in the CBT/treatment arm remained engaged.
Outcome rates: about 70% in the treatment group achieved the desired outcome vs about 19% in the control group.
Takeaway question: Why is random assignment valuable?
Random assignment helps uncover information and insights by reducing bias, and it improves fairness across groups.
It supports more convincing, interpretable conclusions about treatment effectiveness.
Why summarize data in health science?
Raw health data (clinical trials, health records, global health databases) are chaotic; summarization makes patterns, trends, and meaningful summaries clearer.
Enables actionable accountability and decision-making (e.g., vaccine supply planning, resource allocation).
Provides a basis for visualizations that communicate findings to clinicians, policymakers, and the public.
Numerical data in health science: variables and data types
Numerical data fall into two main forms:
Continuous data (e.g., blood pressure, recovery time, life expectancy)
Discrete data (e.g., number of hospital visits, event counts)
In health data, numerical variables help reveal trends (e.g., rising obesity rates, recovery times).
Analogies to daily life: variables like steps counted by a wearable.
Historical context: even early health data (e.g., 1918โ1910 influenza data) used numerical summaries of death rates to identify vulnerable groups.
Types of plots emphasize relationships between numerical variables (e.g., scatter plots, dot plots, histograms).
Visualizing numerical data: scatter plots and more
Scatter plots display the relationship between two numerical variables and help assess association, linearity, and potential trends.
Example discussions discussed in class:
Life expectancy vs total fertility rate (from Gapminder data): initial association appears negative (as fertility rises, life expectancy tends to be lower historically), with changes over time as health systems improve.
Across countries and regions (Africa, Europe, the Americas, Asia): scatter plots can illustrate regional patterns and inform policy (e.g., family planning as a policy lever to improve health outcomes).
Practical health examples: HIV viral load over time, COVID-19 cases and vaccination status.
The scatter plot discussion emphasizes: are variables associated or independent? is the relationship linear or non-linear? How does the association evolve over time?
The class planned a practical SPSS session to generate plots and analyze data from health datasets.
Other numerical visualizations
Dot plots:
Show distribution of a single numerical variable (e.g., GPA distribution from 2.5 to 4.0).
The mean is often indicated with a marker (e.g., a red triangle) to show the center.
Interpretation considerations: skewness can mislead if only the mean is viewed without the distribution shape.
Stack dot plots:
A stacked arrangement helps reveal density and clustering without overlap, giving a clearer sense of the distribution around the center.
Histogram:
Bars show the density of values across bins (e.g., hours spent on activities, weekly performance).
Choice of bin width affects how spikes and tails are represented; too wide or too narrow bins can obscure important patterns.
Discussed how histogram shape and bin choices can mask or reveal spikes, e.g., in epidemiology (age-specific case counts) or activity measurements.
Shape and modality of distributions:
Unimodal: one peak
Bimodal: two peaks
Multimodal: many peaks
Uniform: flat distribution
Symmetric: bell-shaped, mean โ median
Skewness and outliers:
Right-skewed: long right tail; mean tends to be greater than the median
Left-skewed: long left tail; mean tends to be less than the median
Symmetric distributions have mean โ median
Outliers: isolated bars or gaps; can be data-entry errors or rare events; outliers impact the mean and standard deviation more than the median and IQR; need to scrutinize for data quality and potential insights.
Practical examples discussed:
Ebola outbreak: a spike in Liberia illustrating how outliers can indicate areas with unique transmission dynamics.
COVID-19 and vaccination, and historical HIV data: distributions can reveal spikes and tails relevant to public health decisions.
Transformations to handle skewness:
Log transformation (Y = log(X)) can normalize highly skewed data (e.g., viral loads, attendance counts, skewed health metrics).
Pros: reduces skewness, improves modeling and interpretation for statistical analyses.
Cons: interpretation after transformation can be harder; back-transformation and reporting original scale may be nontrivial.
Example discussion: log-transforming basketball game attendance to illustrate normalization; viral loads often exhibit right-skewness that becomes more symmetric after log-transform.
Density and spatial patterns:
Intensity maps/heat maps illustrate how density or change varies across geography (e.g., opioid crisis density by county).
Colors convey density or rate of change, guiding resource allocation and policy decisions.
Describing distributions with summary statistics
Mean (average):
Symbol:
Population mean: \mu\bar{x} = \frac{1}{n} \sum{i=1}^{n} xi\mu = \frac{1}{N} \sum{i=1}^{N} xis^2 = \frac{1}{n-1} \sum{i=1}^{n} (xi - \bar{x})^2s = \sqrt{s^2}p \le 0.05$$ to declare statistical significance.
Example: gender bias in promotions (classroom study):
Sample: 48 supervisors; 24 male, 24 female; promotions observed.
Outcome example: 35 promotions overall; breakdown by gender showed higher promotion rate among males (e.g., 21/24 = 87.5%) vs females (e.g., 14/24 = 58.3%), a difference of about 29.2%.
Interpretation requires a statistical test to assess whether the observed gap could occur by chance under H0 (no bias).
Randomization and bias reduction:
Random assignments (e.g., in clinical trials or experiments) help ensure comparability between groups and minimize biases from confounding factors.
In health experiments, randomization supports fair comparisons of treatments and reduces systematic differences between groups.
Simulation and hypothesis testing in practice:
Card-shuffling and random allocation simulations illustrate how outcomes could arise under the null hypothesis.
The data inform whether to reject H0 or fail to reject, guiding conclusions about treatment effects or biases.
Practical advice regarding hypothesis testing:
Always consider whether the observed effect size is practically meaningful, not just statistically significant.
Ensure study design minimizes confounding and bias; use randomization where possible.
Be mindful of data quality and whether assumptions (e.g., normality, independence) hold for the chosen test.
Practical takeaways and connections to real-world health research
Data summarization is essential for transforming chaotic health data into actionable insights that can guide clinical and public health decisions.
Visualizations (scatter plots, dot plots, histograms, box plots, mosaic plots, bar charts, pie charts) communicate complex data patterns to diverse audiences.
Descriptive statistics (mean, median, IQR, variance, standard deviation) describe central tendency and variability, with careful choice depending on distribution shape and presence of outliers.
Transformations (e.g., log) can normalize skewed data and improve modeling, but require careful interpretation and potential back-transformation for reporting.
Categorical data and contingency tables enable exploration of associations between variables (e.g., treatment type and outcome; Titanic-era survivorship by age category) and underpin hypothesis testing.
Hypothesis testing and randomization are core to evaluating treatment effects and identifying biases; the burden of proof lies with the alternative hypothesis, and p-values guide decisions to reject or not reject the null.
Real-world relevance: these methods support epidemiological surveillance (e.g., post-COVID trends), clinical trials, vaccine and treatment assessments, and resource allocation in health systems.
Quick practice prompts (conceptual prompts you can try)
Sketch or imagine the expected distribution for the following variables and state whether you anticipate unimodal, bimodal, skewness, or symmetry:
Number of glasses of coffee consumed daily
Hours spent on social media per day by students during exam week
ICU length of stay for a viral pneumonia cohort
Given a small dataset with an obvious outlier, discuss how the mean, standard deviation, median, and IQR would likely differ with and without the outlier.
Design a simple contingency table for a hypothetical vaccine trial, showing recovery status (recovered/not recovered) by vaccine type (A/B/Placebo). Explain how you would use a chi-square test to assess association.
Describe a scenario in which a log transformation would help you analyze a health variable, and explain potential interpretation challenges after transformation.
Outline the null and alternative hypotheses for a hypothetical study on whether a new drug improves recovery time, and describe what a p-value would tell you in this context.
Note on tools and upcoming work
In the next sessions, we will practice with SPSS (or equivalent software) to generate scatter plots, histograms, box plots, contingency tables, mosaic plots, and bar plots from real health datasets.
Projects will involve analyzing data from health records and clinical trials, with emphasis on summarization, visualization, and hypothesis testing.
Final recap
Summarizing health data turns raw information into meaningful, decision-relevant insights.
A toolkit of visualizations and statistics helps describe patterns, assess relationships, and determine whether observed effects are likely due to chance or to real factors.
Always consider distribution shape, outliers, and robustness of summary statistics when interpreting health data.
Use randomization and formal hypothesis testing to guard against bias and to support evidence-based conclusions in health research.