Lecture Notes: Summarizing Data in Health Data (Vocabulary Flashcards)

Purpose of summarizing data in health science

  • The chapter focuses on how to summarize data and how those summaries help address health questions (e.g., patient recovery times, patterns of a spreading virus, treatment outcomes).

  • Goal: turn raw health data into clear, actionable insights and numerical visualizations that reveal centers of tendency and spread, describe relationships, and support decision-making in health care.

  • Emphasizes integrating data summarization with health research questions: Is there an effective drug? What is the pattern of recovery? How do we detect biases or gaps in medical data?

  • Connections to real-world health work: epidemiology, clinical trials, pandemic response, vaccine allocation, and monitoring of treatment outcomes.

Case study from last week: Chronic Fatigue Syndrome (CFS) trial (1997)

  • Study aim: evaluate cognitive behavioral therapy (CBT) for CFS and compare with a control condition.

  • Recruitment and eligibility:

    • About 142 patients recruited from a hospital/clinic.

    • Only about 60 were eligible/ready for the trial; others did not meet criteria, had health issues, or refused participation.

  • Randomization and groups:

    • Participants randomly assigned to CBT vs. relaxation (control) conditions.

    • Approximately 30 participants were placed in the relaxation (control) arm; the remainder were allocated to CBT.

  • Follow-up and outcomes (6 months):

    • Follow-up data show about 19โ€“27 participants in the CBT/treatment arm remained engaged.

    • Outcome rates: about 70% in the treatment group achieved the desired outcome vs about 19% in the control group.

  • Takeaway question: Why is random assignment valuable?

    • Random assignment helps uncover information and insights by reducing bias, and it improves fairness across groups.

    • It supports more convincing, interpretable conclusions about treatment effectiveness.

Why summarize data in health science?

  • Raw health data (clinical trials, health records, global health databases) are chaotic; summarization makes patterns, trends, and meaningful summaries clearer.

  • Enables actionable accountability and decision-making (e.g., vaccine supply planning, resource allocation).

  • Provides a basis for visualizations that communicate findings to clinicians, policymakers, and the public.

Numerical data in health science: variables and data types

  • Numerical data fall into two main forms:

    • Continuous data (e.g., blood pressure, recovery time, life expectancy)

    • Discrete data (e.g., number of hospital visits, event counts)

  • In health data, numerical variables help reveal trends (e.g., rising obesity rates, recovery times).

  • Analogies to daily life: variables like steps counted by a wearable.

  • Historical context: even early health data (e.g., 1918โ€“1910 influenza data) used numerical summaries of death rates to identify vulnerable groups.

  • Types of plots emphasize relationships between numerical variables (e.g., scatter plots, dot plots, histograms).

Visualizing numerical data: scatter plots and more

  • Scatter plots display the relationship between two numerical variables and help assess association, linearity, and potential trends.

  • Example discussions discussed in class:

    • Life expectancy vs total fertility rate (from Gapminder data): initial association appears negative (as fertility rises, life expectancy tends to be lower historically), with changes over time as health systems improve.

    • Across countries and regions (Africa, Europe, the Americas, Asia): scatter plots can illustrate regional patterns and inform policy (e.g., family planning as a policy lever to improve health outcomes).

    • Practical health examples: HIV viral load over time, COVID-19 cases and vaccination status.

  • The scatter plot discussion emphasizes: are variables associated or independent? is the relationship linear or non-linear? How does the association evolve over time?

  • The class planned a practical SPSS session to generate plots and analyze data from health datasets.

Other numerical visualizations

  • Dot plots:

    • Show distribution of a single numerical variable (e.g., GPA distribution from 2.5 to 4.0).

    • The mean is often indicated with a marker (e.g., a red triangle) to show the center.

    • Interpretation considerations: skewness can mislead if only the mean is viewed without the distribution shape.

  • Stack dot plots:

    • A stacked arrangement helps reveal density and clustering without overlap, giving a clearer sense of the distribution around the center.

  • Histogram:

    • Bars show the density of values across bins (e.g., hours spent on activities, weekly performance).

    • Choice of bin width affects how spikes and tails are represented; too wide or too narrow bins can obscure important patterns.

    • Discussed how histogram shape and bin choices can mask or reveal spikes, e.g., in epidemiology (age-specific case counts) or activity measurements.

  • Shape and modality of distributions:

    • Unimodal: one peak

    • Bimodal: two peaks

    • Multimodal: many peaks

    • Uniform: flat distribution

    • Symmetric: bell-shaped, mean โ‰ˆ median

  • Skewness and outliers:

    • Right-skewed: long right tail; mean tends to be greater than the median

    • Left-skewed: long left tail; mean tends to be less than the median

    • Symmetric distributions have mean โ‰ˆ median

    • Outliers: isolated bars or gaps; can be data-entry errors or rare events; outliers impact the mean and standard deviation more than the median and IQR; need to scrutinize for data quality and potential insights.

  • Practical examples discussed:

    • Ebola outbreak: a spike in Liberia illustrating how outliers can indicate areas with unique transmission dynamics.

    • COVID-19 and vaccination, and historical HIV data: distributions can reveal spikes and tails relevant to public health decisions.

  • Transformations to handle skewness:

    • Log transformation (Y = log(X)) can normalize highly skewed data (e.g., viral loads, attendance counts, skewed health metrics).

    • Pros: reduces skewness, improves modeling and interpretation for statistical analyses.

    • Cons: interpretation after transformation can be harder; back-transformation and reporting original scale may be nontrivial.

    • Example discussion: log-transforming basketball game attendance to illustrate normalization; viral loads often exhibit right-skewness that becomes more symmetric after log-transform.

  • Density and spatial patterns:

    • Intensity maps/heat maps illustrate how density or change varies across geography (e.g., opioid crisis density by county).

    • Colors convey density or rate of change, guiding resource allocation and policy decisions.

Describing distributions with summary statistics

  • Mean (average):

    • Symbol:

    • Population mean: barx$isthesamplemean;\\bar{x}\$ is the sample mean;\muisthepopulationmean.</p></li><li><p>Samplemeanformula:is the population mean.</p></li><li><p>Sample mean formula:\bar{x} = \frac{1}{n} \sum{i=1}^{n} xi</p></li><li><p>Populationmean:</p></li><li><p>Population mean:\mu = \frac{1}{N} \sum{i=1}^{N} xi</p></li><li><p>Inhealthtrials,meansarecommonbutcanbedistortedbyoutliers;sometimesmediansortransformedscalesarepreferred.</p></li></ul></li><li><p>Varianceandstandarddeviation:</p><ul><li><p>Populationvariance:ifwehavethepopulation,thedefinitionmirrorsthesampleversion;forasample:</p></li><li><p></p></li><li><p>In health trials, means are common but can be distorted by outliers; sometimes medians or transformed scales are preferred.</p></li></ul></li><li><p>Variance and standard deviation:</p><ul><li><p>Population variance: if we have the population, the definition mirrors the sample version; for a sample:</p></li><li><p>s^2 = \frac{1}{n-1} \sum{i=1}^{n} (xi - \bar{x})^2</p></li><li><p>Standarddeviation:</p></li><li><p>Standard deviation:s = \sqrt{s^2}</p></li><li><p>Variancemeasurestheaveragesquareddeviationfromthemean;standarddeviationpreservesthesameunitsasthedata,aidinginterpretation.</p></li><li><p>Inhealthcontexts,variancequantifiesvariabilityinresponses(e.g.,bloodsugarresponses,hormonelevels)andhelpsassessconsistencyoftreatmenteffects.</p></li></ul></li><li><p>Medianandrobustnesstooutliers:</p><ul><li><p>Median:middlevalueafterorderingthedata;foroddn,themiddlevalue;forevenn,theaverageofthetwomiddlevalues.</p></li><li><p>Medianisrobusttooutliersandextremevalues(unlikethemean),makingitapreferredcentermeasureforskeweddata.</p></li></ul></li><li><p>QuartilesandIQR:</p><ul><li><p>Q1=25thpercentile;Q3=75thpercentile;IQR=Q3โˆ’Q1.</p></li><li><p>IQRdescribesthemiddle50</p></li><li><p>Variance measures the average squared deviation from the mean; standard deviation preserves the same units as the data, aiding interpretation.</p></li><li><p>In health contexts, variance quantifies variability in responses (e.g., blood sugar responses, hormone levels) and helps assess consistency of treatment effects.</p></li></ul></li><li><p>Median and robustness to outliers:</p><ul><li><p>Median: middle value after ordering the data; for odd n, the middle value; for even n, the average of the two middle values.</p></li><li><p>Median is robust to outliers and extreme values (unlike the mean), making it a preferred center measure for skewed data.</p></li></ul></li><li><p>Quartiles and IQR:</p><ul><li><p>Q1 = 25th percentile; Q3 = 75th percentile; IQR = Q3 โˆ’ Q1.</p></li><li><p>IQR describes the middle 50% of the data and is robust to outliers.</p></li></ul></li><li><p>Box plots (five-number summary):</p><ul><li><p>Components: minimum, Q1, median, Q3, maximum; whiskers typically extend to 1.5 ร— IQR beyond the quartiles; observations beyond whiskers are treated as outliers.</p></li><li><p>Visualizes center, spread, and potential outliers; good for comparing groups (side-by-side boxes).</p></li></ul></li><li><p>Relationship between distribution shape and statistics:</p><ul><li><p>For skewed distributions, medians and IQRs are more informative than means and SDs.</p></li><li><p>For symmetric distributions, means and SDs are informative.</p></li></ul></li><li><p>Scenario: how replacing extreme values changes statistics:</p><ul><li><p>Replacing the largest value with a very large number tends to heavily affect the mean but not the median.</p></li><li><p>Medians and IQRs are relatively resistant to extreme values; means and standard deviations are more sensitive to outliers.</p></li></ul></li><li><p>Practical example: skewed survival times or income data favor using median and IQR for describing central tendency and spread.</p></li></ul><h3 collapsed="false" seolevelmigrated="true">Hypothesis testing, bias, and randomization in health research</h3><ul><li><p>Hypothesis testing framework:</p><ul><li><p>Null hypothesis (H0): there is no effect or no bias (independence).</p></li><li><p>Alternative hypothesis ( Ha ): there is an effect or bias (dependence).</p></li><li><p>The burden of proof lies with the alternative; evidence is weighed against the null using a test statistic and p-value.</p></li><li><p>p-value threshold commonly set atp \le 0.05$$ to declare statistical significance.

  • Example: gender bias in promotions (classroom study):

    • Sample: 48 supervisors; 24 male, 24 female; promotions observed.

    • Outcome example: 35 promotions overall; breakdown by gender showed higher promotion rate among males (e.g., 21/24 = 87.5%) vs females (e.g., 14/24 = 58.3%), a difference of about 29.2%.

    • Interpretation requires a statistical test to assess whether the observed gap could occur by chance under H0 (no bias).

  • Randomization and bias reduction:

    • Random assignments (e.g., in clinical trials or experiments) help ensure comparability between groups and minimize biases from confounding factors.

    • In health experiments, randomization supports fair comparisons of treatments and reduces systematic differences between groups.

  • Simulation and hypothesis testing in practice:

    • Card-shuffling and random allocation simulations illustrate how outcomes could arise under the null hypothesis.

    • The data inform whether to reject H0 or fail to reject, guiding conclusions about treatment effects or biases.

  • Practical advice regarding hypothesis testing:

    • Always consider whether the observed effect size is practically meaningful, not just statistically significant.

    • Ensure study design minimizes confounding and bias; use randomization where possible.

    • Be mindful of data quality and whether assumptions (e.g., normality, independence) hold for the chosen test.

Practical takeaways and connections to real-world health research

  • Data summarization is essential for transforming chaotic health data into actionable insights that can guide clinical and public health decisions.

  • Visualizations (scatter plots, dot plots, histograms, box plots, mosaic plots, bar charts, pie charts) communicate complex data patterns to diverse audiences.

  • Descriptive statistics (mean, median, IQR, variance, standard deviation) describe central tendency and variability, with careful choice depending on distribution shape and presence of outliers.

  • Transformations (e.g., log) can normalize skewed data and improve modeling, but require careful interpretation and potential back-transformation for reporting.

  • Categorical data and contingency tables enable exploration of associations between variables (e.g., treatment type and outcome; Titanic-era survivorship by age category) and underpin hypothesis testing.

  • Hypothesis testing and randomization are core to evaluating treatment effects and identifying biases; the burden of proof lies with the alternative hypothesis, and p-values guide decisions to reject or not reject the null.

  • Real-world relevance: these methods support epidemiological surveillance (e.g., post-COVID trends), clinical trials, vaccine and treatment assessments, and resource allocation in health systems.

Quick practice prompts (conceptual prompts you can try)

  • Sketch or imagine the expected distribution for the following variables and state whether you anticipate unimodal, bimodal, skewness, or symmetry:

    • Number of glasses of coffee consumed daily

    • Hours spent on social media per day by students during exam week

    • ICU length of stay for a viral pneumonia cohort

  • Given a small dataset with an obvious outlier, discuss how the mean, standard deviation, median, and IQR would likely differ with and without the outlier.

  • Design a simple contingency table for a hypothetical vaccine trial, showing recovery status (recovered/not recovered) by vaccine type (A/B/Placebo). Explain how you would use a chi-square test to assess association.

  • Describe a scenario in which a log transformation would help you analyze a health variable, and explain potential interpretation challenges after transformation.

  • Outline the null and alternative hypotheses for a hypothetical study on whether a new drug improves recovery time, and describe what a p-value would tell you in this context.

Note on tools and upcoming work

  • In the next sessions, we will practice with SPSS (or equivalent software) to generate scatter plots, histograms, box plots, contingency tables, mosaic plots, and bar plots from real health datasets.

  • Projects will involve analyzing data from health records and clinical trials, with emphasis on summarization, visualization, and hypothesis testing.

Final recap

  • Summarizing health data turns raw information into meaningful, decision-relevant insights.

  • A toolkit of visualizations and statistics helps describe patterns, assess relationships, and determine whether observed effects are likely due to chance or to real factors.

  • Always consider distribution shape, outliers, and robustness of summary statistics when interpreting health data.

  • Use randomization and formal hypothesis testing to guard against bias and to support evidence-based conclusions in health research.