M Clin Path 2025 - Statistics Flashcards

Introduction to Statistics

Statistics: What and Why?

  • Statistics is a collection of tools for interpreting quantitative data in a meaningful way.
  • It helps in understanding observations and their clinical/scientific implications.

Key Statistical Terms

  • Observational Units: Individual entities on which data is collected.
  • Variable(s): Characteristics or measurements that vary among observational units.
  • Data/Observations: Values that variables take for a particular observational unit.
  • Sampled from a “population”: A subset of a larger group (population) from which data is collected.
  • Feature(s) of Interest: Specific characteristics being studied.
  • Categorical (Qualitative) Data: Characteristics classified into groups (e.g., eye color).
  • Numerical (Quantitative) Data: Measurements conveying information about amounts Numerical (quantitative) - Measurements conveying information regarding amount.

Understanding Data

  • Statistics is fundamentally about understanding data.

Statistical Tools Overview

  • Descriptive Statistics: Describing data using summary values (counts, percentages, mean, median, standard deviation, range).
  • Data Visualization: Using graphs and charts to represent and interpret data effectively.
  • Correlation: Measuring the strength and direction of a relationship between two variables.
  • Regression: Examining the relationship between two or more variables.
  • Probability: Understanding the likelihood of events.
  • Inferential Statistics: Drawing conclusions about a larger population based on a sample.
  • Hypothesis Testing: Assessing the credibility of a statement about a population based on sample data.

Data Overview

  • Datasets consist of variables that vary from one entity to another.
  • Quantitative Variables: Numerical measurements.
    • Discrete: Whole values (e.g., number of events, objects, people).
    • Continuous: Measures that can take any value within an observed range (e.g., length, weight).
  • Qualitative Variables: Categorical data.
    • Nominal: Data classified by quality rather than numerical measure (e.g., Dead/Alive).
    • Ordinal: Ordered categories (e.g., level of agreement).

Categorical Data

  • Distribution: Possible category values and the frequency of each value.
  • Frequency: The number of observations that fall into each category.
  • Relative Frequency: The proportion of observations that fall into each category (count/total number).
  • Displaying Categorical Data: Use tables, bar charts, or pie charts, ensuring each data value is represented by the same amount of area.
  • Joint distributions can highlight the relative frequency of more than one variable.

Descriptive Statistics for Categorical Data

  • Categorical data is summarized by counts, tables, pie charts, bar charts, and column charts.

Summarizing Quantitative Data

  • Measures of Central Tendency:
    • Mean: Numeric average of the data.
    • Median: Value that splits the data in half.
    • Mode: Most frequent value.
    • Mid-Range
  • Measures of Variability (Spread):
    • Variance: average
      of squared
      deviations . Average of squared deviations of the observations from the mean.
    • Standard Deviation: Square root of the variance.
    • Range: Difference between the maximum and minimum values.
    • Interquartile Range (IQR): IQR = Q3 – Q1 . The middle 50% of the data. Difference between the 1st and 3rd quartiles.
      • Q1: value below which ¼ of the data lies
      • Q3 : value above which ¼ of the data lies
    • Sample properties reflect the distribution of the underlying population.
  • Spread Relative to Mean:
    • Relative Standard Deviation
    • Coefficient of Variation.
    • Descriptive statistics are data summaries used to characterize quantitative data.

Common Measures of a “Typical” Value

  • Mean: Add all values and divide by the total number of observations.
  • Median: Sort the data and take the middle value (or the average of the two middle values if there’s an even number of observations).

Coefficient of Variation (CV)

  • A standardized measure of dispersion.
  • Used in clinical laboratories to:
    • Aid in the selection of a new method for routine use.
    • Monitor the inherent variability (precision) of a method already in routine use.
  • Formula: CV = (SD / Mean) * 100

Visual Displays

  • Visual displays complement summary statistics in describing the data.
  • Example: Scatter plots to explore relationships between age and TB test results.

Appropriate Numeric Summaries

  • Data Distribution
    • Symmetric Data:
      • Data is similarly spread on either side of the mean ( mean ≈ median ).
      • Sample mean and standard deviation are appropriate summaries.
      • The sample mean and standard deviation are appropriate summaries
    • Skewed Data:
      • Right-Skewed (long right tail): Median is less than the mean.
      • Left-Skewed (long left tail): Median is greater than the mean.
      • Sample median and interquartile range are appropriate summaries
      • The sample median and interquartile range are appropriate summaries

Distributions

  • A distribution describes the pattern of values that data take when drawn from a population (e.g., normal distribution).
    *Rescaling of distribution expresses difference from mean in terms of standard deviations
  • Normal Distribution Commonly used with distribution curve areas under the curve represent probabilities (total area sums to one)

Bivariate Associations

  • Examining bivariate relationships serves several purposes:
    • Exploring the nature of the relationship.
    • Explaining how one variable can explain the variability in another.
    • Fitting a model that best represents the relationship for prediction.
  • Association: Two variables are associated if knowing the value of one tells us something about the values of the other variable.
  • Correlation and linear regression can help determine if a relationship between two numeric variables is meaningful.

Scatterplots

  • Display two quantitative variables measured on the same individuals/experimental units.
  • Useful for showing patterns, trends, relationships, and outliers.
  • Features of a displayed association:
    • Direction: Positive or negative relationship.
    • Form: Linear or non-linear.
    • Strength: How closely the values move together.

Correlation Coefficients

  • Pearson’s Sample Correlation Coefficient (r):
    • Measure of linear association between y and x.
    • Unit-free, between -1 and 1 (inclusive).
    • Depends on both slope and scatter about a line of best fit.
  • Spearman’s Rank Correlation Coefficient:
    • Measure of strength and direction of the monotonic relationship between two ranked variables

Simple Linear Regression

  • Describes the relationship between two numeric variables in terms of a predictor (x) and a response variable (y).
  • Models the relationship as a straight line: y = b0 + b1x
  • Different from correlation as one variable is defined as the response, and one as the predictor or explanatory variable.
  • Least Squares Line: Minimizes the sum of squared deviations.

Residuals and R-squared

  • Residual: The distance between the point and the fitted value on the line of best fit. Residual = Observed value – Fitted value
  • R-squared: Proportion of the variation in the response variable that is explained by the variation in the predictor variable. R^2 = Explained variation/Total variation

Basic Assumptions Underpinning Least Squares Inference

  • Linearity: A straight line provides an adequate model of the relationship.
  • Constant Variance: The spread of the responses about the straight line is the same across the range of the explanatory variable.
  • Normality: The residuals follow a normal distribution.
  • Independence: The difference between any response and its fitted value is independent of all others.

Assessing Model Adequacy Using Residual Plots

  • If a model adequately describes the relationship between the 2 variables, we would expect there is no obvious pattern in the residuals
  • Check for patterns in residual plots to ensure that basic assumptions underpinning least squares inference are met, e.g. linearity, constant variance.

Simple Linear Regression - Additional Considerations

  • Model 5: Outlying observations pull the regression line away from the line of best fit to main data
  • Model 6: Increasing variation in response is observed with increasing values of the predictor variable.
  • Logging the data is often appropriate, especially for biological data. Always check the distribution of your data at the start of any analysis.

Inferential Statistics

  • Drawing conclusions about a larger population based on a sample POPULATION Collect evidence (observations) on some units/subjects SAMPLE Evaluate sample evidence Make decisions about the populations Inferential Statistics HYPOTHESIS TESTING Uses sample statistics to answer questions about the population.

Key Terms in Inferential Statistics

  • Sample Statistics:
    • Calculated summaries describing SAMPLE characteristics.
    • Examples: \bar{X} (sample mean), s (sample standard deviation)
  • Population Parameters:
    • Defining characteristics of the population, usually unknown, and typically labeled with a Greek symbol
    • Examples: μ (population mean), σ (population standard deviation).

Valid Inference

  • Requires:
    • Representative sample of the target population.
    • Application of appropriate statistical method to address the right question.
    • Correct interpretation of results.

Inference to Target Population

  • Important considerations:
    • What is the question of interest?
    • What does the data look like?
    • Is appropriate data available/collectable?
    • How to best answer the question with the available dataset

Sources of Variation in Laboratory Measurements

  • Analytical: Observed differences in the value of an analyte once it has been prepared for analysis.
  • Intra-individual: Variability in true values of an analyte obtained from the same individual.
  • Inter-individual: Variability due to differences in true (mean) values of an analyte between individuals.
  • Understanding inherent variability of data drawn from a population/measurement system is key to statistical understanding and inference
  • Repeatability: The variation observed when the same operator measures the same part repeatedly with the same device.
  • Reproducibility: The variation observed when different operators measure the same part using the same device.

Measurement Systems

  • Accuracy: How close are the measurements to their “true” value?
  • Precision: How close are independent measurements of the same thing to each other?

Populations and Sample

  • Study a part in order to gain information about the whole

Sampling and Data Collection

  • Usefulness: Usefulness of a set of observations is limited unless you can extrapolate to some wider population ( extrapolation ≡ inference).
  • Compared to what? How representative?

Sampling Biases

  • Include:
    • Experimental convenience.
    • Voluntary self-selection.
    • Investigator intervention.

Statistical Inference - Hypothesis Testing

  • How real is an observed relationship between height and muscle strength?
  • Is an observed difference in serum ferritin between males and females statistically significant?

What is a Test of Significance?

  • A statistical way to help answer research questions or claims about certain population parameters.
  • Utilizes observations collected from a sample – a subset of the population of interest.
  • Provides an objective way of addressing the issue. Assessing the credibility of a statement about a population based on sample data.

Hypothesis Tests

  • Hypothesis tests resolve conflicts between two competing opinions (hypotheses).
  • These hypotheses are:
    • H0 , the null hypothesis, which is the presumed status quo.
    • HA , the alternative hypothesis, which is a contradiction of the null hypothesis or a “challenge” to the status quo.
  • Image from freshspectrum.com • Assessing the credibility of a statement about a population based on sample data.

Hypothesis Testing Components

  • Uses sample statistics to answer questions about the population
  • Sample estimate(s) obtained from the data
  • Population parameter(s) (defining characteristic)
  • Test statistic with a known distribution
  • P-value Conclusion
  • Probability your results occurred just by chance if H0 was true
  • Relating back to the context of the question.

Questions to Consider

  • Does this histogram suggest the data comes from a population with a mean greater than12?
  • Sample mean
  • Sample mean= 13.8, SD=0.94, sample size N = 50
  • Hypothesized mean

Sampling Distribution of the Mean

  • Properties of a sample mean:
  • Same mean as the population distribution
  • Smaller variation than the population distribution
  • How much smaller??
  • General rule: Suppose we have a population of N individuals with a population standard deviation σ . Provided we take a simple random sample and the sample size n is small compared with the population size N, the standard deviation of all the sample means of size n which could be taken from the population is inversely proportional to the square root of n.

Sampling Variation

  • Mean: 70; SD: 5
  • 95% of observations between 60.2 and 79.8
  • Population from which samples drawn
  • Sample statistics vary about the population parameter. The larger the sample, the smaller the sampling error (ie the sample mean is more likely to be closer to the population mean)

Hypotheses

  • The two hypotheses of a significance test are:
    • a null hypothesis, denoted by H0
    • an alternative hypothesis denoted by H1 (or sometimes HA)
  • Suppose we are trying to investigate a claim (or research hypothesis), then:
    • H1 – statement that formulates the claim of the research hypothesis
    • H0 – negates this claim; hypothesis of “no difference” or the status quo – set up to be discredited by the significance testing
  • These hypotheses make claims about population parameters – the claims are investigated by evaluating sample estimates (test statistics).

The P-Value

  • To evaluate our hypotheses, we use information on the sampling distribution of our sample estimator and calculate the test statistic.
  • The test statistic:
    • has a known probability distribution under the assumption that H0 is correct
    • is likely to take ‘extreme values’ when H1 is correct (is H0 is false).
  • The P-value is defined to be the probability of getting a test statistic at least as extreme as the one observed, under the assumption that H0 is true. ie it quantifies how likely it is to have seen the data we’ve collected (or something more extreme) if the null hypothesis is true.
  • The smaller the P-value, the less likely it is to have observed the result under the null hypothesis, therefore the stronger the evidence in support of H1

Hypothesis Testing About a Single Population Mean

  • Using sample data to answer a question about the (unknown) mean values of the population from which the sample came
  • Alternative hypothesis: HA :
  • The population mean is not equal to μ0
  • Null hypothesis: H0 :
  • The population mean is equal to some set value μ0

Statistical Inference - Hypothesis Testing

  • Asking if 13.6 is “significantly” different to 14 is reframed as asking if the calculated test statistic of - 1.49 is “significantly” different to 0 or is it likely the result was observed just by chance
  • Insufficient evidence to reject H0, therefore cannot conclude haemoglobin levels in adult females are significantly impacted by increased body mass.

Hypothesis Testing – (Two-Sided Test) Comparing Population Means

  • Alternative hypothesis: HA
  • The population mean of population 1 is different to that of population 2
  • Null hypothesis: H0
  • The population mean is equal to some set value μ0

One-Sided Test (Alternative Hypothesis Specifies Direction)

  • Framed as an inequality, but H1 will never include the equals sign
    H0 : µmale = µfemale
    H1 : µmale > µfemale Or equivalently:
    H0 : µmale - µfemale = 0
    H1 : µmale - µfemale > 0

Statistical Inference - Interpretation

  • Beware of influential outliers
  • Also beware of confounders

Check List for Data Analysis

  • The analysis
  • What is the question of interest?
  • Is appropriate data available?
  • The data
  • What is the population from which it is sourced?
  • Are there biases and/or confounders?
  • Is a transformation appropriate?
  • The interpretation
  • Have you answered the question of interest?
  • How strong is the evidence?
  • Are there any caveats?