Statistics is a collection of tools for interpreting quantitative data in a meaningful way.
It helps in understanding observations and their clinical/scientific implications.
Key Statistical Terms
Observational Units: Individual entities on which data is collected.
Variable(s): Characteristics or measurements that vary among observational units.
Data/Observations: Values that variables take for a particular observational unit.
Sampled from a “population”: A subset of a larger group (population) from which data is collected.
Feature(s) of Interest: Specific characteristics being studied.
Categorical (Qualitative) Data: Characteristics classified into groups (e.g., eye color).
Numerical (Quantitative) Data: Measurements conveying information about amounts Numerical (quantitative) - Measurements conveying information regarding amount.
Understanding Data
Statistics is fundamentally about understanding data.
Statistical Tools Overview
Descriptive Statistics: Describing data using summary values (counts, percentages, mean, median, standard deviation, range).
Data Visualization: Using graphs and charts to represent and interpret data effectively.
Correlation: Measuring the strength and direction of a relationship between two variables.
Regression: Examining the relationship between two or more variables.
Probability: Understanding the likelihood of events.
Inferential Statistics: Drawing conclusions about a larger population based on a sample.
Hypothesis Testing: Assessing the credibility of a statement about a population based on sample data.
Data Overview
Datasets consist of variables that vary from one entity to another.
Quantitative Variables: Numerical measurements.
Discrete: Whole values (e.g., number of events, objects, people).
Continuous: Measures that can take any value within an observed range (e.g., length, weight).
Qualitative Variables: Categorical data.
Nominal: Data classified by quality rather than numerical measure (e.g., Dead/Alive).
Ordinal: Ordered categories (e.g., level of agreement).
Categorical Data
Distribution: Possible category values and the frequency of each value.
Frequency: The number of observations that fall into each category.
Relative Frequency: The proportion of observations that fall into each category (count/total number).
Displaying Categorical Data: Use tables, bar charts, or pie charts, ensuring each data value is represented by the same amount of area.
Joint distributions can highlight the relative frequency of more than one variable.
Descriptive Statistics for Categorical Data
Categorical data is summarized by counts, tables, pie charts, bar charts, and column charts.
Summarizing Quantitative Data
Measures of Central Tendency:
Mean: Numeric average of the data.
Median: Value that splits the data in half.
Mode: Most frequent value.
Mid-Range
Measures of Variability (Spread):
Variance: average
of squared
deviations . Average of squared deviations of the observations from the mean.
Standard Deviation: Square root of the variance.
Range: Difference between the maximum and minimum values.
Interquartile Range (IQR): IQR = Q3 – Q1 . The middle 50% of the data. Difference between the 1st and 3rd quartiles.
Q1: value below which ¼ of the data lies
Q3 : value above which ¼ of the data lies
Sample properties reflect the distribution of the underlying population.
Spread Relative to Mean:
Relative Standard Deviation
Coefficient of Variation.
Descriptive statistics are data summaries used to characterize quantitative data.
Common Measures of a “Typical” Value
Mean: Add all values and divide by the total number of observations.
Median: Sort the data and take the middle value (or the average of the two middle values if there’s an even number of observations).
Coefficient of Variation (CV)
A standardized measure of dispersion.
Used in clinical laboratories to:
Aid in the selection of a new method for routine use.
Monitor the inherent variability (precision) of a method already in routine use.
Formula: CV = (SD / Mean) * 100
Visual Displays
Visual displays complement summary statistics in describing the data.
Example: Scatter plots to explore relationships between age and TB test results.
Appropriate Numeric Summaries
Data Distribution
Symmetric Data:
Data is similarly spread on either side of the mean ( mean ≈ median ).
Sample mean and standard deviation are appropriate summaries.
The sample mean and standard deviation are appropriate summaries
Skewed Data:
Right-Skewed (long right tail): Median is less than the mean.
Left-Skewed (long left tail): Median is greater than the mean.
Sample median and interquartile range are appropriate summaries
The sample median and interquartile range are appropriate summaries
Distributions
A distribution describes the pattern of values that data take when drawn from a population (e.g., normal distribution).
*Rescaling of distribution expresses difference from mean in terms of standard deviations
Normal Distribution Commonly used with distribution curve areas under the curve represent probabilities (total area sums to one)
Bivariate Associations
Examining bivariate relationships serves several purposes:
Exploring the nature of the relationship.
Explaining how one variable can explain the variability in another.
Fitting a model that best represents the relationship for prediction.
Association: Two variables are associated if knowing the value of one tells us something about the values of the other variable.
Correlation and linear regression can help determine if a relationship between two numeric variables is meaningful.
Scatterplots
Display two quantitative variables measured on the same individuals/experimental units.
Useful for showing patterns, trends, relationships, and outliers.
Features of a displayed association:
Direction: Positive or negative relationship.
Form: Linear or non-linear.
Strength: How closely the values move together.
Correlation Coefficients
Pearson’s Sample Correlation Coefficient (r):
Measure of linear association between y and x.
Unit-free, between -1 and 1 (inclusive).
Depends on both slope and scatter about a line of best fit.
Spearman’s Rank Correlation Coefficient:
Measure of strength and direction of the monotonic relationship between two ranked variables
Simple Linear Regression
Describes the relationship between two numeric variables in terms of a predictor (x) and a response variable (y).
Models the relationship as a straight line: y = b0 + b1x
Different from correlation as one variable is defined as the response, and one as the predictor or explanatory variable.
Least Squares Line: Minimizes the sum of squared deviations.
Residuals and R-squared
Residual: The distance between the point and the fitted value on the line of best fit. Residual = Observed value – Fitted value
R-squared: Proportion of the variation in the response variable that is explained by the variation in the predictor variable. R^2 = Explained variation/Total variation
Basic Assumptions Underpinning Least Squares Inference
Linearity: A straight line provides an adequate model of the relationship.
Constant Variance: The spread of the responses about the straight line is the same across the range of the explanatory variable.
Normality: The residuals follow a normal distribution.
Independence: The difference between any response and its fitted value is independent of all others.
Assessing Model Adequacy Using Residual Plots
If a model adequately describes the relationship between the 2 variables, we would expect there is no obvious pattern in the residuals
Check for patterns in residual plots to ensure that basic assumptions underpinning least squares inference are met, e.g. linearity, constant variance.
Simple Linear Regression - Additional Considerations
Model 5: Outlying observations pull the regression line away from the line of best fit to main data
Model 6: Increasing variation in response is observed with increasing values of the predictor variable.
Logging the data is often appropriate, especially for biological data. Always check the distribution of your data at the start of any analysis.
Inferential Statistics
Drawing conclusions about a larger population based on a sample POPULATION Collect evidence (observations) on some units/subjects SAMPLE Evaluate sample evidence Make decisions about the populations Inferential Statistics HYPOTHESIS TESTING Uses sample statistics to answer questions about the population.
Examples: \bar{X} (sample mean), s (sample standard deviation)
Population Parameters:
Defining characteristics of the population, usually unknown, and typically labeled with a Greek symbol
Examples: μ (population mean), σ (population standard deviation).
Valid Inference
Requires:
Representative sample of the target population.
Application of appropriate statistical method to address the right question.
Correct interpretation of results.
Inference to Target Population
Important considerations:
What is the question of interest?
What does the data look like?
Is appropriate data available/collectable?
How to best answer the question with the available dataset
Sources of Variation in Laboratory Measurements
Analytical: Observed differences in the value of an analyte once it has been prepared for analysis.
Intra-individual: Variability in true values of an analyte obtained from the same individual.
Inter-individual: Variability due to differences in true (mean) values of an analyte between individuals.
Understanding inherent variability of data drawn from a population/measurement system is key to statistical understanding and inference
Repeatability: The variation observed when the same operator measures the same part repeatedly with the same device.
Reproducibility: The variation observed when different operators measure the same part using the same device.
Measurement Systems
Accuracy: How close are the measurements to their “true” value?
Precision: How close are independent measurements of the same thing to each other?
Populations and Sample
Study a part in order to gain information about the whole
Sampling and Data Collection
Usefulness: Usefulness of a set of observations is limited unless you can extrapolate to some wider population ( extrapolation ≡ inference).
Compared to what? How representative?
Sampling Biases
Include:
Experimental convenience.
Voluntary self-selection.
Investigator intervention.
Statistical Inference - Hypothesis Testing
How real is an observed relationship between height and muscle strength?
Is an observed difference in serum ferritin between males and females statistically significant?
What is a Test of Significance?
A statistical way to help answer research questions or claims about certain population parameters.
Utilizes observations collected from a sample – a subset of the population of interest.
Provides an objective way of addressing the issue. Assessing the credibility of a statement about a population based on sample data.
Hypothesis Tests
Hypothesis tests resolve conflicts between two competing opinions (hypotheses).
These hypotheses are:
H0 , the null hypothesis, which is the presumed status quo.
HA , the alternative hypothesis, which is a contradiction of the null hypothesis or a “challenge” to the status quo.
Image from freshspectrum.com • Assessing the credibility of a statement about a population based on sample data.
Hypothesis Testing Components
Uses sample statistics to answer questions about the population
Sample estimate(s) obtained from the data
Population parameter(s) (defining characteristic)
Test statistic with a known distribution
P-value Conclusion
Probability your results occurred just by chance if H0 was true
Relating back to the context of the question.
Questions to Consider
Does this histogram suggest the data comes from a population with a mean greater than12?
Sample mean
Sample mean= 13.8, SD=0.94, sample size N = 50
Hypothesized mean
Sampling Distribution of the Mean
Properties of a sample mean:
Same mean as the population distribution
Smaller variation than the population distribution
How much smaller??
General rule: Suppose we have a population of N individuals with a population standard deviation σ . Provided we take a simple random sample and the sample size n is small compared with the population size N, the standard deviation of all the sample means of size n which could be taken from the population is inversely proportional to the square root of n.
Sampling Variation
Mean: 70; SD: 5
95% of observations between 60.2 and 79.8
Population from which samples drawn
Sample statistics vary about the population parameter. The larger the sample, the smaller the sampling error (ie the sample mean is more likely to be closer to the population mean)
Hypotheses
The two hypotheses of a significance test are:
a null hypothesis, denoted by H0
an alternative hypothesis denoted by H1 (or sometimes HA)
Suppose we are trying to investigate a claim (or research hypothesis), then:
H1 – statement that formulates the claim of the research hypothesis
H0 – negates this claim; hypothesis of “no difference” or the status quo – set up to be discredited by the significance testing
These hypotheses make claims about population parameters – the claims are investigated by evaluating sample estimates (test statistics).
The P-Value
To evaluate our hypotheses, we use information on the sampling distribution of our sample estimator and calculate the test statistic.
The test statistic:
has a known probability distribution under the assumption that H0 is correct
is likely to take ‘extreme values’ when H1 is correct (is H0 is false).
The P-value is defined to be the probability of getting a test statistic at least as extreme as the one observed, under the assumption that H0 is true. ie it quantifies how likely it is to have seen the data we’ve collected (or something more extreme) if the null hypothesis is true.
The smaller the P-value, the less likely it is to have observed the result under the null hypothesis, therefore the stronger the evidence in support of H1
Hypothesis Testing About a Single Population Mean
Using sample data to answer a question about the (unknown) mean values of the population from which the sample came
Alternative hypothesis: HA :
The population mean is not equal to μ0
Null hypothesis: H0 :
The population mean is equal to some set value μ0
Statistical Inference - Hypothesis Testing
Asking if 13.6 is “significantly” different to 14 is reframed as asking if the calculated test statistic of - 1.49 is “significantly” different to 0 or is it likely the result was observed just by chance
Insufficient evidence to reject H0, therefore cannot conclude haemoglobin levels in adult females are significantly impacted by increased body mass.
Hypothesis Testing – (Two-Sided Test) Comparing Population Means
Alternative hypothesis: HA
The population mean of population 1 is different to that of population 2
Null hypothesis: H0
The population mean is equal to some set value μ0
One-Sided Test (Alternative Hypothesis Specifies Direction)
Framed as an inequality, but H1 will never include the equals sign
H0 : µmale = µfemale
H1 : µmale > µfemale Or equivalently:
H0 : µmale - µfemale = 0
H1 : µmale - µfemale > 0