Note

0.0(0)

Take a practice test

Chat with Kai

undefined Flashcards

0 Cards0.0(0)

Explore Top Notes

elements of fiction: plot

Note

Studied by 5 people

5.0(1)

1.0: unit one major concepts + review

AP Chem Unit 2: Compound Structure and Properties🧪

Note

Studied by 91 people

5.0(1)

Chapter One: Past and Present

Note

Studied by 12 people

5.0(1)

Science chapter studyguide: The Laws of Motion

Note

Studied by 6 people

5.0(3)

M Clin Path 2025 - Statistics Flashcards

Introduction to Statistics

Statistics: What and Why?

Statistics is a collection of tools for interpreting quantitative data in a meaningful way.
It helps in understanding observations and their clinical/scientific implications.

Key Statistical Terms

Observational Units: Individual entities on which data is collected.
Variable(s): Characteristics or measurements that vary among observational units.
Data/Observations: Values that variables take for a particular observational unit.
Sampled from a “population”: A subset of a larger group (population) from which data is collected.
Feature(s) of Interest: Specific characteristics being studied.
Categorical (Qualitative) Data: Characteristics classified into groups (e.g., eye color).
Numerical (Quantitative) Data: Measurements conveying information about amounts Numerical (quantitative) - Measurements conveying information regarding amount.

Understanding Data

Statistics is fundamentally about understanding data.

Statistical Tools Overview

Descriptive Statistics: Describing data using summary values (counts, percentages, mean, median, standard deviation, range).
Data Visualization: Using graphs and charts to represent and interpret data effectively.
Correlation: Measuring the strength and direction of a relationship between two variables.
Regression: Examining the relationship between two or more variables.
Probability: Understanding the likelihood of events.
Inferential Statistics: Drawing conclusions about a larger population based on a sample.
Hypothesis Testing: Assessing the credibility of a statement about a population based on sample data.

Data Overview

Datasets consist of variables that vary from one entity to another.
Quantitative Variables: Numerical measurements.
- Discrete: Whole values (e.g., number of events, objects, people).
- Continuous: Measures that can take any value within an observed range (e.g., length, weight).
Qualitative Variables: Categorical data.
- Nominal: Data classified by quality rather than numerical measure (e.g., Dead/Alive).
- Ordinal: Ordered categories (e.g., level of agreement).

Categorical Data

Distribution: Possible category values and the frequency of each value.
Frequency: The number of observations that fall into each category.
Relative Frequency: The proportion of observations that fall into each category (count/total number).
Displaying Categorical Data: Use tables, bar charts, or pie charts, ensuring each data value is represented by the same amount of area.
Joint distributions can highlight the relative frequency of more than one variable.

Descriptive Statistics for Categorical Data

Categorical data is summarized by counts, tables, pie charts, bar charts, and column charts.

Summarizing Quantitative Data

Measures of Central Tendency:
- Mean: Numeric average of the data.
- Median: Value that splits the data in half.
- Mode: Most frequent value.
- Mid-Range
Measures of Variability (Spread):
- Variance: average
  of squared
  deviations . Average of squared deviations of the observations from the mean.
- Standard Deviation: Square root of the variance.
- Range: Difference between the maximum and minimum values.
- Interquartile Range (IQR): IQR = Q3 – Q1 . The middle 50% of the data. Difference between the 1st and 3rd quartiles.
  - Q1: value below which ¼ of the data lies
  - Q3 : value above which ¼ of the data lies
- Sample properties reflect the distribution of the underlying population.
Spread Relative to Mean:
- Relative Standard Deviation
- Coefficient of Variation.
- Descriptive statistics are data summaries used to characterize quantitative data.

Common Measures of a “Typical” Value

Mean: Add all values and divide by the total number of observations.
Median: Sort the data and take the middle value (or the average of the two middle values if there’s an even number of observations).

Coefficient of Variation (CV)

A standardized measure of dispersion.
Used in clinical laboratories to:
- Aid in the selection of a new method for routine use.
- Monitor the inherent variability (precision) of a method already in routine use.
Formula: CV = (SD / Mean) * 100

Visual Displays

Visual displays complement summary statistics in describing the data.
Example: Scatter plots to explore relationships between age and TB test results.

Appropriate Numeric Summaries

Data Distribution
- Symmetric Data:
  - Data is similarly spread on either side of the mean ( mean ≈ median ).
  - Sample mean and standard deviation are appropriate summaries.
  - The sample mean and standard deviation are appropriate summaries
- Skewed Data:
  - Right-Skewed (long right tail): Median is less than the mean.
  - Left-Skewed (long left tail): Median is greater than the mean.
  - Sample median and interquartile range are appropriate summaries
  - The sample median and interquartile range are appropriate summaries

Distributions

A distribution describes the pattern of values that data take when drawn from a population (e.g., normal distribution).
*Rescaling of distribution expresses difference from mean in terms of standard deviations
Normal Distribution Commonly used with distribution curve areas under the curve represent probabilities (total area sums to one)

Bivariate Associations

Examining bivariate relationships serves several purposes:
- Exploring the nature of the relationship.
- Explaining how one variable can explain the variability in another.
- Fitting a model that best represents the relationship for prediction.
Association: Two variables are associated if knowing the value of one tells us something about the values of the other variable.
Correlation and linear regression can help determine if a relationship between two numeric variables is meaningful.

Scatterplots

Display two quantitative variables measured on the same individuals/experimental units.
Useful for showing patterns, trends, relationships, and outliers.
Features of a displayed association:
- Direction: Positive or negative relationship.
- Form: Linear or non-linear.
- Strength: How closely the values move together.

Correlation Coefficients

Pearson’s Sample Correlation Coefficient (r):
- Measure of linear association between y and x.
- Unit-free, between -1 and 1 (inclusive).
- Depends on both slope and scatter about a line of best fit.
Spearman’s Rank Correlation Coefficient:
- Measure of strength and direction of the monotonic relationship between two ranked variables

Simple Linear Regression

Describes the relationship between two numeric variables in terms of a predictor (x) and a response variable (y).
Models the relationship as a straight line: y = b0 + b1x
Different from correlation as one variable is defined as the response, and one as the predictor or explanatory variable.
Least Squares Line: Minimizes the sum of squared deviations.

Residuals and R-squared

Residual: The distance between the point and the fitted value on the line of best fit. Residual = Observed value – Fitted value
R-squared: Proportion of the variation in the response variable that is explained by the variation in the predictor variable. R^2 = Explained variation/Total variation

Basic Assumptions Underpinning Least Squares Inference

Linearity: A straight line provides an adequate model of the relationship.
Constant Variance: The spread of the responses about the straight line is the same across the range of the explanatory variable.
Normality: The residuals follow a normal distribution.
Independence: The difference between any response and its fitted value is independent of all others.

Assessing Model Adequacy Using Residual Plots

If a model adequately describes the relationship between the 2 variables, we would expect there is no obvious pattern in the residuals
Check for patterns in residual plots to ensure that basic assumptions underpinning least squares inference are met, e.g. linearity, constant variance.

Simple Linear Regression - Additional Considerations

Model 5: Outlying observations pull the regression line away from the line of best fit to main data
Model 6: Increasing variation in response is observed with increasing values of the predictor variable.
Logging the data is often appropriate, especially for biological data. Always check the distribution of your data at the start of any analysis.

Inferential Statistics

Drawing conclusions about a larger population based on a sample POPULATION Collect evidence (observations) on some units/subjects SAMPLE Evaluate sample evidence Make decisions about the populations Inferential Statistics HYPOTHESIS TESTING Uses sample statistics to answer questions about the population.

Key Terms in Inferential Statistics

Sample Statistics:
- Calculated summaries describing SAMPLE characteristics.
- Examples: \bar{X} (sample mean), s (sample standard deviation)
Population Parameters:
- Defining characteristics of the population, usually unknown, and typically labeled with a Greek symbol
- Examples: μ (population mean), σ (population standard deviation).

Valid Inference

Requires:
- Representative sample of the target population.
- Application of appropriate statistical method to address the right question.
- Correct interpretation of results.

Inference to Target Population

Important considerations:
- What is the question of interest?
- What does the data look like?
- Is appropriate data available/collectable?
- How to best answer the question with the available dataset

Sources of Variation in Laboratory Measurements

Analytical: Observed differences in the value of an analyte once it has been prepared for analysis.
Intra-individual: Variability in true values of an analyte obtained from the same individual.
Inter-individual: Variability due to differences in true (mean) values of an analyte between individuals.
Understanding inherent variability of data drawn from a population/measurement system is key to statistical understanding and inference
Repeatability: The variation observed when the same operator measures the same part repeatedly with the same device.
Reproducibility: The variation observed when different operators measure the same part using the same device.

Measurement Systems

Accuracy: How close are the measurements to their “true” value?
Precision: How close are independent measurements of the same thing to each other?

Populations and Sample

Study a part in order to gain information about the whole

Sampling and Data Collection

Usefulness: Usefulness of a set of observations is limited unless you can extrapolate to some wider population ( extrapolation ≡ inference).
Compared to what? How representative?

Sampling Biases

Include:
- Experimental convenience.
- Voluntary self-selection.
- Investigator intervention.

Statistical Inference - Hypothesis Testing

How real is an observed relationship between height and muscle strength?
Is an observed difference in serum ferritin between males and females statistically significant?

What is a Test of Significance?

A statistical way to help answer research questions or claims about certain population parameters.
Utilizes observations collected from a sample – a subset of the population of interest.
Provides an objective way of addressing the issue. Assessing the credibility of a statement about a population based on sample data.

Hypothesis Tests

Hypothesis tests resolve conflicts between two competing opinions (hypotheses).
These hypotheses are:
- H0 , the null hypothesis, which is the presumed status quo.
- HA , the alternative hypothesis, which is a contradiction of the null hypothesis or a “challenge” to the status quo.
Image from freshspectrum.com • Assessing the credibility of a statement about a population based on sample data.

Hypothesis Testing Components

Uses sample statistics to answer questions about the population
Sample estimate(s) obtained from the data
Population parameter(s) (defining characteristic)
Test statistic with a known distribution
P-value Conclusion
Probability your results occurred just by chance if H0 was true
Relating back to the context of the question.

Questions to Consider

Does this histogram suggest the data comes from a population with a mean greater than12?
Sample mean
Sample mean= 13.8, SD=0.94, sample size N = 50
Hypothesized mean

Sampling Distribution of the Mean

Properties of a sample mean:
Same mean as the population distribution
Smaller variation than the population distribution
How much smaller??
General rule: Suppose we have a population of N individuals with a population standard deviation σ . Provided we take a simple random sample and the sample size n is small compared with the population size N, the standard deviation of all the sample means of size n which could be taken from the population is inversely proportional to the square root of n.

Sampling Variation

Mean: 70; SD: 5
95% of observations between 60.2 and 79.8
Population from which samples drawn
Sample statistics vary about the population parameter. The larger the sample, the smaller the sampling error (ie the sample mean is more likely to be closer to the population mean)

Hypotheses

The two hypotheses of a significance test are:
- a null hypothesis, denoted by H0
- an alternative hypothesis denoted by H1 (or sometimes HA)
Suppose we are trying to investigate a claim (or research hypothesis), then:
- H1 – statement that formulates the claim of the research hypothesis
- H0 – negates this claim; hypothesis of “no difference” or the status quo – set up to be discredited by the significance testing
These hypotheses make claims about population parameters – the claims are investigated by evaluating sample estimates (test statistics).

The P-Value

To evaluate our hypotheses, we use information on the sampling distribution of our sample estimator and calculate the test statistic.
The test statistic:
- has a known probability distribution under the assumption that H0 is correct
- is likely to take ‘extreme values’ when H1 is correct (is H0 is false).
The P-value is defined to be the probability of getting a test statistic at least as extreme as the one observed, under the assumption that H0 is true. ie it quantifies how likely it is to have seen the data we’ve collected (or something more extreme) if the null hypothesis is true.
The smaller the P-value, the less likely it is to have observed the result under the null hypothesis, therefore the stronger the evidence in support of H1

Hypothesis Testing About a Single Population Mean

Using sample data to answer a question about the (unknown) mean values of the population from which the sample came
Alternative hypothesis: HA :
The population mean is not equal to μ0
Null hypothesis: H0 :
The population mean is equal to some set value μ0

Statistical Inference - Hypothesis Testing

Asking if 13.6 is “significantly” different to 14 is reframed as asking if the calculated test statistic of - 1.49 is “significantly” different to 0 or is it likely the result was observed just by chance
Insufficient evidence to reject H0, therefore cannot conclude haemoglobin levels in adult females are significantly impacted by increased body mass.

Hypothesis Testing – (Two-Sided Test) Comparing Population Means

Alternative hypothesis: HA
The population mean of population 1 is different to that of population 2
Null hypothesis: H0
The population mean is equal to some set value μ0

One-Sided Test (Alternative Hypothesis Specifies Direction)

Framed as an inequality, but H1 will never include the equals sign
H0 : µmale = µfemale
H1 : µmale > µfemale Or equivalently:
H0 : µmale - µfemale = 0
H1 : µmale - µfemale > 0