Biostats Midterm

Data Presentation

Every study or experiment yields a set of data which may vary in size from a few measurements to multiple thousands of observations.

A complete data set may not necessarily be easily interpretable without applying descriptive statistics.

1. Introduction to Descriptive Statistics

Descriptive statistics are methods used to organize and summarize observations, providing an overview of the general features of a data set.

Forms of descriptive statistics include:

Tables

Graphs

Numerical summary measures

2. Types of Numerical Data
2.1. Nominal Data

Nominal data consist of unordered categories or classes.

Example: In a study, males might be represented as 1 and females as 0.

The numerical values assigned serve merely as labels, and calculations such as averages are meaningless (e.g., an average blood type of 1.8 has no practical interpretation).

Example from data: 9.6% of AIDS patients had Kaposi's sarcoma, 90.4% did not.

2.2. Ordinal Data

Ordinal data involve order among categories, with examples like severity classifications of injuries.

Example: 1 = fatal injury, 2 = severe injury, 3 = moderate injury, 4 = minor injury.

The differences between ranks are not necessarily uniform.

Example from Table 2.2: Eastern Cooperative Oncology Group performance classification defines patient status from fully active to completely disabled.

2.3. Ranked Data

Ranked data assign ranks based on magnitude.

Example: Listing causes of death and their frequency.

Ranks are informative, sometimes even more so than raw counts.

2.4. Discrete Data

Discrete data involve quantitative measurements that can only take specific values (often integers or counts).

Examples: Count of hospital beds, number of births in a month, etc.

2.5. Continuous Data

Continuous data are measurable values that can take any value within a range (not restricted to discrete values).

Examples: Height, weight, cholesterol levels, etc.

Continuous data allow for meaningful arithmetic operations unlike discrete data (average births).

3. Tables
3.1 Frequency Distributions

A common method of summarizing data is through frequency distributions.

Example: Frequency tables show counts for various categories, such as presence or absence of Kaposi's sarcoma in AIDS patients.

3.2 Relative Frequency

Relative frequency is the proportion of total counts within a certain interval, calculated by dividing the count in an interval by the total count.

Useful for comparing sets of data with different total observations.

3.3 Cumulative Relative Frequency

Cumulative relative frequency indicates the percentage of observations falling below an upper limit in a frequency distribution.

4. Graphs
4.1 Bar Charts

Bar charts display data visually for nominal or ordinal data categories.

4.2 Histograms

Effective for displaying discrete or continuous data frequency distributions.

Normalizes the data into intervals and displays the frequency of each interval.

4.3 Frequency Polygons

A frequency polygon visualizes distribution using points connected by lines.

4.4 Box Plots

Box plots summarize data using quartiles, highlighting the median and potential outliers within a dataset.

5. Numerical Summary Measures

The migration from descriptive to more analytical measures like mean, median, mode, variance, standard deviation, etc.

5.1 Measures of Central Tendency

Mean

The mean is the sum of observations divided by the number of observations (n).

Median

The median represents the middle value once data are ordered.

Mode

The mode is the most frequently occurring observation in data.

5.2 Measures of Dispersion

Range

The difference between maximum and minimum values is the range.

Interquartile Range

The interquartile range calculates variability by focusing on the middle 50% of data values.

Variance and Standard Deviation

Variance is the average of squared differences from the mean; standard deviation is its square root, providing insight into expected data spread.

5.3 Coefficient of Variation

The coefficient of variation is a dimensionless measure indicating relative variability.

6. Probability
6.1 Definition of Probability

Probability quantifies uncertainty regarding the occurrence of events derived from random variables.

It is a numerical measure of the likelihood that an event will occur, typically expressed as a value between 0 (impossible event) and 1 (certain event).

The probability of an event A is denoted as P(A).

6.2 Operations on Events

Events can be combined or manipulated using set operations, which have corresponding probability rules:

Intersection (A \cap B or A \text{ and } B)
: Represents the event where both A and B occur.

For independent events, P(A \cap B) = P(A) \cdot P(B).

Union (A \cup B or A \text{ or } B)
: Represents the event where A occurs, B occurs, or both occur.

P(A \cup B) = P(A) + P(B) - P(A \cap B).

Complement (A^c or A')
: Represents the event that A does not occur.

P(A^c) = 1 - P(A).

Conditional Probability (P(A|B))
: The probability that event A occurs given that event B has already occurred.

P(A|B) = \frac{P(A \cap B)}{P(B)}, provided P(B) > 0.

6.3 Bayes' Theorem

Bayes' theorem allows updating the probability of a hypothesis based on new evidence or information.

It is fundamental in areas like medical diagnosis, machine learning, and risk assessment, as it shows how to revise prior probabilities in light of new data.

The theorem is often expressed as:
P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
Where:

P(A|B) is the posterior probability of event A occurring given B.

P(B|A) is the likelihood of event B occurring given A.

P(A) is the prior probability of event A.

P(B) is the marginal probability of event B.

7. Probability Distributions
7.1 Discrete and Continuous Random Variables

Discrete variables
can take a finite or countably infinite number of fixed, distinct outcomes.

These are typically counts or categories.

Examples: The number of heads in 10 coin flips (0, 1, \dots, 10), the number of patients responding to a treatment (0, 1, \dots, N), the number of defects in a manufactured batch.

Continuous variables
can take any value within a given range, often involving measurements.

These values are not restricted to specific integers and can include decimals.

Examples: Height of a person, weight of an object, blood pressure, cholesterol levels, reaction time to a stimulus.

7.2 Binomial Distribution

The binomial distribution describes the number of successes in a fixed number of independent Bernoulli (binary) trials.

Conditions for a Binomial Distribution:

A fixed number of trials, denoted by n.

Each trial must be independent.

Each trial has only two possible outcomes: "success" or "failure."

The probability of success (p) remains constant for every trial.

Parameters
: n (number of trials) and p (probability of success on a single trial).

Probability Mass Function (PMF)
: The probability of getting exactly k successes in n trials is given by:

P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}

where \binom{n}{k} = \frac{n!}{k!(n-k)!}

7.3 Poisson Distribution

The Poisson distribution models the number of events occurring within a fixed interval of time or space, especially when these events are rare.

It is useful for predicting events where the number of possible outcomes is very large, but the average rate of occurrence is known and constant.

Parameter
: \lambda (lambda), which represents the average rate of event occurrences within the given interval.

Examples: Number of accidents on a highway per month, number of customer calls per hour at a call center, number of mutations in a DNA strand segment.

Probability Mass Function (PMF)
: The probability of observing k events in an interval with an average rate of \lambda is:

P(X=k) = \frac{e^{-\lambda} \lambda^k}{k!}

7.4 Normal Distribution

The normal distribution, also known as the Gaussian distribution, is one of the most common and important probability distributions in statistics.

It is characterized by its symmetric, bell-shaped curve, defined by two parameters: the mean (\mu) and the standard deviation (\sigma).

Properties:

It is perfectly symmetrical around its mean.

The mean, median, and mode are all equal and located at the center of the distribution.

The spread of the distribution is determined by the standard deviation (\sigma).

The empirical rule (68-95-99.7 rule) states that approximately 68% of data falls within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations.

Applications
: Widely used to model continuous data, such as human intelligence scores, blood pressure readings, and measurement errors.

Its significance is particularly highlighted by the central limit theorem, which states that the distribution of sample means of a sufficiently large number of independent, identically distributed random variables will be approximately normal, regardless of the population's original distribution.

8. Sampling Distribution of the Mean
8.1 Overview

The sampling distribution of the mean is a probability distribution consisting of the means of all possible random samples of a given size (n) that can be drawn from a population.

Even if the original population distribution is not normal, the distribution of sample means tends to be normal, especially as the sample size increases. This fundamental concept is crucial for inferential statistics.

Properties of the Sampling Distribution of the Mean:

Mean (\mu_{\bar{x}})
: The mean of the sampling distribution of the mean is equal to the population mean (\mu).

\mu_{\bar{x}} = \mu

This means that sample means, on average, accurately estimate the true population mean.

Standard Deviation (Standard Error of the Mean, \sigma_{\bar{x}})
: The standard deviation of the sampling distribution of the mean is called the Standard Error of the Mean (SEM).

It measures the variability or precision of the sample means around the population mean.

\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}

Where:

\sigma is the population standard deviation.

n is the sample size.

As the sample size (n) increases, the standard error decreases, indicating that sample means become more clustered around the population mean – i.e., the estimates become more precise.

Shape
: The shape of the sampling distribution tends toward a normal distribution, regardless of the population's original distribution, as the sample size increases. This is a direct consequence of the Central Limit Theorem.

8.2 Central Limit Theorem

Significance of the theorem in simplifying calculations and enabling predictions about sample means irrespective of population distribution.

The Central Limit Theorem (CLT) is a cornerstone of statistical theory, stating that if you take sufficiently large random samples from a population with mean \mu and standard deviation \sigma, the distribution of the sample means (\bar{x}) will be approximately normally distributed, regardless of the shape of the original population distribution.

Key aspects of the CLT:

Sample Size
: The approximation to normality improves as the sample size (n) increases. A common rule of thumb is that if n \ge 30, the sampling distribution of the mean can be considered approximately normal.

Population Distribution
: The CLT holds true even if the original population data is non-normal (e.g., skewed, uniform, exponential). If the population itself is normally distributed, the sampling distribution of the mean will be exactly normal for any sample size.

Implications
: The CLT is vital because it allows us to use normal distribution theory (e.g., z-scores, normal probability tables) to make inferences about population parameters (like \mu) based on sample means, even when we don't know the population's distribution.

It simplifies hypothesis testing and the construction of confidence intervals for the population mean.

Standardized Sample Mean (Z-score for a sample mean)
: We can calculate a z-score for a sample mean to find its position relative to the mean of the sampling distribution, allowing us to determine probabilities associated with that mean.

Z = \frac{\bar{x} - \mu{\bar{x}}}{\sigma{\bar{x}}} = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}}

Where:

\bar{x} is the sample mean.

\mu is the population mean.

\sigma is the population standard deviation.

n is the sample size.

This formula transforms a sample mean into a standard normal variable (with mean 0 and standard deviation 1), which can then be used with standard normal tables or software to find probabilities.

8.3 Applications

Real-world applications, illustrating concepts with age at death distributions and cholesterol level examples.

The principles of the sampling distribution of the mean and the Central Limit Theorem have wide-ranging applications in various fields:

Quality Control
: A company manufacturing light bulbs might sample a batch to estimate the average lifespan. Even if individual bulb lifespans are not normally distributed, the average lifespan of large samples will follow a normal distribution, enabling quality control checks.

Medical Research
: Researchers might study the effect of a new drug on blood pressure. By taking multiple samples and observing their average blood pressure, they can infer the drug's effect on the general population, even if individual responses vary widely. For example, if the population mean cholesterol level is known to be \mu with a standard deviation of \sigma, and a sample of n individuals has a mean cholesterol level of \bar{x}, we can use the z-score formula to determine how unusual this sample mean is, and thus infer if the sample likely comes from a different population (e.g., one with higher cholesterol).

Public Policy and Economics
: Governments use sample data (e.g., unemployment rates, average income) to make policy decisions. The CLT ensures that these sample statistics can be reliably used to generalize to the entire population.

Environmental Studies
: Monitoring pollution levels by taking multiple samples from water or air. The average pollutant concentration from these samples can be analyzed using CLT to draw conclusions about overall environmental health.

Example with Cholesterol Levels
: Suppose the population mean cholesterol level for a certain age group is 200 \text{ mg/dL} with a standard deviation of 20 \text{ mg/dL}. A researcher takes a random sample of n = 100 individuals and finds their mean cholesterol level to be 205 \text{ mg/dL}.

We want to know the probability of observing a sample mean this high or higher, assuming it comes from the original population. First, calculate the standard error of the mean:

\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} = \frac{20}{\sqrt{100}} = \frac{20}{10} = 2

Now, calculate the z-score for the sample mean:

Z = \frac{\bar{x} - \mu}{\sigma_{\bar{x}}} = \frac{205 - 200}{2} = \frac{5}{2} = 2.5

Using a standard normal distribution table or software, the probability of a Z-score being 2.5 or higher (P(Z \ge 2.5)) is approximately 0.0062. This low probability suggests that a sample mean of 205 \text{ mg/dL} is quite rare if the true population mean is 200 \text{ mg/dL}, potentially indicating the sample comes from a different population or there's been an error in assumption.