Disease Detective Stats

Basics

This is a crash course on the fundamentals of statistics. This is not a replacement for reading (and understanding) the SOINC guide on statistics in this event or better yet, taking a class or reading a textbook on statistics.

Populations and Samples

The population is the entire set under study. For example, the length of dung beetles. Because it is impossible to measure the length of every single dung beetle on planet earth, statistics use sampling. They take a subset of the dung beetles called a sample and use measurements from the sample to make inferences about the population as a whole. A population parameter is a characteristic of a population; for example, suppose 84% of Philadelphians preferred chocolate ice cream over vanilla ice cream. A sample statistic is an attribute of a sample; for example, we randomly sampled 10 Philadelphians and found that 70% preferred chocolate ice cream over vanilla ice cream.

Distribution Characteristics

Distributions are characterized by center, shape, and spread.

Central Tendency

A central tendency is a "typical" or "middle" value for a distribution.

Mean - Average of all of the values. �=�1+�2+�3+...+�� Means should not be used if the population is very skewed, as means are easily affected by extreme values.

Median - The middle value that separates the data into two halves. Medians are not as affected by extreme values, e.g. the mean number of arms per person in the world is less than 2, but the median is exactly 2.

Mode - The most frequently occurring value in the data set. Modes are useful for describing "peaks" in a distribution.

Shape

Skewedness - Distributions that have a few extreme values on the higher side are skewed to the right. Distributions that have a few extreme values on the lower side are skewed to the left.

Peaks - If a distribution has no peaks, it is uniform. If it has one peak, it is unimodal. If it has two peaks, it is bimodal.

Normal distributions - A set of data that is unimodal, symmetrical, and continues off to infinity on both tails. Also known as a Gaussian distribution. In the normal distribution, the mean, median, and mode are all the same. Technically, the normal distribution is continuous and infinite but can be approximated with discrete values.

Variability

Variability, scatter, and spread all have the same meaning: the extent to which a set of data is dispersed.

Range - The difference between the largest and smallest values in a set. It is not very useful except to get a sense of the possible spread of a distribution.

Interquartile Range (IQR) - The difference between the 75th (third quartile, or �3) and 25th (first quartile or �1) percentiles of a data set. To find �1 and �3, find the median of the data set, then divide the data set into two new sets, one with the data from the median up to the maximum and the other with the data from the median down to the minimum. The median values of the two new sets are �1 and �3. The IQR is used with the median and is the most robust measure of variability, i.e. outliers do not affect the IQR as much. ��=�3−�1

Variance - Average of the squared differences from the mean. The variance gives a very vague sense of how far apart the values in a data set are compared to the mean. ��(�)=∑(�−�¯)2�−1

Standard Deviation (SD) - The square root of the variance. Quantifies the spread in a data set in the same units as the original data. Standard deviation is, in a sense, the average distance away from the mean. A low SD indicates that the data tends to be close to the mean and a high SD indicates the data is far away from the mean. SD and variance are used with the mean. Unlike IQR, SD is not resistant to outliers.

��(�)=�=∑(�−�¯)2�−1

68-95-99.7 Rule - This rule states that 68% of the values in a normally distributed data fall within 1 SD of the mean, 95% fall within 2 SD of the mean, and 99.7% fall within 3 SD of the mean.

Example: Let a data set consist of integers 1 through 10, which sum to 55. The median and mean are 5.5.

To find the IQR, we can divide the data into two sets, one from 1 through 5 and the other from 6 through 10 inclusive. We find the median for each of these sets (3 and 8) and then subtract them. Thus, the IQR is 5.

To find the SD, we need to calculate the difference of each data value from the mean. Then we square the differences, add them, divide it by the sample size - 1 (�=10,�−1=9) and square root the result.

��(�)=(1−5.5)2+(2−5.5)2+⋯+(10−5.5)29=2.87

Note that this population is uniform (each possibility has the same frequency of occurring), so the 68-95-99.7 rule for normal distributions does not apply. If it did apply, 68% of the data would fall in between the interval (5.5−2.87,5.5+2.87)=(2.63,8.37).

Standard Error of the Mean (SEM) - The SEM measures the variability of the mean of different samples around the population mean.

��¯=��

Therefore, as a general rule, the SEM decreases as sample size increases.

Correlation

When two variables are revealed to have a relationship using statistical measures, the variables have a correlation. This correlation can be positive, negative, or zero. Without doing an experiment or trial, it is impossible to conclude that one variable causes another variable to act in some way. There is always the possibility of a third lurking or confounding variable that the original data does not account for. In this case, wording is extremely important. Correlation ≠ causation.

The correlation coefficient � is a measure of the scatter around a linear relationship. It does NOT apply when a relationship is non-linear. Because the correlation coefficient is difficult to calculate by hand, exam writers will typically give the value and ask for the interpretation of the � value. The correlation coefficient is always −1<�<1 and a value of 1 indicates a perfectly positively linear relationship. Conversely, a value of 0 indicates no relationship. Typically, 0.9<|�|<1 is termed strong.

Standardization

The standard score or z score rescales the standard deviation of a normally distributed data set to 1 and mean to 0. Thus, we can model all normally distributed data using a single normal distribution with mean 0 and SD 1.

�=�−��=�−�¯�

The first formula is for a population while the second is for a sample. � represents the population standard deviation while � represents the population mean.

Infant Mortality Rate

The infant mortality rate is the ratio of deaths to births.

Rates in epidemiology are often expressed as a per-1000 or per-1 million, so if the infant mortality rate were 0.05, we could write that as 50 deaths per 1000 births.

Inference

Statistical inference is the process of inferring something about a population given a sample.

Confidence Intervals

Confidence intervals are used to estimate population attributes given statistics from a sample. However, confidence intervals do not take into account confounding or biases. The confidence level determines how wide the interval is. A common confidence level is 95%: "I am 95% certain that the interval captures the true population proportion/mean. This means that if the process used to obtain the interval were repeated many, many times, the interval generated would capture the true population proportion/mean 95% of the time."

Confidence Intervals for Proportions - Used to define a range of values within which a proportion may lie.

�^±�∗�^(1−�^)�

A z table contains common values for z-star.

Confidence Intervals of Means - Used to define a range of values within which a mean may lie.

�¯±�∗��

A t table contains common values for t-star. Note that you need the number of degrees of freedom (df) to find the t-star. Generally, ��=�−1.

Inference Tests

In an inference test, we use statistical inference to determine if a statement is likely or unlikely. We first create a null hypothesis ("the default"). For example, suppose that you were investigating whether drinking the punch at the party is associated with developing salmonellosis symptoms. The null hypothesis would be that eating cabbage is not associated with developing salmonellosis symptoms. The alternative hypothesis would be that eating cabbage is associated with developing salmonellosis symptoms. You would then look at your sample (people who were at the party and did/did not drink the punch and did/did not develop salmonellosis symptoms) and ask, How likely is it that this result occurred by chance, i.e. if the null hypothesis were true? This probability is called the p-value. Statisticians generally use a threshold of 0.05. If the p-value is below 0.05, the result is significant, and you reject the null hypothesis. Otherwise, you fail to reject the null hypothesis.

Error

A Type I error occurs if you reject �� (the null hypothesis) when �� is true. The probability of a Type I error is �, the significance level.

A Type II error occurs if you fail to reject �� when �� is false (�� is true). The probability of a Type II error is represented by the letter �.

The power of the test is the probability that the null hypothesis is rejected if �� is false. The power of the test is equal to 1−�.

Advanced

Sensitivity and Specificity

Sensitivity and specificity are ways to calculate the chance of having a specific disease given you do or do not have a disease.

	Has disease	Has no disease
People who test positive	a	b
People who test negative	c	d

Sensitivity is the chance of testing positive if you do have the disease. The equation to use for sensitivity is: ��+�

Specificity is the chance of testing negative if you do not have the disease. The equation to use for specificity is: ��+�

Chi Square

A chi-square is a statistical measure used to determine the difference between an expected value and an observed value. In epidemiology, it can be used to compare information from different groups (i.e. age) to a local or national average. �2=∑(�−�)2�

Z-Test

Used to compare two means when the population variances are known and the sample size is greater than 30. When �>30, the student's T distribution becomes sufficiently similar (by the Central Limit Theorem) to the Z distribution that we are able to use the Z distribution to compute the test.

We compute the test statistic using �=�^−��(�), where b is the value we are testing our estimate against. The test statistic is then compared to a table of Z-scores, which you can find online. The Z-score is found using a predetermined "alpha" or level of significance, which is typically 5%. If the absolute value of the test statistic is larger than the "critical value," then you reject the null hypothesis.

T-Test

Used to compare two means when sample size is less than 30. The T test statistic is computed the same way as the Z test statistic is computed; however, the test statistic is compared to a table for the T distribution.

Paired T-Test

Used to compare multiple sets of data.

Fischer's Exact Test

Fischer's test searches for non-random associations between two categorical variables.

McNemar's Test

The McNemar Test is similar to a Chi-Square, except that it uses matched paired data.

Maentel Haenszel Test

The Cochran-Maentel-Haenszel Test aims to find the association between variables while controlling for confounding.

ANOVA

The analysis of variance test, or ANOVA, is a statistical measure used to compare variances of two or more samples.