Looks like no one added any tags here yet for you.
Descriptive Statistics
small population
can collect all data
cannot use to make conclusions beyond the data
Inferential Statistics
population is large (cannot collect all population data)
can only collect sample of population
can use sample data to make inferences about a population
focus on making predictions/generalizations about a larger dataset based on a sample
Estimation
inference about 1 group
can be point estimate or interval estimate
can estimate a proportion or mean
Comparison
inference about 2 or more groups
Correlation
relationship between 2 variables
Point estimate
single value
sample mean = population mean
Interval Estimate
defined by two numbers between which a population parameter is said to lie
Confidence interval
measure of how sure one can be
expressed as a percentage (most commonly 95%)
as confidence level (percentage) increases, the confidence interval widens
as sample size decreases, the confidence interval widens
represents confidence that population statistic is within the confidence interval
Prevalence
aka prevalence proportion
proportion of a population found to have a condition
includes ALL cases (new and pre-existing)
(number of subjects with disease)/(total population)
usually expressed as fraction, percentage, or number of cases per 10,000 or 100,000 people
Incidence rate
rate of new cases of a disease occurring in a specific population over a particular period of time
limited to NEW cases only
(number of NEW cases during a specified time period)/(person years at risk during the same time period)
Nominal data
categorical, unranked data
ex. gender, eye color, surgical outcome, blood type
when only 2 possible categories: dichotomous, binary, binomial
Ordinal Data
variables with an inherent order to the relationship among the different categories
implied ordering of the categories with unknown quantitative distance
distances between the levels may not be the same
meaning of different levels may not be the same for different individuals
utilizes numbers to indicate rank/order, but numerical values do not hold mathematical significance
ex. stages of cancer, education level, pain level, satisfaction level, agreement level
Unpaired samples
two groups from different populations
sample size may be different
Paired samples
Same samples undergoing same treatments
Same sample size
Can be same people measured at different times or asked about same products
Steps for group comparison
Check data type
Check dependence (paired or unpaired)
Unpaired Nominal Data
Chi-squared test
all values above or equal to 5 (large sample size)
Fisher’s exact test
any values below 5 (small sample size)
Paired Nominal Data
McNemar’s Test
Kappa statistics
measure of agreement
Unpaired ordinal data
Mann-Whitney U Test
aka Wilcoxon two sample test
Paired ordinal data
Wilcoxon paired sign rank test
Unpaired continuous data
unpaired t-test
Paired continuous data
paired t-test
Contingency table
table of observed data for categorical data
Expected table
table of expected data for categorical data (if no difference between groups)
for first row, first column = (first row margin)(first column margin)/(total)
has same margin and grand total values as contingency table
Odds ratio
odds: P/(1-P)
odds ratio: odds/odds
cross-product method: ad/bc
OR = 1 means no association between outcome and exposure
OR >1 means exposure associated with increased risk for outcome
harmful effect
OR <1 means exposure is associated with reduced risk for outcome
protective effect
consider confidence interval (if it contains 1, not statistically significant)
Accuracy
number of correct diagnoses divided by entire population
Sensitivity
used for paired nominal data
measures of performance of binary classification test
true positive rate
measures proportion of actual positives which are correctly identified
how good a test is at finding actual positive
complementary to false negative rate
used for diagnosis
(actual positives identified)/(actual positives)
Specificity
true negative rate
measures performance of binary classification test
proportion of negatives which are correctly identified
complementary to false positive rate
used for diagnosis
(actual negatives identified)/(actual negatives)
Positive predictive values (PPV)
(number of true positives)/(number of positive calls)
number of positive calls = number of true positives + number of false positives
the chance that a person with a positive test truly has the disease
used for patient knowledge
Negative predictive values (NPV)
probability that a subject with a negative screening test really does not have the disease
used for patient knowledge
(number of true negatives)/(number of negative calls)
number of negative calls = number of true negatives + number of false negatives
Kappa statistics
statistical measure of inter-rater agreement
agreement: both raters have same outcome
for paired nominal data
Kappa statistic strengths of agreement
Poor <0.2
Fair 0.21-0.4
Moderate 0.41-0.60
Good 0.61-0.8
Very Good 0.81-1
T-tests
assess whether the means of two groups are statistically different from each other
Non-parametric tests
distribution free test
does not assume anything about the underlying distribution
ex. Chi-squared test, Fisher’s exact tests, McNemar’s test, Mann-Whitney U Test, and Wilcoxon sign rank test
Parametric test
makes assumptions about a population’s parameters
usually means tests like t-test or ANOVA
assume the population data has a normal distribution
Tests that check normal distribution
QQ plot
Shapiro Wilk Test
QQ plot
quantie-quantile plot
shows distribution of the data against the expected normal distribution
for normally distributed data, observations should lie approximately on a straight line
possible outliers are points at the ends of the line
Shapiro Wilk Test
test of normality in frequentist statistics
null hypothesis: population is normally distributed
if P < 0.05, not normally distributed
nonparametric test should be used
if P > 0.05, normal distribution
t test can be used
best power for a given significance
Unpaired t-test
two sample t test
applied to 2 independent groups (different people in 2 different groups
sample size may be unequal in each group
Paired t-test
one sample t test
measures whether means from a within subjects test group vary over 2 test conditions (same people in same group)
equal sample size
takes into account the fact that pairs of subjects go together
One-tailed t-test
first mean expected to be larger than the second or first mean expected to be smaller than the second
expect the effect to be in a certain direction
2-tailed t tiest
first mean expected to be different from the second in EITHER direction
used when looking for any difference between samples
Test of equal variance (F-Test)
used to test if the variances of 2 populations are equal
ratio of standard deviations of each group
if variances are equal, F = 1
P>0.01
use unpaired t test
the more ratio deviates from 1, the stronger the evidence for unequal population variances
P<0.05
use Welch’s unpaired t-test
used for unpaired data
Excel: FTEST (array1,array2)
returns 2 tailed probability that the variances in array1 and array2 are not significantly different
should check normality before using