Descriptive statistics
methods of organizing and summarizing statistics
Inferential statistics
making generalizations from a sample to the population
population
an entire collection of individuals or objects
Sample
a subset of the population selected for study
Variable
any characteristic whose value changes
Data
observations on single or multi-variables
Categorical variables
Qualitative
basic characteristics
Numerical variables
Quantitative
measurements or observations of numerical data
Discrete data
listable sets (counts)
Continuous data
any value over an interval of values (measurements)
Univariable
one variable
Bivariate
two variables
multivariate
many variables
Symmetrical distribution
Data on which both sides are fairly the same shape and size
"Bell curve"
Uniform
every class has an equal frequency (number)
"Rectangle"
Skewed
One side (tail) is longer than the other side
The skewness is in the direction that the tail points
Bimodal
data of two or more classes have large frequencies separated by another class between them
"double hump camel"
How to describe numerical graphs
S: shape - overall type (symmetrical, skewed, uniform, bimodal)
O: outliers - gaps, clusters, etc.
C: center - middle of the data (mean, median, mode)
S: spread - refers to variability (range, standard deviation, IQR)
C: context - EVERYTHING must use CONTEXT
Parameter
a value of a population (typically unknown)
Statistic
a calculated value about a population from a sample(s)
Measures of center
mean, median, mode
Median
the middle point of the data (50th percentile) when the data is in numerical order.
*if two values, average them
Mean
μ for population (parameter)
x̄ for sample (statistic)
Mode
occurs the most in the data
Can be more than one, or none if all data points occur once
Variability
allows statisticians to distinguish between usual and unusual occurrences
Measures of spread
range, IQR, standard deviation
Range
a single value
max-min
IQR
interquartile range
Q3-Q1
Standard deviation
σ for population (parameter)
s for sample (statistic) - divided by df
measures the typical or average deviations from the mean
Variance
standard deviation squared
Resistant
affected by outliers
Median, IQR
Non-resistant
NOT affected by outliers
mean, range, variance, standard deviation, correlation coefficient, LSRL, coefficient of determination
comparison of mean & median based on symmetrical graph
equal
comparison of mean & median based on skewed right graph
mean is larger
comparison of mean & median based on skewed left graph
mean is smaller
Trimmed mean
use a % to take observations away from the top and bottom of the ordered data
can possibly eliminate outliers
linear transformations of mean and st dev
mean changed by addition, subtraction, multiplication, division
standard deviation changed by multiplication & division
Combination of two or more random variables
Just add or subtract means
Add variances and square root for st dev - X and Y MUST be INDEPENDENT
Z-Score
a standardized score
tells you how many standard deviations from the mean an observation is
in normal distribution, it creates a normal curve consisting of z-scores with a μ=0 & σ=1
Z=(x-μ)/σ
Normal curve
a bell shaped and symmetrical curve
as σ increases the curve flattens
as σ decreases the curve thins
Empirical rule
normal curves
68% of population is between -1σ and 1σ
95% of population is between -2σ and 2σ
99.7% of population is between -3σ and 3σ
Boxplots
5 number summary
min, Q1, median, Q3, max
Probability rules - Sample space
collection of all outcomes
probability of sample space is 1
Probability rules - Event
any sample of outcomes
Probability rules - Complement
all outcomes not in event
P+ (1 - P) = 1
Probability rules - Union
A or B
all the outcome in both circles
P(A) + P(B) - P(A&B)
Probability rules - Intersection
A and B
happening in the middle of A and B
P(A) x P(B) if A&B are independent
Probability rules - Mutually exclusive (disjoint)
A and B have no intersection - CANNOT happen at same time
Probability rules - Independent
if knowing one event does not change the outcome of another
Probability rules - Experimental probability
the number of success from an experiment divided by the total amount from the experiment
Probability rules - Law of large numbers
as an experiment is repeated the experimental probability gets closer and closer to the true probability
the difference between the two probabilities will approach 0
All probability values must add up to
1
P (at least 1 more)
1 - P(none)
conditional probability
takes into account a certain condition
A given B
(P(A&B)) / P(B) = P(both) / P(given)
correlation coefficient (r)
a quantitative assessment of strength and direction of a linear relationship
[-1, 1] = 0, no correlation
(0 ± 0.5) = weak
(±0.5 ± 0.8) = moderate
(±0.8 ± 1) = strong
Least Squares Regression Line (LSRL)
ŷ = a + bx