1/100
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Statistics
The science of collecting, analyzing, and drawing conclusions from data.
Descriptive Statistics
Methods of organizing and summarizing statistics.
Inferential Statistics
Making generalizations from a sample to the population.
Population
An entire collection of individuals or objects.
Sample
A subset of the population selected for study.
Variable
Any characteristic whose value changes.
Categorical (Qualitative) Variable
Basic characteristics.
Numerical (Quantitative) Variable
Measurements or observations of numerical data.
Discrete Variable
Listable sets (counts).
Continuous Variable
Any value over an interval of values (measurements).
Univariate
One variable.
Bivariate
Two variables.
Multivariate
Many variables.
Symmetrical Distribution
Data on which both sides are fairly the same shape and size. “Bell Curve”
Uniform Distribution
Every class has an equal frequency (number) “a rectangle”
Skewed Distribution
One side (tail) is longer than the other side. The skewness is in the direction that the tail points (left or right)
Bimodal Distribution
Data of two or more classes have large frequencies separated by another class between them. “double hump camel”
S.O.C.S.
How to describe numerical graphs: Shape, Outliers, Center, Spread
Parameter
Value of a population (typically unknown).
Statistic
A calculated value about a population from a sample(s).
Median
The middle point of the data (50th percentile) when the data is in numerical order.
Mean
μ is for a population (parameter) and x is for a sample (statistic).
Mode
Occurs the most in the data. There can be more then one mode, or no mode at all if all data points occur once.
Variability
Allows statisticians to distinguish between usual and unusual occurrences.
Range
A single value – (Max – Min).
IQR
Interquartile range – (Q3 – Q1).
Standard Deviation
σ for population (parameter) & s for sample (statistic) – measures the typical or average deviation of observations from the mean – sample standard deviation is divided by df = n-1
Variance
Standard deviation squared.
Resistant
Not affected by outliers.
Non-Resistant
Affected by outliers.
Z-Score
Is a standardized score that tells you how many standard deviations from the mean an observation is.
Normal Curve
Is a bell shaped and symmetrical curve.
Empirical Rule (68-95-99.7)
Measures 1σ, 2σ, and 3σ on normal curves from a center of μ.
Boxplots
Are for medium or large numerical data. It does not contain original observations. Always use modified boxplots where the fences are 1.5 IQRs from the ends of the box (Q1 & Q3). Points outside the fence are considered outliers.
5-Number Summary
Minimum, Q1 (1st Quartile – 25th Percentile), Median, Q3 (3rd Quartile – 75th Percentile), Maximum
Sample Space
Is collection of all outcomes.
Event
Any sample of outcomes.
Complement
All outcomes not in the event.
Union
A or B, all the outcomes in both circles. A ∪ B
Intersection
A and B, happening in the middle of A and B. A ∩ B
Mutually Exclusive (Disjoint)
A and B have no intersection. They cannot happen at the same time.
Independent
If knowing one event does not change the outcome of another.
Experimental Probability
Is the number of success from an experiment divided by the total amount from the experiment.
Law of Large Numbers
As an experiment is repeated the experimental probability gets closer and closer to the true (theoretical) probability.
Correlation Coefficient – (r)
Is a quantitative assessment of the strength and direction of a linear relationship. Values – [-1, 1]
Least Squares Regression Line (LSRL)
Is a line of mathematical best fit that minimizes the deviations (residuals) from the line. Used with bivariate data.
Residuals (error)
Is vertical difference of a point from the LSRL. All residuals sum up to “0”.
Residual Plot
A scatterplot of (x (or ŷ) , residual). No pattern indicates a linear relationship.
Coefficient of Determination (r^2)
Gives the proportion of variation in y (response) that is explained by the relationship of (x, y).
Extrapolation
LRSL cannot be used to find values outside of the range of the original data.
Influential Points
Are points that if removed significantly change the LSRL.
Outliers
Are points with large residuals.
Census
A complete count of the population.
Sampling Frame
Is a list of everyone in the population.
Sampling Design
Refers to the method used to choose a sample.
SRS (Simple Random Sample)
One chooses so that each unit has an equal chance and every set of units has an equal chance of being selected.
Stratified Sample
Divide the population into homogeneous groups called strata, then SRS each strata.
Systematic Sample
Use a systematic approach (every 50th) after choosing randomly where to begin.
Cluster Sample
Based on location. Select a random location and sample ALL at that location.
Bias
Error – favors a certain outcome, has to do with center of sampling distributions – if centered over true parameter then considered unbiased
Voluntary Response
People choose themselves to participate.
Convenience Sampling
Ask people who are easy, friendly, or comfortable asking.
Undercoverage
Some group(s) are left out of the selection process.
Non-response
Someone cannot or does not want to be contacted or participate.
Experimental Unit
Single individual or object that receives a treatment.
Factor
Is the explanatory variable, what is being tested.
Level
A specific value for the factor.
Response Variable
What you are measuring with the experiment.
Treatment
Experimental condition applied to each unit.
Control Group
A group used to compare the factor to for effectiveness – does NOT have to be placebo.
Placebo
A treatment with no active ingredients (provides control).
Blinding
A method used so that the subjects are unaware of the treatment (who gets a placebo or the real treatment).
Double Blinding
Neither the subjects nor the evaluators know which treatment is being given.
Replication
Uses many subjects to quantify the natural variation in the response.
Randomization
Uses chance to assign the subjects to the treatments.
Completely Randomized Design
All units are allocated to all of the treatments randomly.
Randomized Block Design
Units are blocked and then randomly assigned in each block –reduces variation.
Matched Pairs Design
Units are matched up by characteristics and then randomly assigned.
Confounding Variables
Are where the effect of the variable on the response cannot be separated from the effects of the factor being tested.
Random Variable
A numerical value that depends on the outcome of an experiment.
Discrete Random Variable
A count of a random variable.
Continuous Random Variable
A measure of a random variable.
Fair Game
A fair game is one in which all pay-ins equal all pay-outs.
Binomial Distributions
Properties- two mutually exclusive outcomes, fixed number of trials (n), each trial is independent, the probability (p) of success is the same for all trials.
Geometric Distributions
Properties -two mutually exclusive outcomes, each trial is independent, probability (p) of success is the same for all trials. (NOT a fixed number of trials)
Sampling Distribution
Of a statistic is the distribution of all possible values of all possible samples.
Central Limit Theorem
When n is sufficiently large (n > 30) the sampling distribution is approximately normal even if the population distribution is not normal.
Confidence Intervals
Used to estimate the unknown population parameter.
Margin of Error
The smaller the margin of error, the more precise our estimate.
Hypothesis Testing
Tells us if a value occurs by random chance or not.
Null Hypothesis
H0 is the statement being tested. Null hypothesis should be “no effect”, “no difference”, or “no relationship”
Alternate Hypothesis
Ha is the statement suspected of being true.
P-Value
Assuming the null is true, the probability of obtaining the observed result or more extreme.
Level of Significance
α is the amount of evidence necessary before rejecting the null hypothesis.
Type I Error
Is when one rejects H0 when H0 is actually true. (probability is α)
Type II Error
Is when you fail to reject H0, and H0 is actually false. (probability is β)
Power of a Test
Is the probability that the test will reject the null hypothesis when the null hypothesis is false.
Chi-squared Test
Is used to test counts of categorical data.
Goodness of Fit Test
Is for univariate categorical data from a single sample. Does the observed count “fit” what we expect.
Test of Independence
Bivariate categorical data from one sample. Are the two variables independent or dependent?