1/111
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Inference conditions for means
Random: random sample or randomized experiment
10% condition: Population > 10n (if sampling w/o replacement)
Large Counts:
1. Population distribution is Normal (given in problem)
2. Graphs show no skew or outliers and we can assume the data came from an approximately Normal population distribution.
3. n>30 (CLT says the sampling distribution of sample means will be approx Normal)
sampling distribution
the distribution of values taken by the statistic in all possible samples of the same size from the same population.
distribution of a sample
The distribution of values of the variable for the individuals included in a sample
population distribution
The distribution of values of the variable for all individuals in the population.
Catergorical data
data falls into groups or categories. Ex: Favorite car, hair color, etc.
Display categorical data
bar chart, pie chart
Quantitative Data
Data that takes on numerical values (makes sense to average) Ex: height, weight, time, etc.
Display Quantitative Data
Histogram, Dotplot, Boxplot, Stemplot
Marginal Distribution
in a contingency table, the distribution of ONE of the variables alone. (Always out of the total sample)
conditional distribution
describes the values of that variable among individuals who have a specific value of another variable. (Always out of a subgroup of the total sample)
SOCS
shape, outliers, center, spread AND add context!
Comparing Distributions
Address: Shape, Outliers, Center, Spread
in context!
YOU MUST USE comparison phrases like "is greater
than" or "is less than" for Center & Spread
Outlier Rule
Upper Bound = Q3 + 1.5(IQR)
Lower Bound = Q1 - 1.5(IQR)
Interpret Standard Deviation
Measures spread by giving the "typical" or "average" distance that the observations (context) are away from their (context) mean
How does shape affect measures of center?
In general,
Skewed Left (Mean < Median)
Skewed Right (Mean > Median)
Fairly Symmetric (Mean ≈ Median)
Percentile
The percentage of data points that lie at or below the value of interest.
z-score formula
z = (x - μ)/σ
z-score interpretation
The z-score represents the number of standard deviations above/below the mean that a particular point lies within a distribution.
linear transformation
when you multiply, divide, add, or subtract a constant from each score in a distribution.
Shape - Stays the same!
Center - Impacted by all operations (+, -, x, /)
Spread - Impacted by only x & /)
standard normal distribution
A normal distribution of z-scores with a mean of 0 and a standard deviation of 1.
normalcdf
Calculator command to find the area under a normal curve, given the following values:
normalcdf(lower bound, upper bound, mean, standard deviation)
invNorm
Calculstor command to find the value corresponding to a given area to the left of that value, given the following values:
invNorm(area to the left, mean, standard deviation)
SRS (Simple Random Sample)
all individuals in population have equal chance of being selected, and every group has equal chance of being selected
Sampling Techniques
1. SRS
2. Stratified
3. Cluster
4. Census
5. Convenience
6. Voluntary Response
7. Systematic
stratified random sample
a sampling design in which the population is divided into several subpopulations, and random samples are then drawn from each stratum. (SOME from ALL)
Cluster Random Sample
Divide the population into a large number of clusters. Randomly select a certain number of clusters and sample ALL subjects in each cluster. (ALL from SOME)
Census
An attempt to contact/sample all members of the population.
convenience sample
only members of the population who are easily accessible are selected
voluntary response sample
People decide whether to join a sample based on an open invitation (Ex: on-line polls, telephone calls, etc.)
Advantage of stratified random sampling
Stratified random sampling guarantees that each of the strata will be represented. It will produce less variable/more precise information than an SRS of the same size.
Bias
A sampling method is bias if it consistently produces estimates that are too small or too large.
experiment
A research method in which an investigator imposes a treatment upon the experimental units.
observational study
observes individuals and measures variables of interest but does not attempt to influence the responses or impose a treatment.
Experiment vs. Observational Study
An experiment can conclude a cause-and-effect relationship between explanatory and response variables.
An observational study can only conclude an association between explanatory and response variables.
confounding variable
two variables are confounded if it cannot be determined which variable is causing the change in the response variable.
control group
the group that does not receive the experimental treatment. An experiment DOES need comparison, but DOES NOT need a control group in order to compare.
Blinding
a technique where the subjects do not know whether they are receiving a treatment or a placebo.
If both the subject and the people interacting with the subject don't know which treatment is being received/given, then the study is double blind.
Experimental Designs
1. Completely Randomized Design.
2. Randomized Block Design.
3. Matched Pairs
completely randomized design
all experimental units have an equal chance of receiving any treatment
randomized block design
Start by forming blocks consisting of individuals that are similar in some way that is important to the response. Random assignment of treatments is then carried out separately within each block.
matched pairs design
The design of a study where experimental units are naturally paired by a common characteristic, or with themselves in a before-after type of study.
Benefit of Blocking
Blocking helps account for the variability in the response variable (context) that is caused by the blocking variable (context).
scope of inference
1. We can generalize our inference to the entire population when individuals are randomly selected.
2. Inferences about cause and effect are possible when an experiment is performed.
interpret probability
the probability of an event is the proportion of times the event would occur in a very large number of repetitions. (Probability is a long-term relative frequency.)
Law of Large Numbers
if we observe more and more repetitions of any chance process, the proportion of times that a specific outcome occurs approaches a single value.
Conducting a simulation
State: Ask a question about some chance process.
Plan: Describe how to use a random device to simulate one trial of the process and indicate what will be recorded at the end of each trial.
Do: Do many trials.
Conclude: Answer the question of interest.
Two Events are Independent If...
P(A)*P(B) = P(A and B)
OR
P(B) = P(B|A)
Or
P(B) = P(B|A)
Meaning: Knowing that Event A has occurred (or not occurred) doesn't change the probability that event B occurs.
Two events are mutually exclusive if
P(A and B) = 0
Events A and B are mutually exclusive if they share no outcomes.
Interpreting Expected Value/Mean
If we were to repeat the chance process (context) many times, the average value of _____ (context) would be about _______.
Mean of a Discrete Random Variable (expected value)
multiply each possible value by its probability, then add all the products
combining random variables: finding mean
add/subtract the means for each independent distribution.
combining random variables: finding standard deviation
ADD the VARIANCE for each independent distribution. Then square root your final answer.
binomial setting
BINS:
1. Binary: everything is a success or failure.
2. Independent trials.
3. A fixed Number of observations
4. The probability of Success is the same for each observation.
geometric setting
1. Binary: everything is a success or failure.
2. Independent trials.
3. Observe trials UNTIL a success.
4. The probability of Success is the same for each observation.
Mean and Standard Deviation of a Binomial Random Variable
Mean = np
Standard deviation = Sqroot[np(1-p)]
Parameter vs. Statistic
Parameter: a measure (mean/proportion etc.) of a POPULATION
Statistic: a measure (mean/proportion etc.) of SAMPLE
sampling distribution
the distribution of values taken by the statistic in all possible samples of the same size from the same population. Does NOT vary in repeated sampling.
distribution of a sample
The distribution of values of the variable for the individuals included in ONE sample. This distribution varies in repeated sampling.
population distribution
a description of how individuals are distributed with respect to one another. Does NOT vary in repeated sampling.
sampling distribution of the sample mean
A probability distribution of all possible sample means of a given sample size. Shape, center and spread can be found by satisfying the following conditions:
10% condition - Find mean/standard deviation (Formula Sheet)
N > 10n
Large Counts - determines if sampling distribution is approximately Normal.
n > 30 OR population distribution is Normal
sampling distribution of sample proportions
A probability distribution of all possible sample proportions of a given sample size. Shape, center and spread can be found by satisfying the following conditions:
10% condition - Find mean/standard deviation (Formula Sheet)
N > 10n
Large Counts - determines if sampling distribution is approximately Normal.
np > 10 AND n(1-p) > 10
Central Limit Theorem (CLT)
Says that when n is large (n > 30), the sampling distribution of the sample mean is approximately Normal
unbiased estimator
A statistic used to estimate a parameter is an unbiased estimator if the mean of its sampling distribution is equal to the true value of the parameter being estimated. (means and proportions are unbiased estimators).
correlation coefficient
A number that describes the strength and direction of a linear relationship. (from -1 to +1)
explanatory variable
A variable that helps explain or influences changes in a response variable.
response variable
a variable that measures an outcome or result of a study
DUFS
direction, unusual features, form, strength
influential point
An extreme value whose removal would drastically change the LSRL, correlation and/or coefficient of determination
linear regression
A method of finding the best model for a linear relationship between the explanatory and response variable.
negative association
as x increases, y decreases
positive association
as x increases, y increases
correlation coefficient = 0
no LINEAR association
Outlier
A value that "lies outside" (is much smaller or larger than) most of the other values in a set of data.
Scatterplot
a graphical depiction of the relationship between two quantitative variables
coefficient of determination
The percent of the variation in the values of y that can be explain by the least-squares regression line of y on x.
lurking variable
a variable that is not among the explanatory or response variables in a study but that may influence the response variable
extrapolation
Using a model to make a prediction outside the range of data used to create the model in the first place.
slope
the change in the response variable (y) for every one unit of change to the explanatory variable (x)
y-intercept
the value of the response variable (y) when the explanatory variable (x) is 0.
Equation of a line
y = a + bx
LSRL
a unique best-fit line that is found by making the squares of the residuals as small as possible
y hat
predicted value of y
residual
prediction error
Actual Value - Predicted Value
Formula for Residual
residual plot
a scatterplot of the regression residuals against the explanatory variable
standard deviation of residuals
This value gives the approximate size of a "typical" or "average" prediction error (residual).
Parameter
measures a characteristic of a POPULATION (mean, proportion, etc.)
Statistic
measures a characteristic of a SAMPLE (mean, proportion, etc.)
Central Limit Theorem (CLT)
Says that when n is large, the sampling distribution of the sample mean is approximately Normal
unbiased estimator
A statistic used to estimate a parameter is an unbiased estimator if the mean of its sampling distribution is equal to the true value of the parameter being estimated.
4-Step process to inference procedures (confidence intervals and significance tests)
State, Plan, Do & Conclude
Interpret Confidence Interval
we are ___% confident that the interval from ___ to ___ captures the actual value of the [population parameter in context]
Interpret Confidence Level
If we take many samples from a population about ___% of them will result in an interval that captures the parameter (in context)
Standard Error vs Margin of Error
The standard error of a statistic estimates how far the value of the statistic typically differs from the true value of the parameter. (calculating standard deviation from sample data)
The margin of error estimates how far we expect the parameter to differ from the statistic, at most. (the +/- on our confidence interval)
What factors effect the Margin of Error
The margin of error decreases when:
1. The sample size increases
2. The confidence level decreases
Finding the Sample Size (For a given margin of error)
Means - Use z* and assume we know the population SD
Proportions - Use given "p-hat" or if unknown, use p = 0.5
inference conditions for proportions
Random: random sample or randomized experiment
10% condition: Population > 10n (if sampling w/o replacement)
Large Counts: np>10 and n(1-p)>10
Interpret P-Value
The probability, if the null is true, that we would get a statistic as extreme or more by chance (IN CONTEXT)
Assess a claim from a confidence interval
**ONLY works for two-sided tests
1. If the null hypothesis is in the interval, then it is a plausible value and should NOT be rejected.
2. If the null hypothesis is NOT in the interval than it is not a plausible value and should be rejected.
Type I error
Rejecting null hypothesis when you shouldn't
(Accept the alternative hypothesis when it is NOT true)
P(Type I Error) = Alpha (significance level)