1/103
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Statistics (the science)
The science of planning studies and experiments, obtaining data, and organizing,
summarizing, analyzing, and interpreting those data and then drawing conclusions based on them
1st step of conducting a statistical analysis
Prepare: Consider the population, data types, and sampling method
2nd step of conducting a statistical analysis
Analyze: Describe the data you collected and use appropriate statistical methods to help with drawing conclusions
3rd step of conducting a statistical analysis
Conclude: Using statistical inference, make reasonable judgments and answer broad questions
Data
Collections of observations, such as measurements, counts, descriptions, or survey responses
Population
The complete collection of all data that we would like to better understand or describe
Sample
A subset of members selected from a population
Parameter
a numerical measurement describing some characteristic of a population
Statistic
a numerical measurement describing some characteristic of a sample
Quantitative Data
consists of numbers representing counts or measurements
Qualitative/Categorical Data
consists of names or labels (not numbers that represent counts or measurements)
Discrete Data
result when the data values are quantitative and the number of values is finite or “countable.”
Continuous Data
result from infinitely many possible quantitative values, where the collection of values is not countable.
Biased Samples
samples are more likely to produce some outcomes than others. The resulting statistics may be too high or too low
Convenience Samples
easy to collect, often have some bias or do not represent the population in general.
Volunteer Responses
a self-selected sample of people who respond to a general appeal
Simple Random Sample (SRS)
A sample of n subjects is selected in such a way that every possible sample of the
same size n has the same chance (or probability) of being chosen.
Stratified Sample
Subdivide the population into at least two different subgroups (or strata) so that the subjects within the same subgroup share the same characteristics. Then draw a sample from each subgroup (or stratum). The number sampled from each stratum may be done proportionally with respect to the size of the population.
Cluster Sample
Divide the population area into naturally occurring sections (or clusters) then randomly select some of those clusters and choose all the members from those selected clusters.
Systematic Sample
Select some starting point and then select every kth element in the population. This works
well when units are in some order (assembly lines, houses on a block, etc.).
Multistage Sample
Collect data by using some combination of the basic sampling methods.
Bad Sampling Frame
When attempting to list all members of a population, some subjects are missing. It
can be difficult to obtain a full, complete list.
Undercoverage
The sampling frame is missing groups from the population or the groups have smaller
representation in the sample than in the population.
Non-response Bias
Some part of the population chooses not to respond, or subjects were selected but
are not able to be contacted.
Response Bias
Responses given to questions or surveys are not truthful. This may occur when people
are unwilling to reveal personal matters, admit to illegal activity, or otherwise tailor their responses to “please” the investigator.
Wording and Order
The way questions are worded may be leading or inflammatory to elicit a particular
response. The order in which questions are asked may influence the answers.
x-bar
the sample mean
Mu
the population mean
Mean
Uses every data value
Highly affected by outliers
Not good for skewed data sets (but is best for symmetric data!)
Median
Not affected by outliers
Can use with any data set
Mode
Not necessarily in the center
Not affected by outliers
Only useful for multimodal or qualitative data
Histogram
horizontal scale representing classes of quantitative data values, and a vertical scale represents frequency.
Dotplot
shows each value in a dataset as a dot above a number line, no y-axis
Standard Deviation (sigma)
a measure of how much data values deviate from the mean
Variance
Standard Deviation squared
Experiment
The process of applying some treatment and then observing its effects is called an experiment. Has a control group and a treatment group.
Observational Study
The process of observing and measuring specific characteristics without attempting to
modify the individuals being studied
Response Variable
measures an outcome of a study
Explanatory Variable
explains or influences changes in the response variable
Reasons for variability in responses
Treatment effects
Experimental error
Confounding variables
Lurking variables
Control in Experimental Design
control the effects of lurking/confounding variables and other sources of variability on the
response by carefully planning the study
Randomization in Experimental Design
randomly assign experimental units to treatments to reduce or eliminate bias
Replication in Experimental Design
measure the effect of each treatment on many units to reduce chance variation in the
results
Completely Randomized Design
participants are randomly assigned to treatments (including control
groups)
Randomized Block Design
the experimenter divides participants into subgroups called blocks, such that the variability within blocks is less than the variability between blocks. Then, participants within each block are randomly assigned to treatment groups
Matched Pairs Design
used when the experiment has only two treatment groups; and participants can be grouped into pairs, based on one or more blocking variables. Then, within each pair, participants are randomly assigned to different treatments.
Bias of the Subjects
subjects may want to please the researcher or hope for a specific outcome
Hawthorne Effect
When people behave differently because they know they are being watched
Bias of the Researcher
They may assign subjects to groups or report results in a biased way, and may treat people or animals differently when holding certain expectations of their treatment
Blinding
when individuals associated with an experiment are not aware of how subjects have been assigned
Single Blind Study
those who could influence the results are blinded
Double Blind Study
those who evaluate the results are blinded as well as those who influence
z-score
the number of standard deviations away from the mean a certain data value is
Positive z-score
data value is above average
Negative z-score
data value is below average
Standardizing
The process of converting a data value (x) to a z-score
Significantly low values
considered significant or unusual if they are (µ − 2σ) or lower
Significantly high values
considered significant or unusual if they are (µ + 2σ) or higher
Values not significant
Between (µ − 2σ) and (µ + 2σ)
Density Curve
Probability is represented by the area underneath it
Normal Distribution properties
Mean, median, and mode are equal
Normal curve is bell-shaped and symmetric about the mean
Total area under the curve is equal to 1
Normal curve approaches, but never touches, the x-axis
Standard Normal Distribution
distribution of z-scores
Percentile
finding x-values when given probability, solve with z-score formula: 𝑥 = 𝜇 + 𝑧𝜎
Probability Distribution
describes how likely the values of the variable are to occur
Binomial Random Variable four criteria
There are a fixed number of trials/observations (n)
The trials are independent of each other
Each outcome is either a success (s), the outcome being counted, or a failure (f)
The probability of a success P(S) = p is constant for each trial
Summarize the shapes of Binomial Distributions
For small n, the shape tends to be skewed
As n increases, we see more bell-shaped/symmetric distributions (for any p).
When p is closer to 0 or 1, the shape starts to skew
p
population proportion
p-hat
sample proportion
In order to look at the distribution of a statistic, we need to know
the possible values of the random variable and how likely they are to occur
Standard Error
The standard deviation of the sample mean, gets smaller the larger the sample size is
Point Estimate of a Parameter
the value of the sample statistic that corresponds to that parameter
Level of Confidence (C)
the probability that the interval estimate contains / captures the population parameter
Confidence Interval (CI)
a range/interval of values used to estimate the true value of a population parameter
Margin of Error (MOE)
tells us the amount of random sampling error in our results and how far we might be off
How to narrow a confidence interval
Decrease the confidence level
Standard error gets smaller as the sample size
increases
Null Hypothesis
H0: Only claims using =. We assume the equality value in the null hypothesis is true and conduct the test under this assumption.
Alternative Hypothesis
HA: The complement of the null. Only strict inequalities may be used in the alternative
Type I Error
if the null hypothesis is rejected when it is actually true
Type II Error
if the null hypothesis is not rejected when it is actually false
Left-Tailed Test
we are only interested in showing that the parameter is less than a particular value
Right-Tailed Test
we are only interested in showing that the parameter is more than a particular value
Two-Tailed Test
we are interested in showing that the parameter is not equal to a particular value (less than or more than)
P-value (probability value)
the probability of observing this value or something more extreme, under the assumed distribution of the null hypothesis
If the p-value is less than α
reject the null
If the p-value is greater than α
fail to reject the null
If 0 is not included in the confidence interval for the difference of means
Then the means are significantly different
Confidence intervals and 2-sided hypothesis tests are
equivalent
Correlation Coefficient: r
a measure of the strength and the direction of a linear relationship between two variables
Strong r values
values greater than 0.8 or smaller than -0.8
Moderate r values
values between 0.5 and 0.8 or -0.5 and -0.8
Weak r values
values between -0.5 and 0.5 (closer to 0)
Residual
observed y (points) - predicted y (points on line)
Regression line
Best fitting straight line of the sample data
β0
intercept of the population regression model and is the expected value (mean) of Y when x=0
β1
the slope of the population regression model, and is the expected change in Y relative to one unit change in x
The smaller the SSE (sum of squares error),
the better the line fits
Condition 1 for Linear Regression: Linear data
If the data do have a linear association/correlation, then a linear regression model is not a good choice
Condition 2 for Linear Regression: Constant Variance
The errors/deviations around the regression line should be the same at each value of x
Coefficient of Determination: r-squared
the proportion of observed y variation that can be explained by the simple linear regression model