The most you can say is the distribution is approx. normal
(68-95-99.7 rule)
The empirical Rule only goes for normal distributions
smaller = similar
large = different/ varied
(Z-score gives standard deviation)
Do!
label axis (units of measurement)
identity values
shade area of interest
perform calculations
z-score gives you the answer!
Calculator Functions
2nd VARS
normalcdf
enter (lower, upper, mean, standard deviation) to calculate the cumulative probability for a normal distribution.
Gives you percentage in decimal form (0.0668 = 6.68%)
Normal Curve
symmetrical about the mean = mode = median
never touches x-axisthe
total area under curve = 1
the shape of the graph is influenced by
mean (horizontal) shift right
standard deviation (width of the graph)
Bimodal Distributions
~ a type of discrete probability ~
Bernoulli sequence: a sequence of hits that have only 2 outcomes (coin flip)
identical trails
must be independent
If x is a random varible with bimodal distrubtions
Probability of x no of successes in n trails
Order is not important
The values of μ & p change the graph shape
μ: as μ increases, more lines make the graph closer to a bell-shaped curve
p: as p increases (consent), the graph tail is strengthened more to the left
if p < 0.5 tail on the right (very skewed)
A response variable measures an outcome of a study (y) (explanatory predicts)
An explanatory variable may help or explain the change in a response variable (x)
Scatterplot: shows the relationship between 2 quantitative variables measured for the same individual
each individual appears as a point
Correlation: r gives the direction and measures the strength of the linear association between 2 quantitative variables
if asked for r, and you only have r² square root to find r
The Linear association between x and y is _____ and ______.
Regression Line: models show how a response variable (y) changes as an explanatory variable (x) change
y = a +bx
Residuals: distance between actual and predicted
actual y - predicted
y= residual
The actual y was redisdual above/below the predicted y of x-context.
influential points have the most significant impact
an outlier can also be a high-leverage point
not all outliers are high-leverage
away from data but with the same x-values
not all outliers are influential
they can be influential though
if both it can be an outlier
To determine if an outlier is considered:
leverage
x-values
residual distance from the regression line
Sx of the residuals: measures the size of a typical residual
The actual y is typically about Sx away from the predicted y by the LSRL
lethe ast square regression line
r²% of the variations of x-context is explained by the linear relationship with log(y-context)
When a scatterplot shows a curved relationship between 2 quantitative variables, transform one or bothvariablesse to crea linear associations
choose a model whose residual plot has the most random scatter (no curve)
if more than one model produces a randomly scattered residual plot, choose the model with the largest coefficient of determination (r²/r)
Simple Random Sample
relies on using a selection method that provides each survey participant with an equal chance of being selected
based on probability and random selection
more likely to be representative of the total population (free of bias)
Label (give each individual a number)
randomize (use a number generator)
select (choose individuals that correspond)
Calculator
math
prob
randInt(up,lower,h)
For a population of 30 students, choose 5 random students
29
28
5
16
13
Choose SRS with a table
label
randomize
select
more Biased
for surveys (total population is divided into groups (strata)
grouping of similarities (SRS from each group)
used by researchers when trying to evaluate data
define the groups
define the sample size (ratio)
randomly select from each group
review results
Low bias: your data is standard (symmetric)
Low variability: your data is similar to the mean
If your sample has both it is most likely the population and generalizations to the population are most likely to be accurate
Advantages
symmetric demographics
fair method
helps efficient
accurate data
Disadvantages
prior knowledge
may not be representative
more complicated
selection bias
Strata: groups of individuals in a population
a strata random sample is selected by choosing an SRS from each strata
Sampling variability: the static information from a sample will vary as the random sampling is repeated
will decrease as the sample size increases
cluster: a group of individuals that are located near each other
divides the population into groups or clusters
selected to make up your total sample group for a study
useful when surveying a large population and natural grouping
might be bias
define the population and cluster size
generate your clusters
randomly select clusters
collect data
analyze and interpret data
Advantages
cost-effective
efficiency
speed natural grouping
Disadvantages
bias on sampling
complexity
Groups are not similar
sample all from some groups
Systemic random sampling is a large population selected, according to a random starting point but with a fixed periodic interval
calculated by dividing the population size by the desired sample size
confirm population total
determine sample size
determine sampling interval
select a random stat point
add sampling interval until the desired sample
Cluster sampling | systemic sampling | stratified sampling | simple random sampling | |
Population | The population is divided into clusters/groups | population is divided into groups | The population is divided into strats or subgroups | while the population is considered |
sampling unit | clusters are selected randomly but the entire population of clusters is surveyed | every nth unit in the population is selected for surveying | Individuals within each are randomly chosen for surveying | Individuals are randomly selected from the population for surveying |
homogeneity with the sample unit | high homogeneity within each selected cluster | assumes homogeneity within selected intervals | lower homogeneity with each strat/subgroup | assumed homogeneity across the entire population |
complexity | fewer stages of the sampling method involved | simple to implement with one-stage sampling | more stages of sampling involved | simple to add with 1-stage sampling |
Undercoverage occurs when some members of the population are least likely to be chosen or cannot be chosen in a sample
a survey of households (excludes homeless people, prisoners, students, dormitories.)
Response bias occurs when there is a systemic pattern of inaccurate answers to a survey question.
Confounding occurs when 2 variables are associated in a way that their effects on a response variable cannot be distinguished from each other
possible different variable (3rd variable)
Treatment is a condition applied to individuals in an experimentA
A placebo is a treatment that has no active ingredient
Factor is an explanatory variable that is manipulated and may cause a change in the response variable
the different values of a factor are called levels
A control group is used to provide a baseline for sampling the effects of other treatments
2 different groups, 1 with treatment and 1 without (control variable)
The placebo Effect describes the fact that some subjects will respond favorably to any treatment..nt
Single-blind: either subject or people who interact with them and measure the response variable don’t know which treatment a subject is receiving
neither subject nor people (double Blind)
Block is a group of experimental units
randomized block is carried out within each block
Matched Pairs Design is a common experimental design for comparing 2 treatments tuseses a block of size 2.
Observation: select people
Experiment: random assignment, treatment, causation
1. comparison
random assignment
control
replication
Randomized block design
you divide your participants into subgroups
within the blocks they have similarities
Statistical significance
helps quantify results likely to be based on factors of interest
(chance or not?) (lucky or unlucky)
Scope of inference
random selection
random assignment
allows for inference from population (cause/effect)
A random sample will enable us to generalize our conclusions to the population from which we have sampled
when can we decide on the causation
when we generalize
any outcome of a random process is a number between 0 and 1, that describes the proportion of times that outcome would occur in a very long series of trials.
an outcome that never occurs has a probability of zero
an outcome that appears/happens on every trial has a probability of 1
an outcome that happens 1.2 of the time has a probability of 1.2
The law of large numbers says that if we observe more and more trials of any random process, the proportion of times that a specific outcome occurs approaches its probability.
After many many contexts, the proportion of times that context A will occur is about P(A).
Simulation
describe how to set up (use a random process)
identify what your recording
perform many trials
use results to answer question
A sample size is the set of all possible outcomes of an experiment or simulation..on
Mutually Exclusive Events
2 or more events that cannot occur at the same time
disjoint event (P(A or B) = P(A) + P(B)
RULES
between zero and one
all outcomes probability = one
probability of an event is one minus the probability
Venn diagrams
represented by 2 circles that overlap to show a relationship
General Addition Rule
if A and B are 2 events resulting from the same random process
Introspection (n) (and)
all outcomes that are common to both sides
Union(U)(or)
all outcomes that are not common to both sides
Conditional probability
one event happens given that another event is known to have happened
A tree diagram shows the sample space of a random process including multiple stages.
and calculates probability
A random variable takes numerical values that describe the outcomes of a random process.
A probability distribution gives the possible values and their possibilities
to be a valid probability model
between zero and one
sum of all probability = 100%
Discrete random variable
a countable set of possible variables with gapes on a number line (whole number only)
Histogram of probability distribution (no gaps)
values of a random variable
probability
one bar per each x-value
The mean or expected value is its average over many trials of the same random process
to find multiple each possible by its probability then add to the sum
If many many context are randomly selected by the average amount of context of random variable would be about _____ (units).
can also be found from a cumulative probability distribution for the random variable
The median of a discrete random variable is the midpoint of a distribution that varies from the .mean.
If many contexts are randomly selected, the context will typically vary from the mean of x by about standard deviation (units).
adding/subtracting
measures of center (mean or median)
doesn’t change variability
doesn’t change the shape of the probability distribution
multiplying/dividing
measures the center by b
measures the variability by b
doesn’t change the shape of the distribution
Effect of a linear transformation on random variable
has the same shape as the probability distribution of x (if b>a)