What is the standard error of a statistic?
the standard deviation of a sampling distribution
How is the standard error of a statistic calculated?
sd(means)
What is the basic interpretation of a confidence interval?
“I am 95% confident that the true mean lies within the given interval”
How does a confidence interval relate to the precision and uncertainty of a statistical estimate?
A confidence interval is a range of values that is likely to contain the true population parameter with a certain level of confidence. The precision of a statistical estimate is related to the width of the confidence interval, with a narrower interval indicating greater precision. The uncertainty of a statistical estimate is related to the level of confidence chosen for the interval, with a higher confidence level indicating greater uncertainty as the interval will be wider.
What are the basic variable types in R?
numeric, integer, factor, and character
How is a numeric variable defined in R?
data that consists of only numbers, like decimals, whole numbers, positive or negative
How is an integer variable defined in R?
data that consists of only whole numbers
How is a factor variable defined in R?
In R, a factor variable is defined as a categorical variable that can take on a limited set of values, known as levels. It is created using the factor()
function, which takes a vector of categorical values and converts it into a factor variable. For example, the following code creates a factor variable called "gender" with two levels, "male" and "female":
gender <- factor(c("male", "female", "male", "male", "female"))
Once created, factor variables can be used in statistical analyses and visualizations to explore relationships between categorical variables.
How is a character variable defined in R?
data that consists of letters/words
What is the difference between observational and experimental data?
Observational data comes from observing certain variables and trying to determine if there is any correlation, while experimental data is where you control certain variables and try to determine if there is any causality
How do observational and experimental data differ in terms of study design?
The key difference between observational studies and experimental designs is that a well-done observational study does not influence the responses of participants, while experiments do have some sort of treatment condition applied to at least some participants by random assignment
What are the advantages of observational data?
relatively quick, inexpensive, and easy to undertake
What are the disadvantages of observational data?
can be susceptible to other types of bias
What are the advantages of experimental data?
researchers have control over variables to obtain results, the subject does not impact the effectiveness of the research, results are specific
What are the disadvantages of experimental data?
time-consuming process, not all situations can be created in an experiment, expensive
What are the main challenges of establishing causality with observational data?
lurking variables
What is a common sources of bias that can impact observational studies?
sample selection bias - anytime that selection into a sample depends on the value of the dependent variable
What are the "Big Three" criteria for establishing causality?
When X changed, Y also changed. If X changes and Y doesn't change, then we cannot assert that X causes Y.
X happened before Y. If X happens after Y then X cannot cause Y.
Nothing else besides X changed systematically.
What is the treatment in an experiment?
what is randomly assigned to each subject
What is the response in an experiment?
what is measured
How can confidence intervals be used to compare group averages in statistical analysis?
Check if 0 is contained in the confidence interval. If it is, than there is no significant difference between group averages. If not, we can conclude which treatment produces a better response
What is the process for calculating confidence intervals for group means?
t.test(df$response ~ df$treatment)
How can confidence intervals be used to compare group proportions in statistical analysis?
Check if 0 is contained in the confidence interval. If it is, than there no significant difference between the proportion of responses. If not, we can conclude there is a difference between the treatments
What is the process for calculating confidence intervals for group proportions?
prop.test(x=c(# of yes1, # of yes2), n=c(n1, n2))
What is power?
the probability that we will not make a type 2 error (1-B); the probability that we will detect a difference when there actually is one
When can a type 1 error occur?
if the null hypothesis is true
When can a type 2 error occur?
if the null hypothesis is false
What is the power calculation for a test of means?
power.t.test(n, delta, sd, sig.level)
How is power used in statistical analysis to determine the sample size needed for a given study?
power.t.test(power, sig.level, delta, sd, n=NULL)
What is the power calculation for a test of proportions?
power.prop.test(n, p1, p2, sig.level)
What is the difference between p1 and p2?
The difference we are looking to detect between the proportions
What happens to power when effect size increases?
Power increases
What happens to power as standard deviation increases?
Power decreases
What happens to power as the alpha (significance) level increases?
Power decreases
How can you increase power?
increase sample size
What is the beta-binomial model?
an alternative way to assess our uncertainty about the sample proportion
How can the beta-binomial model be used to compare two proportions in statistical analysis?
Running 100,000 iterations using the number of successes and failures +1, the output indicates the % chance that the true success rate for one is greater than the other in the long run
What are the two forms of randomness that are critical to ensuring valid statistical results from an experiment?
Take a random sample
Randomly assign treatments to units in that sample
What factors should be considered when deciding to utilize an analysis of variance (ANOVA) for the analysis of A/B/n tests in experimental studies?
If we are analyzing means
If the response variable is numeric, or continuous
What are the criteria for determining the appropriate use of a chi-square test for analyzing A/B/n tests in experimental studies?
If we are analyzing proportions
Why is the Tukey HSD (Honestly Significant Difference) test necessary for comparing all pairs of treatment groups in statistical analysis?
The test compares each treatment mean with all of the other treatment means. At the same time, it controls the type 1 error rate
What are the key considerations when applying the Tukey HSD test?
It is used when the p-value of an anova test is significant to determine which of the treatments are different from the others
What is a covariate?
a measurement on an experimental unit that is not controlled by the experimenter
Are covariates ever randomized?
No, because they are not part of the experiment
How do causality and covariates connect?
Causality cannot be applied to a covariate because the treatment is not randomized
What happens when the number of subgroup analyses performed increases?
The probability of a Type 1 error increases
If we have 15 covariates, what is the probability of making at least one type 1 error? Assume alpha=0.05
1-.95^15 = .5367
What is Simpson's Paradox?
a phenomena where the results of an analysis of aggregated data are the reverse of subgroup results
What must happen for Simpson’s Paradox to occur?
The subgroup need to be different sizes
There exists a lurking variable that is related to both the treatment and the response
What is a blocking variable?
a categorical variable that explains some of the variation in the response variable
What makes a RCBD?
each treatment appears within each block; they are balanced designs in that each treatment appears the same number of times each block, once
Describe a Latin Squares Design
each treatment appears exactly in one row and each column; the rows and columns each represent a blocking factor; the treatment is represented in the square
What is special about Latin Squares Designs?
They create orthogonal data, which means the estimates from the design will be unbiased
How to analyze a LSD?
run anova, check if treatment has a significant p-value
How does replication impact MSE?
it produces a better estimate of MSE by providing more degrees of freedom
In an ANOVA analysis, why would I be concerned with having a good estimate of the Mean Squared Error?
It is the yardstick against which we measure all of our effects
What are the two ways to execute a Latin squares design with replication?
All experimental runs are carried out at once
One set is carried out at one time/location, time, etc. And then the next set
What is the significance of orthogonality in factorial designs?
It ensures that a change in Factor A implies a change in y and nothing else is changing that would change y; it is the best structure for isolating effects
What is the main effect of a factor?
the average change in the response due to a change in the factor
How to calculate a main effect by hand in a factorial design?
Hand written notes
What is an interaction?
The effect of A on the response depends on the level of factor B
When can you not perform any inference in a factorial experiment?
when you have only one observation at each treatment combination = no way to estimate the experimental error
How can you ensure you can perform inference on a factorial experiment?
replication
What are the assumptions for a valid regression model?
The probability distribution of epsilon should have constant variance for all values of y
The probability distribution of epsilon is normally distributed
Any two values from the probability distribution of epsilon are indepenent
The mean value is 0
What plots should be viewed when checking the assumptions?
Plot the residuals vs. y^. (predicted values)
Plot the residuals vs. run order.
Create a normal probability plot of the residuals.
Identify outliers
What is the procedure for analyzing a two-level factorial design without replicates using a half normal plot?
Hand written notes
What does a design matrix show?
The specific treatment combinations
What does a model matrix used for?
shows all the columns that will get estimates in the model
What are the estimable effects that can be listed from a 2^k factorial experiment?
As the number of factors increases the number of available interactions to estimate also increases. For example if you have three factors A, B and C and they are varied each at two levels you can estimate the following effects: A, B, C are the main effects, AB, AC, BC are all two-way interactions, ABC is a three-way interaction
What is the problem with a design that has no replication?
No replication = no inference because there are no df left over to estimate the error in the model
What kind of model has no d.f. left over?
A 2^2 factorial with no replication
When looking at a Half Normal plot, do we care about the points on or off the line?
Off
What should be checked after making a model following a half-normal plot (no replication)
plot(reg$residuals, reg$fitted.values) #check for a pattern
plot(reg$residuals) # check for a trend
qqnorm(reg$residuals) qqline(reg$residuals) # check for normality
What should a rate response actually be analyzed with?
a logistic regression model
What is a confirmation run?
a test to confirm the rate you expect based on the optimal conditions
What are the fundamental principles and concepts of the Six Sigma Project Management Philosophy?
The philosophical perspective of Six Sigma views all work as processes that can be defined, measured, analyzed, improved, and controlled. Processes require inputs (x) and produce outputs (y). If you control the inputs, you will control the outputs. This is generally expressed as y = f(x).
What does DMAIC stand for in the context of the Six Sigma Project Management Philosophy?
D=Define
M=Measure
A=Analyze
I=Improve
C=Control
What is the distinction between common cause variation and assignable cause variation in the context of process variation analysis?
In many processes, regardless of how well-designed they are or carefully maintained, a certain amount of inherent or natural variability will always exist. We refer to this natural or background or random variability as Common Cause Variation.
A process that is in operation with only common cause variation present is said to be operating in statistical control or in control.
Other kinds of variability may occasionally be present in the process. A measurement could be wrong or the temperature might be off, etc.
This type of variation is typically large when compared to the background variation and typically represents an unacceptable level of process performance. We refer to these sources of variation as Assignable Cause Variation. A process that is operating the presence of assignable cause variation is said to be out of control.
What is the fundamental framework of a control chart, and how does it contribute to the analysis and management of process control?
A typical control chart plots the quality characteristic over time.
Center line (CL) = average value of the quality characteristic
Upper Control Limit (UCL)
Lower Control Limit (LCL)
The UCL and LCL are chosen such that if the process is in control nearly all of the sample points will fall between the control limits
What are the fundamental concepts and applications of Xbar and R-charts in statistical process control?
Xbar chart - captures between subgroup variability
R-chart - captures within subgroup variability; the center line is the average range
How can you effectively assess Xbar and R-charts to identify and analyze assignable causes in statistical process control?
Both charts have to be in control for the process to be in control. If either chart is out of control then there are assignable causes present