data science quiz three

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/49

There's no tags or description

Looks like no tags are added yet.

Last updated 6:05 AM on 4/24/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

50 Terms

New cards

hypothesis test

a statistical technique used to evaluate competing claims using data

New cards

null hypothesis (Ho)

an assumption about the population. “there is nothing going on”

New cards

alternative hypothesis (Ha)

a research question about the population. “there is something going on”

New cards

what is the motivation behind a hypothesis test

decision

New cards

what is the motivation behind a confidence interval

estimation

New cards

one sided (one tailed) alternative hypothesis

the parameter is hypothesised to be less than or greater than the null value, < or >

New cards

two sided (two tailed) alternative hypothesis

the parameter is hypothesised to be not equal to the null value

New cards

what are two characteristics of two sided alternatives

calculated as two times the tail area beyond the observed sample statistic; more objective and hence more widely preferred

New cards

state the hypothesis for an independent case

null hypothesis, observed different in proportions is simply due to chance; Ho: p(treatment) - p(control) = 0

New cards

state the hypothesis for a dependent case

alternative hypothesis, observed difference in proportions is not due to chance; Ha: p(treatment) - p(control) /= 0

New cards

explain the randomisation process

randomly shuffle the rows in the data frame
split off the first 16 rows and set them aside - these represent the people in the control group
split of the final 34 rows and set them aside - these represent the people in the treatment group
calculate the proportion of people in both groups who yawned
calculate the difference in proportions of yawners (treatment - control) and plot it on the chart

New cards

write the code for simulation by computation

New cards

explain the steps for writing the code for simulation by computation

start with the data frame
specify the variables
state the null hypothesis
generate simulated differences via permutation
calculate the sample statistic of interest

New cards

write the code for calculating the p value

New cards

significance level

the cutoff value for whether the p-value is low enough that the data are unlikely to have come from the null model

New cards

when is Ho rejected

if p-value < alpha, reject Ho in favour of Ha - the data provide convincing evidence for the alternative hypothesis

New cards

when is Ho not rejected

if p-value > alpha, fail to reject Ho in favour of Ha - the data do not provide convincing evidence for the alternative hypothesis

New cards

false positive

rejecting the null hypothesis when it is correct

New cards

false negative

failing to reject the null hypothesis when it is incorrect

New cards

assumptions of the central limit theorem

assumes sampling statistics adhere to a normal distribution
observations in the sample are independent
the sample size is sufficiently large

New cards

sketch the normal distribution curve

New cards

p-value

the probability of observing a test statistic as extreme as the one computed from the sample data, assuming the null hypothesis is true

New cards

why are permutation-based approaches used

they repeat simulations to estimate the distribution of the test statistic under the null hypothesis

New cards

code for generating the null distribution

set.seed(123)

New cards

what is the code for visualising a simulated p-value

visualize(null_dist) +
shade_p_value(obs_stat = d_hat, direction = "two-sided")

New cards

what is the code for calculating a simulated p-value

null_dist |>
get_p_value(obs_stat = d_hat, direction = "two-sided")

New cards

what function displays linear regression

fit()

New cards

statistical inference

the process of using sample data to make conclusions about the underlying population from which the sample came

New cards

estimation

uses data from samples to calculate sample statistics (mean, median, slope) which can then be used as estimates for population parameters

New cards

hypothesis testing

use data from samples to calculate p values which can then be used to evaluate competing claims about the population

New cards

confidence intervals

a plausible range of values for a population parameter; need to quantify the variability of the sample statistic in order to construct one

New cards

code for sampling without replacement

sample(x = 1:10, size = 10, replace = FALSE)

New cards

code for sampling with replacement

sample(x = 1:10, size = 10, replace = TRUE)

New cards

explain the bootstrapping scheme

take a bootstrap sample - a random sample taken with replacement from the original sample, of the same size as the original sample
calculate the bootstrap statistic - a statistic such as mean, median proportion, slope computed on the bootstrap samples
repeat steps 1 and 2 to create a bootstrap distribution
calculate the bounds of the confidence interval as the middle of the bootstrap distribution

New cards

code for taking a bootstrap sample

economy_boot_1 <- economy |>
slice_sample(n = nrow(economy), replace = TRUE)

New cards

explain the difference between confidence intervals and p values

confidence interval: range of plausible values for the population parameter; distribution centred around the observed sample statistic
p value: probability of observing the data, given the null hypothesis is true; distribution centred around the value from the null hypothesis
a 95% confidence interval in practice is a hypothesis test with alpha = 0.05

New cards

code for calculating mean

calculate(stat = “mean”)

New cards

code for obtaining the confidence interval

get_ci(x = boot_df, level = 0.95)

New cards

entire code for the chipotle confidence interval problem

New cards

modeling

the use of models to explain the relationship between variables and to make predictions

New cards

linear models

classic forms used for statistical inference

New cards

nonlinear models

much more common in machine learning for prediction

New cards

correlation

ranges between -1 and 1, same sign as the slope

New cards

regression model

a function that describes the relationship between the outcome and the predictor; Y =Model + Error

New cards

simple linear regression

used to model the relationship between a quantitative outcome and a single quantitative predictor

New cards

residual formula

observed - predicted

New cards

least square lines

minimises the sum of squared residuals

New cards

code for simple linear regression

movies_fit <- linear_reg() |>
fit(audience ~ critics, data = movie_scores)

tidy(movies_fit)

New cards

properties of least squares regression

The regression line goes through the center of mass point (the coordinates corresponding to average x and y coordinates)
Slope has the same sign as the correlation coefficient
Sum of the residuals is zero
Residuals and values are uncorrelated

New cards

in what context is the intercept meaningful

when the predictor has values near zero