1/23
This set of flashcards covers key concepts from the bootstrapping, clustering, and regression segments of the lecture on data science and R programming.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What is the purpose of bootstrapping?
To draw observations from the original sample with replacement and create a distribution of point estimates.
What function is used in R to generate a single bootstrap sample?
repsamplen with replace = TRUE.
How do you create a sample for a bootstrapping distribution?
What package is used for bootstrapping in R?
The infer package.
What is the first step to get a 95% confidence interval for a given sample mean?
Take x bootstrap samples using repsamplen.
What is the command to visualize the distribution of a bootstrap sample in R?
Use ggplot() with geom_histogram().
What does 'set.seed()' do in R?
Sets the seed for random number generation to ensure reproducibility.
What is the significance of using 'nstart' in K-means clustering?
It defines the number of random initializations to run when clustering.
What does the augment function do in the context of K-means clustering?
It adds cluster assignments to the original data.
What does a lower RMSE indicate in model performance?
It indicates better predictive accuracy of the model.
What is cross-validation used for?
To estimate the performance of a model and avoid overfitting.
What is the primary advantage of using version control?
It provides transparency and tracks changes in project documents.
Why is Git's staging area useful?
It allows selective committing of changes, associating specific messages with specific files.
What is one disadvantage of not visualizing clusters in high-dimensional data?
It is challenging to assess model performance without visualizing clusters.
What type of plot is useful for examining the relationship in K-means clustering results?
A scatter plot colored by cluster assignments.
How do you calculate the mean of bootstrap samples in R?
Group the samples by replicates and use summarize() to calculate means.
What should you consider when interpreting a classifier's accuracy?
It reflects the percentage of correct predictions out of total predictions made.
What function is used to predict values in regression models in R?
The predict() function.
What does geom_bar(position='fill') do in a ggplot?
Creates a stacked bar chart that fills the bar by proportions.
What is the output of the metrics function in R?
It provides evaluation metrics such as accuracy and RMSE.
What does the filter function do in R?
It subsets a data frame based on certain conditions.
Why might one avoid using a model with 75% accuracy for important applications?
Because 25% of predictions are incorrect, which is too high for critical tasks.
What does the 'labs()' function do in ggplot?
It adds labels to the plot such as titles and axis labels.
What is meant by 'sensitive to overfitting'?
A model that performs well on training data but poorly on unseen data.