1/70
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
The correct number of clusters in Hierarchical clustering can be determined precisely using approaches such as silhouette scores (True or False)
False
In K Means clustering, the analyst does not need to determine the number of clusters (K), these are always derived analytically using the kmeans algorithm. (True or False)
False
One big difference between the unsupervised approaches in this module, and the supervised approaches in prior modules: Unsupervised models do not have a target variable (Y). This make is difficult to know when they are "right" or correct. (True or False)
True
According to the documentation, a silhouette scores of 1 ia
The best score
According to the documentation, a silhouette score of -1 is
The worst score
Select all that apply. Imagine you have a data set with columns/inputs for customers:
Column 1 = Customer ID (a number)
Column 2 = Sales (a dollar value)
Column 3= Frequency (a number)
Column 4 = Satisfaction (a number)
You would like to understand the impact of Frequency on customer Satisfaction. What types of approaches could you use?
Note that the type of data is brackets () after the column name.
Decision tree, random forest, linear regression
Select all that apply. Imagine you have a dataset with the following columns (inputs) for a set of customers.
Column 1 = Customer ID
Column 2 = Distance to Store
Column 3= Yearly spend
Column 4 = Likelihood to return (a survey response that indicates a customer is likely to shop again)
What kind of approaches could you use to understand more about these customers? Why?
Regression - to udnerstand the effect of one or more variables on the others
Clustering-to develop groups of customers that have similar patterns
What is the purpose of the following code?
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
rfm_std = scale.fit_transform(df)
To standardize the data
The elbow method provides an exact number of clusters for a kmeans algorithm. (True or False)
False
Hierarchical clustering is more powerful than Kmeans, as it allows the researcher to determine the exact number of clusters to use in the analysis. (True or False)
False
In kmeans- the algorithm has multiple iterations. If we have a simple 2d problem, and a k =2, it begins by assigning the first centroids to
A random initial starting point
In kmeans- the algorithm has multiple iterations. If we have a simple 2d problem, and a k =2. After the initial centroid, _________________ of each point or record to it after
Measuring the distance
An example this week was done in a Jupiter like environment called Google Collab. What was the language that was demonstrated in the videos?
(One cool thing about this is that is looks just like any other package! Installing this on your own is tricky)
TensorFlow
Neural Networks in computing are exactly the same as the neural networks from biology. (True or False)
False
Deep Neural Networks have only 1 hidden layer and multiple input layers. (True or False)
False
Each of the connections between nodes as a connection, each of those connections has a ________
Activation function
from our possibly overly simplistic explanation)
In the attempt to fit values from the input layer to the output layer, the hidden layer applies some weights to the input values. (True or False)
True
The example we walked through was from a fairly famous dataset for learning about machine learning. The dataset is called:
MNIST
All the the nodes prior to the output nodes essentially 'guess' at the correct weights. Then the algorithm checks to see if the initial guess is correct (usually not). When it is wrong...
It tries again (runs another epoch)
Neural networks are an unsupervised technique, because there is no target variable. (True or False)
False
When viewing a diagram of a neural network there are several layers. The input layer:
Are te Xs, or inputs from your data
When viewing a diagram of a neural network there are several layers. The Output layer:
Are the Ys (The target variable you are interested in)
When viewing a diagram of a neural network there are several layers. The hidden layer:
Something you dont see, here there is some computation to transform X into the Y
NLP stands for
natural language processing
Tokenization, as defined in the lecture, is
A computer turning letters and/or words into something it can read and understand, like numbers
Recommenders come in many flavors. 2 of the most common, often used together and discussed in the lecture are:
User based and Item based
Imagine you have a dataset with 2 columns, both filled with continuous numbers. You believe the first column is a predictor of the second column. Which of the model approaches below could work when building a model?
Random forests, regression, decision trees (Maybe not the BEST solution, Decision Trees have some problems like overfitting that we discussed. )
Decision trees have a few problems, you should probably review those for the final exam! The problem we talked about the most is:
Overfitting
We will start with the most familiar linear regression, a straight-line fit to data. A straight-line fit is a model of the form
y=ax+b
Where a is commonly known as the
Slope
We will start with the most familiar linear regression, a straight-line fit to data. A straight-line fit is a model of the form
y=ax+b"
Where b is commonly known as the
Intercept
The LinearRegression estimator is only capable of simple straight line fits. (True or false)
False
In class we walked through 5 steps to building a machine learning model. The textbook also goes over in some depth the 5 steps. What is step 1?
Choosing a class of model
In class we walked through 5 steps to building a machine learning model. The textbook also goes over in some depth the 5 steps. What is step 2?
Choose hyperparameters
In class we walked through 5 steps to building a machine learning model. The textbook also goes over in some depth the 5 steps. What is step 3?
Aarrange data
In class we walked through 5 steps to building a machine learning model. The textbook also goes over in some depth the 5 steps. What is step 4?
Fit the model
In class we walked through 5 steps to building a machine learning model. The textbook also goes over in some depth the 5 steps. What is step 5?
Predict
What is the purpose of the below code?
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
Import python packages
Your dataset consists of details about customer traits, such as "number of items in the basket at checkout" and "time of day of checkout". Your task is to group customers that are like each other together. You don't already have labeled customer types. What kind of model are you building?
Unsupervised model (like k means)
What is ONE reason the textbook lists for why a Linear regression is a good starting point in a modeling task.
They are interpretable
What is the first variable in a decision tree called (before any of the branches)?
Root
One problem with decision trees is that they are prone to
Over fitting
If you are not careful or do not see the __________________ appropriately, leads to decision trees overfitting
Max depth
The random forest algorithm prevents, or at least avoids to some extent, the problems with overfitting found in decision trees. (True or False)
True
Random forests can only be used on classification problems (true or false)
False
In order to interpret decision trees its necessary to first run a linear regression (true or false)
False
Decision tree's are nice because they are fairly simple and straightforward to interpret (True of False)
True
When running our first decision tree, we took out "maxdepth=". This had the unfortunate result of...
Building a very large hard to understand tree
What is the terminal node as discussed in the lecture?
The last node (sometimes called a leaf), the tree doesnt split after this
Models, such as the random forest model we ran, often have a number of parameters that the analyst can choose or set. What is a the best source of up to date information about the different parameters that can be set?
The scikit learn documentation
Random forests are __________ interpretable than decision trees
Less
Pipelines are useful (in the analytics with Python sense) for what reasons?
Make it easy to repeat/replicate steps and run multiple models, help organize the code you used to clean and treat data, and make it eassy to change small things in model like which variables to include.
Y and y-hat are a little different. Y is our target vector, and y-hat is an output in our model that is a.....
Estimate or prediction of y
The basic idea of a regression is very simple. We have some X values (we called these ___________ and some Y value (this is the variable we are trying to _________ . We could have multiple Y values, but that is not something we have covered.
Features; Predict
When looking at the code in the videos, we sometimes used a variable to hold our model.
What is the significance of the word "model" in the below code?
model = LinearRegression(fit_intercept=True)
model' is a named variable and is just holding our linear regression model. It could be renamed anything. The word itself is not important. It is just a container.
Which of the below were discussed as being problems with the hold out method for validation?
Outliers can skew the results and the model is not trained on all of the data
Which of the following is a common use case for the random forest algorithm in machine learning?
Classifying data into categories based on input features
Which of the following is a potential benefit of using decision trees in machine learning?
Can handle both numerical and categorical data
Which of the following statements best describes an ensemble method in machine learning?
A technique that combines the results of multiple models to improve overall predictive accuracy
Which of the following best describes supervised learning?
A machine learning approach where the algorithm receives labeled data and learns to map inputs to outputs based on those labels
Which of the following statements best describes classification in machine learning?
A type of supervised learning where the goal is to assign input data points to predefined categories or classes
We want the R-squared value for our regression model to be 100% (true or false)
False
One weakness of cross-validation discussed is that information can sometimes ____ across different periods. A common situation in which this happens is when we are looking at stock data.
Leak
In which of these situations would you want to use a clustering algorithm?
You have a dataset containing customer data for Cheesecake Factory and you want to look at customer spending at the restaurant in order to find patterns among customers who share similar characteristics
What is a potential downside of using linear regression models in machine learning?
They are prone to over fitting the data
What type of algorithm would you use to segment customers into groups?
Assume the groups are already labeled.
Decision trees, regression, random forest, cluster regression
Which of the following is true about data validation and cross-validation in machine learning?
Data validation and cross-validation are used to evaluate a model's performance and prevent overfitting
What is the role of cluster centers clustering, and how are they determined during the algorithm?
Cluster centers are the initial data points chosen randomly to begin clustering, and they are updated iteratively to minimize the within-cluster sum of squares
Which of the following machine learning models utilizes supervised learning?
Regression
What is scikit-learn?
A machine learning package in Python that has built in machine learning algorithms we can use on our dataset
Which of the following best describes the difference between a supervised and an unsupervised learning task in machine learning?
A supervised learning task requires labeled data, while an unsupervised learning task does not
Which is true about linear regression models?
They are easy to interpret