1/49
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Pipelines are useful (in analytics with Python sense) for the following reasons? (choose all that apply)
- Pipelines make it easy to repeat/replicate steps and run multiple models
- Pipelines are good for moving data into your programming environment
- Pipelines automatically update to new versions of Python
- Pipelines help organize code you used to clean and treat your data
- Pipelines make it very easy to change small things in your model, like which variable to include
- Pipelines make it easy to repeat/replicate steps and run multiple models
- Pipelines help organize code you used to clean and treat your data
- Pipelines make it very easy to change small things in your model, like which variable to include
The basic idea of a regression is very simple. We have some X value (which we call ______) and some Y value that we are trying to _____. We could have multiple Y value, but that is not something we have covered.
features; predict
Y and y-hat are a little different. Y is our target vector, and y-hat is an output in our model that is a.... (choose one of the following)
- estimate or prediction of y
- the actual value of y
- an axis on our 2 way graph
- a combination of XY intercept coordinates
estimate or prediction of y
When looking at the code in the videos, we sometimes used a variable to hold out model. What is the significance of the word "model" in the below code?
model = LinearRegression(fit_intercept=True)
'model' is a named variable and is just holding our linear regression model. It could be renamed anything. The word itself is not important. It is just a container.
What is a good model fit value?
unknowable without knowing/understanding the context of the domain
Imagine X in the below is a missing value. If I were to run a median imputer on this set of data, what would the return value be?
50, 60, 70, 80, 100, 60, 5000, X
70
Which of the below were discussed as being problems with the holdout method for validation?
- Data is not available for test and control differences
- Outliers can skew the result
- The model is not trained on all of the data
- K=3 is not sufficiently large enough
- Validation is sometimes too challenging
- Outliers can skew the result
- The model is not trained on all of the data
The features in a model...
- are used as proxies for y-hat divided by y
- are always functions of each other
- keep the model validation process stable
- none of these answers are correct
none of these answers are correct
What is the first variable in a decision tree called (before any of the branches)?
root
One problem with decision trees is that they are prone to _____ if you are not careful or do not set the _____ appropriately.
overfitting; max depth
True or False: The random forest algorithm prevents, or at least avoids to some extent, the problems with overfitting found in decision trees.
True
True or False: Random Forests can only be used on classification problems
False
True or False: In order to interpret Decision Tree's, it is necessary to first run a linear regression
False
True or False: Decision Tree's are nice because they are fairly simple and straightforward to interpret
True
When running our first decision tree, we took out "maxdepth=". This had the unfortunate result of...
Building a very large hard to understand tree
What is the terminal node as discussed in the lecture?
The last node (sometimes called a leaf is you google the term); the tree doesn't split after this
Models, such as the random forest model we ran, often have a number of parameters that the analyst can choose or set.
What is a the best source of up to date information about the different parameters that can be set?
The scikit learn documentation
Random forests are _____ interpretable than decision trees.
less
True or False: The correct number of clusters in Hierarchical clustering can be determined precisely using approaches such as silhouette scores.
False
True or False: In K Means clustering, the analyst does not need to determine the number of clusters (K), these are always derived analytically using the kmeans algorithm.
False
True or False: One big difference between the unsupervised approaches in this module, and the supervised approaches in prior modules: Unsupervised models do not have a target variable (Y). This make is difficult to know when they are "right" or correct
True
According to the documentation, a silhouette scores of 1 is _____, and -1 is _____.
the best score; the worst score
Select all that apply. Imagine you have a data set with columns/inputs for customers:
Column 1 = Customer ID (a number)
Column 2 = Sales (a dollar value)
Column 3= Frequency (a number)
Column 4 = Satisfaction (a number)
You would like to understand the impact of Frequency on customer Satisfaction. What types of approaches could you use?
Note that the type of data is brackets () after the column name. Choose the best answer(s) from the available choices below.
- Decision Tree
- K Means
- Random Forest
- Linear Regression
- Hierarchical Clustering
- Decision Tree
- Random Forest
- Linear Regression
Select all that apply. Imagine you have a data set with columns/inputs for customers:
Column 1 = Customer ID
Column 2 = Distance to Stores
Column 3= Year spend
Column 4 = Likelihood to return
What kind of approache(s) could you use to understand more about these customers?
- Regression - to understand the effect of one or more variables on the others
- Clustering - to develop groups of customers that have similar patterns
- Regression - to understand the effect of one or more variables on the others
- Clustering - to develop groups of customers that have similar patterns
What is the purpose of the following code:
from sklearn.preprocessing import StandardScaler
scale = SlandardScaler()
rfm_std = scale.fit_transform(df)
to standardize the data
True or False: The elbow method provides an exact number of clusters for a kmean algorithm
False
True or False: Hierarchical clustering is more powerful than Kmeans, as it allows the researcher to determine the exact number of clusters to use in the analysis
False
In kmeans - the algorithm has multiple iterations. If we have a simple 2d problem, and a k = 2, it begins by assigning the first centroids to _____, and then _____ of each point or record to the centroid.
a random initial starting point; measuring the distance
An example this week was done in Jupiter like environment called Google Collab. What was the language that was demonstrated in the videos?
TensorFlow
True or False: Neural Networks in computing are exactly the same as neural networks from biology.
False
When viewing a diagram of a neural network there are several layers. Match the layer to the description below:
Input Layer
Options:
- these are the X's, or inputs from your data
- these are the Y (the target variable you are interested in)
- something you don't see, here there is some computation to transform the X's to the Y
- the layer that translates the axions
these are the X's, or inputs from your data
When viewing a diagram of a neural network there are several layers. Match the layer to the description below:
Output Later
Options:
- these are the X's, or inputs from your data
- these are the Y (the target variable you are interested in)
- something you don't see, here there is some computation to transform the X's to the Y
- the layer that translates the axions
these are the Y (the target variable you are interested in)
When viewing a diagram of a neural network there are several layers. Match the layer to the description below:
Hidden Layer
Options:
- these are the X's, or inputs from your data
- these are the Y (the target variable you are interested in)
- something you don't see, here there is some computation to transform the X's to the Y
- the layer that translates the axions
something you don't see, here there is some computation to transform the X's to the Y
True or False: Deep Neural Networks have only 1 hidden layer and multiple input layers
False
Each of the connections between noes as a connection, each of those connections has a _____.
Activation function
True or False: In an attempt to fit values from the input layer to the output layer, the hidden layer applies some weights to the input values.
True
The example we walked through was from a fairly famous dataset for leaning about machine learning. The data set is called:
MNIST
All the nodes prior to the output nodes essentially 'guess' at the correct weights. The algorithm checks to see if the initial guess is correct (usually not). When it is wrong...
... it tries again (runs another epoch)
True or False: Neural networks are an unsupervised technique, because there is no target variable.
False
NLP stands for...
natural language processing
Tokenization, as defined in the lecture is...
A computer turning letters and/or words into something it can read and understand, like numbers
Recommenders come in many flavors. 2 of the most common, often used together and discussed in the lecture are: (choose the following)
- Item Based
- User Based
- Algorithm Oriented
- Stock Availability Based
- Syntax Dependent
- Item Based
- User Based
Imagine you have a dataset with 2 columns, both filled with continuous numbers. You believe the first column is a predictor of the second column. Which of the model approaches below could work when building a model?
- Regression
- Decision Tree
- Running .describe and .info on the data
- Graphing
- Random Forest
- Regression
- Decision Tree
- Random Forest
Decision trees have a few problems. The problem we talked about the most is:
overfitting
In y = ax + b
A is commonly known as _____ and B is commonly known as _____.
slope; intercept
True or False: The LinearRegression estimator is only capable of simple straight line fits
False
In class we walked through 5 steps to building a machine learning model. The textbook goes over in some depth the 5 steps. What are they?
First Step: choosing a class of model
Second Step: choosing hyperparameters
Third Step: arrange data
Fourth Step: fit the model
Fifth Step: predict
What is the purpose of the below code?
Note that this is probably EASIER than similar questions on the final exam. But I will ask you why/purpose/what for questions on the code I have had you run. It is useful to make notes on your notebooks about why a certain chunk of code is run.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import python packages
Choosing a class of models
Your data set consists of details about customer traits, such as "number of items in the basket at checkout". Your task is to group customers that are like each other together. You don't already have labeled customer types. What kind of model are you building?
unsupervised model (such as K means)
- reminder: if you have a bunch of Xs, but no Ys the problem is unsupervised; When you are building a supervised model, you have an X and a Y. The hints there is "you don't already have labeled customer types " without these labels, the Y, you can't have any supervision
What is ONE reason the textbook lists for why a Linear regression is a good starting point in a modeling task
they are interpretable