QMB3302 Final UF

0.0(0)

Studied by 1 person

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/70

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

71 Terms

New cards

The correct number of clusters in Hierarchical clustering can be determined precisely using approaches such as silhouette scores (True or False)

False

New cards

In K Means clustering, the analyst does not need to determine the number of clusters (K), these are always derived analytically using the kmeans algorithm. (True or False)

False

New cards

One big difference between the unsupervised approaches in this module, and the supervised approaches in prior modules: Unsupervised models do not have a target variable (Y). This make is difficult to know when they are "right" or correct. (True or False)

True

New cards

According to the documentation, a silhouette scores of 1 ia

The best score

New cards

According to the documentation, a silhouette score of -1 is

The worst score

New cards

Select all that apply. Imagine you have a data set with columns/inputs for customers:

Column 1 = Customer ID (a number)

Column 2 = Sales (a dollar value)

Column 3= Frequency (a number)

Column 4 = Satisfaction (a number)

You would like to understand the impact of Frequency on customer Satisfaction. What types of approaches could you use?

Note that the type of data is brackets () after the column name.

Decision tree, random forest, linear regression

New cards

Select all that apply. Imagine you have a dataset with the following columns (inputs) for a set of customers.

Column 1 = Customer ID

Column 2 = Distance to Store

Column 3= Yearly spend

Column 4 = Likelihood to return (a survey response that indicates a customer is likely to shop again)

What kind of approaches could you use to understand more about these customers? Why?

Regression - to udnerstand the effect of one or more variables on the others

Clustering-to develop groups of customers that have similar patterns

New cards

What is the purpose of the following code?

from sklearn.preprocessing import StandardScaler

scale = StandardScaler()

rfm_std = scale.fit_transform(df)

To standardize the data

New cards

The elbow method provides an exact number of clusters for a kmeans algorithm. (True or False)

False

New cards

Hierarchical clustering is more powerful than Kmeans, as it allows the researcher to determine the exact number of clusters to use in the analysis. (True or False)

False

New cards

In kmeans- the algorithm has multiple iterations. If we have a simple 2d problem, and a k =2, it begins by assigning the first centroids to

A random initial starting point

New cards

In kmeans- the algorithm has multiple iterations. If we have a simple 2d problem, and a k =2. After the initial centroid, _________________ of each point or record to it after

Measuring the distance

New cards

An example this week was done in a Jupiter like environment called Google Collab. What was the language that was demonstrated in the videos?

(One cool thing about this is that is looks just like any other package! Installing this on your own is tricky)

TensorFlow

New cards

Neural Networks in computing are exactly the same as the neural networks from biology. (True or False)

False

New cards

Deep Neural Networks have only 1 hidden layer and multiple input layers. (True or False)

False

New cards

Each of the connections between nodes as a connection, each of those connections has a ________

Activation function

New cards

from our possibly overly simplistic explanation)

In the attempt to fit values from the input layer to the output layer, the hidden layer applies some weights to the input values. (True or False)

True

New cards

The example we walked through was from a fairly famous dataset for learning about machine learning. The dataset is called:

MNIST

New cards

All the the nodes prior to the output nodes essentially 'guess' at the correct weights. Then the algorithm checks to see if the initial guess is correct (usually not). When it is wrong...

It tries again (runs another epoch)

New cards

Neural networks are an unsupervised technique, because there is no target variable. (True or False)

False

New cards

When viewing a diagram of a neural network there are several layers. The input layer:

Are te Xs, or inputs from your data

New cards

When viewing a diagram of a neural network there are several layers. The Output layer:

Are the Ys (The target variable you are interested in)

New cards

When viewing a diagram of a neural network there are several layers. The hidden layer:

Something you dont see, here there is some computation to transform X into the Y

New cards

NLP stands for

natural language processing

New cards

Tokenization, as defined in the lecture, is

A computer turning letters and/or words into something it can read and understand, like numbers

New cards

Recommenders come in many flavors. 2 of the most common, often used together and discussed in the lecture are:

User based and Item based

New cards

Imagine you have a dataset with 2 columns, both filled with continuous numbers. You believe the first column is a predictor of the second column. Which of the model approaches below could work when building a model?

Random forests, regression, decision trees (Maybe not the BEST solution, Decision Trees have some problems like overfitting that we discussed. )

New cards

Decision trees have a few problems, you should probably review those for the final exam! The problem we talked about the most is:

Overfitting

New cards

We will start with the most familiar linear regression, a straight-line fit to data. A straight-line fit is a model of the form

y=ax+b

Where a is commonly known as the

Slope

New cards

We will start with the most familiar linear regression, a straight-line fit to data. A straight-line fit is a model of the form

y=ax+b"

Where b is commonly known as the

Intercept

New cards

The LinearRegression estimator is only capable of simple straight line fits. (True or false)

False

New cards

In class we walked through 5 steps to building a machine learning model. The textbook also goes over in some depth the 5 steps. What is step 1?

Choosing a class of model

New cards

In class we walked through 5 steps to building a machine learning model. The textbook also goes over in some depth the 5 steps. What is step 2?

Choose hyperparameters

New cards

In class we walked through 5 steps to building a machine learning model. The textbook also goes over in some depth the 5 steps. What is step 3?

Aarrange data

New cards

In class we walked through 5 steps to building a machine learning model. The textbook also goes over in some depth the 5 steps. What is step 4?

Fit the model

New cards

In class we walked through 5 steps to building a machine learning model. The textbook also goes over in some depth the 5 steps. What is step 5?

Predict

New cards

What is the purpose of the below code?

import matplotlib.pyplot as plt

import seaborn as sns

import numpy as np

Import python packages

New cards

Your dataset consists of details about customer traits, such as "number of items in the basket at checkout" and "time of day of checkout". Your task is to group customers that are like each other together. You don't already have labeled customer types. What kind of model are you building?

Unsupervised model (like k means)

New cards

What is ONE reason the textbook lists for why a Linear regression is a good starting point in a modeling task.

They are interpretable

New cards

What is the first variable in a decision tree called (before any of the branches)?

Root

New cards

One problem with decision trees is that they are prone to

Over fitting

New cards

If you are not careful or do not see the __________________ appropriately, leads to decision trees overfitting

Max depth

New cards

The random forest algorithm prevents, or at least avoids to some extent, the problems with overfitting found in decision trees. (True or False)

True

New cards

Random forests can only be used on classification problems (true or false)

False

New cards

In order to interpret decision trees its necessary to first run a linear regression (true or false)

False

New cards

Decision tree's are nice because they are fairly simple and straightforward to interpret (True of False)

True

New cards

When running our first decision tree, we took out "maxdepth=". This had the unfortunate result of...

Building a very large hard to understand tree

New cards

What is the terminal node as discussed in the lecture?

The last node (sometimes called a leaf), the tree doesnt split after this

New cards

Models, such as the random forest model we ran, often have a number of parameters that the analyst can choose or set. What is a the best source of up to date information about the different parameters that can be set?

The scikit learn documentation

New cards

Random forests are __________ interpretable than decision trees

Less

New cards

Pipelines are useful (in the analytics with Python sense) for what reasons?

Make it easy to repeat/replicate steps and run multiple models, help organize the code you used to clean and treat data, and make it eassy to change small things in model like which variables to include.

New cards

Y and y-hat are a little different. Y is our target vector, and y-hat is an output in our model that is a.....

Estimate or prediction of y

New cards

The basic idea of a regression is very simple. We have some X values (we called these ___________ and some Y value (this is the variable we are trying to _________ . We could have multiple Y values, but that is not something we have covered.

Features; Predict

New cards

When looking at the code in the videos, we sometimes used a variable to hold our model.

What is the significance of the word "model" in the below code?

model = LinearRegression(fit_intercept=True)

model' is a named variable and is just holding our linear regression model. It could be renamed anything. The word itself is not important. It is just a container.

New cards

Which of the below were discussed as being problems with the hold out method for validation?

Outliers can skew the results and the model is not trained on all of the data

New cards

Which of the following is a common use case for the random forest algorithm in machine learning?

Classifying data into categories based on input features

New cards

Which of the following is a potential benefit of using decision trees in machine learning?

Can handle both numerical and categorical data

New cards

Which of the following statements best describes an ensemble method in machine learning?

A technique that combines the results of multiple models to improve overall predictive accuracy

New cards

Which of the following best describes supervised learning?

A machine learning approach where the algorithm receives labeled data and learns to map inputs to outputs based on those labels

New cards

Which of the following statements best describes classification in machine learning?

A type of supervised learning where the goal is to assign input data points to predefined categories or classes

New cards

We want the R-squared value for our regression model to be 100% (true or false)

False

New cards

One weakness of cross-validation discussed is that information can sometimes ____ across different periods. A common situation in which this happens is when we are looking at stock data.

Leak

New cards

In which of these situations would you want to use a clustering algorithm?

You have a dataset containing customer data for Cheesecake Factory and you want to look at customer spending at the restaurant in order to find patterns among customers who share similar characteristics

New cards

What is a potential downside of using linear regression models in machine learning?

They are prone to over fitting the data

New cards

What type of algorithm would you use to segment customers into groups?

Assume the groups are already labeled.

Decision trees, regression, random forest, cluster regression

New cards

Which of the following is true about data validation and cross-validation in machine learning?

Data validation and cross-validation are used to evaluate a model's performance and prevent overfitting

New cards

What is the role of cluster centers clustering, and how are they determined during the algorithm?

Cluster centers are the initial data points chosen randomly to begin clustering, and they are updated iteratively to minimize the within-cluster sum of squares

New cards

Which of the following machine learning models utilizes supervised learning?

Regression

New cards

What is scikit-learn?

A machine learning package in Python that has built in machine learning algorithms we can use on our dataset

New cards

Which of the following best describes the difference between a supervised and an unsupervised learning task in machine learning?

A supervised learning task requires labeled data, while an unsupervised learning task does not

New cards

Which is true about linear regression models?

They are easy to interpret