QMB3302 Final Exam Study: Computer Science Terms & Definitions

0.0(0)

Studied by 0 people

Knowt Live

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/49

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

50 Terms

New cards

Pipelines are useful (in analytics with Python sense) for the following reasons? (choose all that apply)

- Pipelines make it easy to repeat/replicate steps and run multiple models

- Pipelines are good for moving data into your programming environment

- Pipelines automatically update to new versions of Python

- Pipelines help organize code you used to clean and treat your data

- Pipelines make it very easy to change small things in your model, like which variable to include

- Pipelines make it easy to repeat/replicate steps and run multiple models

- Pipelines help organize code you used to clean and treat your data

- Pipelines make it very easy to change small things in your model, like which variable to include

New cards

The basic idea of a regression is very simple. We have some X value (which we call ______) and some Y value that we are trying to _____. We could have multiple Y value, but that is not something we have covered.

features; predict

New cards

Y and y-hat are a little different. Y is our target vector, and y-hat is an output in our model that is a.... (choose one of the following)

- estimate or prediction of y

- the actual value of y

- an axis on our 2 way graph

- a combination of XY intercept coordinates

estimate or prediction of y

New cards

When looking at the code in the videos, we sometimes used a variable to hold out model. What is the significance of the word "model" in the below code?

model = LinearRegression(fit_intercept=True)

'model' is a named variable and is just holding our linear regression model. It could be renamed anything. The word itself is not important. It is just a container.

New cards

What is a good model fit value?

unknowable without knowing/understanding the context of the domain

New cards

Imagine X in the below is a missing value. If I were to run a median imputer on this set of data, what would the return value be?

50, 60, 70, 80, 100, 60, 5000, X

New cards

Which of the below were discussed as being problems with the holdout method for validation?

- Data is not available for test and control differences

- Outliers can skew the result

- The model is not trained on all of the data

- K=3 is not sufficiently large enough

- Validation is sometimes too challenging

- Outliers can skew the result

- The model is not trained on all of the data

New cards

The features in a model...

- are used as proxies for y-hat divided by y

- are always functions of each other

- keep the model validation process stable

- none of these answers are correct

none of these answers are correct

New cards

What is the first variable in a decision tree called (before any of the branches)?

root

New cards

One problem with decision trees is that they are prone to _____ if you are not careful or do not set the _____ appropriately.

overfitting; max depth

New cards

True or False: The random forest algorithm prevents, or at least avoids to some extent, the problems with overfitting found in decision trees.

True

New cards

True or False: Random Forests can only be used on classification problems

False

New cards

True or False: In order to interpret Decision Tree's, it is necessary to first run a linear regression

False

New cards

True or False: Decision Tree's are nice because they are fairly simple and straightforward to interpret

True

New cards

When running our first decision tree, we took out "maxdepth=". This had the unfortunate result of...

Building a very large hard to understand tree

New cards

What is the terminal node as discussed in the lecture?

The last node (sometimes called a leaf is you google the term); the tree doesn't split after this

New cards

Models, such as the random forest model we ran, often have a number of parameters that the analyst can choose or set.

What is a the best source of up to date information about the different parameters that can be set?

The scikit learn documentation

New cards

Random forests are _____ interpretable than decision trees.

less

New cards

True or False: The correct number of clusters in Hierarchical clustering can be determined precisely using approaches such as silhouette scores.

False

New cards

True or False: In K Means clustering, the analyst does not need to determine the number of clusters (K), these are always derived analytically using the kmeans algorithm.

False

New cards

True or False: One big difference between the unsupervised approaches in this module, and the supervised approaches in prior modules: Unsupervised models do not have a target variable (Y). This make is difficult to know when they are "right" or correct

True

New cards

According to the documentation, a silhouette scores of 1 is _____, and -1 is _____.

the best score; the worst score

New cards

Select all that apply. Imagine you have a data set with columns/inputs for customers:

Column 1 = Customer ID (a number)

Column 2 = Sales (a dollar value)

Column 3= Frequency (a number)

Column 4 = Satisfaction (a number)

You would like to understand the impact of Frequency on customer Satisfaction. What types of approaches could you use?

Note that the type of data is brackets () after the column name. Choose the best answer(s) from the available choices below.

- Decision Tree

- K Means

- Random Forest

- Linear Regression

- Hierarchical Clustering

- Decision Tree

- Random Forest

- Linear Regression

New cards

Select all that apply. Imagine you have a data set with columns/inputs for customers:

Column 1 = Customer ID

Column 2 = Distance to Stores

Column 3= Year spend

Column 4 = Likelihood to return

What kind of approache(s) could you use to understand more about these customers?

- Regression - to understand the effect of one or more variables on the others

- Clustering - to develop groups of customers that have similar patterns

- Regression - to understand the effect of one or more variables on the others

- Clustering - to develop groups of customers that have similar patterns

New cards

What is the purpose of the following code:

from sklearn.preprocessing import StandardScaler

scale = SlandardScaler()

rfm_std = scale.fit_transform(df)

to standardize the data

New cards

True or False: The elbow method provides an exact number of clusters for a kmean algorithm

False

New cards

True or False: Hierarchical clustering is more powerful than Kmeans, as it allows the researcher to determine the exact number of clusters to use in the analysis

False

New cards

In kmeans - the algorithm has multiple iterations. If we have a simple 2d problem, and a k = 2, it begins by assigning the first centroids to _____, and then _____ of each point or record to the centroid.

a random initial starting point; measuring the distance

New cards

An example this week was done in Jupiter like environment called Google Collab. What was the language that was demonstrated in the videos?

TensorFlow

New cards

True or False: Neural Networks in computing are exactly the same as neural networks from biology.

False

New cards

When viewing a diagram of a neural network there are several layers. Match the layer to the description below:

Input Layer

Options:

- these are the X's, or inputs from your data

- these are the Y (the target variable you are interested in)

- something you don't see, here there is some computation to transform the X's to the Y

- the layer that translates the axions

these are the X's, or inputs from your data

New cards

When viewing a diagram of a neural network there are several layers. Match the layer to the description below:

Output Later

Options:

- these are the X's, or inputs from your data

- these are the Y (the target variable you are interested in)

- something you don't see, here there is some computation to transform the X's to the Y

- the layer that translates the axions

these are the Y (the target variable you are interested in)

New cards

When viewing a diagram of a neural network there are several layers. Match the layer to the description below:

Hidden Layer

Options:

- these are the X's, or inputs from your data

- these are the Y (the target variable you are interested in)

- something you don't see, here there is some computation to transform the X's to the Y

- the layer that translates the axions

something you don't see, here there is some computation to transform the X's to the Y

New cards

True or False: Deep Neural Networks have only 1 hidden layer and multiple input layers

False

New cards

Each of the connections between noes as a connection, each of those connections has a _____.

Activation function

New cards

True or False: In an attempt to fit values from the input layer to the output layer, the hidden layer applies some weights to the input values.

True

New cards

The example we walked through was from a fairly famous dataset for leaning about machine learning. The data set is called:

MNIST

New cards

All the nodes prior to the output nodes essentially 'guess' at the correct weights. The algorithm checks to see if the initial guess is correct (usually not). When it is wrong...

... it tries again (runs another epoch)

New cards

True or False: Neural networks are an unsupervised technique, because there is no target variable.

False

New cards

NLP stands for...

natural language processing

New cards

Tokenization, as defined in the lecture is...

A computer turning letters and/or words into something it can read and understand, like numbers

New cards

Recommenders come in many flavors. 2 of the most common, often used together and discussed in the lecture are: (choose the following)

- Item Based

- User Based

- Algorithm Oriented

- Stock Availability Based

- Syntax Dependent

- Item Based

- User Based

New cards

Imagine you have a dataset with 2 columns, both filled with continuous numbers. You believe the first column is a predictor of the second column. Which of the model approaches below could work when building a model?

- Regression

- Decision Tree

- Running .describe and .info on the data

- Graphing

- Random Forest

- Regression

- Decision Tree

- Random Forest

New cards

Decision trees have a few problems. The problem we talked about the most is:

overfitting

New cards

In y = ax + b

A is commonly known as _____ and B is commonly known as _____.

slope; intercept

New cards

True or False: The LinearRegression estimator is only capable of simple straight line fits

False

New cards

In class we walked through 5 steps to building a machine learning model. The textbook goes over in some depth the 5 steps. What are they?

First Step: choosing a class of model

Second Step: choosing hyperparameters

Third Step: arrange data

Fourth Step: fit the model

Fifth Step: predict

New cards

What is the purpose of the below code?

Note that this is probably EASIER than similar questions on the final exam. But I will ask you why/purpose/what for questions on the code I have had you run. It is useful to make notes on your notebooks about why a certain chunk of code is run.

import matplotlib.pyplot as plt

import seaborn as sns

import numpy as np

import python packages

New cards

Choosing a class of models

Your data set consists of details about customer traits, such as "number of items in the basket at checkout". Your task is to group customers that are like each other together. You don't already have labeled customer types. What kind of model are you building?

unsupervised model (such as K means)

- reminder: if you have a bunch of Xs, but no Ys the problem is unsupervised; When you are building a supervised model, you have an X and a Y. The hints there is "you don't already have labeled customer types " without these labels, the Y, you can't have any supervision

New cards

What is ONE reason the textbook lists for why a Linear regression is a good starting point in a modeling task

they are interpretable