A compilation of discussion post questions from the class
True or False: Qualitative data use labels or names to identify categories of like items.
True
Multiple Choice: Heat maps are used for:
A) Visualizations within data mining for correlations and missing data
B) Creating a histogram
C) Making a Holt's model
D) Showing scatter plots for variable pairs
A) Visualizations within data mining for correlations and missing data
Which of the following is the most popular rotational method?
A. Quartimax rotations
B. Varimax rotations
C. Promax rotations
D. Equamax rotations
B. Varimax rotations
Which of the following is not a predictive measure of error?
A. MAD
B. Average Error
C. Total SSE
D. ROC Curve
D. ROC Curve
Which distribution plot uses color to show correlation amongst variables?
A) Scatterplots
B) Heat maps
C) Box plots
D) Histograms
B) Heat maps
Could you have a dummy variable hold a value of any number?
A) True
B) False
B) False
"High separation of records" means that using predictor variables attains _____
A. High Error
B. No Error
C. Low Error
C. Low Error
Measures of Variability are the tabular, graphical, and numerical methods used to summarize and present data. (True/False)
False
The goal of principle components analysis is to increase a set of numerical variables. (T/F)
F
What does jittering do?
A. Moves markers by a small random amount
B. Moves the date closer together
C. Uncrowds the data by allowing more markers to be seen
D. Both A and C
D. Both A and C
The curse of dimensionality is the “affliction caused by adding variables to multivariate data”. Why is this a problem for data mining exercises?
A: Too many variables will only allow for one type of data mining technique which is not that accurate: regressions.
B: Can be compared to a chess board, adding a third dimension to chess boards would increase location options by 800%.
C: Too many variables make box plots insufferable to observe.
D: Too many dimensions leave too much remaining noise to perform data mining
B and D
Which of the following would not be considered a measure of variability?
a) Standard Deviation
b) Percentiles
c) Interquartile Range
d) Range
b) Percentiles
Both sensitivity and specificity can address the question, 'how often is the test right? (True/False)
True
T/F Descriptives statistics are the tabular, graphical, and numerical methods used to summerize and present data
True
T/F Qualitative data are numerical values that indicate that indicate how much or how many and quantitative data use labels or names to identify categories of like items
False
Naive Rule classifies all records as belonging to the most prevalent class (True/False).
True
Which of the following types of data uses labels or names to identify categories of like items?
a. Ordinal
b. Qualitative
c. Interval
d. Quantitative
b. Qualitative
What option is not a step in the cutoff for classification process, choose one.
a. Compare to cutoff value, and classify accordingly
b. Compute the probability of belonging to class "0"
c. Compute the probability of belonging to class "1"
b. Compute the probability of belonging to class "0"
What model should one use when dealing with a continuous and supervised learning model
a. Logistic Regression
b. Regression
c. Cluster Analysis
d. Principle Components
b. Regression
Which of the following is not a measurement of error?
A: Tracking Signal
B: Bias
C: Mean Absolute Deviation
D: Mean Squared Error
B: Bias
The goal of a Principal Component Analysis is to increase the number of numerical variables (True/False).
False
Is a histogram a basic plot or a distribution plot?
A. Basic plot
B. Distribution plot
B. Distribution plot
Which of the following is not a measure of error?
A. Mean absolute deviation (MAD)
B. Mean absolute percent error (MAPE)
C. Tracking signal
D. Tracing error
E. Mean squared error (MSE)
D. Tracing error
For Predictions, which of the following is NOT a metric for performance?
A. Average Error
B. GALE
C. MAPE
D. RMSE
B. GALE
Bar Charts are useful for comparing multiple statistics like average, count, percentage, etc. across groups (True/False)
False
Which of the following is NOT considered a basic plot for data exploration?
a. Line Graphs
b. Scatter Plots
c. Bar Charts
d. Histograms
d. Histograms
Graphical Methods for Categorical data include dot plots, histograms, and scatter diagrams (True/False)
False
Distribution plots display “how many” of each value occur in a data set (True/False).
True
Percentage of misclassified records out of the total records in the validation data.
A. Error
B. Accuracy
C. Error rate
D. Naive
C. Error rate
A single categorical variable with m categories is typically transformed into m+1 dummy variables (True/False)
False
The ROC curve was first used in what war?
a. Cold War
b. WW2
c. Vietnam War
d. WW1
b. WW2
Error is classifying all records as belonging to the
most prevalent class (True/False)
False
Fill in the blank: we simplify decision trees by_____ peripheral branches to avoid overfitting.
A. Collecting
B. Pruning
C. Avoid
D. Eliminating
B. Pruning
The MAPE is a measure of the percentage of how much predictions deviate from the actual values (True/False)
True
Multiple choice: Two important charts that visualize distribution of data are boxplots and _________
a. histograms
b. line charts
c. bar charts
d. scatter plots
a. histograms
The Naïve rule is classify all records as belonging to the most prevalent class (True/False)
True
What is the second step in factor analysis?
A. the correlation matrix for all variables is computed
B. Factor extraction
C. Factor rotation
D. Make final decisions about the number of underlying factors
B. Factor extraction
We select the split that most increases the Gini Index (True/False)
False
What is the process of recursive partitioning
a. Graft two branches together
b. Repeatedly split the records into two parts
c. Simplify the tree by removing branches
d. Create a new branch
b. Repeatedly split the records into two parts
Binary Logistic Regression results in a V-shaped distribution function. (True/False)
False
Multiple Choice: What are the "Odds" in step 2 of The Logit?
a. Ratio
b. Question
c. Quantity
d. Linear
a. Ratio
The goal of trees and rules is to classify or predict an outcome based on a set of predictors. (True/False)
True
Logit can be modeled as a linear function of the ____
A. Probabilities
B. Outcomes
C. Variables
D. Predictors
D. Predictors
Which of the below are advantages of regression trees?
A. Can work without extensive handling of missing data
B. Produce rules that are easy to interpret & implement
C. Variable selection & reduction is automatic
D. All of the above
D. All of the above
Simple linear regression is a relation between 3 continuous variables? (True/False)
False
What is the proper definition of pruning as used for decision trees?
a. The process of dividing a node into two smaller nodes.
b. The process of adding a whole section of a tree.
c. The process of cutting down the tree.
d. None of the above.
c. The process of cutting down the tree.
The logistic distribution is an S-shaped distribution function (True/False)
When determining the cutoff value, the popular initial choice is 0.45. (True/False)
False
Which of the following is not a term used when talking about decision trees?
a. Splitting
b. Pruning
c. Toning
d. Grafting
c. Toning
The two most popular ways to measure Impurity are the Gini Index and Entropy Measure (True/False.
True
Which of the following is NOT an example of a categorical class for stock acquisition?
A. Sell
B. Hold
C. Consider
D. Buy
Multiple linear regression is the relation between 2 continuous variables (True/False)
False
When referring to tree structure, split points become nodes on tree (True/False)
True