1/32
These flashcards cover the key concepts, activities, and techniques involved in Phase 3 (Model Planning) of the analytics lifecycle based on the provided lecture notes.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
What is the Model Planning Phase in the analytics lifecycle?
Phase 3, which sits between data preparation and model building, where decisions are made about what to build and how to evaluate success before writing modeling code.
What is the primary benefit of pre-specifying evaluation criteria and techniques during Phase 3?
It prevents the common modeling failure of deciding metrics after seeing results, thereby eliminating bias and ensuring that outcomes do not influence how they are judged.
According to the transcript, what are the three specific purposes of the Model Planning Phase?
Select the right analytical technique for the problem type; 2. Define how the model will be evaluated before results are seen; 3. Establish the structure of the dataset used for modeling.
In Activity 1, how do the Exploratory Data Analysis (EDA) findings directly shape technique selection?
Findings such as heavily imbalanced target variables, distributions, correlations, and missing data patterns determine which algorithms are appropriate and which metrics matter.
Match the following problem types to their target characteristics: Classification, Regression, Clustering, and Association.
Target is a category → Classification
Target is a continuous number → Regression
No target, finding natural groups → Clustering
Finding co-occurrence patterns → Association
How do you apply Constraint Filters in phase 3 of the model planning phase?
Interpretability required? Eliminates black-box methods. Regulated industries (banking, insurance) often require explainable decisions.
How much data? Very small datasets need simple models. Complex models overfit on small data.
Real-time scoring needed? Large ensemble models may be too slow for millisecond decisions.
Feature types? Mix of numeric and categorical favors tree-based methods.
What constraint often limits the use of 'black-box' methods in activity 3 of the model planning phase in regulated industries like banking and insurance?
The requirement for interpretability. Regulated industries (banking, insurance) often require explainable decisions.This means that the models and their results must be understandable to stakeholders, as decisions based on these models can significantly impact individuals and organizations.
What strategy is suggested when selecting candidate models in Phase 4 of Model Planning?
Select 2–3 candidate models and always start with a simple baseline (e.g., logistic regression or linear regression) before building complex models. Always start with a simple baseline (logistic regression, linear regression) before building more complex models.
What types of variables are typically excluded during phase 5 the variable selection process in model planning?
Variables that are redundant (highly correlated with each other), represent leakage (unavailable at prediction time), or add noise (no plausible relationship to the target).
what happens when you pre specify evaluation metrics in phase 6 of model planning ?
Decide how models will be compared before running any of them.
For classification: AUC, accuracy, precision, recall.
For regression: RMSE, MAE, R².
Setting the metric in advance prevents unconsciously picking the metric that makes your preferred model win.
What is 'Step 4' of the Model Planning Phase according to Linoff & Berry aka phase 7 in the model planning phase?
Construct the dataset that will be used for modeling:
Define the eligible population — which records are valid training examples?
Define the observation date — features measured before this date, outcomes after
Plan the train/test split structure
Plan handling of class imbalance (oversampling, class weights)
What is data structure, and what does it mean in the model planning phase/
Dataset structure refers to the arrangement and organization of data used in the analysis process. In model planning this means:
What is the grain? (one row = one customer, one transaction, one store?)
What is the observation date for each record?
Which columns are features (inputs) and which is the target (output)?
How is the data split for training vs. evaluation?
In the context of dataset structure, what does 'grain' refer to?
The organization of the data, where one row represents one unit of study, such as one customer, one transaction, or one store.
Distinguish between structured and unstructured data based on the transcript.
Structured data is organized in a specific format or schema (like rows and columns)
Unstructured data (text, images, audio, video) lacks a specific format and requires specialized processing like NLP or computer vision to extract meaningful information from it.
what is analytical techniques?
Analytical techniques are the methods and tools used to analyze and process data to achieve business objectives.
In model planning the analytical technique classification has what?
For Classification (predicting categories):
Logistic Regression — interpretable, good baseline
Decision Trees — visual, interpretable, handles mixed feature types
Random Forest — ensemble of trees, high accuracy, feature importance
Gradient Boosting (XGBoost) — often highest accuracy on tabular data
In model planning the analytical technique regression has what?
For Regression (predicting continuous values):
Linear Regression — interpretable, fast, assumes linearity
Ridge/Lasso — linear regression with regularization for many features
Random Forest Regression — handles nonlinearity
Gradient Boosting Regression — high accuracy
Which classification technique is described as an ensemble of trees that provides high accuracy and feature importance?
Random Forest.
In model planning the analytical technique clustering has what?
For Clustering (finding natural groups):
K-Means — partitions records into k groups by minimizing within-cluster distance
Hierarchical Clustering — builds a tree of nested clusters
In model planning the analytical technique association has what?
For Association (finding co-occurrences):
Apriori Algorithm — finds frequent itemsets and association rules
What is the specific purpose of the Apriori Algorithm?
It is an Analytical Technique for Association used to find frequent itemsets and association rules.
Which R package is described as providing a unified interface where the same code structure works for hundreds of different algorithms?
R with caret.

Which R package is described as the industry-standard ML library? Consistent .fit() / .predict() interface across all algorithms. Pipelines chain preprocessing and modeling steps?
Python scikit-learn
Define the analytical technique 'K-Means'.
A clustering method that partitions records into k groups by minimizing within-cluster distance.
What modern alternative to caret in R includes recipes (feature engineering), parsnip (model specification), rsample (data splitting), yardstick (evaluation metrics)?
R with tidymodels.
Why are SAS and SPSS highlighted as tools for model planning?
They are enterprise tools common in regulated industries (finance, pharma) due to their strong audit trail and validation capabilities.
How does dataset size influence model selection in Phase 3?
Very small datasets require simple models because complex models tend to overfit on small data.
term for potential models for clustering, classifying, or finding relationships in data. Selected during planning before any model is built?
Candidate models
Data Structure
The arrangement and organization of data used in the analysis process.
Analytical techniques
Methods and tools used to analyze and process data to achieve business objectives.
Variable selection
The process of identifying essential predictors and variables to include in the model.
Structured data
Data organized in a specific format or schema, making it easier to analyze.
Unstructured data
Data that lacks a specific format or structure, often requiring additional processing before analysis.