1/45
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Characteristics of high-quality predictive modeling problems
Clearly identified business issue, questions are well defined, good and useful data is available for answering the questions, predictions will drive actions and increase understanding, predictive analytics can likely improve the approach, and the model can be updated and monitored over time
Producing meaningful problem definition
get to the root cause of the business issue, develop a hypothesis, use KPis, consider availability of good data and implementation issues, and make problem specific enough to solve
What should be true for data to be considered relevant?
Data needs to be unbiased and representative of the population and time frame of interest
Sampling
The process of taking a subset of observations from the population to generate the dataset
Random sampling
Randomly draw observation from the population without replacement. Each observation is equally likely to be sampled
Stratified Sampling
divide the population into non-overlapping strata and randomly sample a set number of observations from each stratum
Systematic sampling
This is a special case of stratified sampling. Draw observations according to a set repeatable pattern.
What makes data high quality?
Data values should be reasonable in the business context, have consistent units and rules, and sufficient documentation
Structured data
This is data that can fit into a tabular arrangement
Structured data pros and cons
Structured data is easier to manipulate but cannot represent data that can’t naturally fit in a tabular arrangement
Unstructured Data
Data that cannot fit into a tabular arrangement, such as audio, images, etc.
Unstructured data pros and cons
This data is more flexible but harder to access and include in a predictive model
Personally Identifiable Information
This is information that can be used to trace to an individual’s identity
How can personally identifiable information be handled?
de-identify data to remove personally identifiable information, ensure data receives sufficient encryption, and conduct model operations while following terms of use
Sensitive variables
These are variables that may be significant but using them in a model may lead to legal, fairness, and discrimination concerns
Target leakage
This occurs when a predictor in a model leaks information about the target variable that is not available at the time of modeling
How can you detect target leakage?
Predictor variable is observed at the same time or after the target variable, target variable is included in a predictor, target variable is used in principal components, or target variable is used in clusters
Goals of Exploratory Data Analysis
Validate data and generate insights for modeling,
What does it mean to validate data
Clean data for inappropriate values, errors, and outliers so that data is ready for analysis
What does it mean to generate insights for modeling
Understand the characteristics of variables, identify potentially useful predictors, generate new features, and decide which type of model is best
Pros and cons of summary statistics
Summary statistics are objective and comparable across variables, but can only capture certain aspects of distributions and are easily affected by outliers
Pros and cons of graphical displays
Graphical displays provide visual impressions of a distributions and can reveal information not easily captured by summary statistics, but are not as precise as summary statistics and less comparable across variables
Three types of bivariate graphical displays for categorical variables
Stacked bar charts, dodged bar charts, and filled bar charts
When to use bar charts
To display counts of levels in a categorical variable
When to use filled bar chart
When you want to display proportions of levels in a categorical variable. This loses information of the count of each levels
Dodge
Compare counts of levels across variables
Issue with correlated numeric predictors
Difficult to separate effects of individual predictors on a target variable, coefficient estimates become unstable, and for decision trees collinearity can dilute variable importance scores
Issue with skewness due to outliers in a variable distribution
Extreme values can overly influence fitted values and distort graphical displays
Solutions to skew due to outliers
Remove outliers if they are due to data entry error or make up a small proportion of the dataset, or apply a log or square root transformation if data is positive
When should a numerical variable be converted into a factor?
A numerical variable should be converted into a factor if it has a small number of values, values have no meaningful values and are labels, or the variable has a complex relationship with the target variable.
For GLM’s why is it better to convert a numerical variable into a factor if it has a complex relationship with the target variable
Factor conversion gives GLMs more flexibility to capture the relationship
When should numerical variables not be converted into factors
Variable has a large number of distinct values, variable values have a numerical order that could be useful in predicting the target variable, variable has simple relationship with target variable, and future observations will have new variable values
Why is sparse levels for a categorical predictor bad
Sparse levels reduce robustness of model and cause overfitting
Sensitivity
This is the proportion of positive observations correctly classified as positive or TP/(TP+FN)
Specificity
This is the proportion of negative observations correctly classified as negative, or TN/(TN+FP)
Precision
This is the proportion of positive predictions correctly that are actually positive, or TP/(TP+FP)
Solutions to unbalanced data
Oversampling and Undersampling
What is Undersampling and what are its cons?
This is when all observations from the smaller class are kept, and fewer observations from the larger class are selected. This can cause the model to become less robust and the classifier to become more prone to overfitting due to the smaller amount of data
Oversampling
This is when all observations from the larger class are kept and more observations are sampled from the smaller class. This can only be done after splitting testing and training data. This is usually computationally harder
Feature Generation
This is the act of creating new variables form original variables. This attempts to transform information from original variables into a more useful form, and make the model easier to interpret.
Dimensionality vs Granularity
Dimensionality refers to how many model inputs can be created from a model, while granularity refers to how specific information is
Model validation
This checks that the model has no deficiencies and has upheld its assumptions
Offset term
An offset term is an included variable whose coefficient is fixed at 1 and is used to adjust for exposure. This allows the expected target variable to grow proportionally with exposure
Weight term
A weight term is not a predictor, but it tells the model how much influence each observation should have when estimating the coefficient.
Assumptions of a glm
Response variable follows distribution from exponential family, observations are independent, mean of response is related to predictors through a link function, and the variance is determined by the mean