Looks like no one added any tags here yet for you.
CRISP-DM Phases
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Business Understanding Phase
define business requirements/objectives
translate objectives into DM problem definition
prepare initial strategy to meet objectives
Data Understanding Phase
collect data
assess data quality
perform EDA
Data Preparation Phase
clean, prepare, and transform dataset
prepare for modeling in following phases
choose cases/variables appropriate for analysis
Modeling Phase
select and apply 1+ modeling techniques
calibrate modeling settings to optimize results
additional data prep if required
Evaluation Phase
evaluate 1+ models for effectiveness
determine if objectives achieved
make decision regarding DM results before deploying
Deployment Phase
make use of models created
simple deployment: generate report
complex deployment: implement additional DM effort in another department
in business, customer often carries out deployment based on model
Description (DM Task)
describes general patterns and trends
easy to interpret/explain
transparent models
pictures and #’s
like scatterplots, descriptive stats, range
Estimation (DM Task)
numerical predictor/categorical (IV) values to estimate changes in numerical target variables (DV)
like regression models - predicting something based off another variable
Classification (DM Task)
like estimation, but target variables (DV) are categorical
like classification of simple vs. complex tasks, fraudulent card transactions, income brackets
can accommodate a categorical variable as your target when partitioned into 2+ categories
Prediction (DM Task)
similar to estimation and classification, but with a time component
like what is the probability of Hogs winning a game with a certain combination of player profiles, or future stock behavior
Clustering (DM Task)
similar to classification, but no target variables
clustering tasks don’t aim to estimate, predict, or classify a target variable
only segmentation of data
like focused marketing campaigns
Association (DM Task)
no target variable
finding attributes of data that go together
profiling relationships between 2+ attributes
understand consequent behaviors based on previous behaviors
like supermarkets seeing what items are purchased together (affinity analysis)
Learning Types (DM Task)
Supervised
have a target variable
know categories (like estimation)
Unsupervised
exploratory, no target variable
searching for patterns across variables
don’t know target variables or categories
like clustering
Supervised DM Tasks
estimation, classification, prediction
Unsupervised DM Tasks
Clustering, association
ways to handle missing data
user-defined constant (0.0 or “Missing”)
mode / mean / median
random values
imputation - most likely value based on other attributes for the record
Min-Max Normalization
determines how much greater the field value is than the minimum value for the field, scales the difference by the field’s range
values range from 0 to 1
good for KNN
X* = (X - min(X))/(max(X)-min(X))
Z-score Standardization
takes difference between field value and field value mean, scales difference by field’s stdev
typically range from -4 to 4
z = (X - mean(X))/stdev(X)
can be used to identify outliers if they’re beyond standard range
Skewness
right skewed → mean > median, tail to the right, positive skewness
left skewed → mean < median, tail to the left, negative skewness
cutoff: -2 to 2
can use natural log transformation to make data more normal
Kurtosis
skewness in height terms
cutoff: -10 to 10
Normality vs. Normalization
Normality - to transform variable so its distribution is closer to normal without changing its basic information
Normalization - standardizes the mean/variance or range of every variable and the effect that each variable has on results
Confidence interval
used when the population has a normal distribution or when n is large
indicates how precise an estimate is (narrow = precise)
point estimate ± margin of error
Margin of error
range of values above and below a sample statistic
MoE = z-score * (population stdev / sqrt(n))
as sample size increases, margin of error will decrease, CI will shrink
Multiple regression
uses multiple independent variables to explain the dependent variable
y = B0 + B1X1 + B2X2 + … + BnXn
Ex. Rating = 59.9 - 2.46*Sugars + 1.33*Carbs
Every additional unit of sugar will decrease rating by 2.46, and every additional unit of carbs with increase rating by 1.33
assumptions:
multicollinearity - make a note if metrics above 0.8
linearity - if histogram of residuals is normally distributed and P-P plot mostly linear
overfitting
model is trained to be overconfident, become unwilling to learn new patterns if they come up and denying the validation’s results
Backward regression
starts with the most complex model (all variables) and step by step removes them until it finds the model with the most explanatory power
consumes the most resources
Forward regression
starts with the least complex model (one IV) and iterates through models that increase in complexity until it finds the one with the most explanatory power
consumes the least resources
Parametric vs. non-parametric models
Parametric
have assumptions on the distributions and characteristics of the data (like skewness/kurtosis), more structured and efficient
Non-parametric
makes no assumptions, can adapt structure based on the data
k-Nearest Neighbors
aka instance-based learning or memory-based reasoning
training set records stored, then classification is performed for new unclassified records based on records they’re most similar to
can weight by distance with euclidean distance formula (numeric variables)
min-max normalization can be used to scale variables so weight is standard
for categorical variables, assign 1 if value is the same for unclassified records and 0 if different
Decision Tree
collection of nodes connected by branches, extending downward from a root node to terminate lead nodes
supervised learning
target variables must be categorical
CART (classification and regression trees)
computes “goodness” of candidate splits/optimality measures
can only do two splits at a time, so the tree will naturally be large
C4.5 Tree
similar to CART, but uses information gain (highest value) / entropy reduction (lowest value) to select optimal splits
not limited to two splits, so it is more efficient and smaller