1/129
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
what is business analytics
study of data through statistical and operational analysis, formation of predictive models, application of optimization techniques and communication of these results to customers
purpose of business analytics
turning big data sets into insights
why business analytics matter
helps firm move from initution-based decisions to data-driven decisions. gives businesses a competitive advantage
DELTA model
Data, Enterprise Orientation, Leadership, Targets, Analysts
D in DELTA Model
Data
accessible, high quality data sets
E in DELTA
Enterprise Orientation
analytics should be used across departments and not siloed
L in DELTA
Leadership
executives champion data-driven decisions
T in DELTA
Targets
clear strategic goals
A in DELTA
Analysts (and technology)
Skilled analysts should be using proper tooling
requirements for successful analytics implementation
high quality data, enterprise wide buy-in, strategic alignment, analytical skills
CRISP-DM
Cross-Industry Standard Practice for Data Mining
Step #1 in CRISP-DM
Business Understanding
define the business problem and objectives before touching the data. “what are we trying to improve by using this data?”
Step #2 in CRISP-DM
Data Understanding
get familiar with the data. collect initial data sources, describe the data and explore with visualizations
Step #3 in CRISP-DM
Data Preparation
make the data usable for modeling. clean missing or invalid values - this step usually takes 60-80% of the project time.
Step #4 in CRISP-DM
Modeling
build models that are actually able to describe and predict. select algorithms, set parameters and train/test using data splits.
model
simplified mathematical representation that helps you understand something
descriptive models
summarize what happened or what is happening (eg, a dashboard or summary report)
predictive models
uses historical data to forecast future outcomesreg
regression models
type of predictive model that tries to classify a number
classification models
type of predictive model that tries to predict a category (eg, fraud or no fraud, churn or no churn, etc)
prescriptive models
uses data + predictions to tell you what to do next
Step #5 in CRISP-DM
Evaluation
assess if the model actually meets the business goal that you were trying to solve.
Step #6 in CRISP-DM
Deployment
implement the insights of the model into business operations. includes monitoring and maintenance over time.
analytically impaired
decisions are made through guess work
localized analytics
isolated through teams, no coordination
analytical aspirations
some leadership support
analytical companies
consistent use of analytics in multiple areas
analytical competitors
analytics are embedded in the culture of the organization
data visualization
simplifying complex data to make patterns visible and understandable
histogram
shows data distribution & skewedness
box and whisker plot
detects outliers and variabilitys
scatter plots
shows correlation between two numeric variablesl
ine chart
displays change over time
bar chart
compares catagories
pie chart
shows part-to-whole relationshipsb
bubble chart
adds a third variable using bubble size
heat map
uses color intensity to represent values in different dimensions
stacked charts
compares multiple relationshipssc
scatter matrix
explores relationships among many variables
data-ink ratio
the ratio of ink used to display data vs. the ink used for decoration. when making data visualizations, remove unnecessary grid lines, shading, 3d effect, etc.
miller’s law
“magic number seven, plus or minus two”
humans can only hold 5-9 pieces of information, so keep dashboards and visualizations simple
performance dashboards
used to monitor business performance in real time. displays KPIs visually, and delivers the right metrics at the right time.
supervised learning
you know the outcome (target variable). used for prediction and classification
classification model
discrete outcome, used for decision trees and logistical regressionpr
prediction and regression model
continuous outcome, used with linear regression
unsupervised learning
no outcome variable to predict, but rather used to find patterns or groups
training data
building the model (fitting rules / patterns) that teaches the algorithms
testing data
evaluating the model (on unseen data), measures how well the model generalizes
overfitting a model
occurs when a model learns on noise and random patterns in the training data, rather than true underlying relationships.
classification modelling
assigning a record to a predetermined category based on input variables. learns the patterns that distinguish one class from another.
examples:
approve vs. reject loan applications
predict churn vs. no chrun
fraudulent vs. legitimate transactions
spam vs. legit email
decision tree model
predict a categorical outcome using a tree-like structure of if-then rules
root node of a decision tree
entire data set
internal nodes of a decision tree
tests on attributes
branches of a decision tree
outcomes of the tests
leaf nodes of a decision tree
final, predicted outcomes
induction process
goal is to split records into subsets that are homogenous as possible
split rules
used to determine which attribute is best for maximizing purity or information gain
gini index
measures how often a randomly chosen record would be misclassified if labelled randomly by the node’s distribution
a lower gini index is more pure
formula for gini index
1 - ∑(pi)2
entropy
measures disorder or uncertainty of the dataent
entropy formula
-∑[pi(t) logn pi(t)]
misclassification error
a simplified impurity measure.
stopping rules
when the tree knows when to stop splitting. trees will keep splitting until every record is perfectly classified, leading to overfitting.
common stopping rules
purity threshold, minimum records per node, maximum tree depth, no improvement in impurity
confusion matrix
compares actual vs. predicted classes
true positive (TP)
predicted positive, actual positive
false positive (FP)
predicted positive, actual negative
false negative (FN)
predicted negative, actual positive
true negative (TN)
predicted negative, actual negative
model accuracy metric
(TP + TN) / (TP + FP + FN + TN)
model precision metric
TP / (TP + FP)
model sensitivity metric
TP / (TP + FN)
model F1 score metric
2 x (precision x recall) / (precision + recall)
model recall
tells how complete your positive predictions are
model F1 score
balances precision and accuracy, better than just using accuracy
type I error
when the model predicts positive, when it’s actually negative
type I error formula
FP / (FP + TN)
type II error
when the model predicts “negative” when it’s actually positive
type II error formula
FN / (TP + FN)
minimize type I errors when…
false alarms are expensivemi
minimize type II errors when..
false positives are costly
imbalanced class problem
occurs when one class dominates a data set (eg, 97% churn vs. no churn 3%). can lead to rare but important cases being missed,
remidies
stratified sampling, oversampling minority, down-sampling the majority, used balance metrics, adjust classification threshold
stratified sampling
ensure both classes are proportionally represented in trainingoversam
oversampling minority
randomly duplicate minority cases so model learns
down-sampling the majority
randomly remove some majority examples to reduce imbalance
used balance metrics
use F1 & precision / recall instead of raw accuracy
root node
entire sample before splittingchi
ld node
segments created by splitting variables
predicted class
majority class within that node (eg income > 60%)
% of records
used to see how pure each node became after the splitc
gain / lift chart
shows how well model concentrates positive in top segments
confusion matrix output
reports TP, FP, TN, FN counts for overall accuracy
decision tree algorithms
CHAID, C&R, C5.0
CHAID
uses catagorical (nominal_ variables) does a chi-square test to find statistically significant relationships between predictor and target.
when to use CHAID
categorical predictors, large sample sizes
limitations of CHAID
doesn’t handle continuous targets, less effective with small data sets
C&R tree
uses catagorical OR classification variables. uses Gini index for classification, or least squares for regression
when to use C&R tree
when you need a robust, general purpose tree - works well for classification & regression and handles missing values well
limitations of the C&R tree
can overfit without data pruning, binary only