Business Intelligence Midterm

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/41

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

42 Terms

New cards

CRISP-DM Phases

Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment

New cards

Business Understanding Phase

define business requirements/objectives
translate objectives into DM problem definition
prepare initial strategy to meet objectives

New cards

Data Understanding Phase

collect data
assess data quality
perform EDA

New cards

Data Preparation Phase

clean, prepare, and transform dataset
prepare for modeling in following phases
choose cases/variables appropriate for analysis

New cards

Modeling Phase

select and apply 1+ modeling techniques
calibrate modeling settings to optimize results
additional data prep if required

New cards

Evaluation Phase

evaluate 1+ models for effectiveness
determine if objectives achieved
make decision regarding DM results before deploying

New cards

Deployment Phase

make use of models created
simple deployment: generate report
complex deployment: implement additional DM effort in another department
in business, customer often carries out deployment based on model

New cards

Description (DM Task)

describes general patterns and trends
easy to interpret/explain
transparent models
pictures and #’s
like scatterplots, descriptive stats, range

New cards

Estimation (DM Task)

numerical predictor/categorical (IV) values to estimate changes in numerical target variables (DV)
like regression models - predicting something based off another variable

New cards

Classification (DM Task)

like estimation, but target variables (DV) are categorical
like classification of simple vs. complex tasks, fraudulent card transactions, income brackets
can accommodate a categorical variable as your target when partitioned into 2+ categories

New cards

Prediction (DM Task)

similar to estimation and classification, but with a time component
like what is the probability of Hogs winning a game with a certain combination of player profiles, or future stock behavior

New cards

Clustering (DM Task)

similar to classification, but no target variables
clustering tasks don’t aim to estimate, predict, or classify a target variable
only segmentation of data
like focused marketing campaigns

New cards

Association (DM Task)

no target variable
finding attributes of data that go together
profiling relationships between 2+ attributes
understand consequent behaviors based on previous behaviors
like supermarkets seeing what items are purchased together (affinity analysis)

New cards

Learning Types (DM Task)

Supervised

have a target variable
know categories (like estimation)

Unsupervised

exploratory, no target variable
searching for patterns across variables
don’t know target variables or categories
like clustering

New cards

Supervised DM Tasks

estimation, classification, prediction

New cards

Unsupervised DM Tasks

Clustering, association, description

New cards

ways to handle missing data

user-defined constant (0.0 or “Missing”)
mode / mean / median
random values
imputation - most likely value based on other attributes for the record

New cards

Min-Max Normalization

determines how much greater the field value is than the minimum value for the field, scales the difference by the field’s range
values range from 0 to 1
good for KNN
X* = (X - min(X))/(max(X)-min(X))

New cards

Z-score Standardization

takes difference between field value and field value mean, scales difference by field’s stdev
typically range from -4 to 4
z = (X - mean(X))/stdev(X)
can be used to identify outliers if they’re beyond standard range

New cards

Skewness

right skewed → mean > median, tail to the right, positive skewness
left skewed → mean < median, tail to the left, negative skewness
cutoff: -2 to 2
can use natural log transformation to make data more normal

New cards

Kurtosis

skewness in height terms
cutoff: -10 to 10

New cards

Normality vs. Normalization

Normality - to transform variable so its distribution is closer to normal without changing its basic information

Normalization - standardizes the mean/variance or range of every variable and the effect that each variable has on results

New cards

Confidence interval

used when the population has a normal distribution or when n is large
indicates how precise an estimate is (narrow = precise)
point estimate ± margin of error

New cards

Margin of error

range of values above and below a sample statistic
MoE = z-score * (population stdev / sqrt(n))
as sample size increases, margin of error will decrease, CI will shrink

New cards

Multiple regression

uses multiple independent variables to explain the dependent variable
y = B₀ + B₁X₁ + B₂X₂ + … + B_nX_n
Ex. Rating = 59.9 - 2.46*Sugars + 1.33*Carbs
- Every additional unit of sugar will decrease rating by 2.46, and every additional unit of carbs with increase rating by 1.33
assumptions:
- multicollinearity - make a note if metrics above 0.8
- linearity - if histogram of residuals is normally distributed and P-P plot mostly linear
- homoscedasticity - variance of errors constant across all IV levels

New cards

overfitting

model is trained to be overconfident, become unwilling to learn new patterns if they come up and denying the validation’s results

New cards

Backward regression

starts with the most complex model (all variables) and step by step removes them until it finds the model with the most explanatory power
consumes the most resources

New cards

Forward regression

starts with the least complex model (one IV) and iterates through models that increase in complexity until it finds the one with the most explanatory power
consumes the least resources

New cards

Parametric vs. non-parametric models

Parametric

have assumptions on the distributions and characteristics of the data (like skewness/kurtosis), more structured and efficient
ex. regression (linear/logistic), clustering (k-Means)

Non-parametric

makes no assumptions, can adapt structure based on the data
classification (KNN, decision trees), rank-based tests

New cards

k-Nearest Neighbors

aka instance-based learning or memory-based reasoning
training set records stored, then classification is performed for new unclassified records based on records they’re most similar to
can weight by distance with euclidean distance formula (numeric variables)
- min-max normalization can be used to scale variables so weight is standard
for categorical variables, assign 1 if value is the same for unclassified records and 0 if different

New cards

Decision Tree

collection of nodes connected by branches, extending downward from a root node to terminate lead nodes
supervised learning
target variables must be categorical

New cards

CART (classification and regression trees)

computes “goodness” of candidate splits/optimality measures
can only do two splits at a time, so the tree will naturally be large

New cards

C4.5 Tree

similar to CART, but uses information gain (highest value) / entropy reduction (lowest value) to select optimal splits
not limited to two splits, so it is more efficient and smaller

New cards

What is the complete form of CRISP-DM?

Cross-industry standard process - data mining

New cards

How many phases are there in CRISP-DM?

New cards

two types of deployment (CRISP-DM)

Simple: report generation

Complex: implementing additional DM effort in another department

New cards

What is the use of standardizing variables?

To convert variables to the same scale