Business Intelligence Midterm

studied byStudied by 0 people
0.0(0)
learn
LearnA personalized and smart learning plan
exam
Practice TestTake a test on your terms and definitions
spaced repetition
Spaced RepetitionScientifically backed study method
heart puzzle
Matching GameHow quick can you match all your cards?
flashcards
FlashcardsStudy terms and definitions

1 / 32

encourage image

There's no tags or description

Looks like no one added any tags here yet for you.

33 Terms

1

CRISP-DM Phases

  1. Business Understanding

  2. Data Understanding

  3. Data Preparation

  4. Modeling

  5. Evaluation

  6. Deployment

New cards
2

Business Understanding Phase

  • define business requirements/objectives

  • translate objectives into DM problem definition

  • prepare initial strategy to meet objectives

New cards
3

Data Understanding Phase

  • collect data

  • assess data quality

  • perform EDA

New cards
4

Data Preparation Phase

  • clean, prepare, and transform dataset

  • prepare for modeling in following phases

  • choose cases/variables appropriate for analysis

New cards
5

Modeling Phase

  • select and apply 1+ modeling techniques

  • calibrate modeling settings to optimize results

  • additional data prep if required

New cards
6

Evaluation Phase

  • evaluate 1+ models for effectiveness

  • determine if objectives achieved

  • make decision regarding DM results before deploying

New cards
7

Deployment Phase

  • make use of models created

  • simple deployment: generate report

  • complex deployment: implement additional DM effort in another department

  • in business, customer often carries out deployment based on model

New cards
8

Description (DM Task)

  • describes general patterns and trends

  • easy to interpret/explain

  • transparent models

  • pictures and #’s

  • like scatterplots, descriptive stats, range

New cards
9

Estimation (DM Task)

  • numerical predictor/categorical (IV) values to estimate changes in numerical target variables (DV)

  • like regression models - predicting something based off another variable

New cards
10

Classification (DM Task)

  • like estimation, but target variables (DV) are categorical

  • like classification of simple vs. complex tasks, fraudulent card transactions, income brackets

  • can accommodate a categorical variable as your target when partitioned into 2+ categories

New cards
11

Prediction (DM Task)

  • similar to estimation and classification, but with a time component

  • like what is the probability of Hogs winning a game with a certain combination of player profiles, or future stock behavior

New cards
12

Clustering (DM Task)

  • similar to classification, but no target variables

  • clustering tasks don’t aim to estimate, predict, or classify a target variable

  • only segmentation of data

  • like focused marketing campaigns

New cards
13

Association (DM Task)

  • no target variable

  • finding attributes of data that go together

  • profiling relationships between 2+ attributes

  • understand consequent behaviors based on previous behaviors

  • like supermarkets seeing what items are purchased together (affinity analysis)

New cards
14

Learning Types (DM Task)

Supervised

  • have a target variable

  • know categories (like estimation)

Unsupervised

  • exploratory, no target variable

  • searching for patterns across variables

  • don’t know target variables or categories

  • like clustering

New cards
15

Supervised DM Tasks

estimation, classification, prediction

New cards
16

Unsupervised DM Tasks

Clustering, association

New cards
17

ways to handle missing data

  1. user-defined constant (0.0 or “Missing”)

  2. mode / mean / median

  3. random values

  4. imputation - most likely value based on other attributes for the record

New cards
18

Min-Max Normalization

  • determines how much greater the field value is than the minimum value for the field, scales the difference by the field’s range

  • values range from 0 to 1

  • good for KNN

  • X* = (X - min(X))/(max(X)-min(X))

New cards
19

Z-score Standardization

  • takes difference between field value and field value mean, scales difference by field’s stdev

  • typically range from -4 to 4

  • z = (X - mean(X))/stdev(X)

  • can be used to identify outliers if they’re beyond standard range

New cards
20

Skewness

  • right skewed → mean > median, tail to the right, positive skewness

  • left skewed → mean < median, tail to the left, negative skewness

  • cutoff: -2 to 2

  • can use natural log transformation to make data more normal

New cards
21

Kurtosis

  • skewness in height terms

  • cutoff: -10 to 10

New cards
22

Normality vs. Normalization

Normality - to transform variable so its distribution is closer to normal without changing its basic information

Normalization - standardizes the mean/variance or range of every variable and the effect that each variable has on results

New cards
23

Confidence interval

  • used when the population has a normal distribution or when n is large

  • indicates how precise an estimate is (narrow = precise)

  • point estimate ± margin of error

New cards
24

Margin of error

  • range of values above and below a sample statistic

  • MoE = z-score * (population stdev / sqrt(n))

  • as sample size increases, margin of error will decrease, CI will shrink

New cards
25

Multiple regression

  • uses multiple independent variables to explain the dependent variable

  • y = B0 + B1X1 + B2X2 + … + BnXn

  • Ex. Rating = 59.9 - 2.46*Sugars + 1.33*Carbs

    • Every additional unit of sugar will decrease rating by 2.46, and every additional unit of carbs with increase rating by 1.33

  • assumptions:

    • multicollinearity - make a note if metrics above 0.8

    • linearity - if histogram of residuals is normally distributed and P-P plot mostly linear

New cards
26

overfitting

model is trained to be overconfident, become unwilling to learn new patterns if they come up and denying the validation’s results

New cards
27

Backward regression

  • starts with the most complex model (all variables) and step by step removes them until it finds the model with the most explanatory power

  • consumes the most resources

New cards
28

Forward regression

  • starts with the least complex model (one IV) and iterates through models that increase in complexity until it finds the one with the most explanatory power

  • consumes the least resources

New cards
29

Parametric vs. non-parametric models

Parametric

  • have assumptions on the distributions and characteristics of the data (like skewness/kurtosis), more structured and efficient

Non-parametric

  • makes no assumptions, can adapt structure based on the data

New cards
30

k-Nearest Neighbors

  • aka instance-based learning or memory-based reasoning

  • training set records stored, then classification is performed for new unclassified records based on records they’re most similar to

  • can weight by distance with euclidean distance formula (numeric variables)

    • min-max normalization can be used to scale variables so weight is standard

  • for categorical variables, assign 1 if value is the same for unclassified records and 0 if different

New cards
31

Decision Tree

  • collection of nodes connected by branches, extending downward from a root node to terminate lead nodes

  • supervised learning

  • target variables must be categorical

New cards
32

CART (classification and regression trees)

  • computes “goodness” of candidate splits/optimality measures

  • can only do two splits at a time, so the tree will naturally be large

New cards
33

C4.5 Tree

  • similar to CART, but uses information gain (highest value) / entropy reduction (lowest value) to select optimal splits

  • not limited to two splits, so it is more efficient and smaller

New cards
robot