Business Intelligence Midterm

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/41

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

42 Terms

1
New cards

CRISP-DM Phases

  1. Business Understanding

  2. Data Understanding

  3. Data Preparation

  4. Modeling

  5. Evaluation

  6. Deployment

2
New cards

Business Understanding Phase

  • define business requirements/objectives

  • translate objectives into DM problem definition

  • prepare initial strategy to meet objectives

3
New cards

Data Understanding Phase

  • collect data

  • assess data quality

  • perform EDA

4
New cards

Data Preparation Phase

  • clean, prepare, and transform dataset

  • prepare for modeling in following phases

  • choose cases/variables appropriate for analysis

5
New cards

Modeling Phase

  • select and apply 1+ modeling techniques

  • calibrate modeling settings to optimize results

  • additional data prep if required

6
New cards

Evaluation Phase

  • evaluate 1+ models for effectiveness

  • determine if objectives achieved

  • make decision regarding DM results before deploying

7
New cards

Deployment Phase

  • make use of models created

  • simple deployment: generate report

  • complex deployment: implement additional DM effort in another department

  • in business, customer often carries out deployment based on model

8
New cards

Description (DM Task)

  • describes general patterns and trends

  • easy to interpret/explain

  • transparent models

  • pictures and #’s

  • like scatterplots, descriptive stats, range

9
New cards

Estimation (DM Task)

  • numerical predictor/categorical (IV) values to estimate changes in numerical target variables (DV)

  • like regression models - predicting something based off another variable

10
New cards

Classification (DM Task)

  • like estimation, but target variables (DV) are categorical

  • like classification of simple vs. complex tasks, fraudulent card transactions, income brackets

  • can accommodate a categorical variable as your target when partitioned into 2+ categories

11
New cards

Prediction (DM Task)

  • similar to estimation and classification, but with a time component

  • like what is the probability of Hogs winning a game with a certain combination of player profiles, or future stock behavior

12
New cards

Clustering (DM Task)

  • similar to classification, but no target variables

  • clustering tasks don’t aim to estimate, predict, or classify a target variable

  • only segmentation of data

  • like focused marketing campaigns

13
New cards

Association (DM Task)

  • no target variable

  • finding attributes of data that go together

  • profiling relationships between 2+ attributes

  • understand consequent behaviors based on previous behaviors

  • like supermarkets seeing what items are purchased together (affinity analysis)

14
New cards

Learning Types (DM Task)

Supervised

  • have a target variable

  • know categories (like estimation)

Unsupervised

  • exploratory, no target variable

  • searching for patterns across variables

  • don’t know target variables or categories

  • like clustering

15
New cards

Supervised DM Tasks

estimation, classification, prediction

16
New cards

Unsupervised DM Tasks

Clustering, association, description

17
New cards

ways to handle missing data

  1. user-defined constant (0.0 or “Missing”)

  2. mode / mean / median

  3. random values

  4. imputation - most likely value based on other attributes for the record

18
New cards

Min-Max Normalization

  • determines how much greater the field value is than the minimum value for the field, scales the difference by the field’s range

  • values range from 0 to 1

  • good for KNN

  • X* = (X - min(X))/(max(X)-min(X))

19
New cards

Z-score Standardization

  • takes difference between field value and field value mean, scales difference by field’s stdev

  • typically range from -4 to 4

  • z = (X - mean(X))/stdev(X)

  • can be used to identify outliers if they’re beyond standard range

20
New cards

Skewness

  • right skewed → mean > median, tail to the right, positive skewness

  • left skewed → mean < median, tail to the left, negative skewness

  • cutoff: -2 to 2

  • can use natural log transformation to make data more normal

21
New cards

Kurtosis

  • skewness in height terms

  • cutoff: -10 to 10

22
New cards

Normality vs. Normalization

Normality - to transform variable so its distribution is closer to normal without changing its basic information

Normalization - standardizes the mean/variance or range of every variable and the effect that each variable has on results

23
New cards

Confidence interval

  • used when the population has a normal distribution or when n is large

  • indicates how precise an estimate is (narrow = precise)

  • point estimate ± margin of error

24
New cards

Margin of error

  • range of values above and below a sample statistic

  • MoE = z-score * (population stdev / sqrt(n))

  • as sample size increases, margin of error will decrease, CI will shrink

25
New cards

Multiple regression

  • uses multiple independent variables to explain the dependent variable

  • y = B0 + B1X1 + B2X2 + … + BnXn

  • Ex. Rating = 59.9 - 2.46*Sugars + 1.33*Carbs

    • Every additional unit of sugar will decrease rating by 2.46, and every additional unit of carbs with increase rating by 1.33

  • assumptions:

    • multicollinearity - make a note if metrics above 0.8

    • linearity - if histogram of residuals is normally distributed and P-P plot mostly linear

    • homoscedasticity - variance of errors constant across all IV levels

26
New cards

overfitting

model is trained to be overconfident, become unwilling to learn new patterns if they come up and denying the validation’s results

27
New cards

Backward regression

  • starts with the most complex model (all variables) and step by step removes them until it finds the model with the most explanatory power

  • consumes the most resources

28
New cards

Forward regression

  • starts with the least complex model (one IV) and iterates through models that increase in complexity until it finds the one with the most explanatory power

  • consumes the least resources

29
New cards

Parametric vs. non-parametric models

Parametric

  • have assumptions on the distributions and characteristics of the data (like skewness/kurtosis), more structured and efficient

  • ex. regression (linear/logistic), clustering (k-Means)

Non-parametric

  • makes no assumptions, can adapt structure based on the data

  • classification (KNN, decision trees), rank-based tests

30
New cards

k-Nearest Neighbors

  • aka instance-based learning or memory-based reasoning

  • training set records stored, then classification is performed for new unclassified records based on records they’re most similar to

  • can weight by distance with euclidean distance formula (numeric variables)

    • min-max normalization can be used to scale variables so weight is standard

  • for categorical variables, assign 1 if value is the same for unclassified records and 0 if different

31
New cards

Decision Tree

  • collection of nodes connected by branches, extending downward from a root node to terminate lead nodes

  • supervised learning

  • target variables must be categorical

32
New cards

CART (classification and regression trees)

  • computes “goodness” of candidate splits/optimality measures

  • can only do two splits at a time, so the tree will naturally be large

33
New cards

C4.5 Tree

  • similar to CART, but uses information gain (highest value) / entropy reduction (lowest value) to select optimal splits

  • not limited to two splits, so it is more efficient and smaller

34
New cards

What is the complete form of CRISP-DM?

Cross-industry standard process - data mining

35
New cards

How many phases are there in CRISP-DM?

6

36
New cards

two types of deployment (CRISP-DM)

Simple: report generation

Complex: implementing additional DM effort in another department

37
New cards

What is the use of standardizing variables?

To convert variables to the same scale

38
New cards

fit statistics for decision model selection

misclassification: smallest, average profit/loss: largest/smallest, kolmogorov-smirnov statistic: largest

39
New cards

fit statistics for ranking model selection

ROC index: largest, gini coefficient: largest

40
New cards

fit statistics for estimate model selection

average squared error: smallest, Shwarz’s bayesian criterior: smallest, log-likelihood: largest

41
New cards

confidence (trees)

reliability of a rule; like 3/5 cases are true, so confidence is 60%

42
New cards

support (trees)

frequency of item in a dataset; like 5/8 total records are considered