3KE3 CHAPTER 4///

0.0(0)
Studied by 4 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/28

flashcard set

Earn XP

Description and Tags

Last updated 9:47 PM on 2/2/25
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

29 Terms

1
New cards
Why data mining?
\-More intense competition at the global scale

\-Recognition of the value in data sources

\-Availability of quality data on customers, vendors, transactions, Web, etc.

\-Consolidation and integration of data repositories into data warehouses

\-The exponential increase in data processing and storage capabilities; and decrease in cost

\-Movement toward conversion of information resources into nonphysical form
2
New cards
What is data mining?
The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data stored in structured databases.
3
New cards
What are the four types of patterns?
association, prediction, cluster (segmentation), and sequential (or time series) relationships
4
New cards
What are the three most common data mining processes?
1) CRISP-DM

2) SEMMA

3) KDD (Knowledge Discovery in Databases)
5
New cards
What are the six steps in the CRISP-DM data mining process?
1) Business understanding

2) Data understanding

3) Data preparation

4) Model building

5) Testing and evaluation

6) Deployment
6
New cards
What are the 5 steps in the SEMMA data mining process?
Sample, Explore, Modify, Model, Assess
7
New cards
What are the steps involved in KDD?
1) Data selection

2) Data cleaning

3) Data transformation

4) Data mining

5) Internalization
8
New cards
Provide examples of commercial data mining software tools
IBM SPSS Modeler, SAS Enterprise Miner, Statistica
9
New cards
Provide examples of free and/or open source software tools
KNIME, RapidMiner, R
10
New cards
What are the major characteristics and objectives of data mining?
\-Data is presented in a variety of formats

\-Data environment is usually client/server architecture or web-based IS architecture

\-the miner is often the end-user

\-DM tools combined with spreadsheets & other software development tools

\-parallel-processing used
11
New cards
What are associations?
commonly co-occurring groupings of things
12
New cards
What are predictions?
tell the nature of future occurrences of certain events based on what's happened in the past; experience and opinion-based; associated with forecasting
13
New cards
What are clusters?
Identify natural groupings of things based on known characteristics such as assigning customers in different segments based on their demographics and past purchase behaviors
14
New cards
What are sequential relationships?
discover time-ordered events
15
New cards
What are the three main categories of Data Mining?
prediction, association, and segmentation (clustering)
16
New cards
What is classification?
The objective is to analyze the historical data stored in a database & automatically generate a model that can predict future behaviour
17
New cards
What are the three main types of prediction
classification, regression, time-series
18
New cards
What are decision trees?
One data mining methodology is decision trees, which generate rules and classify data sets; a hierarchy of if/then statements
19
New cards
What is clustering?
the tendency to remember similar or related items in groups
20
New cards
What are two commonly used derivatives in association mining?
1) Link Analysis -> the linkage among many objects of interest is discovered automatically

2) Sequence mining -> relationships are examined in terms of their order of occurrence to identify associations over time
21
New cards
What is the difference between statistics and data mining?
Statistics - collects sample data to test the hypothesis

DM & Analytics - use all existing data to discover new patterns & relationships
22
New cards
What is the difference between CRISP-DM and SEMMA?
CRISP-DM: takes a more comprehensive approach; including understanding of the business & relevant data to DM projects

\
SEMMA: implicitly assumes that the DM project's goals and objectives and data sources have been identified and understood
23
New cards
Describe Knowledge Discovery in Databases (KDD)
process of using DM methods to find useful info and patterns in the data; in relation to data mining: DM involves using algorithms to identify patterns in data derived from the KDD process
24
New cards
What's the difference between classification and clustering?
Classification learns the function between the characteristics of things and their membership through a supervised learning process -both variable types presented to the algorithm

\
Clustering learns through an unsupervised learning process where only the input variables are presented to the algorithm
25
New cards
What are the factors considered in model assessment?
predictive accuracy, speed, robustness, scalability, interpretability
26
New cards
What is a simple split?
partitions data into two mutually exclusive subsets called a training set & a testing set
27
New cards
What is the main criticism of the simple split?
that it makes the assumption that the data in the two subsets are of the same kind
28
New cards
What is k-fold cross-validation?
aka rotation estimation; the complete data set is randomly split into 'k' mutually exclusive subsets of approximately equal size
29
New cards
What are some additional classification assessment methodologies?
1) Leave-one-out

2) Bootstrapping

3) Jackknifing

4) Area under the ROC curve