3KE3 CHAPTER 4///

studied byStudied by 4 people
0.0(0)
learn
LearnA personalized and smart learning plan
exam
Practice TestTake a test on your terms and definitions
spaced repetition
Spaced RepetitionScientifically backed study method
heart puzzle
Matching GameHow quick can you match all your cards?
flashcards
FlashcardsStudy terms and definitions

1 / 28

flashcard set

Earn XP

Description and Tags

Polish

4th

29 Terms

1
Why data mining?
\-More intense competition at the global scale

\-Recognition of the value in data sources

\-Availability of quality data on customers, vendors, transactions, Web, etc.

\-Consolidation and integration of data repositories into data warehouses

\-The exponential increase in data processing and storage capabilities; and decrease in cost

\-Movement toward conversion of information resources into nonphysical form
New cards
2
What is data mining?
The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data stored in structured databases.
New cards
3
What are the four types of patterns?
association, prediction, cluster (segmentation), and sequential (or time series) relationships
New cards
4
What are the three most common data mining processes?
  1. CRISP-DM

  2. SEMMA

  3. KDD (Knowledge Discovery in Databases)

New cards
5
What are the six steps in the CRISP-DM data mining process?
  1. Business understanding

  2. Data understanding

  3. Data preparation

  4. Model building

  5. Testing and evaluation

  6. Deployment

New cards
6
What are the 5 steps in the SEMMA data mining process?
Sample, Explore, Modify, Model, Assess
New cards
7
What are the steps involved in KDD?
  1. Data selection

  2. Data cleaning

  3. Data transformation

  4. Data mining

  5. Internalization

New cards
8
Provide examples of commercial data mining software tools
IBM SPSS Modeler, SAS Enterprise Miner, Statistica
New cards
9
Provide examples of free and/or open source software tools
KNIME, RapidMiner, R
New cards
10
What are the major characteristics and objectives of data mining?
\-Data is presented in a variety of formats

\-Data environment is usually client/server architecture or web-based IS architecture

\-the miner is often the end-user

\-DM tools combined with spreadsheets & other software development tools

\-parallel-processing used
New cards
11
What are associations?
commonly co-occurring groupings of things
New cards
12
What are predictions?
tell the nature of future occurrences of certain events based on what's happened in the past; experience and opinion-based; associated with forecasting
New cards
13
What are clusters?
Identify natural groupings of things based on known characteristics such as assigning customers in different segments based on their demographics and past purchase behaviors
New cards
14
What are sequential relationships?
discover time-ordered events
New cards
15
What are the three main categories of Data Mining?
prediction, association, and segmentation (clustering)
New cards
16
What is classification?
The objective is to analyze the historical data stored in a database & automatically generate a model that can predict future behaviour
New cards
17
What are the three main types of prediction
classification, regression, time-series
New cards
18
What are decision trees?
One data mining methodology is decision trees, which generate rules and classify data sets; a hierarchy of if/then statements
New cards
19
What is clustering?
the tendency to remember similar or related items in groups
New cards
20
What are two commonly used derivatives in association mining?
  1. Link Analysis -> the linkage among many objects of interest is discovered automatically

  2. Sequence mining -> relationships are examined in terms of their order of occurrence to identify associations over time

New cards
21
What is the difference between statistics and data mining?
Statistics - collects sample data to test the hypothesis

DM & Analytics - use all existing data to discover new patterns & relationships
New cards
22
What is the difference between CRISP-DM and SEMMA?
CRISP-DM: takes a more comprehensive approach; including understanding of the business & relevant data to DM projects

\
SEMMA: implicitly assumes that the DM project's goals and objectives and data sources have been identified and understood
New cards
23
Describe Knowledge Discovery in Databases (KDD)
process of using DM methods to find useful info and patterns in the data; in relation to data mining: DM involves using algorithms to identify patterns in data derived from the KDD process
New cards
24
What's the difference between classification and clustering?
Classification learns the function between the characteristics of things and their membership through a supervised learning process -both variable types presented to the algorithm

\
Clustering learns through an unsupervised learning process where only the input variables are presented to the algorithm
New cards
25
What are the factors considered in model assessment?
predictive accuracy, speed, robustness, scalability, interpretability
New cards
26
What is a simple split?
partitions data into two mutually exclusive subsets called a training set & a testing set
New cards
27
What is the main criticism of the simple split?
that it makes the assumption that the data in the two subsets are of the same kind
New cards
28
What is k-fold cross-validation?
aka rotation estimation; the complete data set is randomly split into 'k' mutually exclusive subsets of approximately equal size
New cards
29
What are some additional classification assessment methodologies?
  1. Leave-one-out

  2. Bootstrapping

  3. Jackknifing

  4. Area under the ROC curve

New cards
robot