Polish

3KE3 CHAPTER 4///

Studied by 4 people

0.0(0)

LearnA personalized and smart learning plan

Practice TestTake a test on your terms and definitions

Spaced RepetitionScientifically backed study method

Matching GameHow quick can you match all your cards?

FlashcardsStudy terms and definitions

1 / 28

Earn XP

Description and Tags

Polish

4th

29 Terms

Why data mining?

\-More intense competition at the global scale

\-Recognition of the value in data sources

\-Availability of quality data on customers, vendors, transactions, Web, etc.

\-Consolidation and integration of data repositories into data warehouses

\-The exponential increase in data processing and storage capabilities; and decrease in cost

\-Movement toward conversion of information resources into nonphysical form

New cards

What is data mining?

The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data stored in structured databases.

New cards

What are the four types of patterns?

association, prediction, cluster (segmentation), and sequential (or time series) relationships

New cards

What are the three most common data mining processes?

CRISP-DM
SEMMA
KDD (Knowledge Discovery in Databases)

New cards

What are the six steps in the CRISP-DM data mining process?

Business understanding
Data understanding
Data preparation
Model building
Testing and evaluation
Deployment

New cards

What are the 5 steps in the SEMMA data mining process?

Sample, Explore, Modify, Model, Assess

New cards

What are the steps involved in KDD?

Data selection
Data cleaning
Data transformation
Data mining
Internalization

New cards

Provide examples of commercial data mining software tools

IBM SPSS Modeler, SAS Enterprise Miner, Statistica

New cards

Provide examples of free and/or open source software tools

KNIME, RapidMiner, R

New cards

What are the major characteristics and objectives of data mining?

\-Data is presented in a variety of formats

\-Data environment is usually client/server architecture or web-based IS architecture

\-the miner is often the end-user

\-DM tools combined with spreadsheets & other software development tools

\-parallel-processing used

New cards

What are associations?

commonly co-occurring groupings of things

New cards

What are predictions?

tell the nature of future occurrences of certain events based on what's happened in the past; experience and opinion-based; associated with forecasting

New cards

What are clusters?

Identify natural groupings of things based on known characteristics such as assigning customers in different segments based on their demographics and past purchase behaviors

New cards

What are sequential relationships?

discover time-ordered events

New cards

What are the three main categories of Data Mining?

prediction, association, and segmentation (clustering)

New cards

What is classification?

The objective is to analyze the historical data stored in a database & automatically generate a model that can predict future behaviour

New cards

What are the three main types of prediction

classification, regression, time-series

New cards

What are decision trees?

One data mining methodology is decision trees, which generate rules and classify data sets; a hierarchy of if/then statements

New cards

What is clustering?

the tendency to remember similar or related items in groups

New cards

What are two commonly used derivatives in association mining?

Link Analysis -> the linkage among many objects of interest is discovered automatically
Sequence mining -> relationships are examined in terms of their order of occurrence to identify associations over time

New cards

What is the difference between statistics and data mining?

Statistics - collects sample data to test the hypothesis

DM & Analytics - use all existing data to discover new patterns & relationships

New cards

What is the difference between CRISP-DM and SEMMA?

CRISP-DM: takes a more comprehensive approach; including understanding of the business & relevant data to DM projects

\
SEMMA: implicitly assumes that the DM project's goals and objectives and data sources have been identified and understood

New cards

Describe Knowledge Discovery in Databases (KDD)

process of using DM methods to find useful info and patterns in the data; in relation to data mining: DM involves using algorithms to identify patterns in data derived from the KDD process

New cards

What's the difference between classification and clustering?

Classification learns the function between the characteristics of things and their membership through a supervised learning process -both variable types presented to the algorithm

\
Clustering learns through an unsupervised learning process where only the input variables are presented to the algorithm

New cards

What are the factors considered in model assessment?

predictive accuracy, speed, robustness, scalability, interpretability

New cards

What is a simple split?

partitions data into two mutually exclusive subsets called a training set & a testing set

New cards

What is the main criticism of the simple split?

that it makes the assumption that the data in the two subsets are of the same kind

New cards

What is k-fold cross-validation?

aka rotation estimation; the complete data set is randomly split into 'k' mutually exclusive subsets of approximately equal size

New cards

What are some additional classification assessment methodologies?

Leave-one-out
Bootstrapping
Jackknifing
Area under the ROC curve

New cards