Data Mining Chapter One

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/51

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

52 Terms

1
New cards
Business Intelligence
Business intelligence (BI) is a technology-driven process for analyzing data and presenting actionable information to help executives, managers and other corporate end users make informed business decisions.
2
New cards
Gartner Foundational Definition of Big Data
In 2012, Gartner has laid the foundation for the definition of Big Data, based on 3 things
3
New cards
Data Analytics
Data Analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing, conclusions, and supporting decision-making.
4
New cards
Data Mining
Data mining is the process of sorting through large data sets to identify patterns and establish relationships to solve problems through data analysis. Data mining tools allow enterprises to predict trends.
5
New cards
Changes in the Definition of Data Mining
For example, Berry and Linoff, in their book, Data Mining Techniques for Marketing, Sales and Customer Support gave the following definition for data-mining. “Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules. Three years later, the authors revisit their definitions of data mining and mention that, “If there is anything we regret, it is the phrase ‘by automatic or semi-automatic means’ because we feel there come to be too much focus on the automatic techniques and not enough on the exploration and analysis. This has misled many people into believing that data mining is a product that can be bought rather than a discipline that must be mastered.”
6
New cards
CRISP-DM
The Cross Industry Standard Process. CRISP provides a nonproprietary and freely available standard process for fitting data mining into the general problem solving strategy of a business or research unit.
7
New cards
CRISP-DM Phase One
Business/Research Understanding Phase
8
New cards
Business/ Research Understanding Phase
First clearly enunciate the project objectives and requirements in terms of the business as a whole. Then, translate these goals and restrictions into the formulation of a data mining problem definition. Finally, prepare a preliminary strategy for achieving these objectives.
9
New cards
CRISP-DM Phase Two
Data Understanding Phase
10
New cards
Data Understanding Phase
First, collect the data. Then, use exploratory data analysis to familiarize yourself with the data, and discover initial insights. Evaluate the quality of the data. Finally, if desired, select interesting subjects that may contain actionable patterns.
11
New cards
CRISP-DM Phase Three
Data Preparation Phase
12
New cards
Data Preparation Phase
This labor intensive phase covers all aspects of preparing the final data set, which shall be used for subsequent phases, from the initial, raw, dirty data. Select the cases and variables you want to analyze, and that are appropriate for your analysis. Perform transactions on certain variables, if needed. Clean the raw data so that it is ready for the modeling tools.
13
New cards
CRISP-DM Phase Four
Modeling Phase
14
New cards
Modeling Phase
Select and apply appropriate modeling techniques. Calibrate model settings to optimize results. Often, several different techniques may be applied for the same data mining problem. May require looping back to data preparation phase, in order to bring the form of the data into line with the specific requirements of a particular data mining technique.
15
New cards
CRISP-DM Phase Five
Evaluation Phase
16
New cards
Evaluation Phase

The modeling phase has delivered on or more models. These models must be evaluated for quality and effectiveness, before we deploy them for use in the field. Also, determine whether the model in fact achieves the objectives set for it in Phase 1. Establish whether some important facet of the business problem has not been sufficiently accounted for. Finally, come to a decision regarding the use of the data mining results.

17
New cards
CRISP-DM Phase Six
Deployment Phase
18
New cards
Deployment Phase
Model creation does not signify the completion of the project. Need to make use of created models. For businesses, the customer often carries out the deployment based on your model.
19
New cards
Example of a simple deployment
Generate a report.
20
New cards
Example of a more complex deployment
Implement a parallel data mining process in another department.
21
New cards
Fallacy One
There are data mining tools that we can turn loose on our data repositories, and find answers to our problems.
22
New cards
Fallacy One Reality
There are no automatic data mining tools, which will mechanically solve your problems “while you wait”. Rather data mining is a process, CRISP-DM is one method for fitting the data mining process into the overall business or research plan of action.
23
New cards
Fallacy Two
The data mining process is autonomous, requiring little or no human oversight.
24
New cards
Fallacy Two Reality
Data mining is not magic. Without skilled human supervision, blind use of data mining software will only provide you with the wrong answer to the wrong question applied to the wrong type of data. Further, the wrong analysis is worse than no analysis, since it leads to policy recommendations that will probably turn out to be expensive failures. Even after the model is deployed, the introduction of new data often requires an updating of the model. Continuous quality monitoring and other evaluative measures must be assessed by human analysts.
25
New cards
Fallacy Three
Data mining pays for itself quite quickly.
26
New cards
Fallacy Three Reality
The return rates vary, depending on the start-up costs, analysis personnel costs, data warehousing preparation costs, and so on.
27
New cards
Fallacy Four
Data mining software packages are intuitive and easy to use.
28
New cards
Fallacy Four Reality
Again, ease of use varies. However, regardless of what some software vendor advertisements may claim, you cannot just purchase some data mining software, install it, sit back, and watch it solve all your problems. For example, the algorithms require specific data formats, which may require substantial preprocessing. Data analysts must combine subject matter knowledge with an analytical mind, and a familiarity with the overall business or research model.
29
New cards
Fallacy Five
Data mining will identify the causes of our business or research problems,
30
New cards
Fallacy Five Reality
The knowledge discovery process will help you uncover patterns of behavior. Again, it is up to the humans to identify the causes.
31
New cards
Fallacy Six
Data mining will automatically clean up our messy database.
32
New cards
Fallacy Six Reality
Not automatically. As a preliminary phase in the data mining process, data preparation often deals with data that has not been examined or used in years. Therefore, organization beginning a new data mining operation will often be confronted with the problem of data that has been lying around for years, is stale, and needs considerable updating.
33
New cards
Fallacy Seven
Data mining always provides positive results.
34
New cards
Fallacy Seven Reality
There is no guarantee of positive results when mining data for actionable knowledge. Data mining is not a panacea for solving business problems. But, used properly, by people who understand the models involved, the data requirements, and the overall project objectives, data mining can indeed provide actionable and highly profitable results.
35
New cards
Data Mining Tasks
Description, Estimation, Prediction, Classification, Clustering, Association
36
New cards

Description

To outline patterns and trends lying within the data. Often suggests possible explanations for such patterns and trends. For example, those who are laid off are now less well off financially than before the incumbent was elected, and so would tend to prefer an alternative.

37
New cards
Estimation
We approximate the value of a numeric target variable using a set of numeric and/or categorical predictor variables. Models are built using “complete” records, which provide the value of the target variable, as well as the predictors. Then, for new observations, estimates of the value of the target variable are made, based on the values of the predictors.
38
New cards

Prediction

Similar to classification and estimation, except that the results lie in the future.

39
New cards

Classification

Similar to estimation, except that the target variable is categorical rather than numeric. There is a target categorical variable, such as income bracket, which for example, could be partitioned into three categories

40
New cards

Clustering

Refers to the grouping of records, observations, or cases into classes of similar objects. A collection of records that are similar to one another, and dissimilar to records in other collections. It differs from classification in that there is no target variable. The task does not try to classify, estimate, or predict the value of a target variable. Instead, algorithms seek to segment the whole data set into relatively homogenous subgroups, where the similarity of the records within the group is maximized and the similarity to records out of this group is minimized.

41
New cards
Association
The association task for data mining is the job of finding which attributes “go together”. Most prevalent in the business world, where it is known as affinity analysis or market basket analysis, the task of association seeks to uncover rules for quantifying the relationship between two or more attributes. Association rules are of the form “If antecedent then consequent”. Together with a measure of the support and confidence associated with the rule.
42
New cards
Estimation
The Boston Celtics would like to approximate how many points their next opponent will score against them.
43
New cards
Description
A military intelligence officer is interested in learning about the respective proportions of Sunnis and Shias in a particular strategic region.
44
New cards
Classification
A NORAD defense computer must decide immediately whether a blip on the radar is a flock of geese or an incoming nuclear missile.
45
New cards
Clustering
A political strategist is seeking the best groups to canvass for donations in a particular county.
46
New cards
Association
A Homeland Security official would like to determine whether a certain sequence of financial and residence moves implies a tendency to terrorist acts.
47
New cards
Prediction
A Wall Street analyst has been asked to find out the expected change in stock price for a set of companies with similar price/earnings ratios.
48
New cards
Evaluation Phase
Managers want to know by next week whether deployment will take place. Therefore, analysts meet to discuss how useful and accurate their model is.
49
New cards
Data Understanding Phase
The data mining project manager meets with the data warehousing manager to discuss how the data will be collected.
50
New cards
Business Understanding Phase
The data mining consultant meets with the Vice President for Marketing, who says that he would like to move forward with customer relationship management.
51
New cards
Deployment Phase
The data mining project manager meets with the production line supervisor, to discuss implementation of changes and improvements.
52
New cards
Modeling Phase
The analysts meet to discuss whether the neural network or decision tree models should be applied.