Data Mining Intro

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/25

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

26 Terms

1
New cards

Four V’s of Big Data

Volume

Variety

Velocity

Veracity

2
New cards

What is data mining?

The science of discovering interesting knowledge automatically from huge data repositories

3
New cards

Basic elements of data mining

data: very large, huge

mining techniques: automatic, semi-automatic, non-trival, efficient

new knowledge: implicit, previously unknown, potentially useful, w/ meaningful and unexpected patterns

4
New cards

why is it so important

it helps in automated decision making

it helps in cost reduction (use and allocate resources more efficiently)

it helps in precise prediction and forecasting

it improves overall customer experience

5
New cards

DM origins

draws ideas from machine learning/AI, pattern recognition, statistics and database systems

built for data that are:

  • large scale

  • high dimensional

  • heterogeneous - complex

  • distributed

6
New cards

data mining

techniques for analyzing existing data to find patterns and extracts insights

7
New cards

artificial intelligence

creating machines that perform intelligent tasks autonomously

8
New cards

machine learning

enables a computer to learn and improve on its own, based on experience

9
New cards

data science

end to end solution of collecting, processing, analyzing and interpreting vast amounts of data to solve complex problems

10
New cards

data mining process

data → data preprocessing → data mining → data post-processing → information

11
New cards

data

collect

describe

explore

verify

12
New cards

data preprocessing

feature selection

cleaning

dimensional reduction

construct

integrate data

normalization

13
New cards

data mining

select model techniques

generate test design

build model

assess model

14
New cards

data preprocessing

filter patterns

visualization

pattern interpretation

15
New cards

pre & post processing objectives

Pre

  • convert the data into the right format for subsequent analysis by selecting the appropriate data segments and extracting attributes that are relevant to the data mining task

Post

  • make the data mining results more accessible and easier to be interpreted by analysts e.g. remove uninteresting patterns apply visualization techniques to explore and interact with the data mining results

16
New cards

data mining tasks - predictive tasks

use some variables to predict unknown or future values of other variables

e.g. predict which users will buy a specific product

17
New cards

data mining tasks - descriptive tasks

find humans-interpretable patterns that describe the data

e.g. find the set of documents that share similar topics

18
New cards

predictive modeling

use some variables to predict unknown or future values of other variables

types of variables

  • explanatory - define the properties of data

  • target - whose value is to predicted the DM task

subcategories:

  • classification - predict the values of discrete target variables

  • regression - predict the values of continuous target variables

19
New cards

association analysis

find hidden associations on transaction data

produces a set of dependence rules that predict the occurrences of other variables

20
New cards

clustering

group similar data together

homogeneous groupings of data points; data points belonging in one cluster are more similar to each other than to data points from a separate cluster

Applications

  • understand the data

    • e.g. land segmentation according to vegetation cover

  • summarize the data to reduce their size

21
New cards

anomaly detection

find outliers, data points that do not fit and are significantly different than the rest of the data

22
New cards

scalability

need for efficient data structures, parallel algorithms, etc

23
New cards

dimensional

number of dimensions (attributes, feature) too large, because of temporal, spatial, and sequential nature of the data

24
New cards

heterogeneity

complicated data types (graph-based, free-form text, structured and semi-structured) that traditional statistical methods might not be able to hndle

25
New cards

imperfection

missing values, noise, perfect algorithm + imperfect data = wrong info

26
New cards

data ownership and distribution

need to develop distributed data mining solutions, efficient algorithms to cope with the distributed datasets to minimize the cost of communications, along with data security and data ownership issues