Data Mining Intro

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/25

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

26 Terms

New cards

Four V’s of Big Data

Volume

Variety

Velocity

Veracity

New cards

What is data mining?

The science of discovering interesting knowledge automatically from huge data repositories

New cards

Basic elements of data mining

data: very large, huge

mining techniques: automatic, semi-automatic, non-trival, efficient

new knowledge: implicit, previously unknown, potentially useful, w/ meaningful and unexpected patterns

New cards

why is it so important

it helps in automated decision making

it helps in cost reduction (use and allocate resources more efficiently)

it helps in precise prediction and forecasting

it improves overall customer experience

New cards

DM origins

draws ideas from machine learning/AI, pattern recognition, statistics and database systems

built for data that are:

large scale
high dimensional
heterogeneous - complex
distributed

New cards

data mining

techniques for analyzing existing data to find patterns and extracts insights

New cards

artificial intelligence

creating machines that perform intelligent tasks autonomously

New cards

machine learning

enables a computer to learn and improve on its own, based on experience

New cards

data science

end to end solution of collecting, processing, analyzing and interpreting vast amounts of data to solve complex problems

New cards

data mining process

data → data preprocessing → data mining → data post-processing → information

New cards

data

collect

describe

explore

verify

New cards

data preprocessing

feature selection

cleaning

dimensional reduction

construct

integrate data

normalization

New cards

data mining

select model techniques

generate test design

build model

assess model

New cards

data preprocessing

filter patterns

visualization

pattern interpretation

New cards

pre & post processing objectives

Pre

convert the data into the right format for subsequent analysis by selecting the appropriate data segments and extracting attributes that are relevant to the data mining task

Post

make the data mining results more accessible and easier to be interpreted by analysts e.g. remove uninteresting patterns apply visualization techniques to explore and interact with the data mining results

New cards

data mining tasks - predictive tasks

use some variables to predict unknown or future values of other variables

e.g. predict which users will buy a specific product

New cards

data mining tasks - descriptive tasks

find humans-interpretable patterns that describe the data

e.g. find the set of documents that share similar topics

New cards

predictive modeling

use some variables to predict unknown or future values of other variables

types of variables

explanatory - define the properties of data
target - whose value is to predicted the DM task

subcategories:

classification - predict the values of discrete target variables
regression - predict the values of continuous target variables

New cards

association analysis

find hidden associations on transaction data

produces a set of dependence rules that predict the occurrences of other variables

New cards

clustering

group similar data together

homogeneous groupings of data points; data points belonging in one cluster are more similar to each other than to data points from a separate cluster

Applications

understand the data
- e.g. land segmentation according to vegetation cover
summarize the data to reduce their size

New cards

anomaly detection

find outliers, data points that do not fit and are significantly different than the rest of the data

New cards

scalability

need for efficient data structures, parallel algorithms, etc

New cards

dimensional

number of dimensions (attributes, feature) too large, because of temporal, spatial, and sequential nature of the data

New cards

heterogeneity

complicated data types (graph-based, free-form text, structured and semi-structured) that traditional statistical methods might not be able to hndle

New cards

imperfection

missing values, noise, perfect algorithm + imperfect data = wrong info

New cards

data ownership and distribution

need to develop distributed data mining solutions, efficient algorithms to cope with the distributed datasets to minimize the cost of communications, along with data security and data ownership issues