Data Mining CA1

0.0(0)

Studied by 6 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/174

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

175 Terms

New cards

data analysis

Test models and hypotheses on datasets

New cards

data phishing/ data dredging

Analysing data without an a-priori hypothesis (1960s)

New cards

Data mining

The process of extracting and discovering patterns within large datasets

New cards

Information age

Data collection: Automated data collection tools and mature data technologies lead to tremendous amounts of data stored in databases and other information repositories
Petabytes of data were produced every day.
The need for tools and processes to move from the Data Age to the Information

New cards

Data mining

can be viewed as a result of the natural evolution of information technolo

New cards

data mining

Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from large dataset

New cards

what is not data mining

(Deductive) query processing.
Expert systems or small ML/statistical programs

New cards

Data mining applications

Sports - Advanced analysis of game statistics (shots blocked, assists, and fouls) to gain a competitive advantage
Science- Astronomy, earth science, meteorology, experimental physics,
Web mining - Surf-Aid: Many companies apply data mining algorithms to Web access logs for market-related pages to discover customer preference and behaviour pages, analyse the effectiveness of Web marketing, improve Web site organisation, etc.
Fraud Detection
Market Analysis and Management

New cards

Market managing

customer profiling
identifying customer requirements
providing summary information

New cards

Corporations

Finance planning and asset evaluation: Cash flow analysis and prediction, Contingent claim analysis to evaluate assets, cross-sectional and time series analysis (financial ratio, trend analysis, etc.)
Resource planning: Summarise and compare the resources and spending
Competition: Monitor competitors and market directions, Group customers into classes and a class-based pricing procedure, Set pricing strategy in a highly competitive mark

New cards

data mining

the core of knowledge discovery process

New cards

KDD process

Learning the application domain: relevant prior knowledge and goals of the given applications.
Creating a target data set: data selection
Cleaning and pre-processing the data (may take 60% of effort!)
Performing data reduction and transformation: Find useful features, dimensionality/variable reduction, invariant representation
Choosing functions of data mining: summarisation, classification, regression, association, clustering
Choosing the mining algorithms
Data mining: search for patterns of interest
Testing & Evaluating the patterns and presenting the knowledge: visualisation, transformation, removing redundant patterns, etc.
Using the discovered knowledge

New cards

DM Software Tools

Commercial and free (open source) software tools
KXEN Modeler, IBM SPSS Modeler, Oracle Data Mining, Angoss

KnowledgeSTUDIO, etc.

RapidMiner, Weka, KNIME, SCaViS, Kaggle, Rattle,Tanagra, R,

Orange (free version), …

New cards

purpose of the analysis

Identify population groups and domains of interest.
Example: Population groups: customers, personnel, etc; Domain of interest: sales, profit, stock, products, etc

New cards

audience for the analysis

Agencies, companies, directors, communities, etc

New cards

why data pre-processing?

Data in the real world is dirty:incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data; noisy: containing errors or outliers; inconsistent: containing discrepancies in codes or names
No quality data, no quality mining results!: Quality decisions must be based on quality data; Data warehouse needs consistent integration of quality data

New cards

perfect data

Data is valid, complete, and reliable. No data extrapolation is need

New cards

not perfect data

Data with NO serious flaws, but needs some pre-processing

New cards

verbal/inspection data

Data with serious gaps → requires additional documentation and verification

prior to its inclusion in the DM process

New cards

soft data

Data relied on the memories of experienced personnel of the participating

facility

The most difficult to summarise

New cards

example of not perfect data

The data recorded in a dimension which is not important

New cards

example of verbal/inspection data

Wrong or non-recorded values of an airplane flight parameter

New cards

example of soft data

Memories of an experienced analyst who dealt with the same problem

before

New cards

data cleaning

Fill in missing values, smooth noisy data, identify or remove outliers, and

resolve inconsistencies

New cards

data integration

Integration of multiple databases, data cubes, or files

New cards

data transformation

Normalisation and aggregation

New cards

data reduction

Obtains reduced representation in volume but produces the same or similar

analytical results

New cards

data discretization

Part of data reduction but with particular importance, especially for numerical

data

New cards

dataset

Collection of data objects and their attributes
A data object is also called a record, point,

entity, or instance

New cards

data object

A collection of attributes describing a data

object

New cards

attribute

a property or characteristic of an object
Also called a feature, dimension, variable
Examples: gender, age, income, etc.

New cards

categorical (nominal) variable

The value of a categorical variable can take more than 2 states.
Example: Green, Blue, Black, Brown,

New cards

binary variable

Its value can take two categories: 0 or 1, True or False

New cards

ordinal variable

Possible values that have a meaningful order or ranking among

them, but the magnitude between successive values is not known

Example: Excellent, very good, good, average

New cards

interval-scaled variable

The values are measured on a scale of equal-size units.
Example: age, weight (kg), …,

New cards

ratio-scaled variable

A value is ratio-scaled if it is a multiple (or ratio) of another value.
Example: exponential scale

New cards

discrete vs continuous variable

A discrete attribute has a finite or countably infinite set of values

New cards

descriptive data summarization

Identify typical properties of the data
Highlight which data values should be treated as noise or outliers

New cards

descriptive statistics

Understand the distribution of the data
Central Tendency: mean, median, midrange
Data Dispersion: quartiles, inter-quartile range (IQR), variance

New cards

arithmetic mean

Effective numerical measure of the centre
Let x1, x2, …, xN a set of observations

New cards

Drawbacks

Sensitivity to extreme values (outliers)
Trimmed mean: obtained after removing the extreme

New cards

median

Used for skewed (asymmetric data)
Let {x1, x2, …, xN} a set of ordered observations
the middle value if N is odd and is the average of the two

middle values if N is even

New cards

mode

Indicates the value that occurs most frequently in the s

New cards

range

max(X) - min(X)

New cards

midrange

[max(X) + min(X)]/2

New cards

dispersion of the data

The degree to which numerical data tend to spread
the most common measures are range, five-number summary, IQR, and standard deviation

New cards

standard deviation

Measures spread about the mean
Can only be used when the mean is chosen as the measure of the centre
Let X={x1, x2, …, xN}

New cards

five-number summary

{min(X), Q1, median, Q3, max(X)}

New cards

IQR

Q3-Q1

New cards

histograms

Frequency histograms
A graphical method for summarising the distribution of a given attribute

New cards

quantile plot

Simple way to have a 1st look at a univariate data distribution
Allows us to compare different distributions based on their quantile

New cards

Quantile-Quantile Plot (q-q plot)

Graphs the quantiles of one univariate distribution against the corresponding quantiles of another
Is a powerful visualisation too

New cards

scatter plot

Each pair of values is treated as a pair of coordinates in an algebraic sense and plotted as points in the plane
Most effective graphical methods for determining if there appears to be relationship, patterns, or trends between two numerical attributes

New cards

Missing values may be due t

equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to a misunderstanding
Certain data may not be considered important at the time of entry
No register history or changes in the dat

New cards

how to handle missing values

Ignore the tuple: not effective when the percentage of missing values per attribute varies considerably
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter
Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree

New cards

noise

random error or variance in a measured variable

New cards

Incorrect attribute values may be due to

Faulty data collection instruments
Data entry problems
Data transmission problems
Technology limitation
inconsistency in naming convention

New cards

how to handle noisy data

Binning method: first sort data and partition into bins (buckets); then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
Clustering: detect and remove outliers
Combined computer and human inspection: detect suspicious values and check by human
Regression: smooth by fitting the data into regression function

New cards

equal-width (distance) partitioning

It divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute, the width of

intervals will be: W = (B - A)/N

The most straightforward
But outliers may dominate presentation
Skewed data is not handled well

New cards

equal-depth (frequency) partitioning

It divides the range into N intervals, each containing approximately the same number of objects
Good data scaling
Managing categorical attributes can be tricky

New cards

data integration

combines data from multiple sources into a coherent store

New cards

schema integration

integrate metadata from different sources
Entity identification problem: identify real-world entities from multiple data

sources, e.g., A.cust-id =B.cust-#

New cards

smoothing

remove noise from data

New cards

aggregation

summarisation, data cube constructio

New cards

generalization

concept hierarchy climbing

New cards

normalization

scaled to fall within a small, specified range

New cards

attribute/feature construction

New attributes constructed from the given one

New cards

data reduction

Obtains a reduced representation of the dataset that is much smaller in volume

but yet produces the same (or almost the same) analytical resu

New cards

data reduction strategies

Data cube aggregation
Dimensionality reduction
Numerosity reduction
Discretisation and concept hierarchy generation

New cards

feature selection

Select a minimum set of features such that the probability distribution of

different classes given the values for those features is as close as possible to

the original distribution given the values of all features

reduce the number of patterns ==> easier to understand

New cards

heuristic methods

step-wise forward selection

step-wise backward elimination
combining forward selection and backward elimination
decision-tree induction

New cards

string compression

There are extensive theories and well-tuned algorithms
typically lossless
But only limited manipulation is possible without expansion

New cards

audio/video compression

Typically lossy compression, with progressive refinement
Sometimes small fragments of signal can be reconstructed without reconstructing

the whole

New cards

principal component analysis (PCA)

Given M data vectors from n-dimensions, find k ≤ n orthogonal

vectors that can be best used to represent data: The original data set is reduced to one consisting of N data vectors on x

principal components (reduced dimensions)

Each data vector is a linear combination of the x principal

component vectors

Works for numerical data only
Used when the number of dimensions is large

New cards

parametric methods

Assume the data fits some model, estimate model parameters, store

only the parameters, and discard the data (except possible outliers)

Log-linear models: obtain value at a point in m-D space as the

product on appropriate marginal subspace

New cards

non-parametric methods

Do not assume models
Major families: histograms, clustering, sampling

New cards

linear regression

Data are modelled to fit a straight line
Often uses the least-square method to fit the line

New cards

multiple regression

allows a response variable Y to be modelled as a linear function of

multidimensional feature vector

New cards

log-linear model

approximates discrete multidimensional probability distribution

New cards

clustering

Partition the dataset into clusters, and one can store cluster

representation only

Can have hierarchical clustering and be stored in multi-dimensional

index tree structures

There are many choices of clustering definitions and clustering

algorithms

New cards

hierarchical reduction

Use multi-resolution structure with different degrees of reduction
Hierarchical clustering is often performed but tends to define

partitions of data sets rather than “clusters”

Parametric methods are usually not amenable to hierarchical

representation

Hierarchical aggregation: An index tree hierarchically divides a data set into partitions by value range of some attributes; Each partition can be considered as a bucket; Thus an index tree with aggregates stored at each node is a hierarchical histogram

New cards

nominal attributes

values from an unordered set

New cards

ordinal attributes

values from an ordered set

New cards

continuous attributes

real numbers

New cards

discretization

divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical attributes
Reduce data size by discretisation
Prepare for further analysis

New cards

concept hierarchies

reduce the data by collecting and replacing low level concepts (such as

numeric values for the attribute age) by higher level concepts (such as young,

middle-aged, or senior).

New cards

Discretisation and concept hierarchy generation for numeric

data

Binning
Histogram analysis
Clustering analysis
Entropy-based discretisation
Segmentation by natural partitioning

New cards

Concept hierarchy generation for categorical data

Specification of a partial ordering of attributes explicitly at the

schema level by users or experts

Specification of a portion of a hierarchy by explicit data grouping
Specification of a set of attributes but not of their partial ordering
Specification of only a partial set of attributes

New cards

multidimensional data model

views data in the form of a data cube
The lattice of cuboids forms a data cube

New cards

data cube

allows data to be modelled and viewed in multiple dimensio

New cards

dimension tables

represent dimension

New cards

fact table

contains keys to each of the related dimension tables and measure

New cards

star schema

A fact table in the middle connected to a set of dimension tabl

New cards

snowflake schema

A refinement of star schema where some dimensional hierarchy is normalised

into a set of smaller dimension tables, forming a shape similar to snowflake

New cards

fact constellation

Multiple fact tables share dimension tables, viewed as a collection of stars,

therefore called galaxy schema or fact constellation

New cards

distributive

if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning.
Examples: count(), sum(), min(), max

New cards

Algebraic

if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function.
Examples: avg(), min_N(), standard_deviation(

New cards

Holistic

if there is no constant bound on the storage size needed to describe a sub-aggregate.
Examples: median(), mode(), rank()

New cards

roll up (drill up)

summarize data
by climbing up hierarchy or by dimension reductio

100

New cards

drill down (roll down)

reverse of roll-up
from higher level summary to lower level summary or detailed da