Data Mining CA1

0.0(0)
studied byStudied by 6 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/174

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

175 Terms

1
New cards

data analysis

Test models and hypotheses on datasets

2
New cards

data phishing/ data dredging

Analysing data without an a-priori hypothesis (1960s)

3
New cards

Data mining

The process of extracting and discovering patterns within large datasets

4
New cards

Information age

  • Data collection: Automated data collection tools and mature data technologies lead to tremendous amounts of data stored in databases and other information repositories

  • Petabytes of data were produced every day.

  •  The need for tools and processes to move from the Data Age to the Information

5
New cards

Data mining

can be viewed as a result of the natural evolution of information technolo

6
New cards

data mining

Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from large dataset

7
New cards

what is not data mining

  • (Deductive) query processing.

  • Expert systems or small ML/statistical programs

8
New cards

Data mining applications

  • Sports - Advanced analysis of game statistics (shots blocked, assists, and fouls) to gain a competitive advantage

  • Science- Astronomy, earth science, meteorology, experimental physics,

  • Web mining - Surf-Aid: Many companies apply data mining algorithms to Web access logs for market-related pages to discover customer preference and behaviour pages, analyse the effectiveness of Web marketing, improve Web site organisation, etc.

  • Fraud Detection

  • Market Analysis and Management

9
New cards

Market managing

  • customer profiling

  • identifying customer requirements

  • providing summary information

10
New cards

Corporations

  • Finance planning and asset evaluation: Cash flow analysis and prediction, Contingent claim analysis to evaluate assets, cross-sectional and time series analysis (financial ratio, trend analysis, etc.)

  • Resource planning: Summarise and compare the resources and spending

  • Competition: Monitor competitors and market directions, Group customers into classes and a class-based pricing procedure, Set pricing strategy in a highly competitive mark

11
New cards

data mining

the core of knowledge discovery process

12
New cards

KDD process

  • Learning the application domain: relevant prior knowledge and goals of the given applications.

  • Creating a target data set: data selection

  • Cleaning and pre-processing the data (may take 60% of effort!)

  • Performing data reduction and transformation: Find useful features, dimensionality/variable reduction, invariant representation

  • Choosing functions of data mining: summarisation, classification, regression, association, clustering

  • Choosing the mining algorithms

  • Data mining: search for patterns of interest

  • Testing & Evaluating the patterns and presenting the knowledge: visualisation, transformation, removing redundant patterns, etc.

  • Using the discovered knowledge

13
New cards

DM Software Tools

  • Commercial and free (open source) software tools

  • KXEN Modeler, IBM SPSS Modeler, Oracle Data Mining, Angoss

KnowledgeSTUDIO, etc.

  • RapidMiner, Weka, KNIME, SCaViS, Kaggle, Rattle,Tanagra, R,

Orange (free version), …

14
New cards

purpose of the analysis

  • Identify population groups and domains of interest.

  • Example: Population groups: customers, personnel, etc; Domain of interest: sales, profit, stock, products, etc

15
New cards

audience for the analysis

Agencies, companies, directors, communities, etc

16
New cards

why data pre-processing?

  • Data in the real world is dirty:incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data; noisy: containing errors or outliers; inconsistent: containing discrepancies in codes or names

  • No quality data, no quality mining results!: Quality decisions must be based on quality data; Data warehouse needs consistent integration of quality data

17
New cards

perfect data

Data is valid, complete, and reliable. No data extrapolation is need

18
New cards

not perfect data

Data with NO serious flaws, but needs some pre-processing

19
New cards

verbal/inspection data

Data with serious gaps → requires additional documentation and verification

prior to its inclusion in the DM process

20
New cards

soft data

  • Data relied on the memories of experienced personnel of the participating

facility

  • The most difficult to summarise

21
New cards

example of not perfect data

The data recorded in a dimension which is not important

22
New cards

example of verbal/inspection data

Wrong or non-recorded values of an airplane flight parameter

23
New cards

example of soft data

Memories of an experienced analyst who dealt with the same problem

before

24
New cards

data cleaning

Fill in missing values, smooth noisy data, identify or remove outliers, and

resolve inconsistencies

25
New cards

data integration

Integration of multiple databases, data cubes, or files

26
New cards

data transformation

Normalisation and aggregation

27
New cards

data reduction

Obtains reduced representation in volume but produces the same or similar

analytical results

28
New cards

data discretization

Part of data reduction but with particular importance, especially for numerical

data

29
New cards

dataset

  • Collection of data objects and their attributes

  • A data object is also called a record, point,

entity, or instance

30
New cards

data object

A collection of attributes describing a data

object

31
New cards

attribute

  • a property or characteristic of an object

  • Also called a feature, dimension, variable

  • Examples: gender, age, income, etc.

32
New cards

categorical (nominal) variable

  • The value of a categorical variable can take more than 2 states.

  • Example: Green, Blue, Black, Brown,

33
New cards

binary variable

Its value can take two categories: 0 or 1, True or False

34
New cards

ordinal variable

  • Possible values that have a meaningful order or ranking among

them, but the magnitude between successive values is not known

  • Example: Excellent, very good, good, average

35
New cards

interval-scaled variable

  • The values are measured on a scale of equal-size units.

  • Example: age, weight (kg), …,

36
New cards

ratio-scaled variable

  • A value is ratio-scaled if it is a multiple (or ratio) of another value.

  • Example: exponential scale

37
New cards

discrete vs continuous variable

A discrete attribute has a finite or countably infinite set of values

38
New cards

descriptive data summarization

  • Identify typical properties of the data

  • Highlight which data values should be treated as noise or outliers

39
New cards

descriptive statistics

  • Understand the distribution of the data

  • Central Tendency: mean, median, midrange

  • Data Dispersion: quartiles, inter-quartile range (IQR), variance

40
New cards

arithmetic mean

  • Effective numerical measure of the centre

  • Let x1, x2, …, xN a set of observations

41
New cards

Drawbacks

  • Sensitivity to extreme values (outliers)

  • Trimmed mean: obtained after removing the extreme

42
New cards

median

  • Used for skewed (asymmetric data)

  • Let {x1, x2, …, xN} a set of ordered observations

  • the middle value if N is odd and is the average of the two

middle values if N is even

43
New cards

mode

Indicates the value that occurs most frequently in the s

44
New cards

range

max(X) - min(X)

45
New cards

midrange

[max(X) + min(X)]/2

46
New cards

dispersion of the data

  • The degree to which numerical data tend to spread

  • the most common measures are range, five-number summary, IQR, and standard deviation

47
New cards

standard deviation

  • Measures spread about the mean

  • Can only be used when the mean is chosen as the measure of the centre

  • Let X={x1, x2, …, xN}

48
New cards

five-number summary

{min(X), Q1, median, Q3, max(X)}

49
New cards

IQR

Q3-Q1

50
New cards

histograms

  • Frequency histograms

  • A graphical method for summarising the distribution of a given attribute

51
New cards

quantile plot

  • Simple way to have a 1st look at a univariate data distribution

  • Allows us to compare different distributions based on their quantile

52
New cards

Quantile-Quantile Plot (q-q plot)

  • Graphs the quantiles of one univariate distribution against the corresponding quantiles of another

  • Is a powerful visualisation too

53
New cards

scatter plot

  • Each pair of values is treated as a pair of coordinates in an algebraic sense and plotted as points in the plane

  • Most effective graphical methods for determining if there appears to be relationship, patterns, or trends between two numerical attributes

54
New cards

Missing values may be due t

  • equipment malfunction

  • inconsistent with other recorded data and thus deleted

  • data not entered due to a misunderstanding

  • Certain data may not be considered important at the time of entry

  • No register history or changes in the dat

55
New cards

how to handle missing values

  • Ignore the tuple: not effective when the percentage of missing values per attribute varies considerably

  • Fill in the missing value manually: tedious + infeasible?

  • Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!

  • Use the attribute mean to fill in the missing value

  • Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter

  • Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree

56
New cards

noise

random error or variance in a measured variable

57
New cards

Incorrect attribute values may be due to

  • Faulty data collection instruments

  • Data entry problems

  • Data transmission problems

  • Technology limitation

  • inconsistency in naming convention

58
New cards

how to handle noisy data

  • Binning method: first sort data and partition into bins (buckets); then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

  • Clustering: detect and remove outliers

  • Combined computer and human inspection: detect suspicious values and check by human

  • Regression: smooth by fitting the data into regression function

59
New cards

equal-width (distance) partitioning

  • It divides the range into N intervals of equal size: uniform grid

  • if A and B are the lowest and highest values of the attribute, the width of

intervals will be: W = (B - A)/N

  • The most straightforward

  • But outliers may dominate presentation

  • Skewed data is not handled well

60
New cards

equal-depth (frequency) partitioning

  • It divides the range into N intervals, each containing approximately the same number of objects

  • Good data scaling

  • Managing categorical attributes can be tricky

61
New cards

data integration

combines data from multiple sources into a coherent store

62
New cards

schema integration

  • integrate metadata from different sources

  • Entity identification problem: identify real-world entities from multiple data

sources, e.g., A.cust-id =B.cust-#

63
New cards

smoothing

remove noise from data

64
New cards

aggregation

summarisation, data cube constructio

65
New cards

generalization

concept hierarchy climbing

66
New cards

normalization

scaled to fall within a small, specified range

67
New cards

attribute/feature construction

New attributes constructed from the given one

68
New cards

data reduction

Obtains a reduced representation of the dataset that is much smaller in volume

but yet produces the same (or almost the same) analytical resu

69
New cards

data reduction strategies

  • Data cube aggregation

  • Dimensionality reduction

  • Numerosity reduction

  • Discretisation and concept hierarchy generation

70
New cards

feature selection

  • Select a minimum set of features such that the probability distribution of

different classes given the values for those features is as close as possible to

the original distribution given the values of all features

  • reduce the number of patterns ==> easier to understand

71
New cards

heuristic methods

  • step-wise forward selection

  • step-wise backward elimination

  • combining forward selection and backward elimination

  • decision-tree induction 

72
New cards

string compression

  • There are extensive theories and well-tuned algorithms

  • typically lossless

  • But only limited manipulation is possible without expansion

73
New cards

audio/video compression

  • Typically lossy compression, with progressive refinement

  • Sometimes small fragments of signal can be reconstructed without reconstructing

the whole

74
New cards

principal component analysis (PCA)

  • Given M data vectors from n-dimensions, find k ≤ n orthogonal

vectors that can be best used to represent data: The original data set is reduced to one consisting of N data vectors on x

principal components (reduced dimensions)

  • Each data vector is a linear combination of the x principal

component vectors

  • Works for numerical data only

  • Used when the number of dimensions is large

75
New cards

parametric methods

  • Assume the data fits some model, estimate model parameters, store

only the parameters, and discard the data (except possible outliers)

  • Log-linear models: obtain value at a point in m-D space as the

product on appropriate marginal subspace

76
New cards

non-parametric methods

  • Do not assume models

  • Major families: histograms, clustering, sampling

77
New cards

linear regression

  • Data are modelled to fit a straight line

  • Often uses the least-square method to fit the line

78
New cards

multiple regression

allows a response variable Y to be modelled as a linear function of

multidimensional feature vector

79
New cards

log-linear model

approximates discrete multidimensional probability distribution

80
New cards

clustering

  • Partition the dataset into clusters, and one can store cluster

representation only

  • Can have hierarchical clustering and be stored in multi-dimensional

index tree structures

  • There are many choices of clustering definitions and clustering

algorithms

81
New cards

hierarchical reduction

  • Use multi-resolution structure with different degrees of reduction

  • Hierarchical clustering is often performed but tends to define

partitions of data sets rather than “clusters”

  • Parametric methods are usually not amenable to hierarchical

representation

  • Hierarchical aggregation: An index tree hierarchically divides a data set into partitions by value range of some attributes; Each partition can be considered as a bucket; Thus an index tree with aggregates stored at each node is a hierarchical histogram

82
New cards

nominal attributes

values from an unordered set

83
New cards

ordinal attributes

values from an ordered set

84
New cards

continuous attributes

real numbers

85
New cards

discretization

  • divide the range of a continuous attribute into intervals

  • Some classification algorithms only accept categorical attributes

  • Reduce data size by discretisation

  • Prepare for further analysis

86
New cards

concept hierarchies

reduce the data by collecting and replacing low level concepts (such as

numeric values for the attribute age) by higher level concepts (such as young,

middle-aged, or senior).

87
New cards

Discretisation and concept hierarchy generation for numeric

data

  • Binning

  • Histogram analysis

  • Clustering analysis

  • Entropy-based discretisation

  • Segmentation by natural partitioning

88
New cards

Concept hierarchy generation for categorical data

  • Specification of a partial ordering of attributes explicitly at the

schema level by users or experts

  • Specification of a portion of a hierarchy by explicit data grouping

  • Specification of a set of attributes but not of their partial ordering

  • Specification of only a partial set of attributes

89
New cards

multidimensional data model

  • views data in the form of a data cube

  • The lattice of cuboids forms a data cube

90
New cards

data cube

allows data to be modelled and viewed in multiple dimensio

91
New cards

dimension tables

represent dimension

92
New cards

fact table

contains keys to each of the related dimension tables and measure

93
New cards

star schema

A fact table in the middle connected to a set of dimension tabl

94
New cards

snowflake schema

A refinement of star schema where some dimensional hierarchy is normalised

into a set of smaller dimension tables, forming a shape similar to snowflake

95
New cards

fact constellation

Multiple fact tables share dimension tables, viewed as a collection of stars,

therefore called galaxy schema or fact constellation

96
New cards

distributive

  • if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning.

  •  Examples: count(), sum(), min(), max

97
New cards

Algebraic

  • if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function.

  • Examples: avg(), min_N(), standard_deviation(

98
New cards

Holistic

  • if there is no constant bound on the storage size needed to describe a sub-aggregate.

  • Examples: median(), mode(), rank()

99
New cards

roll up (drill up)

  • summarize data

  • by climbing up hierarchy or by dimension reductio

100
New cards

drill down (roll down)

  • reverse of roll-up

  • from higher level summary to lower level summary or detailed da