1/174
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
data analysis
Test models and hypotheses on datasets
data phishing/ data dredging
Analysing data without an a-priori hypothesis (1960s)
Data mining
The process of extracting and discovering patterns within large datasets
Information age
Data collection: Automated data collection tools and mature data technologies lead to tremendous amounts of data stored in databases and other information repositories
Petabytes of data were produced every day.
The need for tools and processes to move from the Data Age to the Information
Data mining
can be viewed as a result of the natural evolution of information technolo
data mining
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from large dataset
what is not data mining
(Deductive) query processing.
Expert systems or small ML/statistical programs
Data mining applications
Sports - Advanced analysis of game statistics (shots blocked, assists, and fouls) to gain a competitive advantage
Science- Astronomy, earth science, meteorology, experimental physics,
Web mining - Surf-Aid: Many companies apply data mining algorithms to Web access logs for market-related pages to discover customer preference and behaviour pages, analyse the effectiveness of Web marketing, improve Web site organisation, etc.
Fraud Detection
Market Analysis and Management
Market managing
customer profiling
identifying customer requirements
providing summary information
Corporations
Finance planning and asset evaluation: Cash flow analysis and prediction, Contingent claim analysis to evaluate assets, cross-sectional and time series analysis (financial ratio, trend analysis, etc.)
Resource planning: Summarise and compare the resources and spending
Competition: Monitor competitors and market directions, Group customers into classes and a class-based pricing procedure, Set pricing strategy in a highly competitive mark
data mining
the core of knowledge discovery process
KDD process
Learning the application domain: relevant prior knowledge and goals of the given applications.
Creating a target data set: data selection
Cleaning and pre-processing the data (may take 60% of effort!)
Performing data reduction and transformation: Find useful features, dimensionality/variable reduction, invariant representation
Choosing functions of data mining: summarisation, classification, regression, association, clustering
Choosing the mining algorithms
Data mining: search for patterns of interest
Testing & Evaluating the patterns and presenting the knowledge: visualisation, transformation, removing redundant patterns, etc.
Using the discovered knowledge
DM Software Tools
Commercial and free (open source) software tools
KXEN Modeler, IBM SPSS Modeler, Oracle Data Mining, Angoss
KnowledgeSTUDIO, etc.
RapidMiner, Weka, KNIME, SCaViS, Kaggle, Rattle,Tanagra, R,
Orange (free version), …
purpose of the analysis
Identify population groups and domains of interest.
Example: Population groups: customers, personnel, etc; Domain of interest: sales, profit, stock, products, etc
audience for the analysis
Agencies, companies, directors, communities, etc
why data pre-processing?
Data in the real world is dirty:incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data; noisy: containing errors or outliers; inconsistent: containing discrepancies in codes or names
No quality data, no quality mining results!: Quality decisions must be based on quality data; Data warehouse needs consistent integration of quality data
perfect data
Data is valid, complete, and reliable. No data extrapolation is need
not perfect data
Data with NO serious flaws, but needs some pre-processing
verbal/inspection data
Data with serious gaps → requires additional documentation and verification
prior to its inclusion in the DM process
soft data
Data relied on the memories of experienced personnel of the participating
facility
The most difficult to summarise
example of not perfect data
The data recorded in a dimension which is not important
example of verbal/inspection data
Wrong or non-recorded values of an airplane flight parameter
example of soft data
Memories of an experienced analyst who dealt with the same problem
before
data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
data integration
Integration of multiple databases, data cubes, or files
data transformation
Normalisation and aggregation
data reduction
Obtains reduced representation in volume but produces the same or similar
analytical results
data discretization
Part of data reduction but with particular importance, especially for numerical
data
dataset
Collection of data objects and their attributes
A data object is also called a record, point,
entity, or instance
data object
A collection of attributes describing a data
object
attribute
a property or characteristic of an object
Also called a feature, dimension, variable
Examples: gender, age, income, etc.
categorical (nominal) variable
The value of a categorical variable can take more than 2 states.
Example: Green, Blue, Black, Brown,
binary variable
Its value can take two categories: 0 or 1, True or False
ordinal variable
Possible values that have a meaningful order or ranking among
them, but the magnitude between successive values is not known
Example: Excellent, very good, good, average
interval-scaled variable
The values are measured on a scale of equal-size units.
Example: age, weight (kg), …,
ratio-scaled variable
A value is ratio-scaled if it is a multiple (or ratio) of another value.
Example: exponential scale
discrete vs continuous variable
A discrete attribute has a finite or countably infinite set of values
descriptive data summarization
Identify typical properties of the data
Highlight which data values should be treated as noise or outliers
descriptive statistics
Understand the distribution of the data
Central Tendency: mean, median, midrange
Data Dispersion: quartiles, inter-quartile range (IQR), variance
arithmetic mean
Effective numerical measure of the centre
Let x1, x2, …, xN a set of observations
Drawbacks
Sensitivity to extreme values (outliers)
Trimmed mean: obtained after removing the extreme
median
Used for skewed (asymmetric data)
Let {x1, x2, …, xN} a set of ordered observations
the middle value if N is odd and is the average of the two
middle values if N is even
mode
Indicates the value that occurs most frequently in the s
range
max(X) - min(X)
midrange
[max(X) + min(X)]/2
dispersion of the data
The degree to which numerical data tend to spread
the most common measures are range, five-number summary, IQR, and standard deviation
standard deviation
Measures spread about the mean
Can only be used when the mean is chosen as the measure of the centre
Let X={x1, x2, …, xN}
five-number summary
{min(X), Q1, median, Q3, max(X)}
IQR
Q3-Q1
histograms
Frequency histograms
A graphical method for summarising the distribution of a given attribute
quantile plot
Simple way to have a 1st look at a univariate data distribution
Allows us to compare different distributions based on their quantile
Quantile-Quantile Plot (q-q plot)
Graphs the quantiles of one univariate distribution against the corresponding quantiles of another
Is a powerful visualisation too
scatter plot
Each pair of values is treated as a pair of coordinates in an algebraic sense and plotted as points in the plane
Most effective graphical methods for determining if there appears to be relationship, patterns, or trends between two numerical attributes
Missing values may be due t
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to a misunderstanding
Certain data may not be considered important at the time of entry
No register history or changes in the dat
how to handle missing values
Ignore the tuple: not effective when the percentage of missing values per attribute varies considerably
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter
Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree
noise
random error or variance in a measured variable
Incorrect attribute values may be due to
Faulty data collection instruments
Data entry problems
Data transmission problems
Technology limitation
inconsistency in naming convention
how to handle noisy data
Binning method: first sort data and partition into bins (buckets); then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
Clustering: detect and remove outliers
Combined computer and human inspection: detect suspicious values and check by human
Regression: smooth by fitting the data into regression function
equal-width (distance) partitioning
It divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B - A)/N
The most straightforward
But outliers may dominate presentation
Skewed data is not handled well
equal-depth (frequency) partitioning
It divides the range into N intervals, each containing approximately the same number of objects
Good data scaling
Managing categorical attributes can be tricky
data integration
combines data from multiple sources into a coherent store
schema integration
integrate metadata from different sources
Entity identification problem: identify real-world entities from multiple data
sources, e.g., A.cust-id =B.cust-#
smoothing
remove noise from data
aggregation
summarisation, data cube constructio
generalization
concept hierarchy climbing
normalization
scaled to fall within a small, specified range
attribute/feature construction
New attributes constructed from the given one
data reduction
Obtains a reduced representation of the dataset that is much smaller in volume
but yet produces the same (or almost the same) analytical resu
data reduction strategies
Data cube aggregation
Dimensionality reduction
Numerosity reduction
Discretisation and concept hierarchy generation
feature selection
Select a minimum set of features such that the probability distribution of
different classes given the values for those features is as close as possible to
the original distribution given the values of all features
reduce the number of patterns ==> easier to understand
heuristic methods
step-wise forward selection
step-wise backward elimination
combining forward selection and backward elimination
decision-tree induction
string compression
There are extensive theories and well-tuned algorithms
typically lossless
But only limited manipulation is possible without expansion
audio/video compression
Typically lossy compression, with progressive refinement
Sometimes small fragments of signal can be reconstructed without reconstructing
the whole
principal component analysis (PCA)
Given M data vectors from n-dimensions, find k ≤ n orthogonal
vectors that can be best used to represent data: The original data set is reduced to one consisting of N data vectors on x
principal components (reduced dimensions)
Each data vector is a linear combination of the x principal
component vectors
Works for numerical data only
Used when the number of dimensions is large
parametric methods
Assume the data fits some model, estimate model parameters, store
only the parameters, and discard the data (except possible outliers)
Log-linear models: obtain value at a point in m-D space as the
product on appropriate marginal subspace
non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling
linear regression
Data are modelled to fit a straight line
Often uses the least-square method to fit the line
multiple regression
allows a response variable Y to be modelled as a linear function of
multidimensional feature vector
log-linear model
approximates discrete multidimensional probability distribution
clustering
Partition the dataset into clusters, and one can store cluster
representation only
Can have hierarchical clustering and be stored in multi-dimensional
index tree structures
There are many choices of clustering definitions and clustering
algorithms
hierarchical reduction
Use multi-resolution structure with different degrees of reduction
Hierarchical clustering is often performed but tends to define
partitions of data sets rather than “clusters”
Parametric methods are usually not amenable to hierarchical
representation
Hierarchical aggregation: An index tree hierarchically divides a data set into partitions by value range of some attributes; Each partition can be considered as a bucket; Thus an index tree with aggregates stored at each node is a hierarchical histogram
nominal attributes
values from an unordered set
ordinal attributes
values from an ordered set
continuous attributes
real numbers
discretization
divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical attributes
Reduce data size by discretisation
Prepare for further analysis
concept hierarchies
reduce the data by collecting and replacing low level concepts (such as
numeric values for the attribute age) by higher level concepts (such as young,
middle-aged, or senior).
Discretisation and concept hierarchy generation for numeric
data
Binning
Histogram analysis
Clustering analysis
Entropy-based discretisation
Segmentation by natural partitioning
Concept hierarchy generation for categorical data
Specification of a partial ordering of attributes explicitly at the
schema level by users or experts
Specification of a portion of a hierarchy by explicit data grouping
Specification of a set of attributes but not of their partial ordering
Specification of only a partial set of attributes
multidimensional data model
views data in the form of a data cube
The lattice of cuboids forms a data cube
data cube
allows data to be modelled and viewed in multiple dimensio
dimension tables
represent dimension
fact table
contains keys to each of the related dimension tables and measure
star schema
A fact table in the middle connected to a set of dimension tabl
snowflake schema
A refinement of star schema where some dimensional hierarchy is normalised
into a set of smaller dimension tables, forming a shape similar to snowflake
fact constellation
Multiple fact tables share dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
distributive
if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning.
Examples: count(), sum(), min(), max
Algebraic
if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function.
Examples: avg(), min_N(), standard_deviation(
Holistic
if there is no constant bound on the storage size needed to describe a sub-aggregate.
Examples: median(), mode(), rank()
roll up (drill up)
summarize data
by climbing up hierarchy or by dimension reductio
drill down (roll down)
reverse of roll-up
from higher level summary to lower level summary or detailed da