1/25
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Four V’s of Big Data
Volume
Variety
Velocity
Veracity
What is data mining?
The science of discovering interesting knowledge automatically from huge data repositories
Basic elements of data mining
data: very large, huge
mining techniques: automatic, semi-automatic, non-trival, efficient
new knowledge: implicit, previously unknown, potentially useful, w/ meaningful and unexpected patterns
why is it so important
it helps in automated decision making
it helps in cost reduction (use and allocate resources more efficiently)
it helps in precise prediction and forecasting
it improves overall customer experience
DM origins
draws ideas from machine learning/AI, pattern recognition, statistics and database systems
built for data that are:
large scale
high dimensional
heterogeneous - complex
distributed
data mining
techniques for analyzing existing data to find patterns and extracts insights
artificial intelligence
creating machines that perform intelligent tasks autonomously
machine learning
enables a computer to learn and improve on its own, based on experience
data science
end to end solution of collecting, processing, analyzing and interpreting vast amounts of data to solve complex problems
data mining process
data → data preprocessing → data mining → data post-processing → information
data
collect
describe
explore
verify
data preprocessing
feature selection
cleaning
dimensional reduction
construct
integrate data
normalization
data mining
select model techniques
generate test design
build model
assess model
data preprocessing
filter patterns
visualization
pattern interpretation
pre & post processing objectives
Pre
convert the data into the right format for subsequent analysis by selecting the appropriate data segments and extracting attributes that are relevant to the data mining task
Post
make the data mining results more accessible and easier to be interpreted by analysts e.g. remove uninteresting patterns apply visualization techniques to explore and interact with the data mining results
data mining tasks - predictive tasks
use some variables to predict unknown or future values of other variables
e.g. predict which users will buy a specific product
data mining tasks - descriptive tasks
find humans-interpretable patterns that describe the data
e.g. find the set of documents that share similar topics
predictive modeling
use some variables to predict unknown or future values of other variables
types of variables
explanatory - define the properties of data
target - whose value is to predicted the DM task
subcategories:
classification - predict the values of discrete target variables
regression - predict the values of continuous target variables
association analysis
find hidden associations on transaction data
produces a set of dependence rules that predict the occurrences of other variables
clustering
group similar data together
homogeneous groupings of data points; data points belonging in one cluster are more similar to each other than to data points from a separate cluster
Applications
understand the data
e.g. land segmentation according to vegetation cover
summarize the data to reduce their size
anomaly detection
find outliers, data points that do not fit and are significantly different than the rest of the data
scalability
need for efficient data structures, parallel algorithms, etc
dimensional
number of dimensions (attributes, feature) too large, because of temporal, spatial, and sequential nature of the data
heterogeneity
complicated data types (graph-based, free-form text, structured and semi-structured) that traditional statistical methods might not be able to hndle
imperfection
missing values, noise, perfect algorithm + imperfect data = wrong info
data ownership and distribution
need to develop distributed data mining solutions, efficient algorithms to cope with the distributed datasets to minimize the cost of communications, along with data security and data ownership issues