DATA MINING PRELIMS (INTRO TO DATA MINING)

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/146

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

147 Terms

New cards

DATA

Facts and statistics collected together for reference or analysis.

Things known or assumed as facts, making the basis of reasoning or calculation.

Ex: History of corps, Comments/Reviews, Transactions

New cards

INFORMATION

What is conveyed or represented by a particular arrangement or sequence of things.

Processed, stored, or transmitted data by a computer.

Ex: Analysis of comments/reviews, Determine anomalous transactions

New cards

Qualitative
Quantitative

TYPES OF DATA

New cards

QUALITATIVE

Associated with details that is either verbal or narrative form (ex: interview transcripts)

Implemented when data can be segregated into well-defined groups

Collected data can just be observed and not evaluated

Ex: Scents, Appearance, Beauty, Colors, Flavors, etc.

New cards

QUANTITATIVE

Associated with numbers or numerical values that may correspond to specific label or category (ex: enrolment statistics)

Implemented when data is numerical

Collected data can be statistically analyzed

Examples: Height, Weight, Time, Price, Temperature

New cards

Quantitative

Identify if Qualitative or Quantitative:
Website upload/download speed

New cards

Quantitative

Identify if Qualitative or Quantitative:
Conversion rate

New cards

Qualitative

Identify if Qualitative or Quantitative:
Computer Assisted Personal Interview

New cards

Quantitative

Identify if Qualitative or Quantitative:
54% people prefer shopping online instead of going to the mall

New cards

Qualitative

Identify if Qualitative or Quantitative:
Better standard of living

New cards

Qualitative

Identify if Qualitative or Quantitative:
Home schooling over traditional schooling

New cards

LEVELS OF MEASUREMENT

A classification that relates the values assigned to variables

A scale of measurement used to describe information within the values

New cards

Nominal
Ordinal
Interval
Ratio

Enumerate Levels of Measurement

New cards

NOMINAL

Used for labeling and can only be categorized (ex: hair color, gender (1-male, 2-female)

New cards

ORDINAL

A scale to arrange or assign order and can be used to categorize or classify (rank) (ex: 1st-2nd-3rd, fair-good-best)

New cards

INTERVAL

A scale that have equal distance between adjacent values (ex: 10°C-20°C = 90°C-100°C

NO TRUE ZERO: 0°C doesn’t mean “No Temp”

New cards

RATIO

Used to depict order and has equal intervals (ex: height, weight) with a fixed point of 0

WITH TRUE ZERO: 0ft means “No Height”

New cards

Nominal

Identify if Level of Measurement:
Nationality

New cards

Ordinal

Identify if Level of Measurement:
Level of service

New cards

Ratio

Identify if Level of Measurement:
Annual sales

New cards

Ordinal

Identify if Level of Measurement:
Educational level

New cards

Interval

Identify if Level of Measurement:
IQ test

New cards

Nominal

Identify if Level of Measurement:
Hair color

New cards

Ratio

Identify if Level of Measurement:
Voltage

New cards

Ratio

Identify if Level of Measurement:
Crime rate

New cards

Ratio

Identify if Level of Measurement:
Height

New cards

Record
Graph
Ordered Data
Time Series

TYPES OF DATASET

New cards

Record

document matrix (ex: transaction dataset or market basket)

New cards

Graph

depicts interactions of multiple entities

New cards

Ordered Data

spatial, temporal, sequential, genetic sequence

New cards

Time Series

A single attribute of interest over time

New cards

VOLUME
VARIETY
VELOCITY
VERACITY
VALUE
VARIABILITY

Six V’s of Big Data

New cards

VARIABILITY

ways in which big data can be used and formatted

New cards

VALUE

business value of the collected data

New cards

VERACITY

degree of which big data can be trusted

New cards

VELOCITY

speed at which big data is generated

New cards

VARIETY

types of data: STRUCTURED (Excel rows and columns), UNSTRUCTURED (Tweets/Posts/Images), and SEMI-STRUCTURED (XML/JSON)

New cards

Structured
Unstructured
Semi-Structured

Types of Variety

New cards

VOLUME

amount of data from myriad sources

New cards

Analytics Sophistication

Foundations of Data Analytics
(Descriptive, Predictive, Prescriptive)

New cards

Captured
Detected
Inferred

Foundations of Data Analytics
(Made consumable and accessible to everyone, optimized for their specific purpose, at the point of impact to deliver better decisions and actions through)

New cards

Structured
Unstructured

Foundations of Data Analytics

Use _____ and _____ data
(Numeric, Text, Image, Audio, Video)

New cards

DESCRIPTIVE/EXPLORATORY (Hindsight)
PREDICTIVE (Insight)
DIAGNOSTIC (Foresight)
PRESCRIPTIVE (Wide sight)
COGNITIVE (Deep sight)

TAXONOMY OF DATA ANALYTICS

New cards

DESCRIPTIVE/EXPLORATORY (Hindsight)

summarize or condenses data to extract patterns

data is described and summarized using basic statistical tools and graphs to produce reports and dashboards for decision making

What happened?

New cards

PREDICTIVE (Insight)

extracts models from data to be used for future predictions

What will happen?

New cards

Supervised Learning
Unsupervised Learning

Types of Predictive Analytics

New cards

Classification
Regression
Time Series Analysis

Types of Supervised Learning

New cards

Clustering
Association Analysis
Sequential Pattern Analysis
Text Mining/Social Media Sentiment Analysis

Types of Unsupervised Learning

New cards

DIAGNOSTIC (Foresight)

Find out various problems that are exhibited through data

Why did it happen?

New cards

PRESCRIPTIVE (Wide sight)

combines insights from the first three which allows companies to make decisions based on them

is an application of analytics that recommends the optimal solution to a problem given constraints.

This application also seeks to find the best solution given multiple what-if scenarios

How can we make it happen?

New cards

COGNITIVE (Deep sight)

unfold hidden patterns and replicate human thought

What is the extent of what can happen?

New cards

DATA ANALYTICS FRAMEWORK

Data from source systems are collected, processed and loaded into the DATA WAREHOUSE, a centralized database that holds large amounts of data. ANALYSTS then perform exploratory data analysis, data mining, simulation and optimization to gain insights. Then, DECISION MAKERS use the analysis to make business decisions.

New cards

DATA WAREHOUSE

a centralized database that holds large amounts of data

New cards

ANALYSTS

then perform exploratory data analysis, data mining, simulation and optimization to gain insights

New cards

DECISION MAKERS

use the analysis to make business decisions.

New cards

Accuracy
Completeness
Consistency
Timeliness
Believability
Interpretability

Measures for data quality

New cards

Accuracy

correct or wrong, accurate or not
How correct and reliable the data is in reflecting real-world facts.

New cards

Completeness

not recorded, unavailable,
Whether all required data is available and fully recorded.

New cards

Consistency

some modified but some not, dangling
Whether data is uniformly stored and maintained across systems without contradictions.

New cards

Timeliness

timely update?
How up-to-date the data is, ensuring it reflects the most current information.

New cards

Believability

how trustable the data are correct?
The degree to which the data is trusted to be true and credible.

New cards

Interpretability

How easily the data can be understood and used by its audience.

New cards

Data Cleaning
Data Integration
Data Reduction
Data Transformation and Data Discretization

Major Tasks in Data Preprocessing

New cards

DATA CLEANING

Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

New cards

DATA INTEGRATION

Combines data from multiple sources into a coherent store such as multiple databases, data cubes, or files

New cards

Schema integration

A.cust-id ≡ B.cust-#
Integrate metadata from different sources

New cards

Entity identification problem

Identify real-world entities from multiple data sources, e.g., Bill Clinton = William Clinton

New cards

Detecting and resolving data value conflicts

For the same real-world entity, attribute values from different sources are different
Possible reasons: different representations, different scales, e.g., metric vs. British units

New cards

Object identification

The same attribute or object may have different names in different databases

New cards

Derivable data

One attribute may be a “derived” attribute in another table, e.g., annual revenue

New cards

correlation analysis
covariance analysis

Redundant attributes may be detected by _____ _____ and _____ ______

New cards

redundancies
inconsistencies

Careful integration of the data from multiple sources may help reduce/ avoid ______ and ______ and improve mining speed and quality

New cards

CORRELATION ANALYSIS (NOMINAL DATA)

Χ2 (chi-square) test

The larger the Χ2 value, the more likely the variables are related

The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count

New cards

causality

Correlation does not imply _____

New cards

1 - confidence level

Chi-Square alpha formula?
α = ?

New cards

Degrees of freedom (DOF)

refer to the number of independent variables or values in a dataset that can vary without breaking any constraints. It’s used to describe the flexibility of a model in fitting the data.

New cards

CORRELATION COEFFICIENT

also called Pearson’s product moment coefficient

New cards

COVARIANCE (NUMERIC DATA)

similar to correlation
where n is the number of tuples, and are the respective mean or expected values of A and B,

New cards

Positive covariance

If CovA,B > 0, then A and B both tend to be larger than their expected values.

New cards

Negative covariance

If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller than its expected value.

New cards

Independence

CovA,B = 0 but the converse is not true:

Some pairs of random variables may have a covariance of 0 but are not independent. Only under some additional assumptions (e.g., the data follow multivariate normal distributions) does a covariance of 0 imply independence

New cards

DATA REDUCTION

Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results

Why reduce data?: A database/data warehouse may store terabytes of data. Complex data analysis may take a very long time to run on the complete data set.

New cards

DIMENSIONALITY REDUCTION
NUMEROSITY/DATA REDUCTION
DATA COMPRESSION

DATA REDUCTION STRATEGIES

New cards

DIMENSIONALITY REDUCTION

e.g., remove unimportant attributes

Help eliminate irrelevant features and reduce noise

Reduce time and space required in data mining

Allow easier visualization

New cards

Curse of dimensionality

When dimensionality increases, data becomes increasingly sparse

Density and distance between points, which is critical to clustering, outlier analysis, becomes less meaningful

The possible combinations of subspaces will grow exponentially

New cards

Wavelet transforms

Principal Component Analysis

Supervised and nonlinear techniques

Dimensionality reduction techniques

New cards

PRINCIPAL COMPONENT ANALYSIS (PCA)

Find a projection that captures the largest amount of variation in data

The original data are projected onto a much smaller space, resulting in dimensionality reduction. We find the eigenvectors of the covariance matrix, and these eigenvectors define the new space

New cards

ATTRIBUTE SUBSET SELECTION

Another way to reduce dimensionality of data

Redundant attributes

Duplicate much or all of the information contained in one or more other attributes

purchase price of a product and the amount of sales tax paid

Irrelevant attributes

Contain no information that is useful for the data mining task at hand

students' ID is often irrelevant to the task of predicting students' GPA

New cards

HEURISTIC SEARCH IN ATTRIBUTE SELECTION

There are 2d possible attribute combinations of d attributes

New cards

ATTRIBUTE CREATION (FEATURE GENERATION)

Create new attributes (features) that can capture the important information in a data set more effectively than the original ones

New cards

Attribute extraction
Attribute construction

ATTRIBUTE CREATION (FEATURE GENERATION) Methodologies

New cards

NUMEROSITY/DATA REDUCTION

Reduce data volume by choosing an alternative, smaller forms of data representation

New cards

Paremetric
Non-parametric

NUMEROSITY/DATA REDUCTION Methods

New cards

Parametric methods

Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)

New cards

Regression

Enumerate Parametric method/s

New cards

Linear
Multiple
Log-Linear

Enumerate different Regression

New cards

Regression Analysis

Modeling and analysis techniques of numerical data consisting of values of a dependent variable (also called response variable or measurement) and of one or more independent variables (or explanatory variables or predictors)

Parameters are estimated to give best fit of the data

Most commonly the best fit is evaluated by using the least squares method

New cards

dependent variable

(also called response variable or measurement)

100

New cards

independent variables

(also called explanatory variables or predictors)