1/146
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
DATA
Facts and statistics collected together for reference or analysis.
Things known or assumed as facts, making the basis of reasoning or calculation.
Ex: History of corps, Comments/Reviews, Transactions
INFORMATION
What is conveyed or represented by a particular arrangement or sequence of things.
Processed, stored, or transmitted data by a computer.
Ex: Analysis of comments/reviews, Determine anomalous transactions
Qualitative
Quantitative
TYPES OF DATA
QUALITATIVE
Associated with details that is either verbal or narrative form (ex: interview transcripts)
Implemented when data can be segregated into well-defined groups
Collected data can just be observed and not evaluated
Ex: Scents, Appearance, Beauty, Colors, Flavors, etc.
QUANTITATIVE
Associated with numbers or numerical values that may correspond to specific label or category (ex: enrolment statistics)
Implemented when data is numerical
Collected data can be statistically analyzed
Examples: Height, Weight, Time, Price, Temperature
Quantitative
Identify if Qualitative or Quantitative:
Website upload/download speed
Quantitative
Identify if Qualitative or Quantitative:
Conversion rate
Qualitative
Identify if Qualitative or Quantitative:
Computer Assisted Personal Interview
Quantitative
Identify if Qualitative or Quantitative:
54% people prefer shopping online instead of going to the mall
Qualitative
Identify if Qualitative or Quantitative:
Better standard of living
Qualitative
Identify if Qualitative or Quantitative:
Home schooling over traditional schooling
LEVELS OF MEASUREMENT
A classification that relates the values assigned to variables
A scale of measurement used to describe information within the values
Nominal
Ordinal
Interval
Ratio
Enumerate Levels of Measurement
NOMINAL
Used for labeling and can only be categorized (ex: hair color, gender (1-male, 2-female)
ORDINAL
A scale to arrange or assign order and can be used to categorize or classify (rank) (ex: 1st-2nd-3rd, fair-good-best)
INTERVAL
A scale that have equal distance between adjacent values (ex: 10°C-20°C = 90°C-100°C
NO TRUE ZERO: 0°C doesn’t mean “No Temp”
RATIO
Used to depict order and has equal intervals (ex: height, weight) with a fixed point of 0
WITH TRUE ZERO: 0ft means “No Height”
Nominal
Identify if Level of Measurement:
Nationality
Ordinal
Identify if Level of Measurement:
Level of service
Ratio
Identify if Level of Measurement:
Annual sales
Ordinal
Identify if Level of Measurement:
Educational level
Interval
Identify if Level of Measurement:
IQ test
Nominal
Identify if Level of Measurement:
Hair color
Ratio
Identify if Level of Measurement:
Voltage
Ratio
Identify if Level of Measurement:
Crime rate
Ratio
Identify if Level of Measurement:
Height
Record
Graph
Ordered Data
Time Series
TYPES OF DATASET
Record
document matrix (ex: transaction dataset or market basket)
Graph
depicts interactions of multiple entities
Ordered Data
spatial, temporal, sequential, genetic sequence
Time Series
A single attribute of interest over time
VOLUME
VARIETY
VELOCITY
VERACITY
VALUE
VARIABILITY
Six V’s of Big Data
VARIABILITY
ways in which big data can be used and formatted
VALUE
business value of the collected data
VERACITY
degree of which big data can be trusted
VELOCITY
speed at which big data is generated
VARIETY
types of data: STRUCTURED (Excel rows and columns), UNSTRUCTURED (Tweets/Posts/Images), and SEMI-STRUCTURED (XML/JSON)
Structured
Unstructured
Semi-Structured
Types of Variety
VOLUME
amount of data from myriad sources
Analytics Sophistication
Foundations of Data Analytics
(Descriptive, Predictive, Prescriptive)
Captured
Detected
Inferred
Foundations of Data Analytics
(Made consumable and accessible to everyone, optimized for their specific purpose, at the point of impact to deliver better decisions and actions through)
Structured
Unstructured
Foundations of Data Analytics
Use _____ and _____ data
(Numeric, Text, Image, Audio, Video)
DESCRIPTIVE/EXPLORATORY (Hindsight)
PREDICTIVE (Insight)
DIAGNOSTIC (Foresight)
PRESCRIPTIVE (Wide sight)
COGNITIVE (Deep sight)
TAXONOMY OF DATA ANALYTICS
DESCRIPTIVE/EXPLORATORY (Hindsight)
summarize or condenses data to extract patterns
data is described and summarized using basic statistical tools and graphs to produce reports and dashboards for decision making
What happened?
PREDICTIVE (Insight)
extracts models from data to be used for future predictions
What will happen?
Supervised Learning
Unsupervised Learning
Types of Predictive Analytics
Classification
Regression
Time Series Analysis
Types of Supervised Learning
Clustering
Association Analysis
Sequential Pattern Analysis
Text Mining/Social Media Sentiment Analysis
Types of Unsupervised Learning
DIAGNOSTIC (Foresight)
Find out various problems that are exhibited through data
Why did it happen?
PRESCRIPTIVE (Wide sight)
combines insights from the first three which allows companies to make decisions based on them
is an application of analytics that recommends the optimal solution to a problem given constraints.
This application also seeks to find the best solution given multiple what-if scenarios
How can we make it happen?
COGNITIVE (Deep sight)
unfold hidden patterns and replicate human thought
What is the extent of what can happen?
DATA ANALYTICS FRAMEWORK
Data from source systems are collected, processed and loaded into the DATA WAREHOUSE, a centralized database that holds large amounts of data. ANALYSTS then perform exploratory data analysis, data mining, simulation and optimization to gain insights. Then, DECISION MAKERS use the analysis to make business decisions.
DATA WAREHOUSE
a centralized database that holds large amounts of data
ANALYSTS
then perform exploratory data analysis, data mining, simulation and optimization to gain insights
DECISION MAKERS
use the analysis to make business decisions.
Accuracy
Completeness
Consistency
Timeliness
Believability
Interpretability
Measures for data quality
Accuracy
correct or wrong, accurate or not
How correct and reliable the data is in reflecting real-world facts.
Completeness
not recorded, unavailable,
Whether all required data is available and fully recorded.
Consistency
some modified but some not, dangling
Whether data is uniformly stored and maintained across systems without contradictions.
Timeliness
timely update?
How up-to-date the data is, ensuring it reflects the most current information.
Believability
how trustable the data are correct?
The degree to which the data is trusted to be true and credible.
Interpretability
How easily the data can be understood and used by its audience.
Data Cleaning
Data Integration
Data Reduction
Data Transformation and Data Discretization
Major Tasks in Data Preprocessing
DATA CLEANING
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
DATA INTEGRATION
Combines data from multiple sources into a coherent store such as multiple databases, data cubes, or files
Schema integration
A.cust-id ≡ B.cust-#
Integrate metadata from different sources
Entity identification problem
Identify real-world entities from multiple data sources, e.g., Bill Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real-world entity, attribute values from different sources are different
Possible reasons: different representations, different scales, e.g., metric vs. British units
Object identification
The same attribute or object may have different names in different databases
Derivable data
One attribute may be a “derived” attribute in another table, e.g., annual revenue
correlation analysis
covariance analysis
Redundant attributes may be detected by _____ _____ and _____ ______
redundancies
inconsistencies
Careful integration of the data from multiple sources may help reduce/ avoid ______ and ______ and improve mining speed and quality
CORRELATION ANALYSIS (NOMINAL DATA)
Χ2 (chi-square) test
The larger the Χ2 value, the more likely the variables are related
The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count
causality
Correlation does not imply _____
1 - confidence level
Chi-Square alpha formula?
α = ?
Degrees of freedom (DOF)
refer to the number of independent variables or values in a dataset that can vary without breaking any constraints. It’s used to describe the flexibility of a model in fitting the data.
CORRELATION COEFFICIENT
also called Pearson’s product moment coefficient
COVARIANCE (NUMERIC DATA)
similar to correlation
where n is the number of tuples, and are the respective mean or expected values of A and B,
Positive covariance
If CovA,B > 0, then A and B both tend to be larger than their expected values.
Negative covariance
If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller than its expected value.
Independence
CovA,B = 0 but the converse is not true:
Some pairs of random variables may have a covariance of 0 but are not independent. Only under some additional assumptions (e.g., the data follow multivariate normal distributions) does a covariance of 0 imply independence
DATA REDUCTION
Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results
Why reduce data?: A database/data warehouse may store terabytes of data. Complex data analysis may take a very long time to run on the complete data set.
DIMENSIONALITY REDUCTION
NUMEROSITY/DATA REDUCTION
DATA COMPRESSION
DATA REDUCTION STRATEGIES
DIMENSIONALITY REDUCTION
e.g., remove unimportant attributes
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Wavelet transforms
Principal Component Analysis
Supervised and nonlinear techniques
Dimensionality reduction techniques
PRINCIPAL COMPONENT ANALYSIS (PCA)
Find a projection that captures the largest amount of variation in data
The original data are projected onto a much smaller space, resulting in dimensionality reduction. We find the eigenvectors of the covariance matrix, and these eigenvectors define the new space
ATTRIBUTE SUBSET SELECTION
Another way to reduce dimensionality of data
Redundant attributes
Duplicate much or all of the information contained in one or more other attributes
purchase price of a product and the amount of sales tax paid
Irrelevant attributes
Contain no information that is useful for the data mining task at hand
students' ID is often irrelevant to the task of predicting students' GPA
HEURISTIC SEARCH IN ATTRIBUTE SELECTION
There are 2d possible attribute combinations of d attributes
ATTRIBUTE CREATION (FEATURE GENERATION)
Create new attributes (features) that can capture the important information in a data set more effectively than the original ones
Attribute extraction
Attribute construction
ATTRIBUTE CREATION (FEATURE GENERATION) Methodologies
NUMEROSITY/DATA REDUCTION
Reduce data volume by choosing an alternative, smaller forms of data representation
Paremetric
Non-parametric
NUMEROSITY/DATA REDUCTION Methods
Parametric methods
Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)
Regression
Enumerate Parametric method/s
Linear
Multiple
Log-Linear
Enumerate different Regression
Regression Analysis
Modeling and analysis techniques of numerical data consisting of values of a dependent variable (also called response variable or measurement) and of one or more independent variables (or explanatory variables or predictors)
Parameters are estimated to give best fit of the data
Most commonly the best fit is evaluated by using the least squares method
dependent variable
(also called response variable or measurement)
independent variables
(also called explanatory variables or predictors)