1/91
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Online Transaction Process (OLTP)
Type of computer processing where the computer responds immediately to user requests asnd focuses on data capute
Transaction databases such as ATM, ERP, SCM, CRM, …
Main focus is on efficiency of routine tasks
Data capture
Online Analytical Processing (OLAP)
Processing for end-user ad hoc reports, queries, and analysis used for decision support
Data Warehouses or Data Marts
Main focus is converting data into information for decision support (queries)
Business Intelligence
An umbrella term that combines architectures, tools, databases, analytical tools, applications, and methodologies
Enables interactive access to data
Provide business managers with the ability to conduct appropriate analysis
Critical BI System Considerations
Developing or Acquring BI Systems
Make versus Buy
BI Shells
Justification and Cost-Benefit Analysis
Security and Protection of Privacy
Integration to Other Systems and Applications
Analytics
The process of developing actionable decisions or recommendations for actions based on insights generated from historical data
Combination of technology, science, and statistics to solve problems
Descriptive Analytics
Refers to knowing what is happening in the organization and understanding some underlying trends and causes of such occurrences
Answering the question of what happened
Analysis of historical data
Enablers
DW
Data Visualization
Dashboards and Scorecards
Predictive Analytics
Aims to determine what is likely to happen in the future
Used to forecast whether customers are likely to switch to a competitor, what customers are likely to buy, how likely customers will respond to a promotion
Looking at the past to determine the future
Enablers
Data Mining
Text Mining / Web Mining
Forecasting (i.e. time-series)
Prescriptive Analytics
Aims to determine the best possible solution
To identify decisions or actions that will optimize the performance of a system
Used to set prices, create production plans, and identify the best locations for facilities
Uses both descriptive and predictive to create the alternatives, and determines the best one
Enablers
Optimization
Simulation
Multi-Criteria Decision Modeling
Big Data Analytics
Data that cannot be stored or processed easily using traditional tools/means
Data that comes in different forms: structured, unstructured, large, continuous, etc
Major sourced include clickstreams from Web Sites, postings on social media, and data from traffic, sensors, and weather
Is worthless if it does not provide business value
Data
Collection of facts usually obtained as the result of experiments, observations, transactions, or experiences
The Nature of Data
Is the main ingredient in all forms of analytics
Usually obtained as a result of experiences, observations, and experiments
It may consist of numbers, words, images, etc.
Lowest level of abstraction (from which information and knowledge are derived)
Has to be carefully created/identified, collected, integrated, cleaned, transformed
Data quality and data integrity → Critical to analysis
Metrics for Analytics Ready Data
Source reliability
Content accuracy
Accessibility
Security and privacy
Richness
Currency/timeliness (up to date)
Validity and Relevancy
Structured Data
Organized information in a fixed format, easily searchable and analyzed, typically stored in databases
Targeted for computers to process
Unstructured/Textual Data
Data not organized in a pre-defined manner, often textual and challenging to analyze.
Targeted for humans to process
Need to be converted into some form of categorical or numeric representation
Semi-Structured Data
Data that does not have a fixed format but contains tags or markers to separate elements.
XML, HTML, Log files, etc.
Steps for readying data for analytics
Data Consolidation: Get data together
Data Cleaning: Dealing with missing values
Data Transformation: Formatting
Data reduction
a. Variables (dimensional reduction, variable selection)
b. Cases / Samples (Sampling, Balancing)
Statistics
A collection of mathematical techniques to characterize and interpret data
Descriptive Statistics
Describing the data (as is is) (mean, median, mode, etc.)
Inferential Statistics
Drawing inferences about the population based on the sample data (Regression…)
Dispersion
Measure of data spread around a central point.
If it is large, the mean is not a good representation of the data because there are larger differences between individual scores
Kurtosis
Nature of the distribution (Peak, tall, skinny, etc.)
Regression
Part of inferential statistics
Most widely known and used analytics technique in statistics
Used to characterize relationships between explanatory (input) and response (output) variables
Can be used for
Hypothesis testing (explanation)
Forecasting (prediction)
Correlation vs. Regression
Correlation makes a no priori assumption of whether one variable is dependent on the other(s)
Gives an estimate on the degree of association between the variables
Regression simplicity assumes that there is a one-way effect from the explanatory variable(s) to the response variable
Logistic Regression
A statistical method used to model relationships between a binary dependent variable and one or more independent variables. It estimates the probability of the outcome.
Ex] Will the student pass the class? → Yes/No
Business Report
A written document that contains information regarding business matters
Purpose: To improve managerial decisions
Sources: Data from inside and outside the organization (extract, transform, and load)
Format: Text + Tables + Graphs/Charts
Distribution: In print, email, portal/internet
Metric Management Reports
Help business performance through metrics (KPSs for internals)
Dashboard-Type Reports
Graphical presentation of several performance indicators in a single page using dials/gauges
Balanced Scorecard-Type Reports
A performance measurement and management methodology that helps translate an organization’s financial, customer, internal process, and learning and growth objectives and targets into a set of actionable initiatives
Strategic management system
Identifies and measurements around vision and values
Focuses on growth
Heavy on strategic content
Data Visualization
The use of visual representations to explore, make sense of, and communicate data
Often includes charts, graphs, and other illustrations
Information
Aggregation, summarization, and contextualization of data
Visual analytics
The science of analytical reasoning facilitated by interactive visual interfaces
May use descriptive, predictive, and prescriptive analytics
Information Visualization
Graphical representation of data and information.
It uses visual elements like charts, graphs, maps, and infographics to present data
Dashboard Design
The fundamental challenge of dashboard design is to display all the required information on a single screen, clearly and without distraction, in a manner that can be assimilated quickly
What to look for in a dashboard
Use of visual components to highlight data and exceptions that require actions
Transparent to the user, meaning that they can require minimal training and are extremely easy to use
Combine data from a variety of systems into a single summarized, unified view of the business
Enable drill-down or drill-through to underlying data sources and reports
Present a dynamic, real-world view with timely data
Require little coding to implement, deploy, and maintain
Performance Dashboards
Provide visual displays of important information that is consolidated and arranged on a single screen so that the information can be digested at a single glance, easily drilled in and further explored
Commonly used in BPM software suites and BI Platforms
Data Mining
Used to describe the process of discovering new patterns and developing intelligence from collected, organized, and stored data
The process of finding mathematical patterns from (usually) large data sets such as correlations, trends, or prediction models
Allows a better understanding of customers, operations, and solving organizational problems
Other names: Knowledge extraction, pattern analysis, knowledge discovery, information harvesting, pattern searching, etc.
Machine Learning
A subset of artificial intelligence that involves the use of algorithms and statistical models to enable computers to perform specific tasks without explicit instructions. It relies on patterns and inference from data to improve performance over time.
Applications include image recognition, natural language processing, robotics, and predictive analytics.
BI is the entry level to what?
Descriptive Analytics
Data Warehouse
A collection of integrated, subject-oriented databases designed to support DSS functions
Characteristics of Data Warehouses
Subject-Oriented: Data organized by subject
Integrated: different sources
Time Variant (time-series): Historical data over time
Nonvolatile: Data can’t be changed or updated
Metadata
Client/server, real-time/right-time/active
Data Mart
A departmental scale “DW” that stores only limited/relevant data
Enterprise Data Warehouse (EDW)
A data warehouse for an enterprise (CRM, SCM)
Metadata
Data about Data
Describes the contents of the data warehouse and its acquisition and use
DW Architecture
Three Tier Architecture
Data acquisition software
The data warehouse that contains the data & software
Client (front-end) software that allows users to access and analyze data from the warehouse
Two Tier Architecture
The first two tiers from the three-tier structure are combined
Data Integration
Integration that combines three major processes
Accessing the data
Combining different views of the data (federation)
Capturing changes to the data
Enterprise Application Integration (EAI)
A technology that provides a vehicle for pushing data from source systems into a data warehouse
E T L = Extract Transform Load
Extract, Transform, Load
Reading data from a database, converting extracted data into required format, writing the data into target database
Inmon Model: EDW Approach (top-down)
A data warehousing approach that starts with an enterprise data warehouse, integrating data across the organization before creating data marts.
Kimball Model: Data Mart (DM) Approach (bottom-up)
A bottom-up data warehousing approach that creates data marts from operational systems.
Additional DW Considerations Hosted Data Warehouses (Outsourcing)
Benefits:
Requires minimal investment on in-house systems
Frees up capacity on in-house systems
Frees up cash flow
Makes powerful solutions affordable
Enables solutions that provide for growth
Offers better quality equipment and software
Provides faster connections
Dimensional Modeling
A retrieval-based system that supports high-volume query access
Star Schema
Most commonly used and simplest style of dimensional modeling
Contain a fact table surrounded by and connected to several dimension tables
Snowflakes Schema
An extension of star schema where the diagram resembles a snowflake
Multidimensionality
The ability to organize, present, and analyze data by several dimenstions
Dimensions: Products, sales volume, head count, inventory, profit, actual vs. forecast, etc.
Measures: Money, sales volume, head count, inventory profit
Time: Daily, weekly, monthly, quarterly, or yearly
Scalability
Refers to the degree to which a system can adjust to changes in demand without major additional changes or investments
Main issues of scalability:
The amount of data in the warehouse
How quickly the warehouse is expected to grow
The number of concurrent users
The complexity of user queries
Good Scalability
Queries and other data-access functions will grow linearly with the size of teh warehouse
Business Performance Management
Strategy Focused
A real-time system that alerts managers to potential opportunities, impending threats, and empowers a reaction through models and collabs
AKA Corporate Performance Management (CPM), Enterprise Performance Management (EPM), Strategic Enterprise Management (SEM)
Performance Measurement System
A system that assists managers in tracking the implementations of business strategy by comparing actual results against strategic goals and objectives
Key Performance Indicator
A representation of a strategic objective and metric(s) that measures performance against a goal
Outcome: revenues, lagging indicators
Driver: Sales leads, leading indicators
Six Sigma
A performance Management methodology that aims to reduce the number of defects in a business process to as close as zero defects per million opportunities (3.4 per million)
Performance measurement system
Establishes accountability for leadership for wellness and profitability
Maximizing profitability
Heavy on execution for profitability
Effective Performance Measurement Should:
Focus on key factors
Be a mix of past, present, and future
Should balance needs of shareholders, employees, partners, suppliers, and other stakeholders
Should start at the top and flow down to the bottom
Need to have targets that are based on research and reality rather than arbitrary
Closed Loop Process to Optimize Business Performance:
Process Steps
Strategize
Plan
Monitor / analyze
Act / adjust
Has sub-process steps
Strategize (Process Step 1)
Conduct a current situation analysis
Determine planning horizon
Conduct environment scan
Identify critical success factors
Complete a gap analysis
Create a strategic vision
Develop a business strategy
Identify strategic objectives and goals
Operational Planning (Process Step 2)
Plan that translated an organization’s strategic objectives and goals into a set of well-defined tactics and initiatives, resource requirements, and expected results for some future time period (usually a year)
Operational planning can be:
Tactic-centric (operationally focused)
Budget-centric (financially focused)
Monitor / Analyze: How are We Doing? (Process Step 3)
A comprehensive framework for monitoring performance that should address two key issues:
What to monitor?
Critical success factors
Strategic goals and targets
Here is where KPI’s dashboards, reporting, and analytics are helpful
Act and Adjust: What Do We Need to Do Differently?
Success (or mere survival) depends on new projects: creating new products, entering new markets, acquiring new customers (or businesses), or streamlining some process
Types of Data Mining Patterns
Association: Establishing relationships among items
Prediction: Act of telling about the future
Cluster (Segmentation): Finding groups of entities with similar characteristics (unknown class labels)
Sequential (or time series) relationships
Times Series Forecasting
Values of the same variable are captured over time
Data Mining Proces: CRISP-DM (Cross-Industry Standard Process for Data)
Highly repetative and experimental
Proposed in 1990s by a European consortium
Step 1: Business Understanding
Step 2: Data understanding
Step 3: Data Preparation
Step 4: Model Building
Step 5: Testing and Evaluation
Step 6: Deployment
Data Mining Process: SEMMA
Developed by SAS Institute
Sample: Generate representative sample of data
Explore: Visualization and basic description of the data
Modify: Select variables, transform variable representations)
Model: Use variety of statistical and machine learning models
Assess: Evaluate the accuracy and usefulness of the models
Data Mining Process (Knowledge Discovery in Databases; KDD)
Sources of raw data (data selection)→ Target Data (Data Cleaning)→ Preprocessed data (Data Transformation)→ Transformed data (Data Mining)→ Extracted Patterns (Internalization) → Knowledge “Actionable Insight”
Data Mining Methods: Classification
Most frequently used DM method
Part of the machine-learning family
Employ supervised learning
Learn patterns from past data (of previously labeled items), classify new data
The output variable is categorical (nominal or ordinal) in nature.
Classification versus regression? If numeric → Regression, If non-numeric → classification
Predictive Accuracy
The accuracy of a model in predicting class labels for new data.
Speed
Model building versus predicting/usage speed
Robustness
The ability of a model to make accurate predictions consistently, even in the presence of variability or uncertainty.
Sample Split
Split the data into 2 mutually exclusive sets: training (~70%) and testing (30%)
Decision Trees
Income + credit score (attributes) → class (loan risk low, medium, high)
Recursively divides a training set until each division consists of examples from one class
Creating a root node and assign all the training data to it
Select the best splitting attribute
Add a branch to the root note for each value of the split. Split the data into mutually exclusive subsets along the lines of the specific split
Cluster Algorithms
When the data records do not have predefined class identifiers. Sort cases into groups. Members in each group are similar
Used for automatic identification of natural groupings of things
Part of the machine-learning family
Employ unsupervised learning (Includes only descriptive attributes)
Learns the clusters of things from past data, then assigns new instances
There is not an output/target variable
What do Cluster Algorithms Do?
Identify natural groupings of customers
Provide characterization, definition, labeling of populations
k-Means Clustering Algorithm
k: pre-determined number of clusters
Algorithm (Step 0: determine the value of k)
Step 1: Randomly generate k random points as initial cluster centers
Step 2: Assign each point to the nearest cluster center
Step 3: Re-compute the new cluster centers
Repetition Step: Repeat steps 2 and 3 until some convergence criterion is met (usually that the assignment point to clusters becomes stable)
Association Rule Mining
A very popular DM method in business
Finds interesting relationships (affinities) between variables (items or events)
Part of machine learning family
Employs unsupervised learning
There is no output variable
Also known as market basket analysis
Find strong relationships between products (e.g Beer and diapers)
The generic rule of Association Rule Mining
X → Y [S%, C%]
X, Y: Products and/or services
X: Left-hand-side (LHS)
Y: Right-hand-side (RHS)
S: Support (frequency): how often X and Y go together
C: Confidence: how often Y go together with the X
Example: {{Laptop Computer, Antivirus Software} → {Extended Service Plan} [30%, 70%]
Association Rule Mining Algorithms
Several algorithms are developed for discovering (identifying) association rules
Apriori
Eclat
FP-Growth
The algorithms help identify frequent item sets, which are then converted to association rules
Data Mining Software Tools
Commercial
IBM SPSS Modeler (formerly Clementine)
SAS Enterprise Miner
Statistica - Dell/Statsoft
…many more
Free and/or Open Source
RapidMiner
Weka
R, …
Data Mining Mistakes
Selecting the wrong problem for data mining
Beginning without the end in mind
Not leaving sufficient time for data acquisition, selection, and preparation
Looking only at aggregated results and not at individual records/predictions
Supervised Learning
A machine learning approach where a model is trained using labeled data to learn patterns and make predictions on new data.
Classification: K-nearest neighbors, naïve buyers, decision trees, logistic reasoning, sentiment analysis
Regression: Least Squares, Linear Regression, Forecasting, Non-Linear Repression, Intervention Analysis
Unsupervised Learning
A machine learning approach where a model identifies patterns and structures in unlabeled data without predefined outputs.
Clustering / Segmentation
K-Means
Outlier Detection
Markov Chains
Customer Relationship Management (Data Mining Applications)
Maximize return on marketing campaigns
Improve customer retention
Maximize Customer Value
Identify and treat most valued customers
Baking and Other Financial (Data Mining Applications)
Automate the loan application process
Detecting fraudulent transactions
Maximize customer value (cross-, up-selling)
Optimizing cash reserves with forecasting
Retailing and Logistics (Data Mining Applications)
Optimize inventory levels at different locations
Improve store layout and sales promotions
Optimize logistics by predicting seasonal effects
Minimize losses due to limited shelf life
Manufacturing and Maintenance (Data Mining Applications)
Predict / prevent machinery failures
Identify anomalies in production systems to optimize the use of manufacturing capacity
Discover novel patterns and improve product quality
Data Mining Applications
Computer hardware and software
Government and defense
Homeland security and law enforcement
Travel, entertainment, sports
Healthcare and medicine
Sports, …virtually everywhere…