Business Intelligence: Big Data & Analytics
Chapter 6: Business Intelligence: Big Data & Analytics
Business Intelligence (BI)
Definition: Business Intelligence (BI) is defined as a set of technological processes designed for collecting, managing, and analyzing organizational data, resulting in insights that inform business strategies and operations (Source: IBM).
Big Data & Analytics
Definition: Big data analytics is classified as a type of advanced analytics that entails complex applications featuring:
Predictive models
Statistical algorithms
What-if analyses powered by analytics systems.
Importance of Learning About Big Data
Reasons to Learn:
New data is constantly generated from countless sources.
Approximately one zettabyte of data is produced per year:
Equivalent to 1 trillion gigabytes or a 1 followed by 21 zeros.
Necessitates analysis of vast quantities of data to:
Measure past and current performance.
Predict future trends and outcomes.
Drive anticipatory business actions.
Improve business strategies and operations.
Enrich decision-making processes.
Enhance competitive advantage.
Characteristics of Big Data
Volume: Refers to a large amount of data.
Velocity: Represents the speed of data generation and how quickly it is processed.
Value: Refers to the magnitude of importance or worth of data.
Variety: Denotes the different forms or types of data involved.
Veracity: Indicates the accuracy and truthfulness of the data.
Sources of Useful Data
Organizations accumulate useful data from various sources.
Technologies for Managing and Processing Big Data
Different technologies and systems are employed to handle big data:
Data Warehouses: Serve as central repositories for data collected from various sources.
ETL Process: Extract, Transform, Load process for managing data.
Data Marts: Subsets of data warehouses tailored for specific business areas.
Data Lakes: Storage systems that maintain all data in its raw form.
NoSQL Databases: A non-relational database for storing data not in tabular relations.
Hadoop: An open-source framework designed to process large datasets.
In-Memory Databases: Databases that store all data in RAM for faster access.
Online Transaction Processing (OLTP)
Definition: OLTP systems are traditionally utilized for capturing transaction data, but they lack current analytical capabilities.
Examples include:
ATMs
Financial transaction systems
Online banking
Reservation systems (booking and ticketing).
Data Warehousing: Facilitates decision-making by providing access to OLTP data. Data is transformed and loaded via ETL processes.
Comparison of Data Structures
Data Warehouse:
Extensive database that integrates data from various sources across an organization, encompassing all processes, products, and customer information.
Data Mart:
A focused subset of a data warehouse, beneficial for smaller businesses or departments, optimized for specific decision-making needs.
Data Lake:
Embraces a storage-based strategy where all data is maintained in raw, unprocessed formats.
NoSQL Databases
Differentiates from traditional relational databases by:
Structuring data without fixed table relations, allowing for flexible data management.
Utilizing horizontal scaling and not requiring predefined schemas.
Offering improved speed and redundancy.
Types of NoSQL Databases:
Document: Stores data in documents.
Graph: Manages interconnected data.
Key-Value: Data is organized into a key-value pair.
Wide-column: Similar structure to column family, providing high flexibility.(e.g. Cassandra, HBase)
Examples:
Document databases include MongoDB, CouchDB; graph databases include Neo4j.
Hadoop Framework
Definition: Hadoop is an open-source software framework that provides modules for the storage and processing of vast data sets.
Hadoop Distributed File System (HDFS):
A distributed file system used for data storage. Divides data into subsets and processes them across multiple servers.
MapReduce Program
Functionality: A composite program that consists of:
Map Procedure: Responsible for filtering and sorting data.
Reduce Method: Conducts summary operations on the filtered data.
Limitations: Processing capabilities are restricted to batch processing only.
In-Memory Databases (IMDB)
Storage Mechanism: Stores the complete database within RAM,
Advantages:
Significantly faster data retrieval compared to secondary storage solutions.
Efficient in analyzing big data and complex data processing applications.
Enabling Factors:
Increase in RAM capacity.
Decrease in RAM costs.
IMDB Providers: Include Altibase (HDB), Oracle (Times Ten), SAP (HANA), and others.
Analytics and Business Intelligence (BI)
Exploiting data and quantitative analysis extensively leads to factual decision-making within organizations.
Benefits:
Fraud detection
Improved forecasting
Enhanced sales performance
Operational optimizations
Cost reductions
Role of a Data Scientist
Skill Set: Combines business acumen, in-depth analytics knowledge, and understanding limitations of tools and techniques.
Analysis Impact: Aims to improve decision-making significantly.
Job Market: Promising career prospects and rigorous educational requirements.
Required Components for Effective BI & Analytics
A comprehensive data management program, inclusive of governance principles.
Expertise from creative data scientists.
Strong organizational commitment to data-driven decision-making.
Analytics Techniques
Types of Analysis include:
Descriptive Analysis
Predictive Analysis
Text Analysis
Video Analysis
Optimization Techniques
Simulation Analysis
Descriptive Analysis
Primary Data Processing Stage: Acts as a preliminary step in data analysis, identifying patterns and answering critical questions about data.
Typical Questions Addressed:
Who, what, where, when, and to what extent are key queries.
Types of Descriptive Analysis:
Visual analytics
Regression analysis
Visual Analytics: Presentation of data through graphs or pictorial means (e.g., word clouds, conversion funnels).
Regression Analysis: Establishes relationships between dependent and independent variables, producing regression equations for prediction.
Predictive Analytics
Definition: Focuses on analyzing current data to identify future probabilities and trends, allowing for predictions.
Time Series Analysis: A statistical method to analyze time-dependent data.
Data Mining Techniques: Include association analysis, neural computing, and case-based reasoning to uncover hidden patterns and guide decision making.
CRISP-DM: The Cross Industry Standard Process for Data Mining, structuring a data mining project.
Optimization Techniques
Purpose: Allocates limited resources to either minimize costs or maximize profits.
Genetic Algorithm: Mimics natural evolutionary processes to derive solutions to complex problems.
Linear Programming: Determines the optimal value of a linear expression based on decision variable values while considering constraints.
Simulation Techniques
Functionality: Emulates real-world systems' responses to varying inputs.
Scenario Analysis: Projects future values based on selected events.
Monte Carlo Simulation: Evaluates numerous potential outcomes, factoring in multiple influencing variables.
Text and Video Analysis
Text Analysis: Aims to derive insights from large volumes of unstructured text data.
Video Analysis: Seeks to extract information from video footage to support decision making.
Self-Service Analytics
Objective: Provides users with training, tools, and techniques to independently analyze data.
Advantages:
Facilitates access to valuable data for end users,
Encourages data-driven decision making.
Aids in addressing the shortage of data scientists.