Business Intelligence: Big Data & Analytics

Chapter 6: Business Intelligence: Big Data & Analytics

Business Intelligence (BI)

  • Definition: Business Intelligence (BI) is defined as a set of technological processes designed for collecting, managing, and analyzing organizational data, resulting in insights that inform business strategies and operations (Source: IBM).

Big Data & Analytics

  • Definition: Big data analytics is classified as a type of advanced analytics that entails complex applications featuring:

    • Predictive models

    • Statistical algorithms

    • What-if analyses powered by analytics systems.

Importance of Learning About Big Data

  • Reasons to Learn:

    • New data is constantly generated from countless sources.

    • Approximately one zettabyte of data is produced per year:

    • Equivalent to 1 trillion gigabytes or a 1 followed by 21 zeros.

    • Necessitates analysis of vast quantities of data to:

    • Measure past and current performance.

    • Predict future trends and outcomes.

    • Drive anticipatory business actions.

    • Improve business strategies and operations.

    • Enrich decision-making processes.

    • Enhance competitive advantage.

Characteristics of Big Data

  1. Volume: Refers to a large amount of data.

  2. Velocity: Represents the speed of data generation and how quickly it is processed.

  3. Value: Refers to the magnitude of importance or worth of data.

  4. Variety: Denotes the different forms or types of data involved.

  5. Veracity: Indicates the accuracy and truthfulness of the data.

Sources of Useful Data

  • Organizations accumulate useful data from various sources.

Technologies for Managing and Processing Big Data

  • Different technologies and systems are employed to handle big data:

    • Data Warehouses: Serve as central repositories for data collected from various sources.

    • ETL Process: Extract, Transform, Load process for managing data.

    • Data Marts: Subsets of data warehouses tailored for specific business areas.

    • Data Lakes: Storage systems that maintain all data in its raw form.

    • NoSQL Databases: A non-relational database for storing data not in tabular relations.

    • Hadoop: An open-source framework designed to process large datasets.

    • In-Memory Databases: Databases that store all data in RAM for faster access.

Online Transaction Processing (OLTP)

  • Definition: OLTP systems are traditionally utilized for capturing transaction data, but they lack current analytical capabilities.

  • Examples include:

    • ATMs

    • Financial transaction systems

    • Online banking

    • Reservation systems (booking and ticketing).

  • Data Warehousing: Facilitates decision-making by providing access to OLTP data. Data is transformed and loaded via ETL processes.

Comparison of Data Structures

  1. Data Warehouse:

    • Extensive database that integrates data from various sources across an organization, encompassing all processes, products, and customer information.

  2. Data Mart:

    • A focused subset of a data warehouse, beneficial for smaller businesses or departments, optimized for specific decision-making needs.

  3. Data Lake:

    • Embraces a storage-based strategy where all data is maintained in raw, unprocessed formats.

NoSQL Databases

  • Differentiates from traditional relational databases by:

    • Structuring data without fixed table relations, allowing for flexible data management.

    • Utilizing horizontal scaling and not requiring predefined schemas.

    • Offering improved speed and redundancy.

  • Types of NoSQL Databases:

    • Document: Stores data in documents.

    • Graph: Manages interconnected data.

    • Key-Value: Data is organized into a key-value pair.

    • Wide-column: Similar structure to column family, providing high flexibility.(e.g. Cassandra, HBase)

  • Examples:

    • Document databases include MongoDB, CouchDB; graph databases include Neo4j.

Hadoop Framework

  • Definition: Hadoop is an open-source software framework that provides modules for the storage and processing of vast data sets.

  • Hadoop Distributed File System (HDFS):

    • A distributed file system used for data storage. Divides data into subsets and processes them across multiple servers.

MapReduce Program

  • Functionality: A composite program that consists of:

    • Map Procedure: Responsible for filtering and sorting data.

    • Reduce Method: Conducts summary operations on the filtered data.

  • Limitations: Processing capabilities are restricted to batch processing only.

In-Memory Databases (IMDB)

  • Storage Mechanism: Stores the complete database within RAM,

  • Advantages:

    • Significantly faster data retrieval compared to secondary storage solutions.

    • Efficient in analyzing big data and complex data processing applications.

  • Enabling Factors:

    • Increase in RAM capacity.

    • Decrease in RAM costs.

  • IMDB Providers: Include Altibase (HDB), Oracle (Times Ten), SAP (HANA), and others.

Analytics and Business Intelligence (BI)

  • Exploiting data and quantitative analysis extensively leads to factual decision-making within organizations.

  • Benefits:

    • Fraud detection

    • Improved forecasting

    • Enhanced sales performance

    • Operational optimizations

    • Cost reductions

Role of a Data Scientist

  • Skill Set: Combines business acumen, in-depth analytics knowledge, and understanding limitations of tools and techniques.

  • Analysis Impact: Aims to improve decision-making significantly.

  • Job Market: Promising career prospects and rigorous educational requirements.

Required Components for Effective BI & Analytics

  1. A comprehensive data management program, inclusive of governance principles.

  2. Expertise from creative data scientists.

  3. Strong organizational commitment to data-driven decision-making.

Analytics Techniques

  • Types of Analysis include:

    • Descriptive Analysis

    • Predictive Analysis

    • Text Analysis

    • Video Analysis

    • Optimization Techniques

    • Simulation Analysis

Descriptive Analysis

  • Primary Data Processing Stage: Acts as a preliminary step in data analysis, identifying patterns and answering critical questions about data.

  • Typical Questions Addressed:

    • Who, what, where, when, and to what extent are key queries.

  • Types of Descriptive Analysis:

    • Visual analytics

    • Regression analysis

  • Visual Analytics: Presentation of data through graphs or pictorial means (e.g., word clouds, conversion funnels).

  • Regression Analysis: Establishes relationships between dependent and independent variables, producing regression equations for prediction.

Predictive Analytics

  • Definition: Focuses on analyzing current data to identify future probabilities and trends, allowing for predictions.

  • Time Series Analysis: A statistical method to analyze time-dependent data.

  • Data Mining Techniques: Include association analysis, neural computing, and case-based reasoning to uncover hidden patterns and guide decision making.

  • CRISP-DM: The Cross Industry Standard Process for Data Mining, structuring a data mining project.

Optimization Techniques

  • Purpose: Allocates limited resources to either minimize costs or maximize profits.

  • Genetic Algorithm: Mimics natural evolutionary processes to derive solutions to complex problems.

  • Linear Programming: Determines the optimal value of a linear expression based on decision variable values while considering constraints.

Simulation Techniques

  • Functionality: Emulates real-world systems' responses to varying inputs.

  • Scenario Analysis: Projects future values based on selected events.

  • Monte Carlo Simulation: Evaluates numerous potential outcomes, factoring in multiple influencing variables.

Text and Video Analysis

  • Text Analysis: Aims to derive insights from large volumes of unstructured text data.

  • Video Analysis: Seeks to extract information from video footage to support decision making.

Self-Service Analytics

  • Objective: Provides users with training, tools, and techniques to independently analyze data.

  • Advantages:

    • Facilitates access to valuable data for end users,

    • Encourages data-driven decision making.

    • Aids in addressing the shortage of data scientists.