Data Warehousing and Data Mining - Study Notes

Basic Concepts

  • Data Warehouse (DW): a data repository maintained separately from an organization's operational databases. It enables integration of multiple application systems and provides a consolidated historic data platform for analysis.
  • William H. Inmon definition (foundation):
    • A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision making process.
    • Key distinction keywords: subject-oriented, integrated, time-variant, nonvolatile.
    • These four keywords separate data warehouses from generic data repositories like relational databases, transaction processing systems, and files.
  • Four distinguishing keywords (as above) summarized:
    • Subject-oriented
    • Integrated
    • Time-variant
    • Nonvolatile
  • Subject-oriented: organized around major subjects (customers, suppliers, products, sales) to support decision making; excludes irrelevant data.
  • Integrated: data from multiple heterogeneous sources (relational DBs, flat files, OLTP, etc.) cleaned and harmonized for consistency in naming, encoding, measures, and attributes.
  • Time-variant: stored with a historical perspective (e.g., 5–10 years); each key structure includes a time element to enable trend discovery.
  • Nonvolatile: DW is a physically separate store; once data are loaded, they are not overwritten by transaction processing data; typically only loading and querying operations are performed.
  • Important note on nonvolatile behavior: previous data are not erased when new data are added.

Data Warehouse Architecture and Components

  • Data warehouse serves as a platform for analysis and decision support, contrasting with OLTP systems which support day-to-day transactions.
  • How organizations use DW information: support decision-making across multiple areas such as
    • increasing customer focus (analyzing buying patterns, timing, budgets, spending tendencies)
    • repositioning products and managing portfolios (compare sales by quarter, year, region to tune production)
    • analyzing operations for profit sources
    • managing customer relationships, environmental corrections, and corporate asset costs
  • DW vs OLTP: key differences include users (knowledge workers vs clerks/IT), data contents (historic/aggregated vs current detailed data), design (star/snowflake vs ER models), and access patterns (read-heavy, ad-hoc queries vs frequent transactions).

Data Warehouse Models, Schemas, and Architecture

  • Enterprise Warehouse: an enterprise-wide repository that aggregates information about all subjects, providing corporate-wide data integration.

  • Data Mart: a subset of DW focusing on a specific group or subject area (e.g., customer, item, sales); scope is narrower and tailored to particular user needs.

  • Virtual Warehouse: a set of views over operational databases; efficient for query processing but may require excess capacity on operational servers.

  • Independent vs dependent data marts: independent marts are stand-alone without a DW; dependent marts source data from an enterprise DW.

  • Three-Tier Data Warehouse Architecture:

    • Top tier (front-end/tools): query/reporting, analysis, data mining tools for end users.
    • Middle tier (OLAP server): handles multidimensional analysis; can be realized as:
    • ROLAP: Relational OLAP, maps multidimensional operations to standard relational DBMS operations.
    • MOLAP: Multidimensional OLAP, specialized server implementing multidimensional data and operations directly.
    • Bottom tier (data warehouse server): DW database (usually a relational DBMS); back-end tools extract, clean, transform, load data from operational and external sources; metadata repository present.
  • ETL (Extraction, Transformation, Loading) and refresh cycle:

    • Extraction: gather data from multiple heterogeneous sources.
    • Cleaning: detect and rectify data quality issues.
    • Transformation: convert legacy/host formats to the warehouse format; merge and reconcile data.
    • Load: sort, summarize, consolidate, compute views, build indices/partitions.
    • Refresh: propagate updates from sources to the warehouse.
  • Gateways: application interfaces that generate SQL code to be executed at the server; gateways are supported by the DBMS.

  • Metadata Repository: stores information about the warehouse and its contents, including structure, lineage, schedules, and performance data.

Data Warehouse Modeling: Data Cube and OLAP

  • Multidimensional data model underpins DW and OLAP tools; data are viewed as a data cube with multiple dimensions around a central theme (facts).
  • Dimensions: perspectives/entities (e.g., time, item, branch, location) used to organize data.
  • Facts: numeric measures (e.g., dollarssold, unitssold) that are analyzed across dimensions.
  • Example: a sales data warehouse with dimensions time, item, branch, location and measures dollarssold, unitssold.
  • Central theme example: sales; data organized around this fact table with foreign keys to each dimension and the measures.
  • Data cube views can be 2D, 3D, or higher-dimensional representations; a lattice of cuboids represents various levels of summarization across dimensions.
  • A 4-D data cube example includes time, item, location, and supplier as dimensions.
  • Cuboids and summarization:
    • Each cuboid represents a level of summarization for a subset of dimensions.
    • Lattice of cuboids forms the data cube; the base cuboid corresponds to the most detailed level, while the apex cuboid (0-D) corresponds to the most generalized level (often denoted as all).
    • Apex cuboid: 0-D, contains the total sum across all dimensions.
    • Base cuboid: most detailed, lowest level of granularity.
    • Notation: apex cuboid is sometimes denoted as all.
  • Data cube operations enable interactive querying and analysis across views and levels of granularity.

OLAP Schemas: Star, Snowflake, and Constellations

  • Star Schema: central fact table surrounded by dimension tables; dimension tables are denormalized; simple and fast for queries.
    • Fact table contains foreign keys to dimensions and measures (e.g., dollarssold, unitssold).
    • Dimension tables contain attributes for each dimension (e.g., time, item, location, branch).
  • Snowflake Schema: dimension tables are normalized into multiple related tables, reducing redundancy but requiring more joins and potentially slower queries.
    • Example: item dimension expands to itemkey, itemname, brand, type, supplierkey; supplierkey links to a supplier dimension table.
    • Location dimension may split into location and city tables, with city_key linking to city details.
  • Fact Constellation (Galaxy) Schema: multiple fact tables share dimension tables; supports modeling of interrelated subjects.
    • Example: sales fact table and shipping fact table share time, item, and location dimensions; both can co-exist in a single schema.
  • Data Marts and Data Warehouses: data marts typically use star or snowflake schemas focused on single subject area; data warehouses use one or more of the above schemas to model enterprise data.

Dimensions and Concept Hierarchies

  • Concept hierarchy: a sequence of mappings from low-level concepts to higher-level, more general concepts, enabling roll-up and drill-down operations.
  • Location example: city values (Vancouver, Toronto, New York, Chicago) map to provinces/states (British Columbia, Ontario, Illinois, etc.), which in turn map to countries (Canada, United States).
  • Implicit hierarchies often exist in schema (e.g., location attributes: street < city < province/state < country).
  • Time dimension lattice: day < month < quarter < year, and possibly week < year in a lattice structure.
  • Schema hierarchy: total or partial order among attributes forming a concept hierarchy; may be defined manually by domain experts or automatically generated by the system.
  • Discretization/grouping hierarchies: discretizing values into groups (e.g., price intervals [$X, Y]) to create higher-level groupings.

OLAP Operations

  • Multidimensional model enables viewing data from multiple perspectives; OLAP data cube operations materialize these views for interactive analysis.
  • Example cube: dimensions location, time, and item; aggregations could be:
    • location aggregated by city values
    • time aggregated by quarters
    • item aggregated by item types
  • Roll-up (drill-up): aggregate data along a dimension by climbing up a hierarchy or reducing dimensions.
    • Example: roll-up from city to country in the location dimension.
    • Note: dimension reduction removes one or more dimensions from the cube.
  • Drill-down: reverse of roll-up; navigate from less detailed to more detailed data by stepping down a hierarchy or adding dimensions.
    • Example: drill-down from quarter to month in the time dimension to view monthly sales.
  • Slice and Dice: select a subcube by applying a filter on one or more dimensions.
  • Pivot (Rotate): change the orientation of the data axes to obtain alternative views for visualization.
  • Other OLAP operations: drill-across (across multiple fact tables), drill-through (drill through to back-end relational tables via SQL), ranking (top N or bottom N items), moving averages, growth rates, currency conversions, and various statistical functions.

Data Mining, OLAP Integration, and Examples

  • Data mining is often discussed alongside DW and OLAP as complementary to decision support; the provided content focuses on DW/OLAP concepts and operations with examples and figures.
  • Example figures and snapshots illustrate OLAP operations on multidimensional data, including 2D/3D views and pivoting across dimensions like time, location, and item.

Practical Details and Real-World Relevance

  • Data extraction and loading often leverage gateways and back-end utilities to interface with different data sources (operational databases, external sources, etc.).
  • Metadata plays a crucial role in governance, data lineage, data quality, performance tuning, and maintaining the semantics of the warehouse content.
  • The architecture supports scalability via a three-tier approach, enabling separation of concerns between data storage, analytical processing, and user interface components.
  • The decision support emphasis emphasizes the value of historical data and multidimensional views to identify trends, support planning, optimize processes, and improve customer relationships.

Key Terms and Concepts (Glossary)

  • Data Warehouse (DW)
  • OLTP vs OLAP
  • Subject-oriented, Integrated, Time-variant, Nonvolatile (Inmon definition)
  • Integrated data cleaning and data integration
  • Time-variant data and historical perspective
  • Nonvolatile storage and separation from operational data
  • ETL: Extraction, Transformation, Loading, and Refresh
  • Gateways (SQL generation for the server)
  • Metadata Repository (structure, lineage, performance, business metadata)
  • Enterprise Data Warehouse (EDW)
  • Data Mart, Independent vs Dependent Marts
  • Virtual Warehouse (views over operational data)
  • Star Schema, Snowflake Schema, Fact Constellation (Galaxy)
  • Dimensions, Facts, and Measures (e.g., time, item, location; dollarssold, unitssold)
  • Concept Hierarchies (total/partial order, lattice, group-by hierarchies)
  • Data Cube, Apex Cuboid (0-D, all), Base Cuboid (most detailed)
  • OLAP Operations: Roll-up, Drill-down, Slice, Dice, Pivot; Drill-across, Drill-through

Summary of Notable Formulas and Notations

  • Inmon definition terms (for quick recall):
    • Subject-oriented, Integrated, Time-variant, Nonvolatile
  • Data cube and cuboids:
    • Apex cuboid: 0-D; denoted as all; total aggregation across all dimensions.
    • Base cuboid: most detailed level; highest granularity for all dimensions.
    • A lattice of cuboids covers all combinations of dimensional subsets for summarization.
  • Example dimensional structures:
    • Time, Item, Location, Supplier as a 4-D cube; measures include dollarssold and unitssold.
    • Location hierarchy: street < city < province/state < country.
    • Time lattice: day < month < quarter < year; possibly week < year.
  • OLAP operations in notation:
    • Roll-up: aggregate along a hierarchy or reduce dimensionality; e.g., roll-up from city to country.
    • Drill-down: expand details; e.g., from quarter to month.
    • Slice: select a single dimension to create a subcube.
    • Dice: select multiple dimensions to create a subcube.
    • Pivot: rotate axes for alternate presentation.

Connections to Foundational Principles and Real-World Applications

  • The DW approach supports decision making by enabling historic data analysis, trend detection, and performance assessment across time, geography, and product lines.
  • By structuring data into subject-oriented, integrated, time-variant, and nonvolatile repositories, organizations can achieve consistent analytics, cross-functional visibility, and scalable reporting.
  • The concept hierarchies and lattices enable flexible aggregation levels, supporting roll-ups for high-level summaries and drill-downs for in-depth investigation.
  • The architectural separation (data storage, processing, and presentation) aligns with best practices in information systems for scalability, maintainability, and governance.

References to Formed Examples in the Transcript (Illustrative)

  • 4-D data cube with dimensions: time, item, location, supplier; measures: dollarssold, unitssold.
  • Base cuboid vs apex cuboid concepts demonstrated in OLAP discussions.
  • Star versus snowflake examples: item dimension normalization leads to item and supplier tables; location normalization leads to location and city tables.
  • Example OLAP queries showcase filtering by location cities (e.g., New York, Vancouver) and time quarters (Q1, Q2) with item types such as home entertainment or computer.
  • Example OLAP operations visualized: roll-up from city to country; drill-down from quarters to months; slice by location and time; dice by multiple dimensions; pivot to reorient axes.
  • Metadata examples include data warehouse structure descriptions, data lineage, currency of data, monitoring and audit trails, data on performance and refresh schedules.

Quick Reference Table (conceptual)

  • DW vs OLTP: DW is for analysis; OLTP for transactions.
  • DW Models: Enterprise Warehouse, Data Mart, Virtual Warehouse
  • Schemas: Star, Snowflake, Fact Constellation
  • OLAP Operations: Roll-up, Drill-down, Slice, Dice, Pivot; (plus) Drill-across, Drill-through, ranking, moving averages, currency conversions, etc.
  • ETL Cycle: Extract → Clean → Transform → Load → Refresh
  • Metadata: structure, lineage, currency, monitoring, business terms, data ownership, charging policies