Data Warehousing and Data Mining - Study Notes

Basic Concepts

Data Warehouse (DW): a data repository maintained separately from an organization's operational databases. It enables integration of multiple application systems and provides a consolidated historic data platform for analysis.
William H. Inmon definition (foundation):
- A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision making process.
- Key distinction keywords: subject-oriented, integrated, time-variant, nonvolatile.
- These four keywords separate data warehouses from generic data repositories like relational databases, transaction processing systems, and files.
Four distinguishing keywords (as above) summarized:
- Subject-oriented
- Integrated
- Time-variant
- Nonvolatile
Subject-oriented: organized around major subjects (customers, suppliers, products, sales) to support decision making; excludes irrelevant data.
Integrated: data from multiple heterogeneous sources (relational DBs, flat files, OLTP, etc.) cleaned and harmonized for consistency in naming, encoding, measures, and attributes.
Time-variant: stored with a historical perspective (e.g., 5–10 years); each key structure includes a time element to enable trend discovery.
Nonvolatile: DW is a physically separate store; once data are loaded, they are not overwritten by transaction processing data; typically only loading and querying operations are performed.
Important note on nonvolatile behavior: previous data are not erased when new data are added.

Data Warehouse Architecture and Components

Data warehouse serves as a platform for analysis and decision support, contrasting with OLTP systems which support day-to-day transactions.
How organizations use DW information: support decision-making across multiple areas such as
- increasing customer focus (analyzing buying patterns, timing, budgets, spending tendencies)
- repositioning products and managing portfolios (compare sales by quarter, year, region to tune production)
- analyzing operations for profit sources
- managing customer relationships, environmental corrections, and corporate asset costs
DW vs OLTP: key differences include users (knowledge workers vs clerks/IT), data contents (historic/aggregated vs current detailed data), design (star/snowflake vs ER models), and access patterns (read-heavy, ad-hoc queries vs frequent transactions).

Data Warehouse Models, Schemas, and Architecture

Enterprise Warehouse: an enterprise-wide repository that aggregates information about all subjects, providing corporate-wide data integration.
Data Mart: a subset of DW focusing on a specific group or subject area (e.g., customer, item, sales); scope is narrower and tailored to particular user needs.
Virtual Warehouse: a set of views over operational databases; efficient for query processing but may require excess capacity on operational servers.
Independent vs dependent data marts: independent marts are stand-alone without a DW; dependent marts source data from an enterprise DW.
Three-Tier Data Warehouse Architecture:
- Top tier (front-end/tools): query/reporting, analysis, data mining tools for end users.
- Middle tier (OLAP server): handles multidimensional analysis; can be realized as:
- ROLAP: Relational OLAP, maps multidimensional operations to standard relational DBMS operations.
- MOLAP: Multidimensional OLAP, specialized server implementing multidimensional data and operations directly.
- Bottom tier (data warehouse server): DW database (usually a relational DBMS); back-end tools extract, clean, transform, load data from operational and external sources; metadata repository present.
ETL (Extraction, Transformation, Loading) and refresh cycle:
- Extraction: gather data from multiple heterogeneous sources.
- Cleaning: detect and rectify data quality issues.
- Transformation: convert legacy/host formats to the warehouse format; merge and reconcile data.
- Load: sort, summarize, consolidate, compute views, build indices/partitions.
- Refresh: propagate updates from sources to the warehouse.
Gateways: application interfaces that generate SQL code to be executed at the server; gateways are supported by the DBMS.
Metadata Repository: stores information about the warehouse and its contents, including structure, lineage, schedules, and performance data.

Data Warehouse Modeling: Data Cube and OLAP

Multidimensional data model underpins DW and OLAP tools; data are viewed as a data cube with multiple dimensions around a central theme (facts).
Dimensions: perspectives/entities (e.g., time, item, branch, location) used to organize data.
Facts: numeric measures (e.g., dollarssold, unitssold) that are analyzed across dimensions.
Example: a sales data warehouse with dimensions time, item, branch, location and measures dollarssold, unitssold.
Central theme example: sales; data organized around this fact table with foreign keys to each dimension and the measures.
Data cube views can be 2D, 3D, or higher-dimensional representations; a lattice of cuboids represents various levels of summarization across dimensions.
A 4-D data cube example includes time, item, location, and supplier as dimensions.
Cuboids and summarization:
- Each cuboid represents a level of summarization for a subset of dimensions.
- Lattice of cuboids forms the data cube; the base cuboid corresponds to the most detailed level, while the apex cuboid (0-D) corresponds to the most generalized level (often denoted as all).
- Apex cuboid: 0-D, contains the total sum across all dimensions.
- Base cuboid: most detailed, lowest level of granularity.
- Notation: apex cuboid is sometimes denoted as all.
Data cube operations enable interactive querying and analysis across views and levels of granularity.

OLAP Schemas: Star, Snowflake, and Constellations

Star Schema: central fact table surrounded by dimension tables; dimension tables are denormalized; simple and fast for queries.
- Fact table contains foreign keys to dimensions and measures (e.g., dollarssold, unitssold).
- Dimension tables contain attributes for each dimension (e.g., time, item, location, branch).
Snowflake Schema: dimension tables are normalized into multiple related tables, reducing redundancy but requiring more joins and potentially slower queries.
- Example: item dimension expands to itemkey, itemname, brand, type, supplierkey; supplierkey links to a supplier dimension table.
- Location dimension may split into location and city tables, with city_key linking to city details.
Fact Constellation (Galaxy) Schema: multiple fact tables share dimension tables; supports modeling of interrelated subjects.
- Example: sales fact table and shipping fact table share time, item, and location dimensions; both can co-exist in a single schema.
Data Marts and Data Warehouses: data marts typically use star or snowflake schemas focused on single subject area; data warehouses use one or more of the above schemas to model enterprise data.

Dimensions and Concept Hierarchies

Concept hierarchy: a sequence of mappings from low-level concepts to higher-level, more general concepts, enabling roll-up and drill-down operations.
Location example: city values (Vancouver, Toronto, New York, Chicago) map to provinces/states (British Columbia, Ontario, Illinois, etc.), which in turn map to countries (Canada, United States).
Implicit hierarchies often exist in schema (e.g., location attributes: street < city < province/state < country).
Time dimension lattice: day < month < quarter < year, and possibly week < year in a lattice structure.
Schema hierarchy: total or partial order among attributes forming a concept hierarchy; may be defined manually by domain experts or automatically generated by the system.
Discretization/grouping hierarchies: discretizing values into groups (e.g., price intervals [$X, Y]) to create higher-level groupings.

OLAP Operations

Multidimensional model enables viewing data from multiple perspectives; OLAP data cube operations materialize these views for interactive analysis.
Example cube: dimensions location, time, and item; aggregations could be:
- location aggregated by city values
- time aggregated by quarters
- item aggregated by item types
Roll-up (drill-up): aggregate data along a dimension by climbing up a hierarchy or reducing dimensions.
- Example: roll-up from city to country in the location dimension.
- Note: dimension reduction removes one or more dimensions from the cube.
Drill-down: reverse of roll-up; navigate from less detailed to more detailed data by stepping down a hierarchy or adding dimensions.
- Example: drill-down from quarter to month in the time dimension to view monthly sales.
Slice and Dice: select a subcube by applying a filter on one or more dimensions.
Pivot (Rotate): change the orientation of the data axes to obtain alternative views for visualization.
Other OLAP operations: drill-across (across multiple fact tables), drill-through (drill through to back-end relational tables via SQL), ranking (top N or bottom N items), moving averages, growth rates, currency conversions, and various statistical functions.

Data Mining, OLAP Integration, and Examples

Data mining is often discussed alongside DW and OLAP as complementary to decision support; the provided content focuses on DW/OLAP concepts and operations with examples and figures.
Example figures and snapshots illustrate OLAP operations on multidimensional data, including 2D/3D views and pivoting across dimensions like time, location, and item.

Practical Details and Real-World Relevance

Data extraction and loading often leverage gateways and back-end utilities to interface with different data sources (operational databases, external sources, etc.).
Metadata plays a crucial role in governance, data lineage, data quality, performance tuning, and maintaining the semantics of the warehouse content.
The architecture supports scalability via a three-tier approach, enabling separation of concerns between data storage, analytical processing, and user interface components.
The decision support emphasis emphasizes the value of historical data and multidimensional views to identify trends, support planning, optimize processes, and improve customer relationships.

Key Terms and Concepts (Glossary)

Data Warehouse (DW)
OLTP vs OLAP
Subject-oriented, Integrated, Time-variant, Nonvolatile (Inmon definition)
Integrated data cleaning and data integration
Time-variant data and historical perspective
Nonvolatile storage and separation from operational data
ETL: Extraction, Transformation, Loading, and Refresh
Gateways (SQL generation for the server)
Metadata Repository (structure, lineage, performance, business metadata)
Enterprise Data Warehouse (EDW)
Data Mart, Independent vs Dependent Marts
Virtual Warehouse (views over operational data)
Star Schema, Snowflake Schema, Fact Constellation (Galaxy)
Dimensions, Facts, and Measures (e.g., time, item, location; dollarssold, unitssold)
Concept Hierarchies (total/partial order, lattice, group-by hierarchies)
Data Cube, Apex Cuboid (0-D, all), Base Cuboid (most detailed)
OLAP Operations: Roll-up, Drill-down, Slice, Dice, Pivot; Drill-across, Drill-through

Summary of Notable Formulas and Notations

Inmon definition terms (for quick recall):
- Subject-oriented, Integrated, Time-variant, Nonvolatile
Data cube and cuboids:
- Apex cuboid: 0-D; denoted as all; total aggregation across all dimensions.
- Base cuboid: most detailed level; highest granularity for all dimensions.
- A lattice of cuboids covers all combinations of dimensional subsets for summarization.
Example dimensional structures:
- Time, Item, Location, Supplier as a 4-D cube; measures include dollarssold and unitssold.
- Location hierarchy: street < city < province/state < country.
- Time lattice: day < month < quarter < year; possibly week < year.
OLAP operations in notation:
- Roll-up: aggregate along a hierarchy or reduce dimensionality; e.g., roll-up from city to country.
- Drill-down: expand details; e.g., from quarter to month.
- Slice: select a single dimension to create a subcube.
- Dice: select multiple dimensions to create a subcube.
- Pivot: rotate axes for alternate presentation.

Connections to Foundational Principles and Real-World Applications

The DW approach supports decision making by enabling historic data analysis, trend detection, and performance assessment across time, geography, and product lines.
By structuring data into subject-oriented, integrated, time-variant, and nonvolatile repositories, organizations can achieve consistent analytics, cross-functional visibility, and scalable reporting.
The concept hierarchies and lattices enable flexible aggregation levels, supporting roll-ups for high-level summaries and drill-downs for in-depth investigation.
The architectural separation (data storage, processing, and presentation) aligns with best practices in information systems for scalability, maintainability, and governance.

References to Formed Examples in the Transcript (Illustrative)

4-D data cube with dimensions: time, item, location, supplier; measures: dollarssold, unitssold.
Base cuboid vs apex cuboid concepts demonstrated in OLAP discussions.
Star versus snowflake examples: item dimension normalization leads to item and supplier tables; location normalization leads to location and city tables.
Example OLAP queries showcase filtering by location cities (e.g., New York, Vancouver) and time quarters (Q1, Q2) with item types such as home entertainment or computer.
Example OLAP operations visualized: roll-up from city to country; drill-down from quarters to months; slice by location and time; dice by multiple dimensions; pivot to reorient axes.
Metadata examples include data warehouse structure descriptions, data lineage, currency of data, monitoring and audit trails, data on performance and refresh schedules.

Quick Reference Table (conceptual)

DW vs OLTP: DW is for analysis; OLTP for transactions.
DW Models: Enterprise Warehouse, Data Mart, Virtual Warehouse
Schemas: Star, Snowflake, Fact Constellation
OLAP Operations: Roll-up, Drill-down, Slice, Dice, Pivot; (plus) Drill-across, Drill-through, ranking, moving averages, currency conversions, etc.
ETL Cycle: Extract → Clean → Transform → Load → Refresh
Metadata: structure, lineage, currency, monitoring, business terms, data ownership, charging policies