Data Warehousing and Data Mining - Study Notes
Basic Concepts
- Data Warehouse (DW): a data repository maintained separately from an organization's operational databases. It enables integration of multiple application systems and provides a consolidated historic data platform for analysis.
- William H. Inmon definition (foundation):
- A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision making process.
- Key distinction keywords: subject-oriented, integrated, time-variant, nonvolatile.
- These four keywords separate data warehouses from generic data repositories like relational databases, transaction processing systems, and files.
- Four distinguishing keywords (as above) summarized:
- Subject-oriented
- Integrated
- Time-variant
- Nonvolatile
- Subject-oriented: organized around major subjects (customers, suppliers, products, sales) to support decision making; excludes irrelevant data.
- Integrated: data from multiple heterogeneous sources (relational DBs, flat files, OLTP, etc.) cleaned and harmonized for consistency in naming, encoding, measures, and attributes.
- Time-variant: stored with a historical perspective (e.g., 5–10 years); each key structure includes a time element to enable trend discovery.
- Nonvolatile: DW is a physically separate store; once data are loaded, they are not overwritten by transaction processing data; typically only loading and querying operations are performed.
- Important note on nonvolatile behavior: previous data are not erased when new data are added.
Data Warehouse Architecture and Components
- Data warehouse serves as a platform for analysis and decision support, contrasting with OLTP systems which support day-to-day transactions.
- How organizations use DW information: support decision-making across multiple areas such as
- increasing customer focus (analyzing buying patterns, timing, budgets, spending tendencies)
- repositioning products and managing portfolios (compare sales by quarter, year, region to tune production)
- analyzing operations for profit sources
- managing customer relationships, environmental corrections, and corporate asset costs
- DW vs OLTP: key differences include users (knowledge workers vs clerks/IT), data contents (historic/aggregated vs current detailed data), design (star/snowflake vs ER models), and access patterns (read-heavy, ad-hoc queries vs frequent transactions).
Data Warehouse Models, Schemas, and Architecture
Enterprise Warehouse: an enterprise-wide repository that aggregates information about all subjects, providing corporate-wide data integration.
Data Mart: a subset of DW focusing on a specific group or subject area (e.g., customer, item, sales); scope is narrower and tailored to particular user needs.
Virtual Warehouse: a set of views over operational databases; efficient for query processing but may require excess capacity on operational servers.
Independent vs dependent data marts: independent marts are stand-alone without a DW; dependent marts source data from an enterprise DW.
Three-Tier Data Warehouse Architecture:
- Top tier (front-end/tools): query/reporting, analysis, data mining tools for end users.
- Middle tier (OLAP server): handles multidimensional analysis; can be realized as:
- ROLAP: Relational OLAP, maps multidimensional operations to standard relational DBMS operations.
- MOLAP: Multidimensional OLAP, specialized server implementing multidimensional data and operations directly.
- Bottom tier (data warehouse server): DW database (usually a relational DBMS); back-end tools extract, clean, transform, load data from operational and external sources; metadata repository present.
ETL (Extraction, Transformation, Loading) and refresh cycle:
- Extraction: gather data from multiple heterogeneous sources.
- Cleaning: detect and rectify data quality issues.
- Transformation: convert legacy/host formats to the warehouse format; merge and reconcile data.
- Load: sort, summarize, consolidate, compute views, build indices/partitions.
- Refresh: propagate updates from sources to the warehouse.
Gateways: application interfaces that generate SQL code to be executed at the server; gateways are supported by the DBMS.
Metadata Repository: stores information about the warehouse and its contents, including structure, lineage, schedules, and performance data.
Data Warehouse Modeling: Data Cube and OLAP
- Multidimensional data model underpins DW and OLAP tools; data are viewed as a data cube with multiple dimensions around a central theme (facts).
- Dimensions: perspectives/entities (e.g., time, item, branch, location) used to organize data.
- Facts: numeric measures (e.g., dollarssold, unitssold) that are analyzed across dimensions.
- Example: a sales data warehouse with dimensions time, item, branch, location and measures dollarssold, unitssold.
- Central theme example: sales; data organized around this fact table with foreign keys to each dimension and the measures.
- Data cube views can be 2D, 3D, or higher-dimensional representations; a lattice of cuboids represents various levels of summarization across dimensions.
- A 4-D data cube example includes time, item, location, and supplier as dimensions.
- Cuboids and summarization:
- Each cuboid represents a level of summarization for a subset of dimensions.
- Lattice of cuboids forms the data cube; the base cuboid corresponds to the most detailed level, while the apex cuboid (0-D) corresponds to the most generalized level (often denoted as all).
- Apex cuboid: 0-D, contains the total sum across all dimensions.
- Base cuboid: most detailed, lowest level of granularity.
- Notation: apex cuboid is sometimes denoted as all.
- Data cube operations enable interactive querying and analysis across views and levels of granularity.
OLAP Schemas: Star, Snowflake, and Constellations
- Star Schema: central fact table surrounded by dimension tables; dimension tables are denormalized; simple and fast for queries.
- Fact table contains foreign keys to dimensions and measures (e.g., dollarssold, unitssold).
- Dimension tables contain attributes for each dimension (e.g., time, item, location, branch).
- Snowflake Schema: dimension tables are normalized into multiple related tables, reducing redundancy but requiring more joins and potentially slower queries.
- Example: item dimension expands to itemkey, itemname, brand, type, supplierkey; supplierkey links to a supplier dimension table.
- Location dimension may split into location and city tables, with city_key linking to city details.
- Fact Constellation (Galaxy) Schema: multiple fact tables share dimension tables; supports modeling of interrelated subjects.
- Example: sales fact table and shipping fact table share time, item, and location dimensions; both can co-exist in a single schema.
- Data Marts and Data Warehouses: data marts typically use star or snowflake schemas focused on single subject area; data warehouses use one or more of the above schemas to model enterprise data.
Dimensions and Concept Hierarchies
- Concept hierarchy: a sequence of mappings from low-level concepts to higher-level, more general concepts, enabling roll-up and drill-down operations.
- Location example: city values (Vancouver, Toronto, New York, Chicago) map to provinces/states (British Columbia, Ontario, Illinois, etc.), which in turn map to countries (Canada, United States).
- Implicit hierarchies often exist in schema (e.g., location attributes: street < city < province/state < country).
- Time dimension lattice: day < month < quarter < year, and possibly week < year in a lattice structure.
- Schema hierarchy: total or partial order among attributes forming a concept hierarchy; may be defined manually by domain experts or automatically generated by the system.
- Discretization/grouping hierarchies: discretizing values into groups (e.g., price intervals [$X, Y]) to create higher-level groupings.
OLAP Operations
- Multidimensional model enables viewing data from multiple perspectives; OLAP data cube operations materialize these views for interactive analysis.
- Example cube: dimensions location, time, and item; aggregations could be:
- location aggregated by city values
- time aggregated by quarters
- item aggregated by item types
- Roll-up (drill-up): aggregate data along a dimension by climbing up a hierarchy or reducing dimensions.
- Example: roll-up from city to country in the location dimension.
- Note: dimension reduction removes one or more dimensions from the cube.
- Drill-down: reverse of roll-up; navigate from less detailed to more detailed data by stepping down a hierarchy or adding dimensions.
- Example: drill-down from quarter to month in the time dimension to view monthly sales.
- Slice and Dice: select a subcube by applying a filter on one or more dimensions.
- Pivot (Rotate): change the orientation of the data axes to obtain alternative views for visualization.
- Other OLAP operations: drill-across (across multiple fact tables), drill-through (drill through to back-end relational tables via SQL), ranking (top N or bottom N items), moving averages, growth rates, currency conversions, and various statistical functions.
Data Mining, OLAP Integration, and Examples
- Data mining is often discussed alongside DW and OLAP as complementary to decision support; the provided content focuses on DW/OLAP concepts and operations with examples and figures.
- Example figures and snapshots illustrate OLAP operations on multidimensional data, including 2D/3D views and pivoting across dimensions like time, location, and item.
Practical Details and Real-World Relevance
- Data extraction and loading often leverage gateways and back-end utilities to interface with different data sources (operational databases, external sources, etc.).
- Metadata plays a crucial role in governance, data lineage, data quality, performance tuning, and maintaining the semantics of the warehouse content.
- The architecture supports scalability via a three-tier approach, enabling separation of concerns between data storage, analytical processing, and user interface components.
- The decision support emphasis emphasizes the value of historical data and multidimensional views to identify trends, support planning, optimize processes, and improve customer relationships.
Key Terms and Concepts (Glossary)
- Data Warehouse (DW)
- OLTP vs OLAP
- Subject-oriented, Integrated, Time-variant, Nonvolatile (Inmon definition)
- Integrated data cleaning and data integration
- Time-variant data and historical perspective
- Nonvolatile storage and separation from operational data
- ETL: Extraction, Transformation, Loading, and Refresh
- Gateways (SQL generation for the server)
- Metadata Repository (structure, lineage, performance, business metadata)
- Enterprise Data Warehouse (EDW)
- Data Mart, Independent vs Dependent Marts
- Virtual Warehouse (views over operational data)
- Star Schema, Snowflake Schema, Fact Constellation (Galaxy)
- Dimensions, Facts, and Measures (e.g., time, item, location; dollarssold, unitssold)
- Concept Hierarchies (total/partial order, lattice, group-by hierarchies)
- Data Cube, Apex Cuboid (0-D, all), Base Cuboid (most detailed)
- OLAP Operations: Roll-up, Drill-down, Slice, Dice, Pivot; Drill-across, Drill-through
Summary of Notable Formulas and Notations
- Inmon definition terms (for quick recall):
- Subject-oriented, Integrated, Time-variant, Nonvolatile
- Data cube and cuboids:
- Apex cuboid: 0-D; denoted as all; total aggregation across all dimensions.
- Base cuboid: most detailed level; highest granularity for all dimensions.
- A lattice of cuboids covers all combinations of dimensional subsets for summarization.
- Example dimensional structures:
- Time, Item, Location, Supplier as a 4-D cube; measures include dollarssold and unitssold.
- Location hierarchy: street < city < province/state < country.
- Time lattice: day < month < quarter < year; possibly week < year.
- OLAP operations in notation:
- Roll-up: aggregate along a hierarchy or reduce dimensionality; e.g., roll-up from city to country.
- Drill-down: expand details; e.g., from quarter to month.
- Slice: select a single dimension to create a subcube.
- Dice: select multiple dimensions to create a subcube.
- Pivot: rotate axes for alternate presentation.
Connections to Foundational Principles and Real-World Applications
- The DW approach supports decision making by enabling historic data analysis, trend detection, and performance assessment across time, geography, and product lines.
- By structuring data into subject-oriented, integrated, time-variant, and nonvolatile repositories, organizations can achieve consistent analytics, cross-functional visibility, and scalable reporting.
- The concept hierarchies and lattices enable flexible aggregation levels, supporting roll-ups for high-level summaries and drill-downs for in-depth investigation.
- The architectural separation (data storage, processing, and presentation) aligns with best practices in information systems for scalability, maintainability, and governance.
References to Formed Examples in the Transcript (Illustrative)
- 4-D data cube with dimensions: time, item, location, supplier; measures: dollarssold, unitssold.
- Base cuboid vs apex cuboid concepts demonstrated in OLAP discussions.
- Star versus snowflake examples: item dimension normalization leads to item and supplier tables; location normalization leads to location and city tables.
- Example OLAP queries showcase filtering by location cities (e.g., New York, Vancouver) and time quarters (Q1, Q2) with item types such as home entertainment or computer.
- Example OLAP operations visualized: roll-up from city to country; drill-down from quarters to months; slice by location and time; dice by multiple dimensions; pivot to reorient axes.
- Metadata examples include data warehouse structure descriptions, data lineage, currency of data, monitoring and audit trails, data on performance and refresh schedules.
Quick Reference Table (conceptual)
- DW vs OLTP: DW is for analysis; OLTP for transactions.
- DW Models: Enterprise Warehouse, Data Mart, Virtual Warehouse
- Schemas: Star, Snowflake, Fact Constellation
- OLAP Operations: Roll-up, Drill-down, Slice, Dice, Pivot; (plus) Drill-across, Drill-through, ranking, moving averages, currency conversions, etc.
- ETL Cycle: Extract → Clean → Transform → Load → Refresh
- Metadata: structure, lineage, currency, monitoring, business terms, data ownership, charging policies