Ch1_Introduction_to_SW
Course Information
Course Resources
Blackboard is the main platform for class syllabi, project descriptions, slides, announcements, etc.
Email will also be utilized for announcements.
Course Evaluation
Midterm exam: 30%
Project: 30%
Final exam: 40%
Recommended Literature
Essential Texts
The Data Warehouse Toolkit by Ralph Kimball & Margy Ross
Building the Data Warehouse by William H. Inmon
Data Warehouse Systeme by Andreas Bauer & Holger Günzel
Further Readings
The Data Warehouse ETL Toolkit by Ralph Kimball & Joe Caserta
OLAP Solutions by Erik Thomsen
Data Warehouses and OLAP by Robert Wrembel & Christian Koncilia
Introduction to Data Warehousing
Why a Data Warehouse is Needed
Definition of a Data Warehouse
Core Concepts in Data Warehousing
Applications and Users of Data Warehouses
Importance of Data
Data may soon be more important than software (Tim O'Reilly)
Data is precious and lasts longer than systems (Tim Berners-Lee)
Information is likened to oil in the 21st century (Peter Sondergaard)
Data lives indefinitely beneath applications (Tom Kyte)
Major Challenges for Effective Decision Support
Distributed Data
Problem: Data exists across various systems and technologies.
Solution: Consolidate data into a single system for analysis.
Different Data Structures
Problem: Inconsistencies across systems regarding data types and modeling.
Solution: Implement a dedicated system to standardize data.
Issues with Historic Data
Problem: Data is often deleted or archived after a short period.
Solution: Store all changes in a system suited for large data volumes.
System Workload Challenges
Problem: Systems are not optimized for analytical workloads, risking performance issues.
Solution: Use dedicated systems designed for complex queries on large data sets.
What is a Data Warehouse?
A data warehouse is a system where data is stored and analyzed to inform decisions.
It is specifically designed to support business intelligence activities.
Characteristics of a Data Warehouse
Large Repository
A data warehouse is a large-scale repository of organizational data.
It begins around 1 TB and can extend to several PB across multiple servers.
Data Cleansing and Snapshots
Contains historical snapshots of operational data, obtained through ETL processes.
Analytical Focus
Tailored for tactical and strategic reporting, capturing changes over time.
Definitions by Experts
Ralph Kimball:
Describes a data warehouse as a structured copy of transactional data for queries and analysis.
Bill Inmon:
Defines it as a subject-oriented, integrated, non-volatile, time-variant collection of data for decision support.
Functional Characteristics
Subject Oriented:
Data is organized around real-world events or objects (e.g., Customer, Product).
Integrated:
Data sourced from various operational systems, harmonized for consistency.
Non-Volatile:
Data is static and retained for future reporting; once committed, data isn't deleted or changed directly.
Time-Variant:
Tracks changes, enabling historical reporting and analysis over time.
Utilization of Data Warehouses
OLTP (Online Transaction Processing):
Used for operational data entry and transaction processing, reflecting current states.
OLAP (Online Analytical Processing):
Used for reporting and decision-making analytics based on DW data.
Typical Applications of Data Warehousing
Investment and Insurance
Market analysis and customer behavior insights.
Retail
Monitoring products, pricing strategies, and purchasing patterns.
Healthcare
Generating treatment reports and collaborating with insurance and research departments.
Education
Real-time analytics for informed decision-making regarding faculty and students.
Banking
Monitoring resources and assessing product performance.
Manufacturing and Distribution
Consolidating data for forecasting and trend analysis.
Government
Analyzing tax information and health records.
Telecommunications
Informing distribution and sales strategies.