DW_Lecture1.
Course Information
Course Resources
Blackboard is the primary platform for accessing essential course materials, including:
Class syllabus: outlines course schedule and policies.
Project description: provides guidelines and expectations for the project to be undertaken during the course.
Course slides: slides used in lectures for better understanding of the topics covered.
Turn project reports: facilitates the submission of project-related documents.
Announcements: course updates and important information will be communicated here.
Email will also serve as a communication channel for important announcements not covered on Blackboard.
Course Evaluation
The evaluation of students' performance will consist of the following components:
Midterm exam (30%): assesses understanding of the first half of the course material.
Project (30%): practical application of learned concepts to a real-world scenario.
Final exam (40%): a comprehensive evaluation of all course content covered.
Recommended Literature
Books
The Data Warehouse Toolkit by Ralph Kimball & Margy Ross
Building the Data Warehouse by William H. Inmon
Data Warehouse-Systeme by Andreas Bauer & Holger Günzel
The Data Warehouse ETL Toolkit by Ralph Kimball & Joe Caserta
OLAP Solutions by Erik Thomsen
Data Warehouses and OLAP by Robert Wrembel & Christian Koncilia
Introduction to Data Warehousing
Key Questions
Why do we need a data warehouse (DW)?
To consolidate data for effective analysis and reporting, eliminating the need for disparate systems.
What is a data warehouse?
A specialized database optimized for query and analysis.
Key concepts in data warehousing:
Understanding the difference between operational databases and analytical databases.
Applications and users of DW:
Ranging from corporate sectors to educational institutions, serving various functional needs.
Source Reference
Insights into internet data usage, specifically examining the implications of data consumption in an internet minute, highlighting the advancing digital footprint.
Perspectives on Data
Quotations
"We’re entering a new world in which data may be more important than software." – Tim O’Reilly
"Data is a precious thing and will last longer than the systems themselves." – Tim Berners-Lee
"Information is the oil of the 21st century." – Peter Sondergaard, Gartner
"Applications come, applications go. The data, however, lives forever... It really is about the data underneath these applications." – Tom Kyte, Oracle
Major Problems for Effective Decision Support
Issues Identified
Distributed Data:
Data is scattered across various systems, leading to inconsistency.
Solution: Accumulate data on a unified system for comprehensive analysis.
Different Data Structures:
Independent development of systems results in inconsistent data formats (e.g., zip codes, units).
Solution: A harmonization framework that standardizes data.
Historic Data:
Quick deletion or archiving of historical data restricts in-depth analysis.
Solution: Establish a centralized data storage solution supporting long-term retention for historical queries.
System Workload:
Increased workload can lead to system performance issues causing potential failures.
Solution: Implement dedicated systems optimized for complex analytical queries.
Why Data Warehousing?
Operative systems alone do not suffice for analytical purposes.
A separated system is crucial to prevent interference with daily business transactions and to maintain efficiency in operations.
Definition of a Data Warehouse (DW)
General Description
A data warehouse serves as a large repository designed for storing, organizing, and analyzing data to enhance decision-making processes.
Specifics of DW
Collective data repository: Incorporates snapshots of transactional data for evaluation.
Designed for reporting and long-term analysis: Typically starts at a size of 1TB, expanding to several PB to accommodate large volumes of data over time.
Key Characteristics
Subject-oriented:
Focus on specific subjects or events (e.g., customers, products) rather than on individual transactions.
Integrated:
Merges data from various operational systems, ensuring consistency in datasets.
Non-volatile:
Data remains static and read-only after being committed, with subsequent changes leading to the creation of new records.
Time-variant:
Records changes over time supporting solid historical reporting.
Key Definitions by Experts
Ralph Kimball: "A copy of transaction data specifically structured for query and analysis."
Bill Inmon: "A subject-oriented, integrated, non-volatile, time-variant collection of data supporting decision-making."
Characteristics of a Data Warehouse
Subject-Oriented: Data is organized around real-world subjects, facilitating easier access to relevant information.
Integrated: Brings together data from diverse source systems ensuring definitive consistency across datasets.
Non-Volatile: Maintains static data properties, focusing on creating new records in the event of updates rather than altering existing entries.
Time-Varying: Ensures the retention of historical data to enable a consistent overview across different time frames.
Typical Features of a Data Warehouse
Operates on scalable DBMS (Database Management Systems) like Oracle, IBM DB2, which can handle increased data loads over time.
Retains data over extensive periods sourced from various operational systems, ready for analysis as needed.
Data Processing Types
OLTP (Online Transaction Processing):
Designed for daily operations, offering real-time data reflecting current statuses.
OLAP (Online Analytical Processing):
Tailored for reporting and analytical queries generated from data warehouse repositories, enabling deep data analysis.
Applications of Data Warehousing
Investment and Insurance:
Analyzing market factors and customer data for improved financial decision-making.
Retail:
Tracking products efficiently and strategizing on pricing for maximizing sales.
Healthcare:
Managing treatment reports and sharing vital data securely across platforms.
Education:
Real-time access to data for informed operational and academic decision-making.
Banking:
Evaluating market conditions and tracking financial performance metrics.
Manufacturing:
Conducting trend analysis and accurate forecasting to optimize production.
Government:
Monitoring key records like taxation and healthcare for better public service management.
Telecommunications:
Making informed decisions regarding distribution and marketing strategies.