Data Warehousing Notes

Data Warehouse

Warehousing Strategy

The data warehouse strategy is defined as part of the information technology strategy of the enterprise.
Traditional Information Strategy Plan (ISP) may not give sufficient attention to decisional information requirements.
A data warehouse strategy remedies this by focusing on the decisional needs of the enterprise.

Strategy Components

Strategy Components
Determine Organizational Context
Conduct Preliminary Survey of Requirements
- Interview Categories and Sample Questions
- Interviewing Tips
Conduct Preliminary Source System Audit
Identify External Data Sources (If Applicable)
Define Warehouse Rollouts (Phased Implementation)
Define Preliminary Data Warehouse Architecture
Evaluate Development and Production Environment and Tools

Strategy Components

Preliminary data warehouse rollout plan.
- Prioritize user requirements and assign them to different warehouse rollouts.
- Divide warehouse development into phased, successive rollouts.
Preliminary data warehouse architecture.
- Define the overall data warehouse architecture for the pilot and subsequent warehouse rollouts.
- Define the initial technical architecture of each rollout.
Short-listed data warehouse environment and tools.
- Create a short-list for the tools and environments that appear to meet warehousing needs.

Determine Organizational Context

Understanding the organization helps to establish the context of the project.
Answers to organizational background questions are typically obtained from the Project Sponsor, the CIO, or the Project Manager.
Typical organizational background questions include:
- Who is the Project Sponsor for this project?
- What are the IS or IT groups in the organization? Which are involved in the data warehousing effort?
- What are the roles and responsibilities of the individuals who have been assigned to this effort?

Conduct Preliminary Survey of Requirements

Obtain an inventory of the requirements of business users through individual and group interviews with the end-user community.
Obtain layouts of the current management reports (and their planned enhancements).
The requirements inventory provides the breadth of information that the warehouse is expected to eventually provide.
The objective is to understand the user needs enough to prioritize the requirements.
This is a critical input for identifying the scope of each data warehouse rollout.

Interview Categories & Sample Questions

Functions:
- What is the mission of your group or unit?
- How do you go about fulfilling this mission?
Customers:
- How do you group or classify your customers?
- Do these groupings change over time?
- Does your grouping affect how you treat your customers?
Profit:
- At what level do you measure profitability in your group?
- Per agent? Per customer? Per product? Per region?
- At what level of detail are costs and revenues tracked in your organization?
- How do you track costs and revenues now?
Systems:
- What systems do you use as part of your job?
- What systems are you aware of in other groups that contain information you require?
Time:
- How many months or years of data do you need to track?
- Do you analyze performance across years?
- At what level of detail do you need to see figures? Daily? Weekly? Monthly? Quarterly? Yearly?

Interviewing Tips

Avoid making commitments about warehouse scope.
Keep the interview objective in mind.
Don't overwhelm the interviewees.
Record the session if the interviewee lets you.
Change the interviewing style depending on the interviewee.

Conduct Preliminary Source System Audit

Conduct the audit of the source system with respect to:
- Current architecture:
  - What is the current technology architecture of the organization?
  - What kind of systems, hardware, DBMS, network, end-user tools, development tools, and data access tools are currently in use?
- Source system relationships:
  - Are the source systems related in any way?
  - Does one system provide information to another?
- Network facilities:
  - Is it possible to use a single terminal or PC to access the different operational systems, from all locations?
- Data quality:
  - How much cleaning, scrubbing, deduplication, and integration do you suppose will be required?
  - What areas (tables or fields) in the source systems are currently known to have poor data quality?
- Documentation:
  - How much documentation is available for the source systems?

Identify External Data Sources (if Applicable)

The enterprise may also make use of external data sources to augment the data from internal source systems.
Examples of external data that can be used are:
- Data from credit agencies
- Statistical or census data
- Data from industry organizations
- Data from publications and news agencies
Use of external data presents opportunities for enriching the data warehouse

Define Warehouse Rollouts (Phased Implementation)

Dividing the data warehouse development into phased, successive rollouts helps to manage user expectations through the clear definition of scope for each rollout.
Figure shows a sample table listing all requirements identified during the initial round of interviews with end users. Each requirement is assigned a priority level.

Define Preliminary DW Architecture

Define the preliminary architecture of each rollout based on the approved rollout scope.
At a minimum, the preliminary architecture should indicate the following:
- Data warehouses and data marts
- Number of users
- Location
Following figure shows Sample Preliminary Architecture per Rollout

Evaluate Development & Production Environments and Tools

Enterprises can choose from several environments and tools for the data warehouse initiative.
Select the combination of tools that best meets the needs of the enterprise.
At present, no single vendor provides an integrated suite of warehousing tools.
There are, however, clear leaders for each tool category.
Produce a short-list from which each rollout or project will choose its tool set as shown in figure (sample tool short list).

Warehouse Management and Support Processes

Warehouse management and support processes are designed to address aspects of planning and managing a data warehouse project that are critical to the successful implementation and subsequent extension of the data warehouse.
These processes are defined to assist the project manager and warehouse driver during warehouse development projects.
Following issues are covered in it:
1. Define Issue Tracking and Resolution Process
2. Perform Capacity Planning
3. Define Warehouse Purging Rules
4. Define Security Measures
5. Define Backup and Recovery Strategy
6. Set Up Collection of Warehouse Usage Statistics

Define Issue Tracking and Resolution Process

During the course of a project, it is inevitable that a number of business and technical issues will surface.
The project will quickly be delayed by unresolved issues if an issue tracking and resolution process is not in place.
Figure shows a sample issue log that tracks all the issues that arise during the course of the project.

Issue Tracking Guidelines:

Issue description: State the issue briefly in two to three sentences. Provide a more detailed description of the issue as a separate paragraph.
Urgency: Indicate the priority level of the issue: high, medium, or low.
Raised by: Identify the person who raised the issue.
Assigned to: Identify the person on the team who is responsible for resolving the issue.
Date opened: This is the date when the issue was first logged.
Date closed: This is the date when the issue was finally resolved.
Resolved by: The person who resolved the issue.
Resolution description: State briefly the resolution of this issue in two or three sentences.
Issue logs formalize the issue resolution process. They also serve as a formal record of key decisions made throughout the project.

Perform Capacity Planning

Warehouse capacity requirements come in the following forms: space required, machine processing power, network bandwidth, and number of concurrent users.
These requirements increase with each rollout of the data warehouse.
Space Requirements: Space requirements are determined by the following:
- schema design, expected volume, and expected growth rate;
- indexing strategy used;
- backup and recovery strategy;
- aggregation strategy;
- staging and deduplication area required; and
- metadata space requirements.
Machine Processing Power: MPP (massively parallel processing) and SMP (symmetric multiprocessing) machines are the ideal hardware platform for data warehousing.
- Choose a configuration that is scalable and that meets the minimum processing requirements.
Network Bandwidth: The network bandwidth must not be allowed to slow down the warehouse extraction and warehouse performance.
- Verify all assumptions about the network bandwidth before proceeding with each rollout.

Define Warehouse Purging Rules

Purging rules specify when data are to be removed from the data warehouse.
Define the mechanisms for archiving or removing older data from the data warehouse.
Check for any legal, regulatory, or auditing requirements that may warrant the storage of data in other media prior to actual purging from the warehouse.
Acquire the software and devices that are required for archiving.

Define Security Measures

Keep the data warehouse secure to prevent the loss of competitive information either to unforeseen disasters or to unauthorized users.
Define the security measures for the data warehouse, taking into consideration both physical security (i.e., where the data warehouse is physically located), as well as user-access security.
Additional precautions are required if either the warehouse data or warehouse reports are available to users through an intranet or over the public Internet infrastructure.

Define Backup and Recovery Strategy

Define the backup and recovery strategy for the warehouse, taking into consideration the following factors:
- Data to be backed up
- Batch window of the warehouse
- Maximum acceptable time for recovery
- Acceptable costs for backup and recovery
Also consider the following when selecting the backup mechanism:
- Archive format
- Automatic backup devices
- Parallel data streams
- Incremental backups
- Offsite backups
- Backup and recovery procedures

Set Up Collection of Warehouse Usage Statistics

Warehouse usage statistics are collected to provide the data warehouse designer with inputs for further refining the data warehouse design and to track general usage and acceptance of the warehouse.
Define the mechanism for collecting these statistics, and assign resources to monitor and review these regularly.

Data Warehouse Planning

Data warehouse planning is conducted to clearly define the scope of one data warehouse rollout.
Each rollout repeatedly executes the four tracks (top-down, bottom-up, back-end and front-end), and the scope of the data warehouse iteratively extended as a result. Figure illustrates the concept.
The combination of the top-down and bottom-up tracks gives the planning process—a requirements-driven approach that is grounded on available data.
The clear separation of the front-end and back-end tracks encourages the development of warehouse subsystems for extracting, transporting, cleaning, and loading warehouse data independently of the front-end tools that will be used to access the warehouse.

Activities in Data Warehouse Planning

Assemble and Orient Team
Conduct Decisional Requirements Analysis
Conduct Decisional Source System Audit
Design Logical and Physical Warehouse Schema
Produce Source-to-Target Field Mapping
Select Development and Production Environment and Tools
Create Prototype for This Rollout
Create Implementation Plan of This Rollout

Assemble and Orient Team

Identify all parties who will be involved in the data warehouse implementation and brief them about the project.
Distribute copies of the warehouse strategy as background material for the planning activity.

Conduct Decisional Requirements Analysis

The objective of Decisional Requirements Analysis is to gain a thorough understanding of the information needs of decision-makers.
Decisional requirements analysis represents the top-down aspect of data warehousing.

Conduct Decisional Source System Audit

The decisional source system audit is a survey of all information systems that are current or potential sources of data for the data warehouse.
A preliminary source system audit during warehouse strategy formulation should provide a complete inventory of data sources.
Identify all possible source systems for the warehouse if this information is currently unavailable.
Data sources are primarily internal.
If external data sources are also available, these may be integrated into the warehouse.

Design Logical and Physical Warehouse Schema

Design the data warehouse schema that can best meet the information requirements of the rollout.
Two main schema design techniques are available:
- Normalization:
  - The database schema is designed using the normalization techniques traditionally used for OLTP applications;
- Dimensional modeling:
  - This technique produces denormalized, star schema designs consisting of fact and dimension tables.
  - A variation of the dimensional star schema also exists (i.e., snowflake schema).

Produce Source -to-Target Field Mapping

The Source-To-Target Field Mapping documents how fields in the operational (source) systems are transformed into data warehouse fields.

Select Development and Production Environment and Tools

Finalize the computing environment and tool set for the rollout based on the results of the development and production environment and tools study during the data warehouse strategy definition.

Create Prototype for This Rollout

Using the short-listed or final tools and production environment, create a prototype of the data warehouse.

Create Implementation Plan of This Rollout

With the scope now fully defined and the source-to-target field mapping fully specified, it is now possible to draft an implementation plan for the rollout.
Consider the following factors when creating the implementation plan:
- Number of source systems, and their related extraction mechanisms and logistics
- Number of decisional business processes supported
- Number of subject areas involved
- Estimated database size
- Availability and quality of source system documentation
- Data quality issues and their impact on the schedule
- Time frame required for IT infrastructure upgrades
- IT support and participation

Data Warehouse Implementation

The data Warehouse implementation approach describes the activities related to implementing one rollout of the date warehouse.
The data warehouse implementation team builds or extends an existing warehouse schema based on the final logical schema design produced during planning.
The team also builds the warehouse subsystems that ensure a steady, regular flow of clean data from the operational systems into the data warehouse.
An implementation project should be scoped to last between three to six months.

Activities in Data Warehouse Implementation

Acquire and Set Up Development Environment
Obtain Copies of Operational Tables
Finalize Physical Warehouse Schema Design
Build or Configure Extraction and Transformation Subsystems
Build or Configure Data Quality Subsystem
Build Warehouse Load Subsystem
Set Up Warehouse Metadata
Set Up Data Access and Retrieval Tools
Perform the Production Warehouse Load
Conduct User Training
Conduct User Testing and Acceptance

Acquire and Set Up Development Environment

This activity includes the following tasks, among others: install the hardware, the operating system, the relational database engine; install all warehousing tools; create all necessary network connections; and create all required user IDs and user access definitions.

Obtain Copies of Operational Tables

There may be instances where the team has no direct access to the operational source systems from the warehouse development environment.
Regardless of the reason for the lack of access, the warehousing team must establish and document a consistent, reliable, and easy-to-follow procedure for obtaining copies of the relevant tables from the operational systems.

Finalize Physical Warehouse Schema Design

Translate the detailed logical and physical warehouse design from the warehouse planning stage into a final physical warehouse design, taking into consideration the specific, selected database management system.

Build or Configure Extraction and Transformation Subsystems

Easily 60 percent to 80 percent of a warehouse implementation project is devoted to the back-end of the warehouse.
The back-end subsystems must extract, transform, clean, and load the operational data into the data warehouse.

Build or Configure Data Quality Subsystem

A data quality subsystem is on the back-end of the warehouse therefore is a critical component of the overall warehouse architecture.

Build Warehouse Load Subsystem

The warehouse load subsystem takes the load images created by the extraction and transformation subsystems and loads these images directly into the data warehouse.
Much of the challenge in building the load subsystem therefore lies in optimizing the load process to reduce the total time required.
The team may want to test the accuracy and performance of the warehouse load subsystem on dummy data before attempting a real load with actual load images.
Create the data warehouse schema in the development environment while the team is constructing or configuring the warehouse back-end subsystems.

Set Up Warehouse Metadata

It describe the contents of the data warehouse, indicate where the warehouse data originally came from, and document the business rules that govern the transformation of the data.

Set Up Data Access and Retrieval Tools

These tools are critical to the acceptance and usability of the warehouse.

Perform the Production Warehouse Load

The production data warehouse load can be performed only when the load images are ready and both the warehouse schema and metadata are set up.

Conduct User Training

The IT organization is encouraged to fully take over the responsibility of conducting user training.

Conduct User Testing and Acceptance

The data warehouse, like any system, must undergo user testing and acceptance.

Hardware and Operating Systems for Data Warehousing

The term hardware and operating systems refers to the server platforms and operating systems that serve as the computing environment of the data warehouse.
Warehousing environments are typically separate from the operational computing environments (i.e., a different machine is used) to avoid potential resource contentions between operational and decisional processing.
The two primary categories of parallel hardware used for data warehousing are the symmetric multiprocessing (SMP) machines and massively parallel processing (MPP) machines.

SMP

SMPs have multiple CPUs that share a common memory and input/output.
Also known as a "Shared Everything" architecture and is limited by the scalability and performance limits of the bus that connects its various components.
Such architectures scale up by adding more CPUs, upgrading existing ones, or by clustering together several SMP machines.

MPP

MPPs in contrast, allow multiple, independent CPUs, connected to each other by a high-speed network.
Each CPU has its own copy of the operating system and can essentially function as an independent processor.
MPP architectures scale up by adding nodes or CPUs.
Unfortunately, not all applications can take advantage of the parallel architecture of MPPs; applications that have been designed to work on only one processor will fail to take advantage of parallel processing on multiple processors.

Hardware Selection Criteria

The following selection criteria are recommended for hardware selection.

Scalability: The warehouse solution is able to scale up in terms of space and processing power. This is particularly important if the warehouse is projected to grow at a rapid rate.
Financial stability: The product vendor has proven to be a strong and visible player in the hardware segment, and its financial performance indicates growth or stability.
Price/performance: The product performs well in a price/performance comparison with other vendors of similar products.
Delivery lead time: The product vendor can deliver the hardware or an equivalent service unit within the required time frame.
Reference sites: The hardware vendor has a reference site that is using a similar unit for the same purpose.
Availability of support: Support for the hardware and its operating system is available, and support response times are within the acceptable down time for the warehouse.

Client/Server Computing Model & Data Warehousing

Purpose of Application Servers:
1. To run middleware and establish connectivity
2. To execute management and control software
3. To handle data access from the Web
4. To manage metadata
5. For authentication
6. As front end
7. For managing and running standard reports
8. For sophisticated query management
9. For OLAP applications
Generally, the client workstations still handle the presentation logic and provide the presentation services

Parallel Processors & Cluster Systems

The data warehouse is a user-centric and query-intensive environment where users will constantly be executing complex queries to perform all types of analysis.
Each query would need to read large volumes of data to produce results sets.
If the data warehouse is not tuned properly for handling large, complex, simultaneous queries efficiently, the value of the data warehouse will be lost. Performance is of primary importance.
To speed up query processing, data loading, and index creation, a very effective way to accomplish this is to use parallel processing.
In parallel processing, Split problem in smaller tasks that are executed concurrently.

Advantages:

Increasing speed & optimizing resources utilization

Disadvantages:

Complex programming models-difficult development

Steps of Parallel Processing Software

Analyzing a large task to identify independent units that can be executed in parallel
Identifying which of the smaller units must be executed one after the other
Executing the independent units in parallel and the dependent units in the proper sequence
Collecting, collating, and consolidating the results returned by the smaller units

Advantages of Using Parallel Processing in Data Warehouse

Performance improvement for query processing, data loading, and index creation.
Scalability allowing the addition of CPUs and memory modules without any changes to the existing-application.
Fault tolerance so that the database would be available even when some of the parallel processors fail
Single logical view of the database even though the data may reside on the disks of multiple nodes.

What is cluster computing?

A computer cluster is a group of linked computers, working together closely so that in many respects they form a single computer.
The components of a cluster are commonly, but not always, connected to each other through fast local area networks.
Clusters are usually deployed to improve performance and/or availability over that provided by a single computer, while typically being much more cost-effective than single computers of comparable speed or availability.

Cluster consists of

Nodes(master+computing)
Network
OS
Cluster middleware: It permits compute clustering programs to be portable to a wide variety of clusters

Distributed DBMS Implementation

A distributed database management system (DDBMS) is a set of multiple, logically interrelated databases distributed over a network.
They Provide a mechanism that makes the distribution of data transport to users.
DDBMS is widely used in data warehousing, where huge volumes of data are processed and accessed by numerous users or database clients at the same time.
This database system is used to manage data in networks, maintain confidentiality and handle data integrity.

The advantages of DDBS

Local autonomy
Good availability (because support multi copies)
Good flexibility
Low system cost
High efficiency (most access processed locally, less communication comparing to centralized database system)
Parallel process

The disadvantages of DDBS

Hard to integrate existing databases
Too complex (system itself and its using, maintenance, etc. such as DDB design)

Distributed Processing

Shares database's logical processing among physically, networked independent sites.

Warehousing Software

A warehousing team will require several different types of tools during the course of a warehousing project.
These software products generally fall into one or more of the following categories.

Extraction and transformation:

As part of the data extraction and transformation process, the warehouse team requires tools that can extract, transform, integrate, clean, and load data from source systems into one or more data warehouse databases.

Warehouse storage:

Software products are also required to store warehouse data and their accompanying metadata.
Relational database management systems in particular are well suited to large and growing warehouses.

Data access and retrieval:

Different types of software are required to access, retrieve, distribute, and present warehouse data to its end users.

Extraction Tools

Tool Selection: Warehouse teams have many options when it comes to extraction tools.
In general, the choice of tool depends greatly on the following factors:
- The source system platform and database:
  - Extraction and transformation tools cannot access all types of data sources on all types of computing platforms.
  - Unless the team is willing to invest in middleware, the tool options are limited to those that can work with the enterprise's source systems.
- Built-in extraction or duplication functionality:
  - The source systems may have built-in extraction or duplication features, either through application code or through database technology.
  - The availability of these built-in tools may help reduce the technical difficulties inherent in the data extraction process.
- The batch windows of the operational systems:
  - Some extraction mechanisms are faster or more efficient than others.
  - The batch windows of the operational systems determine the available time frame for the extraction process and therefore may limit the team to a certain set of tools or extraction techniques.

Extraction Methods

Change-based replication:
- Only data that have been newly inserted or updated in the source systems are extracted and loaded into the warehouse.
- This approach places less stress on the network (due to the smaller volume of data to be transported) but requires more complex programming to determine when a new warehouse record must be inserted or when an existing warehouse record must be updated.
Bulk extractions:
- The entire data warehouse is refreshed periodically by extractions from the source systems.
- All applicable data are extracted from the source systems for loading into the warehouse.
- This approach heavily taxes the network connection between source and target databases, but such warehouses are easier to set up and maintain.

Transformation tools

Transformation tools are aptly named; they transform extracted data into the appropriate format, data structure, and values that are required by the data warehouse.

Data Quality Tools:

Data quality tools assist warehousing teams with the task of locating and correcting data errors that exist in the source system or in the data warehouse.
Data quality tools can help identify and correct data errors, ideally at the source systems.
If corrections at the source are not possible, data quality tools can also be used on the warehouse load images or on the warehouse data itself.

Data Loaders:

Data loaders load transformed data (i.e., load images) into the data warehouse.
If load images are available on the same RDBMS engine as the warehouse, then stored procedures can be used to handle the warehouse loading.
If the load images do not yet have warehouse keys, then data loaders must generate the appropriate warehouse keys as part of the load process.

Database Management Systems:

A database management system is required to store the cleansed and integrated data for easy retrieval by business users.
Two flavors of database management systems are currently popular: relational databases and Multidimensional databases.

Data Access & Retrieval Tools:

Data warehouse users derive and obtain information through these types of tools.
Data access and retrieval tools are currently classified into the subcategories below.
- OLAP tools (ROLAP, MOLAP, HOLAP)
- Data Mining
- Exception Reporting & Alert Systems

Data Modeling Tools:

Data modeling tools allow users to prepare and maintain an information model of both the source database and the target database.
Some of these tools also generate the data structures based on the models that are stored or are able to create models by reverse engineering existing databases.

Warehouse Management Tools:

These tools assist warehouse administrators in the day-to-day management and administration of the warehouse.
Different warehouse management tools support or automate different aspects of the warehouse administration and management tasks.
For example, some tools focus on the load process and therefore track the load histories of the warehouse.
Other tools track the types of queries that users direct to the warehouse and identify which data are not used and therefore are candidates for removal.

Warehouse Schema Design

Dimensional Modeling for Decisional Systems
Dimensional modeling provides a number of techniques or principles for denormalizing the database structure to create schemas that are suitable for supporting decisional processing.
These modeling principles are discussed below:
- Two Types of Tables: Facts and Dimensions
- Two types of tables are used in dimensional modeling: Fact tables and Dimensional tables.

Fact Tables

Fact tables are used to record actual facts or measures in the business. Facts are the numeric data items that are of interest to the business.
Facts are the numbers that users analyze and summarize to gain a better understanding of the business.

Below are examples of facts for different industries:

Retail: Number of units sold, sales amount
Telecommunications: Length of call in minutes, average number of calls
Banking: Average daily balance, transaction amount
Insurance: Claims amounts
Airline: Ticket cost, baggage weight

Dimension Tables

Dimension tables, on the other hand, establish the context of the facts. Dimensional tables store fields that describe the facts.

Below are examples of dimensions facts for the same industries:

Retail: Store name, store zip code, product name, product category, day of week
Telecommunications: Call origin, call destination
Banking: Customer name, account number, data, branch, account officer
Insurance: Policy type, insured party
Airline: Flight number, flight destination, airfare class

A Schema Is a Fact Table Plus Its Related Dimension Tables

Visually, a dimensional schema looks very much like a star, hence the use of the term star schema to describe dimensional models.
Fact tables reside at the center of the schema, and their dimensions are typically drawn around it.

Dimensional Hierarchies and Hierarchical Drilling

As a result of denormalization of the dimensions, each dimension will quite likely have hierarchies that imply the grouping and structure.
The easiest example can be found in the Time dimension.
When warehouse users drill up and down for detail, they typically drill up and down these dimensional hierarchies to obtain more or less detail about the business.

The Granularity of the fact Table

The term granularity is used to indicate the level of detail stored in the fact table.
The granularity of the Fact table follows naturally from the level of detail of its related dimensions.
For example, if each Time record represents a day, each Product record represents a product, and each Organization record represents one branch, then the grain of a sales Fact table with these dimensions would likely be: sales per product per day per branch.