scalable 7

Leveraging PIT and Bridge Tables for Virtualization

Leveraging PIT (Point-in-Time) and bridge tables is crucial for virtualization within a Data Vault 2.0 architecture, enhancing performance and flexibility in data warehousing by optimizing query speeds and simplifying complex joins.

PIT Tables for Virtualization
Loading Data into PIT Tables
  • The Business Vault, part of the EDW (Enterprise Data Warehouse) layer, uses a DataVault connection manager to connect Execute SQL Tasks. This ensures reliable and efficient data transfer between different layers of the data warehouse.

  • An SQL query editor is utilized to load snapshots into the PIT table, which involves crafting specific queries tailored to the data structure and snapshot requirements.

  • The SQL statement used includes a BETWEEN clause, assuming no overlap between the load date and load end date (the latter being one nanosecond before the subsequent record's load date). This assumption is critical for temporal accuracy in the PIT table.

  • The primary key column's key value (e.g., AirportKey) is employed in downstream dimensional tables, ensuring proper linkage and referential integrity across the data warehouse.

  • The source statement generates a record for each hash key in the parent hub along with the snapshot date, controlled by an SSIS variable. This approach ensures that every relevant entity has a corresponding snapshot record in the PIT table.

  • A WHERE condition enables recovery, ensuring new business keys populate the target for past snapshot dates, which maintains equi-joins, and can be optionally rerun in SSIS for past snapshots. This condition is vital for historical accuracy and consistency.

  • The snapshot date is stored in the dSnapshotDate variable within the SSIS control flow, providing a central point of management for the snapshot date.

  • Parameter mapping within the Execute SQL Task editor binds the dSnapshotDate variable to SQL statement parameters. This ensures that the snapshot date is correctly applied throughout the SQL query.

  • Because the snapshot date is used multiple times in the SQL statement, and the parameters are not named in OLE DB statements, the variable has to be bound multiple times. This is done by adding multiple parameter mappings to the list and selecting the dSnapshotDate variable in the User namespace. The DATE data type is selected, and each parameter is set to one instance of the parameter reference (from 0 to 4), leaving the parameter size as is. This meticulous parameter binding is crucial for accurate data loading.

  • To insert multiple snapshot dates, the task can be executed within a for loop container (for example during an initial load). This is particularly useful during the initial population of the data warehouse.

  • To load only descriptive data on Type 2 dimension changes into the PIT table, the WHERE condition of the above statement is modified, using the load date of the referenced satellites for change detection, as the load date changes per change that is added to the satellite. This ensures that only relevant descriptive data is loaded, optimizing storage and performance.

Creating Virtualized Dimensions with PIT Tables
  • A T-SQL view is created to implement Type 2 dimensions, leveraging the PIT table. This view provides a dynamic and efficient way to access historical dimension data.

  • This view provides a list of airports with descriptive information from the destination airport source, and alternatively loads descriptive data from the origin airport satellite if the primary source lacks data at that snapshot date. This ensures data completeness and accuracy.

  • Joins are converted from LEFT OUTER JOIN to INNER JOIN to enhance performance, and are all equi-joins. This optimization significantly improves query execution time.

  • Query performance is improved by basing the query on the PIT table instead of collecting load dates from multiple satellites. This simplifies the query and reduces the amount of data that needs to be processed.

  • The airport code hub is joined to retrieve the business key. This is essential for identifying and linking airport entities.

  • Both satellites that provide descriptive data are joined. Joining both satellites ensures that all available descriptive data is included in the view.

  • A check on the referenced hash key in the PIT table can ensure if the leading satellite provides useful data, using the ghost record to retrieve data from the other satellite. This check ensures that the most accurate and relevant data is used.

Example T-SQL View:

-- Example view definition (hypothetical)
CREATE VIEW VirtualAirportDimension AS
SELECT
    H.AirportCode,
    S1.AirportName AS DestinationAirportName,
    S2.AirportName AS OriginAirportName
FROM
    HubAirport H
INNER JOIN
    PitTable P ON H.AirportHashKey = P.AirportHashKey
LEFT JOIN
    SatelliteDestinationAirport S1 ON P.DestAirportHashKey = S1.DestAirportHashKey AND P.SnapshotDate = S1.LoadDate
LEFT JOIN
    SatelliteOriginAirport S2 ON P.OriginAirportHashKey = S2.OriginAirportHashKey AND P.SnapshotDate = S2.LoadDate

Implementing such changes is cost-effective because the data is virtually provided, meeting business users' performance needs. This approach avoids the need for costly physical data duplication.

The same PIT table can be used to create various virtual dimensions with differing business logic. This provides flexibility and adaptability in meeting diverse business requirements.

For example, a view can be created that provides descriptive information from the destination airport source only.

  • Additional filtering on airports (e.g. airports from California) becomes part of the business rule implemented in this view. This allows for specific and targeted data analysis.

The advantages of separating the join operation from other business logic is highlighted through the examples described above. Similar concepts can be applied to fact tables as well.

Loading Bridge Tables

The purpose of a bridge table is to ensure that the performance requirements of business users are met for virtual fact tables. Bridge tables streamline complex relationships, making data access faster and more efficient.

Key Factors Affecting Fact Table Performance:
  1. Joins Between Links: Pre-joining improves performance where the desired grain requires joining multiple links. This reduces the complexity of the query and speeds up execution.

  2. Required Aggregations and Computed Values: Materializing aggregations and complex computations enhances virtual fact table performance. Pre-calculation reduces the computational load during query execution.

  3. Additional Customization: Defined by individual business specification, the customization involves deriving a number of dimensions and measures that have to be derived from the raw data. This allows for tailored data presentation to meet specific business needs.

Performing these activities and separating it from customization provides an advantage in the information delivery of fact tables, which is similar to the method for PIT tables with dimension tables.

Bridge Table Structure and Loading
  • Bridge tables materialize the basis for virtual fact tables, meeting business user performance requirements. These tables provide a pre-processed foundation for data analysis.

  • A simple bridge table pre-joins non-historized links, like TLinkFlight and TDiversionFlight to analyze flight diversions including planned flight information. Pre-joining these tables simplifies the analysis of flight diversions.

  • The grain defines the primary key of the bridge table and is used by the SQL statement that loads the bridge table incrementally. Proper grain definition ensures data accuracy and consistency.

  • Indices on hash keys improve join performance between hubs, satellites, and the bridge table, particularly for ad-hoc queries. Indexing optimizes query performance, especially for unstructured queries.

  • An incremental INSERT statement within SSIS loads the bridge, handling standard Data Vault 2.0 links similarly. Incremental loading ensures efficient data updates.

  • The bridge joins links while avoiding Cartesian products through appropriate conditions in the WHERE clause. Avoiding Cartesian products is crucial for data accuracy and performance.

  • Hubs are not joined at this time and business keys are not prejoined into the bridge by default. Only the hash key is added to the bridge table, but this key is already available in the link structures. This design decision optimizes performance.

  • A sub-select statement in the WHERE clause supports incremental loading and is based on the primary key of the bridge table. This incremental loading approach ensures efficient data updates.

  • The load date from the source is calculated to the next snapshot date. If the snapshot date follows a regular pattern then it can be calculated but if it follows a different patter then a lookup into a reference table for the snapshots of the target might be required. Accurate snapshot dating is essential for temporal data analysis.

  • Having calculated the snapshot date makes it easier when retrieving information from or via PIT tables in the virtual fact table, for example to retrieve additional descriptive information from dependent satellites of the PIT table. This simplifies data retrieval and enhances query performance.

  • The technical load date can be added to the bridge, however other dates provide more business value, for example the flight date. Choosing the right date enhances the business relevance of the data.

  • The calculation requires daily snapshot dates and ensures that load dates without time are mapped to the same snapshot date. This ensures consistency in temporal data representation.

Improving Performance

To improve overall query performance, pre-joining information from hubs and dependent satellites can be adopted, using business keys from hubs into the bridge table. This can significantly reduce query execution time. However, if the table becomes wide then the bridge table performance decreases. Balancing table width and performance is crucial for optimization.

It is important to experiment to find the correct mixes for every individual bridge table. Each bridge table should be optimized based on its specific data and usage patterns.

The bridge table can be set to be in the same grain as the target fact table to achieve more performance gains. Aligning the grain of the bridge and fact tables optimizes query performance.

Grain Shift

To change the grain, the following options are available:

  • Removing hub references: reduces the granularity, for example, by using a GROUP BY clause in the INSERT statement. Reducing granularity can simplify data analysis.

  • Adding hub references: increases the granularity, for example by joining additional links to the current set of used links. Increasing granularity can provide more detailed data insights.

  • The table is loaded via an incremental statement, often through an Execute SQL task in SSIS. Incremental loading ensures data freshness and efficiency.

Example SQL Statement performing Grain Shift:

-- Hypothetical example illustrating grain shift
INSERT INTO BridgeTable (FlightHashKey, DiversionHashKey, AggregatedMeasure)
SELECT LF.FlightHashKey, DF.DiversionHashKey, SUM(SomeMeasure)
FROM TLinkFlight LF
JOIN TDiversionFlight DF ON LF.FlightHashKey = DF.FlightHashKey
GROUP BY LF.FlightHashKey, DF.DiversionHashKey;
Implementation of Grain Shift
  • The INSERT statement implements the grain shift by removing hub references and applying a GROUP BY clause on the data. This provides flexibility in data representation.

  • The performance of virtualized fact tables is further improved by adding the results of required aggregations to the materialized bridge. Pre-calculated aggregations enhance query performance.

  • Set a fixed value in the load statement for the record source because aggregations represent business logic that is implemented by this bridge. This ensures data consistency and accuracy.

  • The flight date was used as a snapshot date because it is not sufficient to calculate the snapshot date from the load date due to the aggregation. This is more appropriate for the target information mart. Using the appropriate snapshot date enhances temporal data analysis.

These bridge tables can now be used as the basis to provide virtual fact tables.

Creating Virtualized Facts
  • They implement additional customization required by individual fact tables in various information marts, which differ in its representation. This allows for tailored data presentation to meet specific business needs.

  • Fact tables that depend on the same bridge table should have the similar grain. Maintaining consistent grain ensures data accuracy and simplifies analysis.

  • If the grains differ between the virtual fact table and the bridge table, then introduce a new bridge in the Business Vault to ensure that the performance meets the requirements of business users. This ensures optimal performance and data consistency.

A DDL statement creates a virtual fact entity based on BrDiversionFlight.

Key Aspects of the SQL Optimizer:

Relatively cheap for the SQL optimizer to execute this statement due to inner joins with equi-join conditions. This design choice optimizes query performance.

  • The statement joins measures from the nonhistorialized satellite TSatFlight. All the joins are the same as the other statements and is based on an inner join with an equi-join condition. This ensures efficient and accurate data retrieval.

  • If a business key is required in the fact table (for example, because not all dimensions are provided via dimension tables in this example), the hub is joined on the hash key only in order to retrieve the business key or composite key from the hub. This ensures that all necessary business keys are available for analysis. This statement concatenates the composite key from hub HubFlightNum into a format that the user is familiar with.

This business logic could be extended to serve other requirements by the end-user, which might differ per information mart or fact table.

  • The flight date could be easily sourced from the bridge table source. This simplifies data retrieval and enhances query performance.

  • The statement joins TLinkFlight for demonstrative purposes, where the grain makes it easy to join the original source of the bridge table to include additional or missing data into the fact table. This ensures data completeness and accuracy. To support such joins, the link hash key, in this example FlightHashKey, should be included.

  • If dimension tables should be provided for some dimensions, it is easy to retrieve the required key values for the dimension entry. This hash value comes from the PIT table. However, because the PIT table provides multiple snapshots for the same hash key, the equi-join condition for the PIT table is extended to include the snapshot date of the fact. This snapshot date was calculated from the technical load date and helps us to retrieve the appropriate entry in the PIT table.

Example:

-- Example of virtualized fact table creation
CREATE VIEW VirtualFactFlight AS
SELECT
    BF.FlightHashKey,
    SF.DistanceFlown,
    DA.AirportName AS DestinationAirport,
    OA.AirportName AS OriginAirport
FROM
    BrFlight BF
INNER JOIN
    TSatFlight SF ON BF.FlightHashKey = SF.FlightHashKey
INNER JOIN
    DimAirport DA ON BF.DestAirportKey = DA.AirportKey
INNER JOIN
    DimAirport OA ON BF.OriginAirportKey = OA.AirportKey;
  • The PIT is used to retrieve a load date for satellites included in the PIT table to retrieve additional descriptive data from dependent satellites. This simplifies data retrieval and enhances query performance.

  • Joining the satellite requires only INNER JOINs with equi-join condition because of the PIT table. If the PIT wasn't available then a more complex join would be required to find the appropriate delta record which would involve a BETWEEN condition and a LEFT JOIN because it is not guaranteed that the satellite provides a record for the given snapshot date.

-Another example implements a virtual fact table on the aggregated bridge BrConnection. This table doesn't include any business logic for the aggregations because these operations were already performed when loading the bridge table.

  • This fact table doesn't require any additional data from the no-history link or satellite because these aggregated measures are included in the bridge table as well. Includes the key values to dimension tables for origin airport and destination airport in addition to the measures.

-Additional descriptive data could be joined from hubs and satellites if required for a specific target.

Implementing Temporal Dimensions

The previous sections showed how to provide type 2 dimensions using joins between the hub table and dependent satellites or the PIT table and dependent satellites.

Load Date Based Joins

-All of these joins were based on the load date to find the record in the dependent satellite that is current for a given snapshot date. This ensures accurate temporal data representation.

-The load date was used because it provides information about the technical validity of a record in the history of the data: which data was current at a given point in time, from a technical perspective.

Temporal Perspective via Effectivity Dates

Instead of technical perspective, business users can use the temporal perspective that is based on the effectivity dates that come in various ways like valid from and valid to dates and membership start and membership end dates. This provides a more business-relevant view of the data.

Chapter 5 explained how to store effectivity dates in effectivity satellites, which are added to hubs and links to indicate if business keys in hubs are deleted in the source system or have become invalid, and the validity of relationships between business keys in Data Vault 2.0 links. Track changes to effectivity dates in also tracked in an auditable manner via these satellites.

A special form of a PIT table can be used to create Type 2 dimensions that reflect the temporal perspective.

Temporal PIT Table

There is no structural difference between a standard PIT and a temporal PIT table. However, instead of prejoining the data based on load date, the effectivity date or any other descriptive date is used when loading the temporal PIT table.

-The join conditions are based on the current active record from the satellite, as indicated by a load end date of NULL, and the snapshot date from the PIT between the valid from and valid to dates from the descriptive satellite (not an effectivity satellite).

Updating and Partitioning:
  • If the effectivity dates change in the raw data, the PIT needs to be updated for past records or records are deleted from the PIT and the above statement inserts the current view requested by the business. This ensures data accuracy and consistency.

  • Add a load date to the temporal PIT table and partition over it and remove old partitions whenever the current partition was successfully loaded, as fully updating the table will be costly from the prespective of performance. This will turn the temporal PIT table table into a rolling history of joins. This approach optimizes performance and storage.

Presenting the data to the business user is done by using a similar view as described in section 14.3.3:

Only difference between the two views the primary source of the fact table, which is the above temporal PIT table instead of the standard PIT.

Other than that, the view definition is exactly as before, making it easy for power users who directly access the Data Vault 2.0 model because they only need to change the PIT source in order to access a temporal view of the data instead of the technically historized view.

Implementing Data Quality using PIT Tables

Another application of PIT tables is to use it for data cleansing purposes. In some (rare) cases, it might be appropriate to use a PIT table for master and duplicate resolution and other data cleansing activities.

DQ PIT (Data Quality PIT)

Name is the only difference between this PIT table and the ones used before. The column definitions remain the same. However, the loading statement differs:

The preceding statement loads the DQ PIT but cleanses the data by setting the DestAirportHashKey and DestLoadDate to the ghost record in the event that the DestCityName is “Frisco, CA”.

Any other business logic could also be applied when loading the DQ PIT table. The business logic in the dependent dimension view would automatically pick a record from an alternate satellite if the ghost record is found.

Advantages of DQ PIT

The user transparency is an advantage of this approach, allowing power users to source cleansed dimension data by using the DQ PIT.

However, adding a computed satellite with cleansed data that could also be added to the standard (or temporal) PIT table could be achieved by a similar approach.

If the power user wants to use cleansed information, this computed satellite is joined to the PIT table instead of the raw satellite.

Dealing with Reference Data

Reference codes are included in Raw Data Vault satellites in many cases. Although these codes may vary from source to source, business users need a conformed view on such codes to run analytical statements across source systems.

Instead of modifying the code in the Raw Data Vault satellite, which would compromise the audit- ability of the Raw Data Vault, the code from a specific source system is replaced or enriched by de- scriptive data from reference tables when the Business Vault or information marts are being built.

Computed Satellite for Reference Data

A simplified computed satellite can be created in the Business Vault that is directly based on the satellite TSatFlight in the Raw Data Vault. It is merely a selection of some columns without any filtering or computing of rows that only serves as the playground for the example in this section.

The computed satellite can be modified by adding descriptive data from a reference table to the satellite by joining the descriptive data from the reference table RefDelayGroup, which is just a view on a master data table in MDS. The reference table is joined twice because it is used to describe two codes in the satellite (DepartureDelayGroups and ArrivalDelayGroups).

The descriptive data is added to the computed satellite by adding some of the attributes from the joined reference table to the view. This approach is useful for creating star schemas.

Dimension Based on Reference Table:

Instead of joining the data in the computed satellite, it could also be joined when creating the dimension tables or fact tables.

A dimension can also be created based on reference tables, where the code is used as the dimension key because it is defined as a unique value that never changes. The code is not hashed when creating dimension tables for reference data because of simplicity.

Once the dimension table is defined, the dimension key is referenced by a fact table (using the code). It is also possible to reference the dimension from another dimension, creating a snow-flake schema.

Mapping Tables

Mapping tables help to map between codes from individual source systems to conformed codes for the data warehouse.

The easiest approach is to create a master data entity with the code column used for the conformed code and individual columns per source system code.

There are also options if multiple codes from the source system need to be mapped to the same conformed code in the data warehouse.

About Hash Keys in the Information Mart

The information mart uses hash keys to define the relationships between facts and dimensions. This presents no problem in the information mart because it is provided virtually. However, the OLAP cubes are often materialized and storage might become an issue or affect performance to some extent.

Advantages of Using Hash Keys in the Information Mart

It is easy to source data when using hash keys in the information mart because all other options require more complex solutions and loading processes.

Chapter 15 explains how hash keys can be used when creating the multidimensional database without any problems from a structural perspective.

Type 1 dimensions

Hash key from the hub is used to populate the key attribute in the dimension

Type 2 dimensions

Key from the PIT table is used because it identifies each change in the PIT (and thus the dimension table) uniquely.

Using hash keys in the data warehouse, including in the dimensional model, is future-proof for all requirements regarding the volume, variety and velocity of data and thus the recommended approach for building information marts and multidimensional databases. Only deviate from this recommendation if it's really needed.

Mitigating Storage and Performance Issues

These are some ways to deal with storage and performance bottlenecks that might be due to hash keys.

Reduce the Number of Dimensions in Cube

Review the number of dimensions in the information mart and dependent cubes to determine if it is possible to reduce the number of dimensions by providing multiple cubes that are more tailored to specific business cases. By doing so, the storage required per cube is reduced which improves the overall performance of the solution while maintaining ease of development and use.

Use Fixed Binary Data Type for Hash Values

If there is still an issue after reducing the number of dimensions in a solution or if reducing the dimensions is not an option, store the hash keys using a binary datatype such as binary(16) in Microsoft SQL Server. Avoid usage of a BLOB, CLOB, TEXT, or LOB data type.

The drawback of this solution is that some tools that are part of the BI stack of your organization might not support reading or writing to binary columns if transparent access using views is not possible. It is recommended to review the tools to be used in the future, includ- ing enterprise service buses (ESBs) (such as Microsoft BizTalk, which doesn’t directly support binary datatypes) and the workarounds available (SQL views, stored procedures, etc.). The advantage of storing the hash keys as character strings is that the string is supported in any tool.

Reduce the Size of the Hash Key

Use MD5 for hashing business keys because it provides an appropriate solution while maintaining a manageable hash value size if SHA-1 (or higher) is being used.

Avoid using anything higher than SHA-1 because you will get anything worth the additional storage or reduced performance. For collisions, avoid using CRC or MD4 functions.

It is recommended to use the MD5 hash function for calculating the hash keys.

Introduce Additional Sequence Numbers

This should be the last resort because it makes dealing with the data more complex. Introduce sequence numbers in the PIT tables (for using them in Type 2 dimensions) or close to hubs (for example in hubs themselves or other structures).

Such a solution requires the sequence number to be materialized (again, for example in the PIT table) to be stable for virtualization. It is also not possible to perform cross-business queries that join multiple information marts or cubes and to incrementally load information marts and OLAP cubes if the sequence number is not materialized somewhere.

This solution requires the replacement of the key column (the hash value that is calculated per parent business key and snapshot date) by an integer sequence value. Use a bigint data type because the PIT table probably contains many records.

For Type 1 dimensions, a hash key should be introduced per business key in hubs.

Multidimensional Database

This chapter explains how to setup the relation part of the data warehouse and how it provides the raw data and information for users to do their analytical work using either ad-hoc SQL queries or other analytical applications.

Online Analytical Processing (OLAP)

Casual business users interact with multidimensional databases to get access to the information for analytical purposes. These multidimensional databases are designed to support Online Analytic Processing (OLAP) and are an effective tool to consume information because it allows business users to formulate queries and get quick responses [1]. A multidimensional database is best suited when the end-user wants to work with aggregated information instead of raw facts.

Microsoft SQL Server Analysis Services (SSAS) provides an OLAP solution developed by Mi- crosoft as part of their SQL Server Business Intelligence stack. SSAS also provides a Tabular mode that contains the detailed data [2]. It provides the presentation database that includes aggregations and indexes to provide high query performance [1].

SSAS Characteristics:
  • User-defined metadata: the multidimensional database consists of one or more OLAP cubes, which are based on facts and dimensions, closely following the definitions known from the information marts created in the previous chapter. SSAS also supports hierarchies, and the option to combine facts from multiple OLAP cubes. If the relational foundation was built using the principles in the previous chapter, it becomes possible to run queries across multiple OLAP cubes.

  • Query performance: SSAS provides superior query performance when dealing with analytic queries. Such queries are characterized by grouping and aggregating when executed against a relational database.

  • Aggregation management: OLAP cubes provide superior performance because they precompute the data at different grain levels. This aggregation management is transparent to the business user.

  • Calculations: SSAS provides calculated measures among other calculations to add calculations to the OLAP cube.

  • Security provisions: SSAS provides complex security rules to protect the aggregated data against unauthorized use.

SSAS is best for providing aggregated information. It is not a good database to provide detailed information, typically dealt with in operational business processes. A relational database or a NoSQL environment becomes more appropriate in this case.

End Goal:

To complete the end-to-end discussion about how to build a scalable data warehouse. Focus is on characteristics that are unique for information marts prepared and built during the previous chapters because we’re dealing with virtualized informa- tion marts.

Accessing the Information Mart

The information mart provides all the information that should be presented to the end-user. It should be included by the information mart first if some data from the Raw Data Vault is missing. The SSAS database is built after.

Avoid Accessing the Raw Data Vault or the Business Vault directly to retrieve additional information or raw data that is not available in the information mart.

Each OLAP cube should only access one information mart in most cases. If data is required from other information marts, this data should be added to the primary information mart. Do not increase the storage usage by using virtualized facts and dimensions, which makes this easy.

Business logic needs to be moved up- stream from the loading procedures of the second information mart into the Business Vault in order to be accessible for the primary information mart. The Business Vault should provide reusable business rules, not any one of the information marts.

Since there should be only one information mart required to build the SSAS database, a new Analysis Services Multidimensional and Data Mining Project is created in Microsoft SQL Server Data Tools and add a new data source to the project.

Creating a Data Source
  1. Create a new data source in the solution explorer.

  2. The data source wizard is presented. After solving the initial welcome page, a page appears that is used to select the connection string.

  3. If the connection string for the information mart is not available yet, create one by selecting the new button.

  4. Enter the server name, user credentials to the information mart and the database that provides the relational information mart entities.

  5. Select the OK button to show the connection in the previous dialog.

  6. Select the connection to the information mart.

Impersonation Information

Configures the impersonation information of Analysis Services.

SSAS connects to the relational information mart via the settings, where the following options are available [3]:

  1. Use a specific Windows user name and password: Used to provide a specific user account and its password to access the relational information mart. Use if there is a dedicated Windows user account was created to access the information mart with only read-only privileges.

  2. Use the service account: The service account that is running the Analysis Services database is used to access the information mart. It requires creating a database login for the service account and granting access to the information mart.

  3. Use the credentials of the current user: The current user is used for accessing the relational information mart. This option is not available for multidimensional databases if they are located on the database backend.

  4. Inherit: Uses the impersonation options of the parent database (the Analysis Service database). Use when setting the options centrally.

Select the appropriate setting and provide user credentials if necessary. It will require that the service account (the Windows account that is running Analysis Services) has access to the relational information mart to select the use the service account option.

To continue, select the next button. The data source wizard is completed with a page that checks if the name of the data source is correct. To complete the wizard, select the finish button.

Creating Data Source View

Once the data source for the information mart has been successfully configured, create a data source view which is used to access the relational database model in the information mart.

Steps:

  1. From the solution explorer, select new data source view from the context menu of the data source views folder.

  2. The data source view wizard appears, select the data source to the relational information mart.

  3. Information marts don’t use foreign key references in most cases. Therefore, Analysis Services offers the option to create logical relationships from the metadata of the tables found in the informa- tion mart, that are found based on the column and table names.

  4. SSAS has the following methods to detect the foreign key relationships in the information mart [4]:

    • Same name as primary key: The logical relationship is created based on the equality of the fact table column name and the name of the dimension’s primary key.

    • Same name as destination table name: The logical relationship is created when the fact table column name matches the name of the dimension table.

    • Destination table name + primary key name: Only if the fact table column name matches the dimension’s table name concatenated with the name of the primary key column.

  5. Select an appropriate matching method, based on naming conventions.

  6. Select the tables and views (usually the facts and dimensions in the dimensional model) that should be included in the data source view. Ensure to have at least one selected fact table and all re- quired dimension tables on the right list of included objects.

  7. The next page shows the settings of the configured data source view.

  8. Review the settings for the data source view and select the finish button to complete the wizard.

  9. The selected tables from the information mart are added to the data source view and presented as a logical model in the center of the application.

  10. Ensure if there are some missing logical relationships, especially between the fact table and the airport dimension, which can be fixed manually by dragging the column from the fact table to the primary key of the corresponding dimension to create a logical relationship in both cases.

This completes the first step to access the information mart that was created throughout the book. The next steps are creating dimensions based on the dimension tables and the cube itself, based on the fact table.

Creating Dimensions

The data source view, created in the previous section, provides SSAS access to and defines the relational information mart. The following sections of this chapter describe how the multidimensional database and the OLAP cubes are defined based on this data source view. The first step is to define the dimensions of the database and where the dimension data is being sourced.

To create a new dimension, select new dimension from the context menu of the dimension folder in the solution explorer. The page will allow you to define the data source of the dimension or to create new data for some specific cases.

Data Source Options:
  1. Use an existing table: Create a dimension from an existing table in the information mart. The information mart table will influence which attributes will be available for inclusion to the dimension.

  2. Generate a time table in the data source: create a new time table in the information mart and use this table as the source for the dimension.

  3. Generate a time table on the server: create a new time table on the SSAS server database and not the information mart.

  4. Generate a nontime table in the data source: create a new table in the information mart, based on dimension attributes created in this wizard. An ETL job is required to load the dimension table created in this top-down approach.

Select the first option to create a new dimension in the multidimensional database based on data from the information mart. Refer to the next section which provides an alternate method to create a time table with more business-user involvement than the standard approach in SSIS. After the next button is selected, the page will ask for the table and columns information to source a new dimension from an existing table.

Configure New Dimension:
  1. Select the data source view that was created in section 15.1.2 and one of the dimension tables.

  2. Make sure that the key column was selected as key column, as there should be only one key column per dimension table

  3. The name column defines the default caption for the member in the dimension. Select an appropri- ate and useful column that provides a distinguishable and understandable name for each member in the dimension.

  4. Select all the columns from the source table that should be usable by the business user and include them into the dimension.

  5. Set up various attribute types that change the behavior of SSAS for common use cases, such as currency and geography dimensions [6].

  6. Setup up the attributes, review the settings for the dimension and make sure to provide a dimension name that is meaningful for the business user. Select finish to create the dimension in the multidimensional database.

  7. Repeat the process for other dimensions in the dimensional model.

Date Dimension

There is some advantage of running the process on your own, as opposed to setting up a date dimension (time table) from the dimension wizard. This approach will let business users change date and months abbreviations or captions.

Analytical master data table is setup and contains the dates that should be included in the date dimension and their corresponding descriptive fields. The Microsoft Data Services (MDS) DWH model supplied with this book provides members for a date dimension with descriptive attributes that can be overwritten in