Data Vault Link and Satellites
Link Definition
Purpose of Link Entities
Link entities serve as the cornerstone for modeling relationships, transactions, and associations within a Data Vault architecture. They establish connections between business keys located in hubs and define the granularity of relationships between data elements.
Key Characteristics of Links
Links capture relationships across different time horizons.
Links are immutable records of inter-hub relations.
Links avoid temporal attributes to preserve timeline agnosticism.
Links focus on relationship existence rather than specific time frames.
Temporary deletions are managed for audit trails rather than direct link modification.
Implementation of Data Vault Links
Links are implemented as tables for representing many-to-many relationships.
Links connect multiple hubs, capturing intricate relationships.
Each link contains hash keys of its connected hubs and associated metadata.
The primary key is a hash key, unique to each link instance.
Satellites enrich links with contextual data, describing the link. Other links can then reference these.
Hash keys enhance lookup speeds during data loading, optimizing ETL processes.
Attributes like Load Date and Record Source provide traceability of data origin and timing.
Logical Symbol and Scalability
Links are symbolized by a chain, representing their connective function.
Links ensure scalability by enabling the addition of further hubs and links without disrupting the existing structure.
Reasons for Many-to-Many Relationships
Flexibility accommodates changing business rules without requiring link redesigns.
Granularity is defined by the number of referenced hubs, offering adaptable data representation.
Links facilitate the absorption of changes in both data and business rules, minimizing impacts on existing systems.
Flexibility of Links
Links enhance model flexibility, simplifying the introduction and modification of relationships.
New functionalities are added by incorporating new hubs and linking them to the current structure.
This approach minimizes disruption to the existing data warehouse.
Granularity of Links
Granularity is determined by the number of connected hubs; increasing hubs refines granularity.
Refactoring practices favor creating new links over modifying existing ones.
Modifying links necessitates ETL job redesign and complicates historical data management.
Instead, a new link strategy involves 'closing' the old link and redirecting new data flows to the new link.
Link Unit-of-Work
The unit-of-work represents a cohesive data set that keeps related key sets unified.
Avoid decomposing links into smaller parts to normalize data, which could compromise data integrity.
Splitting links risks capturing incorrect or incomplete data.
Link Entity Structure
The core structure includes hash keys derived from referenced hubs.
Required metadata includes Load Date and Record Source for auditing.
Optional metadata may feature Last Seen Date for tracking.
An optional Dependent Child Key can specify grain.
Hash Key
Necessary for every link.
ETL processes use hash keys to prevent duplicates by verifying existing relationships.
Replaces traditional joins during data loading to improve efficiency.
Dependent Child Key
This key, such as a line-item number on an invoice, refines data granularity.
The dependent child key influences the uniqueness of the data set.
It is used in the derivation of the hash key to ensure uniqueness.
Link Examples
The LinkConnection example links carrier airports, source airports, destination airports, and flight numbers.
This establishes a multifaceted relationship with key metadata.
Links can also exclusively reference two hubs like HubCarrier and HubAirport.
Satellite Definition
Satellites store descriptive attributes over time adding context to both hubs and links that they are connected to.
Satellites are connected to only one Hub or Link.
Identified by the parent’s hash key and the timestamp of the change.
Satellites preserve a history of non-identifying business elements.
Satellites use the closest possible data type that the source systems use (including NULL values).
Attributes can be stored in individual satellites by category.
Satellite logical Symbol
Satellites are strictly single-parent and cannot be parents themselves.
No introduction of hash keys occurs because of the Satellite architecture.
Importance of Keeping History
Complete historical data in satellites ensures auditability.
The data warehouse acts as the definitive system of record.
Data should never be directly modified, except for the Load End Date.
Satellites are delta-driven, akin to Type II dimensions.
Splitting Satellites
Distribute data across satellites by source system and change frequency.
Splitting By Source System
This approach enhances auditing and enables parallel loading.
Prevents data contamination and inter-system dependencies.
The Record Source remains essential for identifying the data origin.
Splitting By Rate of Change
Separates attributes by how often they change, optimizing storage.
Improves performance by isolating frequently updated data.
Satellite Entity Structure
Required metadata includes the Load Date and Record Source.
Necessary attributes are Parent Hash Key and Load End Date.
Optional attributes include Hash Difference and Extract Date.
Parent Hash Key
This is the satellite's primary key. All satellites depend on one hub or link.
Load Date
Marks the time of a change, forming part of the primary key.
This key is crucial when you need to load historical data.
Load End Date
This date marks when a satellite entry becomes invalid.
Hash Difference
Allows easy value comparisons between rows.
Extract Date
The extraction timestamp aids in tracing data origins.
Satellite Examples
SatAirport depends on HubAirport, using AirportSeq + LoadDate as the primary key, and includes attributes like LoadEndDate and RecordSource.
SatAirportTZ is separate from SatAirport due to the frequent changes in GMTOffset.
A satellite can add descriptive data to a Data Vault link, such as LinkConnection.
Link Driving Key
For SatConnection attached to LinkConnection, a driving key (source structure) is used.
This organizes hub references, which is crucial for satellite data loading.
It maintains the relationships between hub entries, ensuring data integrity.