Data Vault Link and Satellites

Link Definition

Purpose of Link Entities

Link entities serve as the cornerstone for modeling relationships, transactions, and associations within a Data Vault architecture. They establish connections between business keys located in hubs and define the granularity of relationships between data elements.

Key Characteristics of Links
  • Links capture relationships across different time horizons.

  • Links are immutable records of inter-hub relations.

  • Links avoid temporal attributes to preserve timeline agnosticism.

  • Links focus on relationship existence rather than specific time frames.

  • Temporary deletions are managed for audit trails rather than direct link modification.

Implementation of Data Vault Links
  • Links are implemented as tables for representing many-to-many relationships.

  • Links connect multiple hubs, capturing intricate relationships.

  • Each link contains hash keys of its connected hubs and associated metadata.

  • The primary key is a hash key, unique to each link instance.

  • Satellites enrich links with contextual data, describing the link. Other links can then reference these.

  • Hash keys enhance lookup speeds during data loading, optimizing ETL processes.

  • Attributes like Load Date and Record Source provide traceability of data origin and timing.

Logical Symbol and Scalability
  • Links are symbolized by a chain, representing their connective function.

  • Links ensure scalability by enabling the addition of further hubs and links without disrupting the existing structure.

Reasons for Many-to-Many Relationships
  • Flexibility accommodates changing business rules without requiring link redesigns.

  • Granularity is defined by the number of referenced hubs, offering adaptable data representation.

  • Links facilitate the absorption of changes in both data and business rules, minimizing impacts on existing systems.

Flexibility of Links
  • Links enhance model flexibility, simplifying the introduction and modification of relationships.

  • New functionalities are added by incorporating new hubs and linking them to the current structure.

  • This approach minimizes disruption to the existing data warehouse.

Granularity of Links
  • Granularity is determined by the number of connected hubs; increasing hubs refines granularity.

  • Refactoring practices favor creating new links over modifying existing ones.

  • Modifying links necessitates ETL job redesign and complicates historical data management.

  • Instead, a new link strategy involves 'closing' the old link and redirecting new data flows to the new link.

Link Unit-of-Work
  • The unit-of-work represents a cohesive data set that keeps related key sets unified.

  • Avoid decomposing links into smaller parts to normalize data, which could compromise data integrity.

  • Splitting links risks capturing incorrect or incomplete data.

Link Entity Structure
  • The core structure includes hash keys derived from referenced hubs.

  • Required metadata includes Load Date and Record Source for auditing.

  • Optional metadata may feature Last Seen Date for tracking.

  • An optional Dependent Child Key can specify grain.

Hash Key
  • Necessary for every link.

  • ETL processes use hash keys to prevent duplicates by verifying existing relationships.

  • Replaces traditional joins during data loading to improve efficiency.

Dependent Child Key
  • This key, such as a line-item number on an invoice, refines data granularity.

  • The dependent child key influences the uniqueness of the data set.

  • It is used in the derivation of the hash key to ensure uniqueness.

Link Examples
  • The LinkConnection example links carrier airports, source airports, destination airports, and flight numbers.

  • This establishes a multifaceted relationship with key metadata.

  • Links can also exclusively reference two hubs like HubCarrier and HubAirport.

Satellite Definition
  • Satellites store descriptive attributes over time adding context to both hubs and links that they are connected to.

  • Satellites are connected to only one Hub or Link.

  • Identified by the parent’s hash key and the timestamp of the change.

  • Satellites preserve a history of non-identifying business elements.

  • Satellites use the closest possible data type that the source systems use (including NULL values).

  • Attributes can be stored in individual satellites by category.

Satellite logical Symbol
  • Satellites are strictly single-parent and cannot be parents themselves.

  • No introduction of hash keys occurs because of the Satellite architecture.

Importance of Keeping History
  • Complete historical data in satellites ensures auditability.

  • The data warehouse acts as the definitive system of record.

  • Data should never be directly modified, except for the Load End Date.

  • Satellites are delta-driven, akin to Type II dimensions.

Splitting Satellites
  • Distribute data across satellites by source system and change frequency.

Splitting By Source System
  • This approach enhances auditing and enables parallel loading.

  • Prevents data contamination and inter-system dependencies.

  • The Record Source remains essential for identifying the data origin.

Splitting By Rate of Change
  • Separates attributes by how often they change, optimizing storage.

  • Improves performance by isolating frequently updated data.

Satellite Entity Structure
  • Required metadata includes the Load Date and Record Source.

  • Necessary attributes are Parent Hash Key and Load End Date.

  • Optional attributes include Hash Difference and Extract Date.

Parent Hash Key
  • This is the satellite's primary key. All satellites depend on one hub or link.

Load Date
  • Marks the time of a change, forming part of the primary key.

  • This key is crucial when you need to load historical data.

Load End Date
  • This date marks when a satellite entry becomes invalid.

Hash Difference
  • Allows easy value comparisons between rows.

Extract Date
  • The extraction timestamp aids in tracing data origins.

Satellite Examples
  • SatAirport depends on HubAirport, using AirportSeq + LoadDate as the primary key, and includes attributes like LoadEndDate and RecordSource.

  • SatAirportTZ is separate from SatAirport due to the frequent changes in GMTOffset.

  • A satellite can add descriptive data to a Data Vault link, such as LinkConnection.

Link Driving Key
  • For SatConnection attached to LinkConnection, a driving key (source structure) is used.

  • This organizes hub references, which is crucial for satellite data loading.

  • It maintains the relationships between hub entries, ensuring data integrity.