elephant p1
Introduction: Why “The Elephant in the Fridge“?
The title is a deliberate link to the author's previous well-received book, “The Nimble Elephant: Agile Delivery of Data Models using a Pa ern-Based Approach”. This connection emphasizes that the core theme of applying "Agile" development principles to pattern-based models is directly applicable and highly advantageous when generating Data Vault models. The intention is to draw experienced readers of the previous book into this new context, suggesting a continuation and evolution of familiar concepts.
The phrase “in the Fridge” is strategically used as an approachable and business-friendly alternative to the more technical term “vault.” This substitution is designed to immediately convey the idea that the data within should be easily and readily accessible to end-users and business stakeholders, avoiding the perception of a complex or locked-down system.
A traditional bank vault is presented as a metaphor for how data is often perceived: secure but difficult to access. The analogy highlights that while putting data in is relatively straightforward, extracting it involves cumbersome processes such as filling out forms, providing identification, and navigating complex permissions structures. The author points out that some individuals mistakenly portray Data Vault in this unfavorable light, creating a barrier to adoption.
The author consciously uses the more relatable and user-friendly image of a fridge to counter the negativity associated with the bank vault metaphor. A fridge is universally understood as a convenient storage space where items can be easily placed and retrieved, even at a moment's notice (like a midnight snack). This comparison underscores the desired ease of access and usability for data consumers.
The author playfully introduces the term Data Fridge as a substitute for Data Vault to emphasize the intent of making data more accessible and user-friendly for the business. This lighthearted approach aims to reduce the perceived complexity of Data Vault, suggesting ways to streamline processes and provide quicker access to the data that business users need, precisely when they need it.
Acknowledgments
Rob Barnard: Is singularly credited with introducing the author to the Data Vault methodology, profoundly expanding his understanding and vision in the field. His influence was pivotal; the author asserts that this book would not have been conceived or written without Rob's initial guidance and ongoing support.
Natalia Bulashenko and Emma Farrow: Are recognized for significantly enriching the author's journey and deepening his expertise in subsequent stages of his Data Vault career. Their contributions are acknowledged as invaluable in shaping the later directions of the book.
Roelant Vos: Is specifically highlighted as an Australian expert, internationally respected for his pragmatic approach and engaging personality. His thorough and insightful review of the author's “top-down” papers on TDAN.com (The Data Administration Newsletter) played a crucial role in refining and shaping many of the key ideas presented in the book.
Johnny Mackle and Peter Dudley: Are thanked as front-line developers whose support, collaboration, and insights have been deeply appreciated. Their transition from work colleagues to trusted friends is noted as a testament to the collaborative spirit surrounding the project.
Larry Burns: Is acknowledged for providing quiet, consistent encouragement from within the broader data modeling community. His behind-the-scenes support was valuable in maintaining momentum and fostering a sense of community.
Steve Hoberman: Is thanked as the publisher of the book and an enthusiastic supporter of all aspects of “data.” His commitment to the data community is recognized as instrumental in bringing the book to publication.
The author expresses deep appreciation for the indispensable support of their family, especially their wife, acknowledging that undertaking a large project such as writing a book would be impossible without their love and understanding.
Chapter 1: Setting the Scene
Figure 1 (not included in this text) evidently provides a visual representation of the essential components and relationships within the Data Vault 2.0 architecture. This figure serves as a foundational reference point for readers, ensuring all have a common understanding of the core elements.
Possessing a strong understanding of Data Vault fundamentals is deemed absolutely necessary for successfully applying the techniques and strategies discussed throughout the book. The book assumes a certain level of familiarity with these fundamentals.
Recognizing that some readers may be new to Data Vault, the book incorporates an optional primer. However, it is explicitly stated that the primary focus of the book is not on introductory material but rather on advanced techniques to optimize Data Vault model design for overall project success.
While understanding the basic principles of Data Vault is important, the author emphasizes that practical experience from Data Vault projects indicates that this knowledge alone is generally insufficient to guarantee success. Real-world challenges often go beyond the textbook definitions.
The author notes that Data Vault projects may have solid foundations in methodology and architecture but still encounter difficulties primarily related to modeling challenges. These challenges can be subtle and less obvious than infrastructural or procedural issues.
The author acknowledges that the fundamental modeling concepts of Data Vault—Hubs, Links, and Satellites—are relatively straightforward and can be quickly learned. However, the simplicity of these components belies the complexities that arise in their practical application.
Dan Linstedt, the creator of Data Vault, is referenced as having issued clear warnings about potential pitfalls and common mistakes in Data Vault implementations. Despite these warnings, Data Vault projects still frequently fail, highlighting a gap between theoretical knowledge and practical execution. The persistence of these failures is viewed as unfortunate, given the substantial benefits that a well-implemented Data Vault can provide.
This book aims to improve the success rate of Data Vault projects by serving as a guide to navigate common modeling pitfalls. The author employs the metaphor of lighthouses and navigation maps to illustrate how the book will highlight dangers and provide clear paths for successful implementation.
The author expresses a sincere hope that by avoiding common pitfalls and heeding the advice in this book, readers will be better equipped to deliver measurable business value from their Data Vault investments. The ultimate goal is to translate Data Vault implementations into tangible benefits for the organization.
Analogy: The Chainsaw and the 4WD
The author introduces a story about a country lad who purchased a chainsaw but, lacking the understanding of its engine, resorted to using it manually as a hand saw. This anecdote illustrates the theme of possessing a powerful tool without knowing how to fully utilize its capabilities.
Paralleling the chainsaw story, the author reflects on their experience in the 1990s of buying a four-wheel drive (4WD) vehicle and acquiring a workshop manual that presupposed a practical understanding of off-road driving techniques, which the author lacked. This highlights the issue of documentation being insufficient without practical expertise.
The narrative shifts to recount the tragic event of a tourist in Australia who hired a 4WD but, due to a lack of knowledge, became stranded in sand and perished from thirst, unaware of the technique of deflating tires to improve traction. This cautionary tale emphasizes the potentially dire consequences of misusing powerful tools due to insufficient training and understanding.
The analogies culminate in the assertion that Data Vaults, like chainsaws and 4WDs, are highly capable tools designed for specific purposes. However, their effectiveness hinges on the user's knowledge of how to “drive” or operate them correctly, suggesting that without the right expertise, projects may not realize their intended benefits and could lead to disappointment.
Analogy: The Aussie Outback
The Australian Outback is depicted as a vast, predominantly flat landscape characterized by dryness, intense heat, and extreme remoteness. This setting serves as a metaphor for the challenging environment of data management and the potential isolation of data professionals.
Specifically mentioned is the Nullarbor Plain, renowned as one of the world’s longest straight roads, devoid of any trees or significant landmarks. This highlights the monotony and potential for disorientation in complex data projects.
The author recounts a personal experience involving a trip with his wife and friends through a remote “shortcut” that turned out to be an overgrown and barely passable track. This anecdote underscores the risks associated with poorly planned or executed data initiatives.
The narrative details that a particular 70-kilometer (40-mile) segment of the track took two full days to traverse, illustrating the significant time and resource costs that can arise from underestimated challenges in data projects.
Despite the arduous journey, the group eventually reached a picturesque beach, only to discover that it was inhabited by saltwater crocodiles. This unexpected hazard reinforces the theme that even seemingly successful outcomes can present unforeseen risks.
The overarching message of the analogy is that while Australia offers many wonderful places, it also harbors real dangers. The author advises that heeding warnings is crucial for a successful experience, while ignoring these warnings can lead to serious negative consequences.
The analogy is then directly applied to Dan Linstedt's warnings regarding the construction of Data Vaults, drawing a clear parallel between the potential dangers of the Australian Outback and the risks associated with improper Data Vault implementation.
The author concludes by noting that, as with the warnings about the Australian Outback, some individuals disregard Dan’s advice on Data Vault construction and then wonder why their projects do not yield the desired results, thereby reinforcing the importance of following expert guidance.
Dan Linstedt's Warnings
The first warning from Dan Linstedt emphasizes that “Data Vault modeling was, is, and always will be ABOUT THE BUSINESS.” This assertion underscores the necessity of aligning data models with business needs and objectives, not merely technical considerations. He continues: “And if the Data Vault you have in place today is not currently about the business, then unfortunately you’ve hired the wrong people, and those people need to go back to school and re-learn what Data Vault really means. OR you’ve built the wrong solution, and you need to fix it – immediately.” This strong language stresses the urgency and importance of business alignment.
The second warning highlights the importance of ontologies: “Ontologies are a very very important asset to the corporation – if built at the enterprise level, you must focus on ontologies while you are building the Data Vault solution, or the full value of the … Data Vault cannot be realized.” This statement suggests that an enterprise-level ontology is crucial for unlocking the full potential of a Data Vault implementation.
The author interprets Linstedt’s warnings to mean that the data model underlying the Data Vault design must be fundamentally business-centric. The method for achieving this is to begin with an enterprise ontology that reflects and codifies the organization’s business concepts and relationships.
The author clarifies that an enterprise ontology can be practically understood as a data model that describes business concepts, their inter-relationships, and their major attributes. This definition seeks to demystify the somewhat abstract term “ontology” for data professionals.
For practical purposes within the context of this book, the author equates an enterprise ontology with a top-down, big-picture enterprise data model. This rephrasing is intended to make the concept more accessible and actionable for readers who may be more familiar with data modeling terminology.
Questions to be Answered
The first question posed is: “What on earth is an “enterprise ontology”, ‘cause I won’t know if I’ve got one if I don’t know what I’m looking for.” This reflects a need for a clear and understandable definition of enterprise ontology and guidance on how to identify one.
The second question is: “If I can’t find one, and Dan says I need one, how do I get my hands on one, or create one?” This underscores the practical challenge of obtaining or developing an enterprise ontology when one is not readily available.
The third question asks: “And even if I have one of these wonderful things, how do I apply it to get the sort of Data Vault that Dan recommends?” This addresses the critical issue of how to effectively use an enterprise ontology to design and implement a Data Vault aligned with Dan Linstedt’s recommendations.
The author explicitly states that the book will provide answers to each of these questions, setting a clear expectation for readers and outlining the core objectives of the material that follows.
Roadmap to Data Vault Success
The author asserts that by performing the subsequent four tasks, readers will be well-positioned for Data Vault success, offering a structured approach to implementation.
The first step involves seeding the Data Vault project by collecting freely available knowledge within the organization, which is essential for building a comprehensive understanding of the business context.
It is crucial to precisely define the target of the project to ensure that all efforts are aligned with clear, measurable goals.
Knowledge should be gathered regarding the intended data marts and similar applications that the new Data Vault will support, ensuring that the architecture meets downstream requirements.
The approach should be proactive, with the project team acting as “sponges” to absorb information from diverse sources, building a robust knowledge base.
This includes engaging with subject matter experts (SMEs) throughout the business to capture their insights and tacit knowledge.
Thoroughly reviewing all available business process documentation to understand data flows and business rules is essential.
Examining business glossaries to identify and understand the definitions of key business terms such as Customers, Assets, etc., is also vital.
It is important to gain familiarity with established data model patterns, leveraging proven, generic structures that fit most situations and provide a solid starting point.
These patterns should be specialized and tuned to meet the specific and unique needs of the organization, ensuring a tailored solution.
Task #1: Define how the business sees their data
This task is described as central to the top-down design approach for a Data Vault, underscoring the importance of aligning the data model with business perspectives.
The author reiterates Dan Linstedt’s assertion that an “enterprise ontology” is needed to properly shape the Data Vault design, reinforcing the significance of this concept.
The author acknowledges that various professionals refer to similar concepts using different terms, such as enterprise data models, business data models, enterprise conceptual models, or enterprise logical models. This recognizes the diversity of terminology in the field.
The author references David Hay’s book, “Achieving Buzzword Compliance,” which provides precise definitions of these different types of models, including the identification of three distinct types of conceptual models. This acknowledges the nuances within data modeling practices.
For the sake of clarity and consistency within the book, the author adopts a flexible stance on the specific terminology used to describe this essential model. This avoids getting bogged down in semantic debates and focuses on practical application.
The author affirms that the essential characteristic is that it provides an enterprise-wide view of the data of interest. The author suggests that this makes it an enterprise data model by definition, despite specific nomenclature.
The author specifies that, for this book, the enterprise data model consists of two simple parts, making it easier to implement:
It starts with a one-page diagram that identifies the major data subject areas of the enterprise, each represented by an icon typically based on generic data model patterns. This provides a high-level overview of the key data entities.
It then drills down from the generic data model patterns into the specifics of the organization. This involves creating a “taxonomy” that represents the hierarchy from supertypes (generic patterns) to subtypes (business-specific entities).
The author characterizes this as a light-weight framework that can be assembled in weeks or even days, assuming a working familiarity with the underlying data model patterns. This emphasizes the practicality and efficiency of the approach.
Task #2: Design the Data Vault, based on the business view
This task involves mapping the high-level, business-centric view of data into the detailed Data Vault design, ensuring alignment between business needs and the data architecture.
The author reiterates that this is a top-down approach, starting with the business perspective and then translating it into technical specifications.
Hubs are selected from a “sweet spot” on the supertype/subtype hierarchy or taxonomy, ensuring that they represent core business concepts.
Links are constructed based on the business relationships between the identified business Hubs, reflecting actual interactions and dependencies.
Business Satellites are designed to capture data attributes as the business wants to see them, not as they are named or held in source systems, ensuring a business-friendly view of the data.
The author notes that these business-centric Satellites are likely to become conformed Satellites, used later to drive Data Marts, highlighting their reusability and importance.
At this stage, the Satellites represent a preliminary capture of the target information that the business desires, setting the stage for more detailed data integration.
The author emphasizes that for these two tasks, the focus remains on delivering tangible value to the business by reflecting their perspective of the data. Source systems are deliberately not considered initially to avoid technical biases.
The author acknowledges that the reality of available data sources must eventually be addressed, but this is purposefully deferred until after the business view has been clearly defined.
Task #3: Bottom-up Source-to-Data Vault mapping
This task marks the shift from a top-down business view to a bottom-up consideration of available data sources, ensuring that the Data Vault can be populated with real-world data.
The author stresses the importance of mapping source data to the established business objects rather than creating source-centric Hubs or Links unless absolutely necessary. This avoids data silos and promotes integration.
The author uses the example of having 50 source data feeds with Customer data to illustrate the problem of creating 50 Customer Hubs, each with its Satellite. This highlights the need for consolidation and a business-centric approach.
Instead, the author advocates for the creation of source-specific Satellites attached to a single, business-centric Hub, allowing for the capture of source-specific details without fragmenting the core business concepts.
The short version is that if a source feed has Links that map neatly to the previously identified business-centric Links, then that is where they should be mapped, maintaining consistency and alignment.
The author notes that it is often necessary to construct new Links to represent transactions or events as presented by source system data feeds, which may not directly align with the business-centric Links.
These new Links may also require their own Satellites, opening a contentious topic that will be addressed separately, acknowledging the complexity of handling transactional data.
Task #4: Define business rules
This task focuses on defining the business rules that fill the gaps between the raw data and the desired business view, adding necessary transformations and logic.
The author states that the two most common forms of business rules are:
Rules to map multiple source-specific Satellites into a single, consumption-ready “conformed” Satellite, consolidating data from various sources into a unified view.
Rules to map source-specific “Event / Transaction” Links to their corresponding Hubs, Links, and/or Satellites, integrating transactional data into the broader data model.
The author acknowledges that there are other types of business rules, such as those related to de-duplication of instances within a Hub, which are necessary to ensure data quality and consistency.
The author explains that in cases where duplicates exist, the business may want them consolidated to present a de-duplicated view ready for consumption, enhancing usability and accuracy.
In summary
The author concludes that these four tasks collectively define a roadmap for the end-to-end design of a Data Vault, providing a comprehensive framework for implementation.
The author emphasizes that the diagram (not included in this text) presents the essence of sound Data Vault design, serving as a visual summary of the key principles.
The author outlines the subsequent structure of the book:
A primer on the Data Vault elements will be provided to ensure a common understanding of the core components.
Detailed explanations of the four tasks introduced above will be presented, guiding readers through each step of the process.
A deeper dive into some of the more intricate details of Data Vault will be offered, addressing advanced topics and potential challenges.
An appendix will introduce several common data model patterns that can aid in performing Task #1, providing practical tools and techniques.
Chapter 2: A Data Vault Primer
The author introduces Chapter 2 as a Data Vault primer for those who are not yet familiar with this data warehousing approach, providing foundational knowledge for beginners.
The author suggests that readers already well-versed in Data Vault and its relation to other approaches like those of Ralph Kimball and Bill Inmon, and comfortable with Data Vault modeling fundamentals, may skip this primer. They can proceed to the section on “Task #1 – Form the Enterprise View” to engage with more advanced material on Dan Linstedt’s “ontology” concept.
A bit of data warehouse history
The author begins with an overview of Ralph Kimball’s contributions to data warehousing, specifically his approach to dimensional modeling.
The author notes that Kimball’s work is often associated with terms like Dimensional Modeling, Star Schemas, Cubes, and Data Marts, all of which are foundational concepts in business intelligence.
Transaction-processing systems, which handle the day-to-day operations of the business, are depicted on the left side of an implied diagram. These systems are optimized for transactional efficiency.
The author points out that while these systems are effective for operational purposes, they often struggle to support robust business reporting due to their design and data structures.
On the right-hand side of the diagram are purpose-built data marts, which are specialized databases designed to support specific reporting and analytical needs.
Each data mart focuses on a particular set of “facts,” such as complaints against medical practitioners, classified by “dimensions” like the practitioner, complaint type, and dates. This highlights the dimensional structure common in Kimball’s approach.
The author explains that enterprise-wide integration in Kimball’s model is achieved through the use of common, shared dimensions across multiple data marts, known as “conformed dimensions.” This ensures consistency and comparability of data across different reporting areas.
The author notes that these data marts are well-suited for slicing-and-dicing data in various ways, enabling users to explore and analyze data from multiple perspectives.
One advantage of the Kimball approach is the ability to quickly and easily build the first data mart, allowing for rapid deployment and demonstration of value.
Another advantage is the good selection of tools that enable non-technical users to visualize the data, empowering business users to perform their analysis.
The author notes that a perceived disadvantage of the Kimball approach emerges over time as complexity increases with the addition of more data marts and the extension of existing ones to meet new demands. This scaling challenge is a common critique.
The author then introduces Bill Inmon and his Third Normal Form (3NF) Data Warehouse, which represents a centralized approach for supporting organization-wide reporting.
In Inmon’s model, the same operational systems feed into one central “normalized” repository, ensuring a single source of truth.
Data from this central repository is then transformed and pushed out into formats ready for easy consumption, including, of course, Data Marts. This highlights the flexibility of the Inmon approach.
One advantage of this centralized approach is consistency, as all reports are sourced from one clean, uniform version of the “truth.” This addresses a key challenge in data warehousing.
Another advantage that appears over time is that the data supplied from each single source feed can potentially be reused to support multiple Data Marts, improving efficiency and reducing redundancy.
The author acknowledges that this approach comes with a price tag: the first Data Mart to be supported incurs an overhead not needed in the more agile and iterative Kimball approach. This initial investment is a trade-off for long-term benefits.
The author then introduces Dan Linstedt and his Data Vault architecture, positioning it as a response to some of the limitations of the Kimball and Inmon approaches.
The author anticipates potential skepticism by asking, “We’ve already got turf wars between the followers of the Kimball versus Inmon approaches, so why on earth do we need a new player in the field?” This sets the stage for explaining the unique benefits of Data Vault.
The author suggests that the origins of Data Vault provide an answer to this question, highlighting that necessity is often the mother of invention.
The author recounts that Dan Linstedt developed Data Vault in response to a client in the 1990s whose demands pushed the boundaries of data warehousing, specifically the need to store petabytes of data. The author clarifies that a petabyte is thousands of gigabytes.
The author emphasizes that storing vast amounts of data is only one part of the challenge; the other is obtaining timely results from queries, which Data Vault was designed to address.
The author cites a more recent Data Vault implementation at Micron, the computer chip manufacturer, where they add billions of rows to their Data Vault each day, showcasing the scalability of the architecture.
The author states that the title of the book by Dan Linstedt and Michael Olschimke, “Building a Scalable Data Warehouse…,” highlights one of the primary reasons people adopt Data Vault—scalability. However, the author notes that scale is just one of several reasons.
The author admits that most of his clients have very modest requirements compared to “big data” benchmarks like volume and velocity, but that they are often interested in another of the big data V’s: variability.
The author explains that when required to ingest not just predictable, structured data, but also unstructured and semi-structured data, Data Vault’s ability to accommodate all such forms is impressive, making it a versatile choice.
The author highlights other benefits such as agility and extensibility, noting the desire to quickly incorporate changes and get things “right” the first time.
The author acknowledges that Data Vault is not a silver bullet that solves all problems but that it certainly eases the pain associated with managing complex data environments.
For instance, the author praises the philosophy of always inserting new data and never updating existing data. This approach avoids tedious reloading and simplifies data management.
The author summarizes that the “sales pitch” for Data Vault goes on and on, reflecting ever-changing demands, such as real-time feeds and service-oriented architectures. These capabilities make Data Vault a modern and adaptable solution.
The author touches on the need for full traceability of data lineage—from the view presented to data consumers back to the operational systems that sourced the data—and notes that Data Vault supports this requirement.
The author briefly discusses different batch processing approaches based on ETL (extract-transform-load) and ELT (extract-load-transform), noting their different philosophies.
The author explains that the ETL approach often aims to clean up the data on the way into the data repository, while Data Vault aims to load “all the data, all the time” and then apply business rules to clean it up. This allows for greater flexibility and the ability to correct mistakes retroactively.
The author stresses that even “dirty” data can be very instructive and should not be discarded, as it can provide valuable insights into data quality issues.
The author then refers to something that looks similar to Inmon's style, with operational systems flowing data through a staging area into a centralized Data Warehouse, which then feeds Data Marts and flat extracts. This emphasizes common elements across different architectures.
A few things are worth noting
The author emphasizes that the staging area does not necessarily need to persist (store) all the data that passes through it. The Data Vault itself can record these details, though organizations may choose to use persistent staging areas if they already exist.
The author points out that a single Data Vault can hold data from multiple distinct sources, consolidating information from across the enterprise.
The author distinguishes between “raw” data, which faithfully captures all the data as presented from operational sources, and “business” data, which is transformed using explicitly declared business rules.
The author explains that some “raw” data may be ready for direct consumption, while other data requires transformation, such as renaming attributes or computing derivable values.
The consumption layer, as well as objects in the Data Vault generated by business rules, can be virtual, enabling fast construction and reconstruction as needs change. This highlights the agile nature of Data Vault.
The author mentions that technologies such as NoSQL databases can be an integral part of the Data Vault and that the Data Vault can be constructed on platforms like Hadoop. This reflects the adaptability of Data Vault to modern data ecosystems.
The author adds that cloud solutions such as Snowflake are increasingly playing an exciting role in Data Vault deployment, underscoring the technology’s relevance to current trends.
The author notes the potential for a services layer to enable two-way communication between operational systems and the Data Vault. This allows for dynamic interaction and feedback loops.
For example, a data load into the Data Vault may trigger an alert to be managed by an operational system, showcasing the integration capabilities of Data Vault.
The author declares that his passion lies in the modeling part of Data Vault and, like his story of buying a workshop manual for his 4WD, acknowledges that others have more expertise in the technology aspects. He includes the above details to set Data Vault in context.
The author concludes by shifting focus to the data model components inside a Data Vault, signaling the primary subject of the remainder of the book.
Data Vault made (too) easy
One position on Data Vault standards
The author notes that Data Vault standards, like all standards, are subject to change over time, acknowledging the evolving nature of the methodology.
The author highlights that the most notable change was the introduction of Data Vault 2.0 as an upgrade to the original Data Vault, marking a significant evolution in the approach.
The author mentions that Dan Linstedt encourages people to challenge his standards but suggests that any proposed variation be thoroughly and objectively debated before being adopted, emphasizing the need for careful consideration.
The author clarifies that this book represents only one perspective and that objective evaluation involving multiple parties is not possible in a one-way communication. This acknowledges the potential for alternative viewpoints.
The author recognizes the existence of alternative views and that each variation may have strengths and weaknesses in certain circumstances, reiterating the need for careful evaluation.
For the sake of simplicity and consistency, the author commits to following the modeling standards as published in “Building a Scalable Data Warehouse with Data Vault 2.0,” providing a clear reference point.
Towards the end of the book, the author promises to explicitly present some aspects of alternative views, fulfilling the promise of acknowledging different perspectives.
The author suggests that, for the sake of progressing the wider discussion on how an enterprise view can contribute to a Data Vault design, readers may find it simpler to assume Dan’s perspective, at least for now. This is intended to provide a common ground for discussion.
A case study
The author introduces a personal anecdote about a time when his children were threatened by a severe firestorm and he was unable to contact them due to communication failures.
The author explains that this event occurred before the advent of mobile phones and that landlines were down, highlighting the sense of helplessness and urgency.
The author expresses gratitude that, as it turned out, his family was safe, but acknowledges that many others suffered on that day, emphasizing the real-world impact of emergency situations.
The author reveals that he has worked on IT solutions for emergency response organizations and has a deep passion for using technology to protect the community, providing context for the case study.
Wildfire Data
The author introduces the case study, noting that each fire is uniquely identified by a business key known as the Fire Reference Number, providing a specific example of a business key.
The author adds that each fire truck is uniquely identified by a business key known as the Registration Number, providing another specific example of a business key.
The author explains that at any given point in time, each fire truck may be assigned to only one fire, while each fire may have zero, one, or more fire trucks assigned to it. This describes the relationship between fire trucks and fires.
The author highlights the phrase “point in time,” noting that many operational systems hold only the current values, not historical data, emphasizing the need for a Data Warehouse.
The author clarifies that keeping a record of past data values is one of the reasons people consider a Data Warehouse, of which Data Vault is a more modern form, reinforcing the value of historical data.
Hubs
The author explains that Hubs are at the center of data structures in a Data Vault, serving as the core entities around which data is organized.
The author emphasizes that Hubs are based on business concepts and hold business keys, underscoring the business-centric nature of Data Vault.
The author reiterates the word “business” to draw attention to the fact that building a Data Vault is not just a technical exercise but must involve the business, reinforcing the importance of collaboration.
For simplicity, the author assumes that a Fire is a recognized business concept with a business key of Fire Reference Number, providing a concrete example.
Similarly, the author assumes that a Fire Truck is a recognized business concept with a business key of Registration Number, providing another concrete example.
The author concludes that they now have two Hubs, each with a unique index for the nominated business key, establishing the fundamental building blocks of the Data Vault.
Hub Keys
The author notes that relational databases typically expect a single key to be nominated as their “primary key” and that every row in the table must have a unique value for its primary key, providing context for key management.
The author clarifies that the business key could be the primary key for the table, as it is unique, but that Data Vault often uses another mechanism called a “surrogate key.”
The author explains that using a surrogate key is typically a unique but meaningless number generated by software and that greater detail is not needed here, keeping the focus on core concepts.
Here, in a Data Vault, one reason for using a surrogate is to improve performance, providing a justification for this practice.
The author references Data Vault 2.0 (according to the standard published in “Building a Scalable Data Warehouse with Data Vault 2.0”) and notes that this “surrogate primary key” is a hash key, another technical artifact.
The author simplifies the explanation by stating that the text string value of the natural business key is thrown into a mathematical algorithm, and out pops some number, clarifying the hashing process.
The author emphasizes that the same text string will always produce the same hash key value, ensuring consistency and reliability.
The author notes that the previous version of Data Vault had a sequence number that served a similar purpose and stresses that whether the Data Vault is 1.0 or 2.0, the Hub’s surrogate key is for Data Vault internal use only.
The author warns that the Hub's surrogate key should not be used as an alternative business key, reinforcing the separation of technical and business keys.
The author jokingly doubts that anyone looking at the big, ugly structure of the value generated for a hash key would want to use it as a business key, adding a touch of humor.
Hash Keys
The author anticipates the question of why a hash key might be used and provides several good reasons for its adoption.
The author states that this approach helps deliver good performance in several situations, including loading data (facilitating massive parallel processing) and relational join performance on data retrieval.
The author adds that hash keys assist with the joining of relational and non-relational data, which is increasingly important in modern data environments.
The author admits that the last point on non-relational data is a bit technical and not his area of strength, so he is happy to accept that there are sound computer science reasons for this, deferring to other expertise.
The author concludes by stating that he is happy to get back to modeling, reinforcing his focus on the higher-level design aspects.
Hub columns
The author notes that there are two more columns in the Hub tables: the Record Source column, which notes the source system that first presented the business key value to the Data Vault, and the Load Date / Time, which notes the precise date and time that this occurred.
The author explains that these columns may prove useful for audit and diagnostics purposes, providing value for data governance and troubleshooting.
The author cautions against getting side-tracked and emphasizes that a Hub table fundamentally exists to hold the set of business keys for a given business concept, and nothing more, reinforcing its core purpose.
The author provides an example of the attribute values that one instance in the Fire Hub table might have:
Fire Reference Number: “WF2018-123” (Wild Fire number 123 in the year 2018), providing a specific value for the business key.
Fire Hash Key: “27a3f042…”, being a hexadecimal representation of the hash key generated by presenting the text string “WF2018- 123” to some hashing algorithm, illustrating the hash key value.
Load Date / Time: “10/10/2018 12:34:56.789” being the very moment (to a nominated level of precision) the row was loaded into the Data Vault table, not the moment when the row was created in some source operational system, clarifying the meaning of the load date.
Record Source: “ER/RA”, being an acronym (“ER” for the Emergency Response IT operational system,