<img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=458506188731710&amp;ev=PageView&amp;noscript=1">

Data Vault PITs


Definitions, Importance, & Functionality
By Jim Mihalick

Pre-requisites

Is it assumed the reader will have a basic familiarity with Raw Data Vault HUBs, LINKs, and SATs. This article refers to them throughout its content as part of explaining the details about PIT tables but does not deeply explain the details of these 3 basic Data Vault object types.

vaultdivider-1

 

Background

Raw Data Vaults consist of three object types: HUBs, LINKs, and SATs (Satellite).

HUB: Holds business keys in your enterprise. Each different kind of business key (clients, accounts, orders, etc.) has its own HUB object.

LINK: Holds all the "Relationships" between the various HUBs. One link may show how clients associate with accounts, a different link may show how clients associate with orders, and so on.

SAT: Holds all the attributes that describe the Business Key in a HUB, or attributes that describe the relationship in a LINK. These are either "Hub" SATs or "LINK" SATs. All SATs behave the same way, but a particular SAT instance cannot associate with a HUB and LINK or even multiple HUBs or multiple LINKs at the same time. A SAT object associates with just one HUB or LINK object. That is a DV 2.0 standard.

However, a HUB or LINK object can have 0 – n SAT objects associated with the HUB or LINK object. So it’s a one-to-many relationship, not a many-to-many relationship between HUBs/LINKs and SATs.

There can be multiple SATs associated with a particular HUB or LINK for a multitude of reasons. This article won’t detail all the reasons, however, you can find them detailed in Dan Linstedt’s Data Vault 2.0 standard if you’re interested in knowing more of why a HUB or LINK can have multiple SATs associated with it.

One of the reasons for having multiple SATs for a HUB or LINK is that the attributes in the SATs may have different frequencies of changes. If there are 10 attributes for a HUB and 7 of them change very infrequently, those 7 attributes may be in one SAT. The other 3 attributes reside in a different SAT because they change their values frequently. It would be inefficient to continually copy the values of the 7 infrequent changing attributes every time one of the frequent 3 changing attributes is modified. The outcome would be creating a tremendous amount of unnecessary data duplication in the RAW Data Vault. But again, this is not the only reason you may decide to have multiple SATs for a particular HUB or LINK.

As you probably guessed from the above paragraph, SATs are the object in a RAW Data Vault where ALL changes to data are tracked over time. It is exactly these changes to a SAT’s attributes that drive the need for PIT objects. PIT stands for “Point-In-Time”.

DataVaultPITs Image 1

 

Assume we have a HUB called Sample_h and it has 3 SATs (named Sample_SlowChg_sh, Sample_MedChg_sh, and Sample_FastChg_sh)

I name the SATs starting with the ‘Base’ HUB name, then a “RoleName” the SAT is performing, followed by sh so we know it is a SAT to a HUB.

Here is the “for demonstration purposes” DDL of each SAT. There are other possible things we can include in SATs depending on what kind of SAT it is, but that is beyond the scope of this document. These fields will suffice for our demonstration of what PITs are for.

 

Sample 2

Sample 3

Sample 1

Data Value Scenarios in the SATs

Let's assume we have sample data as follows in the 3 SATs:

SDataVault PITs Samples

You can see that the attributes in the 3 different SATs for HUB Sample changed at different frequencies and different Days and Times.

The Performance Reporting Problem

Let’s assume that a reporting user wants to know what HUB Sample and all of its attributes in its 3 SATs looked like at a particular Date and Time. Let’s select Apr 26th, 2021 at 9 am. This date and time are referred to as the reporting user’s Point-In-Time they care about. What did the data look like at that exact Point-In-Time from the source systems?

So for Hub Key “Biz_Value_123” which is Hash Key 0x7F587….895 the most appropriate matching SAT record from each SAT table are these records:

You’ll notice not one of the SAT record’s LoadDates matched the requested “Point-In-Time” by the reporting user. We had to get to the closest record value of each SAT table that was in existence at the requested date and time from the reporting user. In other words, the maximum LoadDate value that is less than or equal to the requested Snapshot Date. But finding the most appropriate SAT record closest to the point in time the reporting user has requested can involve some significant compute time. Which is further compounded when adding more SATs to the HUB/LINK, or adding additional/different records/business keys to the SAT, or adding more simultaneous reporting users requesting different Points In Time at the same moment from the HUB/LINK and SATs, or adding more changes over time to the SATs, etc. This kind of Point-In-Time request by reporting users can be a significant drain on the Data Vault’s compute resources each time they want a different Point-In-Time analysis of the HUB/LINK and its SATs.

 

Check out our strategic services

Govern By Design

Data Governance does not need to be difficult. Learn more about our Data Governance program.

Strategy

Data Transformation begins with a sound strategy. Learn more about our Strategic Data Services.

Data Visualization

Visualizing your data creates an impact when done properly. Get inspired with our Data Viz gallery.

Let’s try something else...

Assume that a reporting user wants to know what the HUB Sample and its attributes in its 3 SATs looked like at a different Date and Time. Let’s choose Nov 10th 2020 at midnight.

For Hub Key “Biz_Value_123”, which is Hash Key 0x7F587….895, the most appropriate matching SAT record from each SAT table are these records:

Data-Vault

 

Screen Shot 2022-03-25 at 12.42.38 PM

 

Because Attributes Attrib_C, Attrib_D, Attrib_E, Attrib_F of the Sample_MedChg_sh and Sample_FastChg_sh SATs respectively didn’t exist on Nov 10th 2020 at midnight, we assign the Ghost values to those attributes. You’ll see why in a bit. The Ghost record is chosen when there is no appropriate SAT record in existence at the Point-In-Time the reporting user is interested in.

Using PIT tables to Solve the Performance Problem

You can easily see how performance can degrade quickly depending on how much data exists in the SATs, how many concurrent reporting users are analyzing data in the Data Vault, etc. So what to do to get the performance improved.

The resolution: The PIT table. PIT tables help us with pre-setting up “Points-In-Time” in advance saving the reporting users from the need to find the most appropriate matching SAT record per business key based on a requested Point-In-Time by the reporting users. The PIT tables have it all figured out for the reporting users in advance. This is where huge performance increases will come back to you.

What does a PIT look like?

A PIT table has these main DDL components according to the Data Vault 2.0 standard. I will be covering some optimizations that can be applied to PITs as well that doesn’t break the Data Vault 2.0 standard concept but can make the storage needs of the PIT much more efficient. Note that there can be other things added to a PIT table as well, but that is beyond the scope of this document. But an important goal of a PIT design is to keep the PIT record as small as you feasibly can.

Screen Shot 2022-03-25 at 1.14.05 PM

Per business requirements, we determine what the “Grain” of the PIT will be determining what the frequency of the SnapshotDate field will represent. Once a Year, or Once a Month, or Once a Day, or Once a minute, etc. It can be an evenly spaced frequency as just mentioned or the frequency could jump around from daily to weekly for a span of time to monthly during another stretch of time back to daily (that is not typical, and not recommended, but can be accomplished).

Once the frequency is chosen, the PIT is filled with all the necessary Snapshot frequency dates from a beginning to ending date decided by the business.

Taking all those separate snapshot dates and matching against a particular Snapshot Date to find the most appropriate records from all the HUB’s SATs are performed Snapshot Date by Snapshot Date. Once the best matching SAT records are located, the Hash Keys and LoadDates of each of the SAT’s records are loaded into their respective columns in the PIT record for this Snapshot Date.

Assume the frequency grain of our Sample_p PIT table is monthly. We will start the PIT on Jan 1, 2021, and End the PIT on Apr 1, 2021. The PIT table would look like this:

Screen Shot 2022-03-25 at 1.31.46 PM

You can see that many times the LoadDate of the SAT record that best matches the Snapshot DateTime frequency date can be quite a bit different in time line because there were no changes to the SAT record for quite some time.

But it is important to note the exact Load Date from the SAT record is necessary to capture in the PIT record for each matching SAT record, because the LoadDate is one of the key fields of the SAT table so when the reporting users find the PIT record they are interested in, it is a simple INNER JOIN back to the appropriate SAT table to pull the exact record of SAT attributes as they looked on the SnapShot/Point- In-Time date with the SAT’s hash key and LoadDate.

You can see how valuable the PIT table is with pulling together all of the right SAT records for a HUB or LINK for a particular Point-In-Time (Snapshot DateTime) in advance, saving the reporting user large amounts of computing time (and thus reducing “wait” time) each time they want to pull reporting data from the SATs for a particular point in time.

The example above didn’t need to pull in any “Ghost” records from the SATs because there was always a good record that matched the Snapshot Date Time frequency but had there been a Snapshot date where a SAT record had not existed at that point in time, the Ghost record for that SAT would have been chosen and placed in the PIT.

The following code is the general pattern for loading data into a PIT table. Views can be used as an alternative to this for a more “Virtualized” instantiation of a PIT table where the data isn’t actually moved into the PIT table and the PIT just points to the correct data back in the RAW Data Vault tables. The PIT View can be dynamic or it can be materialized. Materialized of course is much more performant but you may not be able to determine your range of Snapshot dates in advance, so then you would implement the PIT as more of a dynamic virtualized View. But try to keep your dynamic virtualized Views of a PIT to a minimum to avoid unnecessary performance issues. The different types of PITs will be another future article.

Sample Code to Build a PIT table

The following code is the general pattern for loading data into a PIT table. Views can be used as an alternative to this for a more “Virtualized” instantiation of a PIT table where the data isn’t actually moved into the PIT table and the PIT just points to the correct data back in the RAW Data Vault tables. The PIT View can be dynamic or it can be materialized. Materialized of course is much more performant but you may not be able to determine your range of Snapshot dates in advance, so then you would implement the PIT as more of a dynamic virtualized View. But try to keep your dynamic virtualized Views of a PIT to a minimum to avoid unnecessary performance issues. The different types of PITs will be another future article.

Screen Shot 2022-03-25 at 2.24.48 PM

Screen Shot 2022-03-25 at 2.25.00 PM

DataVaultPITs Image 1

 

Standard DV 2.0 SQL to Access Data in the PIT Table

The following code is the general pattern for selecting data from a PIT table. You can also add a WHERE clause for row filtering, GROUP BY clause for aggregating SAT attributes, etc. if desired for even further reporting enhancement for the report user.

Notice how we are now able to use just INNER JOINs back to the Satellites which increases the efficiency of the query against the PIT to select data from the PIT.

Screen Shot 2022-03-28 at 9.44.34 AM

 

Optimizing the Data storage footprint of the pit without sacrificing performance

PIT tables can require significant amounts of storage to represent all the “Points In Time” requested by the users for all the Business keys in the HUB or LINK. There are techniques beyond the scope of this document that can be employed against a PIT to significantly reduce the data size footprint of a PIT, but you lose some Point-In-Time granularity implementing them.

However, I developed the following technique to significantly reduce the data size required by a PIT by making some modifications to the DDL of the PIT, slight code changes in creating the data in the PIT, and a small modification when querying the PIT. First, you’ll recall at the beginning of this document what the DV 2.0 standard DDL of a PIT table is:

Screen Shot 2022-03-29 at 2.39.02 PM

You’ll notice that all the hash key fields (starting with hk_) require a significant amount of the storage size per PIT record. CHAR(32) is what is required to store an MD5 hash. It gets even larger for more enhanced hash value types (e.g. SHA1, SHA256, etc.).

It would be very helpful to reduce or eliminate those.

If you refer back to the sample data earlier in this article, you will notice that unless the SAT hash key value in the PIT was set to the Ghost record value, its hash value was always equal to the HUB hash key field value in the PIT. In the DDL above, this means the hk_Sample_SlowChg_sh, hk_Sample_MedChg_sh, and hk_Sample_FastChg_sh hash values are identical to the hk_Sample_h hash value for the associated HUB hash key value (or Link if this was a PIT for a LINK’s SATs).

We can make use of that fact and eliminate the HashKey hk SAT fields from the PIT. Removing the HUB business key from the equation which will vary by PIT, then removing the hk SAT fields will reduce the size of the PIT by 34% - 56% or more in space savings for MD5 hashes depending on the number of SATs in the PIT, 36% to 60% or more in space savings for SHA1 hashes, and 40% to 67% or more in space savings for SHA256 hashes. Smaller PIT record sizes will likely equate to faster performance, while not losing any PIT functionality.

The new PIT structure would look like this:

Screen Shot 2022-03-29 at 2.42.00 PM

To build the PIT for this new DDL, all you change is to leave out the hk SAT fields when INSERTING into the PIT. Everything else is the same with the previous code. Here is that adjusted code:

Screen Shot 2022-03-29 at 2.43.45 PM

Remove the highlighted Yellow areas, and you will have a storage optimized PIT without loss of PIT functionality.

But to SELECT data from the storage optimized PIT, the selection SQL code modifies slightly as shown here:

Screen Shot 2022-03-29 at 2.50.15 PM

The first few things to note is I’m selecting the data fields the exact same way as we did before using the DV 2.0 standard. And as highlighted in GREEN, I am still doing the very important INNER JOIN technique to the SATs and I am still doing the same SAT Load Date comparison also highlighted in GREEN.

But when comparing the SAT hash key value in the PIT with the SAT hash key field, I’ve made an important enhancement which is the key to being able to remove the SAT hash keys from the PIT.

I’m using the SAT load date fields that still remain in the PIT and checking for the “Ghost” value load date (Jan 1, 1900) in PURPLE, and if that Ghost Date is found, I return the GHOST hash key in CYAN to look up the matching SAT record, and if I didn’t detect a Ghost Date for the SAT in the PIT, then I’m using 13 P a g e | 13 the PIT’s HUB Hash Key in YELLOW to look up the matching SAT record. Because as I mentioned prior, the PIT’s HUB hash key is identical to the SAT hash key in the PIT in the standard DV 2.0 standard.

This is a small enhancement tweak to the PIT selection SQL and should perform just about as efficiently as the DV 2.0 standard code, but you cut down 34%-60% or more of the PIT storage. Less storage is used, and in many cases, less storage used can mean better performance response time because there is less data to scan through.

Check out our strategic services

Govern By Design

Data Governance does not need to be difficult. Learn more about our Data Governance program.

Strategy

Data Transformation begins with a sound strategy. Learn more about our Strategic Data Services.

Data Visualization

Visualizing your data creates an impact when done properly. Get inspired with our Data Viz gallery.

campaign-creators-774sCXD0dDU-unsplash

 

Conditions when SAT loaddate may not be the best point-in-time field choice

 

According to the DV 2.0 standard, PIT tables are based off a HUB or LINK hash key and a LoadDate from the SAT that is stored in the PIT for finding the proper SAT later when accessing the PIT records.

But occasionally there are data circumstances with the source data that make using the LoadDate in the SAT a difficult way to manage Point-In-Time requests from the reporting users.

One such case is late arriving data from the source. Some source systems can send data in the wrong order to how it happened in real life. This can happen for instance if a transaction takes longer to commit even though it started before another transaction which commits sooner than the original transaction. Hence the source records can come in backwards to the DV from the point in time they happened.

This would mess up the natural / correct history of how the data changes really occurred in the source system and would not have the correct LoadDate representation in the SAT.

In these cases (or other cases like this), it would be better to use a reliable source system provided DateTime to represent the LoadDate to be used in how the PIT table is managed. Using a business DateTime like this that represents the true order of how the source data processed at the source will allow the SATs to be “Self Healing” and the late arriving data will find its rightful order in the PIT as though the data came into the Data Vault in that order in the first place from the source.

So based on the knowledge of the source data, it may be necessary to make adjustments to how the PITs operate to best capture the proper order of the data and thus creating the proper “Point-In-Time” experience from the PITs for the reporting user.

Summary of the value of pits

PIT tables help tremendously with reducing coding complexity from the reporting user to find the correct SAT record for a particular point in time. Most importantly, PITs will greatly reduce the amount of compute time needed to find the correct set of SAT records for a HUB or LINK for a particular point in time, saving valuable resources and possibly budget.

Don’t forget that a particular PIT table is associated with ONLY one HUB or one LINK object along with one to all of the HUB or LINK’s associated SAT objects. A different HUB or LINK object would need a different PIT table. Additionally, a particular HUB or LINK object may have more than one PIT table, because different Snapshot frequencies may need to be supported against that HUB or LINK depending on the reporting user’s requirements.

I hope you’ve found value in this article about the journey of a PIT and the importance of creating them as part of your Data Vault implementation.

Let me or any of us at Teknion Data Solutions know if there is anything we can help you with concerning your Data Vault projects or any data processing needs. jmihalick@teknionusa.com, www.teknionusa.com

Other Resources
noun_Laptop_928920_808285
noun_Laptop_928920_808285

Webinar Archive

What Every Executive Should Know About Visualizing Data

noun_Web Page_708827
noun_Web Page_708827

Blog

What the Rise of the DATA ACT Means for Analysts

noun_participants_75082 (9)
noun_participants_75082 (9)

Customer Success Story

Teknion Utilizes Matillion to Move Public Transit to the Cloud