Home Artificial Intelligence Navigating Slowly Changing Dimensions (SCD) and Data Restatement: A Comprehensive Guide

Navigating Slowly Changing Dimensions (SCD) and Data Restatement: A Comprehensive Guide

1
Navigating Slowly Changing Dimensions (SCD) and Data Restatement: A Comprehensive Guide

Strategies for efficiently managing dimension changes and data restatement in enterprise data warehousing

Imagine this, you’re an information engineer working for a big retail company that utilizes the incremental load technique in data warehousing. This method involves selectively updating or loading only the brand new or modified data for the reason that last update. What could occur when the product R&D department decides to alter the name or description of a current product? How would such updates impact your existing data pipeline and data warehouse? How do you propose to deal with challenges like these? This text provides a comprehensive guide with solutions, utilizing Slowly Changing Dimensions (SCD), to tackle potential issues during data restatement.

Image retrieved from: https://unsplash.com/photos/macbook-pro-with-images-of-computer-language-codes-fPkvU7RDmCo

What are Slowly Changing Dimensions (SCD)?

Slowly changing dimensions confer with infrequent changes in dimension values, which occur sporadically and usually are not tied to a each day or regular time-based schedule, as dimensions typically change less steadily than transaction entries in a system. For instance, a jewellery company that has its customers placing a recent order on their website will grow to be a recent row within the order fact table. However, the jewellery company rarely changes their product name and their product description but that doesn’t mean it’s going to never occur in the longer term.

Managing changes in these dimensions requires employing Slowly Changing Dimension (SCD) management techniques, that are categorized into defined SCD types, starting from Type 0 through Type 6, including some combination or hybrid types. We will employ one in all the next methods:

SCD Type 0: Ignore

Changes to dimension values are completely disregarded, and the values of dimensions remain unchanged from the time they were initially created in the information warehouse.

SCD Type 1: Overwrite/ Replace

This approach is applicable when the previous value of the dimension attribute is not any longer relevant or vital. Nevertheless, historical tracking of changes just isn’t essential.

SCD Type 2: Create a Latest Dimension Row

This approach is really helpful as the first technique for addressing changing dimension values, involving the creation of a second row for the dimension with a start date, end date, and potentially a “current/expired” flag. It’s suitable for our scenarios like product description or address changes, ensuring a transparent partitioning of history. The brand new dimension row is linked to newly inserted fact rows, with each dimension record linked to a subset of fact rows based on insertion times — those before the change linked to the old dimension row, and people after linked to the brand new dimension row.

Figure 1 (Image by the creator): PRODUCT_KEY = “cd3004” is the restatement for PRODUCT_KEY = “cd3002”

SCD Type 3: Create a “PREV” Column

This method is suitable when each the old and recent values are relevant, and users will want to conduct historical evaluation using either value. Nevertheless, it just isn’t practical to use this method to all dimension attributes, as it will involve providing two columns for every attribute in dimension tables or more if multiple “PREV” values need preservation. It needs to be selectively used where appropriate.

Figure 2 (Image by the creator): PRODUCT_KEY = “cd3002” is restated with recent PRODUCT_NAME, the old PRODUCT_NAME is stored in NAME_PREV column

SCD Type 4: Rapidly Changing Large Dimensions

What if in a scenario you could capture every change to each dimension attribute for a really large dimension of retail, say 1,000,000 plus customers of your huge jewelry company? Using type 2 above will in a short time explode the variety of rows in the shopper dimension table to tens and even a whole bunch of hundreds of thousands of rows and using type 3 just isn’t viable.

A more practical solution for rapidly changing and enormous volume dimension tables is to categorize attributes (e.g., customer age category, gender, purchasing power, birthday, etc.) and separate them right into a secondary dimension, like a customer profile dimension. This table, acting as a “full coverage” dimension table all potential values for each category of dimension attributes preloaded into the table, which might higher manage the granularity of changes while avoiding excessive row expansion within the major customer dimension.

For instance, if we’ve 8 age categories, 3 different genders, 6 purchasing power categories, and 366 possible birthdays. Our “full coverage” dimension table for customer profiles that comprises all of the above combos will likely be 8 x 3 x 6 x 366 combos or 52704 rows.

We’ll have to generate surrogate_key for this dimension table and establish a connection to a recent foreign key in the actual fact table. When a modification occurs in one in all these dimension categories, there’s no necessity so as to add one other row to the shopper dimension. As an alternative, we generate a recent fact row and associate it with each the shopper dimension and the brand new customer profile dimension.

Figure 3 (Image by the creator): Entity relationship diagram for a “Full Coverage Dimension” table

SCD Type 5: An Extension to Type 4

To reinforce the Type 4 approach mentioned earlier, we are able to establish a connection between the shopper dimension and the shopper profile dimension. This linkage enables the tracking of the “current” customer profile for a particular customer. The important thing facilitates the connection of the shopper with the most recent customer profile, which allows seamless traversal from the shopper dimension to essentially the most recent customer profile dimension without the necessity to link through the actual fact table.

Figure 4 (Image by the creator): Entity relationship diagram shows the linkage between the customer_dim to the cust_profile_dimension

SCD Type 6: A Hybrid Technique

With this approach, you integrate each Type 2 (recent row) and Type 3 (“PREV” column). This blended approach offers the benefits of each methodologies. You’ll be able to retrieve facts using the “ PREV “ column, which provides historical values and presents facts related to the product category at that specific time. Concurrently, querying by the “recent” column provides all facts for each the present and all preceding values of the product category.

Figure 5 (Image by the creator): PRODUCT_ID = “cd3004” is the restatement for PRODUCT_ID = “cd3002”, which PRODUCT_ID = “cd3001” is marked as “EXPIRED” in LAST_ACTION column

Bonus and Conclusion

Normally, data extraction is available in STAR schema, which incorporates one fact table and multiple dimension tables in an enterprise. While the dimension tables store all of the descriptive data and first keys, the actual fact table comprises numeric and additive data that references the first keys of every dimension around it.

Figure 6 (Image by the creator): Illustration of Star Schema

Nevertheless, in case your marketing sales data extract is provided as a single denormalized table without distinct dimension tables and lacks the first key for its descriptive data, future updates to product names may pose challenges. Handling such scenarios in your existing pipeline could be more complicated.

The absence of primary keys within the descriptive data can result in issues during data restatement, especially if you end up coping with large datasets. As an illustration, if a product name is updated within the restatement extract and not using a unique product_key, the incremental load pipeline may treat it as a recent product, impacting the historical data in your consumption layer. To handle this, creating surrogate_key for the product dimension and a mapping table to link original and restated product names is essential for maintaining data integrity.

In conclusion, every aspect of information warehouse design needs to be fastidiously considered, making an allowance for potential edge cases.

1 COMMENT

  1. I just could not depart your web site prior to suggesting that I really loved the usual info an individual supply in your visitors? Is gonna be back regularly to check up on new posts.

LEAVE A REPLY

Please enter your comment!
Please enter your name here