Wu Yiji's counting Guide_ 5.1 dimension modeling skills: SCD2 type slow change dimension construction

0. Introduction

Slow Changing Dimension (SCD) is a technology used almost all the time in dimension modeling. It provides the ability of slicing and chunking dimensions based on historical data.

1.SCD type

According to master kimball's theory, there are seven types of SCD, but SCD2 is the most commonly used in daily warehouse development. The so-called SCD2 type is to add a new row to record the latest dimension in the dimension table.

2. Take a chestnut

Suppose we have a regional dimension table, which records the administrative area code and place name information. The table creation statement is as follows:

--be based on HQL dialect
CREATE TABLE DIM_ADD(
ADD_NO STRING COMMENT 'Region code',
ADD_NAME STRING COMMENT 'Region name'
)

There is one line of data:

ADD_NOADD_NAME
010Beijing

We know that 100 years ago, Beijing was not called "Beijing" but "Beiping". If we want to analyze based on the data 100 years ago, the place name field at that time should be written as "Beiping". However, there is no "Peiping" in our dimension table. What should we do?

3.SCD2

3.1 what is SCD2

In order to solve the problem in Section 2, SCD2 is a very elegant method.

We can use a dimension table of SCD2 type to maintain historical data. The table creation statement is as follows:

--be based on HQL dialect
CREATE TABLE DIM_ADD_SCD2(
ADD_NO STRING COMMENT 'Region code',
ADD_NAME STRING COMMENT 'Region name',
FROM_DATE STRING COMMENT 'Start time',
TO_DATE STRING COMMENT 'End time'
)

Then, the data of Peiping can be maintained

ADD_NOADD_NAMEFROM_DATETO_DATE
010Peiping1368-09-121947-09-27
010Beijing1947-09-289999-12-31

We can see that in SCD2 type, from is maintained_ Date and TO_DATE represents the start time and end time.

be careful
1. For the same dimension value as the natural key, there shall be no intersection between the start and end time spans of each line, which is illogical in terms of business and will double the data at the same time.
2. For the latest data, write "9999-12-31" at the end time.
3. If it is developed based on HQL, the start time and end time can be directly compared with STRING and DATE variables.

3.2 how to use SCD2

It is also very simple to use SCD2 type dimension tables. You only need to use the business time in the WHERE condition or JOIN condition to limit the start time and end time.

If there is a demand for analysis, I want to calculate the number of registered naturalized people in each city in 1942 from the household registration fact table.

The statement of account fact table is as follows:

--be based on HQL dialect
CREATE TABLE FACT_HOUSEHOLD(
HOUSEHOLD_NO STRING COMMENT 'Account code',
ADD_NO STRING COMMENT 'Region code',
HOUSEHOLD_CNT INT COMMENT 'Number of registered persons',
HOUSEHOLD_RECORD_DATE STRING COMMENT 'Naturalization time'
)

Data example:

HOUSEHOLD_NOADD_NOHOUSEHOLD_CNTHOUSEHOLD_RECORD_DAY
000000000101021937-09-27
000000000302932013-09-01
0000000004053332010-09-01

Then the demand indicator calculation SQL is:

--be based on HQL dialect
SELECT T2.ADD_NAME AS ADD_NAME
      ,SUM(NVL(T1.HOUSEHOLD_CNT,0)) AS ADD_POPULATION
  FROM FACT_HOUSEHOLD T1
  JOIN DIM_ADD_SCD2 T2
     ON T1.ADD_NO = T2.ADD_NO
    AND SUBSTR(T2.TO_DATE,1,4) >= '1942'
    AND SUBSTR(T2.TO_DATE,1,4) <= '1942'
  WHERE SUBSTR(HOUSEHOLD_RECORD_DAY,1,4) = '1942'
  ;
--Generally speaking, the placeholder for transfer to the conference is generated based on business time. This is an example, so it is written as a fixed value.

The above is a simple demo, which introduces how to use SCD2 type dimension table when calculating indicators.

3.3 how to realize SCD2

Still use dim in Section 3.2_ ADD_ For example, scd2 should be updated based on the latest regional dimension information table - dim obtained through ETL_ ADD_ NEW.

The table creation statement is as follows:

--be based on HQL dialect
CREATE TABLE DIM_ADD_NEW(
ADD_NO STRING COMMENT 'Region code',
ADD_NAME STRING COMMENT 'Region name'
)

The data are as follows:

ADD_NOADD_NAME
010Beijing
002A3
003B1
004C1

We are interested in dim_ ADD_ Divide and conquer data in new:

In case 1, when the natural key corresponds, DIM_ADD_NEW and dim_ ADD_ If there is no difference in the latest data of scd2, no operation will be performed.

In case 2, when the natural key corresponds, DIM_ADD_NEW and dim_ ADD_ If the latest data of scd2 is different, the natural key dim_ ADD_ The end time of the latest data in scd2 is set to business time (or business time - 1d, depending on the specific business), and the latest dimension data is inserted. The start time is set to business time + 1d (or business time, depending on the specific business), and the end time is set to '9999-12-31'.

Case 3, for a natural key, dim_ ADD_ Dim exists in New_ ADD_ It does not exist in scd2 and is added. The start time is set to business time + 1d (or business time, depending on the specific business), and the end time is set to '9999-12-31'.

Let's write a small demo code based on this proposition:

3.3.1 obtain dimension benchmark

The dimension table has the following data:

ADD_NOADD_NAMEFROM_DATETO_DATE
010Peiping1368-09-121947-09-27
010Beijing1947-09-289999-12-31
002A11234-01-011949-10-01
002A21949-10-029999-12-31
003B11234-01-019999-12-31

Get dim_ ADD_ The data with scd2 end date of 9999-12-31 is used as the benchmark.

--be based on HQL dialect
CREATE TABLE DIM_ADD_SCD2_NEWEST AS
SELECT T1.ADD_NO
      ,T1.ADD_NAME
  FROM DIM_ADD_SCD2 T1
 WHERE T1.TO_DATE = '9999-12-31'
 ;

The processed data are as follows:

ADD_NOADD_NAME
010Beijing
002A2
003B1

3.3.2 divide and mark according to the situation

For the three cases mentioned in 3.3, mark the latest data. The marking field is named FLAG. 0 corresponds to case 1, 1 corresponds to case 2, and 2 corresponds to case 3.

The table creation statement is as follows:

--be based on HQL dialect
CREATE TABLE DIM_ADD_NEW_FLAG(
ADD_NO STRING COMMENT 'Region code',
ADD_NAME STRING COMMENT 'Region name',
FLAG STRING COMMENT 'Situation identification'
)

Corresponding to case 1, there are codes:

--be based on HQL dialect
INSERT INTO TABLE DIM_ADD_NEW_FLAG
SELECT T1.ADD_NO AS ADD_NO 
      ,T1.ADD_NAME AS ADD_NAME 
      ,'1' AS FLAG
  FROM DIM_ADD_NEW T1
  JOIN DIM_ADD_SCD2_NEWEST T2 --JOIN Filtering function
    ON T1.ADD_NO = T2.ADD_NO 
   AND T1.ADD_NAME = T2.ADD_NAME 
   ;

Corresponding to case 2, there are codes:

--be based on HQL dialect
INSERT INTO TABLE DIM_ADD_NEW_FLAG
SELECT T1.ADD_NO AS ADD_NO 
      ,T1.ADD_NAME AS ADD_NAME 
      ,'2' AS FLAG
  FROM DIM_ADD_NEWT1
  LEFT JOIN DIM_ADD_SCD2_NEWEST T2
    ON T1.ADD_NO = T2.ADD_NO 
   AND T1.ADD_NAME = T2.ADD_NAME 
 WHERE T2.ADD_NO IS NULL --If it is not associated, it means there is updated data
   AND T1.ADD_NO  IN (SELECT ADD_NO  FROM OLD_NEWEST) --The natural keys need to coincide
   ;

Corresponding to case 3, there are codes:

--be based on HQL dialect
INSERT INTO TABLE DIM_ADD_NEW_FLAG
SELECT T1.ADD_NO AS ADD_NO 
      ,T1.ADD_NAME AS ADD_NAME 
      ,'3' AS FLAG
  FROM DIM_ADD_NEW T1
  LEFT JOIN DIM_ADD_SCD2_NEWEST T2
    ON T1.ADD_NO = T2.ADD_NO 
   AND T1.ADD_NAME = T2.ADD_NAME 
 WHERE T2.ADD_NO IS NULL --If it is not associated, it means it is new data
   AND T1.ADD_NO  NOT IN (SELECT ADD_NO  FROM OLD_NEWEST) --Natural bonds need to be asymmetric
   ;

At this time, DIM_ADD_NEW_FLAG data:

ADD_NOADD_NAMEFLAG
010Beijing1
002A32
003B11
004C13

3.3.3 update dimension table

This demo is based on HQL dialect, so the UPDATE operation cannot be performed. The following code should be used when updating:

First, a temporary table is generated as the carrier of the processing results, and the dimension table is updated to create the table statement and dim_ ADD_ Same as scd2:

--be based on HQL dialect
CREATE TABLE DIM_ADD_SCD2_TMP(
ADD_NO STRING COMMENT 'Region code',
ADD_NAME STRING COMMENT 'Region name',
FROM_DATE STRING COMMENT 'Start time',
TO_DATE STRING COMMENT 'End time'
)

For case 1, use the natural key to set the DIM_ADD_SCD2 and DIM_ADD_NEW_FLAG Association, which can be loaded into the temporary table as it is:

--be based on HQL dialect
INSERT INTO TABLE DIM_ADD_SCD2_TMP
SELECT T1.ADD_NO AS ADD_NO 
      ,T1.ADD_NAME AS ADD_NAME 
      ,T1.FROM_DATE AS FROM_DATE 
      ,T1.TO_DATE AS TO_DATE 
  FROM DIM_ADD_SCD2 T1
  JOIN DIM_ADD_NEW_FLAG T2
    ON T1.ADD_NO  = T2.ADD_NO 
   AND T2.FLAG = '1'
    ;

For case 2, we need to load the non latest data of the original dimension table as it is. The end time of the latest data is set as the business time. And based on DIM_ADD_NEW_FLAG adds a new piece of data. The start time is business time + 1d and the end time is' 9999-12-31 '.

--be based on HQL dialect
--Load non latest data as is
INSERT INTO TABLE DIM_ADD_SCD2_TMP
SELECT T1.ADD_NO AS ADD_NO 
      ,T1.ADD_NAME AS ADD_NAME 
      ,T1.FROM_DATE AS FROM_DATE 
      ,T1.TO_DATE AS TO_DATE 
  FROM DIM_ADD_SCD2 T1
  JOIN DIM_ADD_NEW_FLAG T2
    ON T1.ADD_NO  = T2.ADD_NO 
   AND T2.FLAG = '2'
WHERE T1.TO_DATE != '9999-12-31'

UNION ALL

--Set the latest data to end time as business time
SELECT T1.ADD_NO  AS ADD_NO 
      ,T1.ADD_NAME AS ADD_NAME 
      ,T1.FROM_DATE AS FROM_DATE 
      ,'${business_tm}'  AS TO_DATE --Generally speaking, a business time will be passed in, which is represented by a placeholder here
  FROM DIM_ADD_SCD2 T1
  JOIN DIM_ADD_NEW_FLAG T2
    ON T1.ADD_NO  = T2.ADD_NO 
   AND T2.FLAG = '2'
WHERE T1.TO_DATE = '9999-12-31' 

UNION ALL

--Insert latest dimension value
SELECT ADD_NO  AS ADD_NO 
      ,ADD_NAME AS ADD_NAME 
      ,DATE_ADD('${business_tm}',1) AS FROM_DATE 
      ,'9999-12-31' AS TO_DATE
  FROM DIM_ADD_NEW_FLAG
WHERE FLAG = '2' 
    ;

For case 3, load the new data into the dimension table. The code is as follows:

--be based on HQL dialect
INSERT INTO TABLE DIM_ADD_SCD2_TMP
SELECT ADD_NO  AS ADD_NO 
      ,ADD_NAME AS ADD_NAME 
      ,DATE_ADD('${business_tm}',1) AS FROM_DATE 
      ,'9999-12-31' AS TO_DATE
  FROM DIM_ADD_NEW_FLAG
WHERE FLAG = '3' 

Finally, we overwrite the data of the temporary table to the dimension table.

--be based on HQL dialect
INSERT OVERWRITE TABLE DIM_ADD_SCD2
SELECT ADD_NO  AS ADD_NO 
      ,ADD_NAME AS ADD_NAME 
      ,'${business_tm}' AS FROM_DATE 
      ,'9999-12-31' AS TO_DATE
  FROM DIM_ADD_SCD2_TMP

The data in the new dimension table is as follows:

ADD_NOADD_NAMEFROM_DATETO_DATE
010Peiping1368-09-121947-09-27
010Beijing1947-09-289999-12-31
002A11234-01-011949-10-01
002A21949-10-022021-11-20
002A32021-11-219999-12-31
003B11234-01-019999-12-31
004C12021-11-209999-12-31

So far, a round of SCD2 dimension table update has been completed.

The above is the content of this chapter. I believe you have a clear understanding of what SCD2 type is, how to use it and how to build it.

The next chapter explains the reduction and optimization of centipede fact table. Please look forward to it.

[all rights reserved. Any reprint must be investigated.]

Tags: Big Data Data Mining Data Warehouse

Posted on Sun, 21 Nov 2021 03:45:10 -0500 by cstevio