# Wu Yiji's counting Guide_ 5.1 dimension modeling skills: SCD2 type slow change dimension construction

## 0. Introduction

Slow Changing Dimension (SCD) is a technology used almost all the time in dimension modeling. It provides the ability of slicing and chunking dimensions based on historical data.

## 1.SCD type

According to master kimball's theory, there are seven types of SCD, but SCD2 is the most commonly used in daily warehouse development. The so-called SCD2 type is to add a new row to record the latest dimension in the dimension table.

## 2. Take a chestnut

Suppose we have a regional dimension table, which records the administrative area code and place name information. The table creation statement is as follows:

```--be based on HQL dialect
)
```

There is one line of data:

010Beijing

We know that 100 years ago, Beijing was not called "Beijing" but "Beiping". If we want to analyze based on the data 100 years ago, the place name field at that time should be written as "Beiping". However, there is no "Peiping" in our dimension table. What should we do?

## 3.SCD2

### 3.1 what is SCD2

In order to solve the problem in Section 2, SCD2 is a very elegant method.

We can use a dimension table of SCD2 type to maintain historical data. The table creation statement is as follows:

```--be based on HQL dialect
FROM_DATE STRING COMMENT 'Start time',
TO_DATE STRING COMMENT 'End time'
)
```

Then, the data of Peiping can be maintained

010Peiping1368-09-121947-09-27
010Beijing1947-09-289999-12-31

We can see that in SCD2 type, from is maintained_ Date and TO_DATE represents the start time and end time.

be careful
1. For the same dimension value as the natural key, there shall be no intersection between the start and end time spans of each line, which is illogical in terms of business and will double the data at the same time.
2. For the latest data, write "9999-12-31" at the end time.
3. If it is developed based on HQL, the start time and end time can be directly compared with STRING and DATE variables.

### 3.2 how to use SCD2

It is also very simple to use SCD2 type dimension tables. You only need to use the business time in the WHERE condition or JOIN condition to limit the start time and end time.

If there is a demand for analysis, I want to calculate the number of registered naturalized people in each city in 1942 from the household registration fact table.

The statement of account fact table is as follows:

```--be based on HQL dialect
CREATE TABLE FACT_HOUSEHOLD(
HOUSEHOLD_NO STRING COMMENT 'Account code',
HOUSEHOLD_CNT INT COMMENT 'Number of registered persons',
HOUSEHOLD_RECORD_DATE STRING COMMENT 'Naturalization time'
)
```

Data example:

000000000101021937-09-27
000000000302932013-09-01
0000000004053332010-09-01

Then the demand indicator calculation SQL is:

```--be based on HQL dialect
FROM FACT_HOUSEHOLD T1
AND SUBSTR(T2.TO_DATE,1,4) >= '1942'
AND SUBSTR(T2.TO_DATE,1,4) <= '1942'
WHERE SUBSTR(HOUSEHOLD_RECORD_DAY,1,4) = '1942'
;
--Generally speaking, the placeholder for transfer to the conference is generated based on business time. This is an example, so it is written as a fixed value.
```

The above is a simple demo, which introduces how to use SCD2 type dimension table when calculating indicators.

### 3.3 how to realize SCD2

Still use dim in Section 3.2_ ADD_ For example, scd2 should be updated based on the latest regional dimension information table - dim obtained through ETL_ ADD_ NEW.

The table creation statement is as follows:

```--be based on HQL dialect
)
```

The data are as follows:

010Beijing
002A3
003B1
004C1

We are interested in dim_ ADD_ Divide and conquer data in new:

In case 1, when the natural key corresponds, DIM_ADD_NEW and dim_ ADD_ If there is no difference in the latest data of scd2, no operation will be performed.

In case 2, when the natural key corresponds, DIM_ADD_NEW and dim_ ADD_ If the latest data of scd2 is different, the natural key dim_ ADD_ The end time of the latest data in scd2 is set to business time (or business time - 1d, depending on the specific business), and the latest dimension data is inserted. The start time is set to business time + 1d (or business time, depending on the specific business), and the end time is set to '9999-12-31'.

Case 3, for a natural key, dim_ ADD_ Dim exists in New_ ADD_ It does not exist in scd2 and is added. The start time is set to business time + 1d (or business time, depending on the specific business), and the end time is set to '9999-12-31'.

Let's write a small demo code based on this proposition:

#### 3.3.1 obtain dimension benchmark

The dimension table has the following data:

010Peiping1368-09-121947-09-27
010Beijing1947-09-289999-12-31
002A11234-01-011949-10-01
002A21949-10-029999-12-31
003B11234-01-019999-12-31

Get dim_ ADD_ The data with scd2 end date of 9999-12-31 is used as the benchmark.

```--be based on HQL dialect
WHERE T1.TO_DATE = '9999-12-31'
;
```

The processed data are as follows:

010Beijing
002A2
003B1

#### 3.3.2 divide and mark according to the situation

For the three cases mentioned in 3.3, mark the latest data. The marking field is named FLAG. 0 corresponds to case 1, 1 corresponds to case 2, and 2 corresponds to case 3.

The table creation statement is as follows:

```--be based on HQL dialect
FLAG STRING COMMENT 'Situation identification'
)
```

Corresponding to case 1, there are codes:

```--be based on HQL dialect
,'1' AS FLAG
;
```

Corresponding to case 2, there are codes:

```--be based on HQL dialect
,'2' AS FLAG
WHERE T2.ADD_NO IS NULL --If it is not associated, it means there is updated data
;
```

Corresponding to case 3, there are codes:

```--be based on HQL dialect
,'3' AS FLAG
WHERE T2.ADD_NO IS NULL --If it is not associated, it means it is new data
;
```

010Beijing1
002A32
003B11
004C13

#### 3.3.3 update dimension table

This demo is based on HQL dialect, so the UPDATE operation cannot be performed. The following code should be used when updating:

First, a temporary table is generated as the carrier of the processing results, and the dimension table is updated to create the table statement and dim_ ADD_ Same as scd2:

```--be based on HQL dialect
FROM_DATE STRING COMMENT 'Start time',
TO_DATE STRING COMMENT 'End time'
)
```

For case 1, use the natural key to set the DIM_ADD_SCD2 and DIM_ADD_NEW_FLAG Association, which can be loaded into the temporary table as it is:

```--be based on HQL dialect
,T1.FROM_DATE AS FROM_DATE
,T1.TO_DATE AS TO_DATE
AND T2.FLAG = '1'
;
```

For case 2, we need to load the non latest data of the original dimension table as it is. The end time of the latest data is set as the business time. And based on DIM_ADD_NEW_FLAG adds a new piece of data. The start time is business time + 1d and the end time is' 9999-12-31 '.

```--be based on HQL dialect
--Load non latest data as is
,T1.FROM_DATE AS FROM_DATE
,T1.TO_DATE AS TO_DATE
AND T2.FLAG = '2'
WHERE T1.TO_DATE != '9999-12-31'

UNION ALL

--Set the latest data to end time as business time
,T1.FROM_DATE AS FROM_DATE
,'\${business_tm}'  AS TO_DATE --Generally speaking, a business time will be passed in, which is represented by a placeholder here
AND T2.FLAG = '2'
WHERE T1.TO_DATE = '9999-12-31'

UNION ALL

--Insert latest dimension value
,'9999-12-31' AS TO_DATE
WHERE FLAG = '2'
;
```

For case 3, load the new data into the dimension table. The code is as follows:

```--be based on HQL dialect
,'9999-12-31' AS TO_DATE
WHERE FLAG = '3'
```

Finally, we overwrite the data of the temporary table to the dimension table.

```--be based on HQL dialect
,'9999-12-31' AS TO_DATE
```

The data in the new dimension table is as follows:

010Peiping1368-09-121947-09-27
010Beijing1947-09-289999-12-31
002A11234-01-011949-10-01
002A21949-10-022021-11-20
002A32021-11-219999-12-31
003B11234-01-019999-12-31
004C12021-11-209999-12-31

So far, a round of SCD2 dimension table update has been completed.

The above is the content of this chapter. I believe you have a clear understanding of what SCD2 type is, how to use it and how to build it.

The next chapter explains the reduction and optimization of centipede fact table. Please look forward to it.