Digital China Innovation Competition (DCIC) - open source learning scheme for champion players of consumer credit score

Project introduction

Technology introduction: data mining, feature engineering, machine learning, deep learning

Title: Digital China Innovation Competition - consumer crowd portrait credit intelligence score credit score

Match question connection

Article background: co author of the third runner up scheme of - 58 intra city AI algorithm competition with friend Liu Xin

Scheme link

Competition background

With the in-depth promotion of the construction of social credit system, the construction of social credit standards has developed rapidly, and relevant standards have been issued one after another. A multi-level standard system including credit service standards, credit data collection and service standards, credit repair standards, urban credit standards and industrial credit standards needs to be issued urgently, and the social credit standard system is expected to be promoted rapidly. Credit service institutions in all sectors of society are deeply involved in the construction of advertising, government affairs, finance, bike sharing, tourism, major investment projects, education, environmental protection and social credit system. The construction of social credit system is a systematic project. As an indispensable part of social enterprises, communication operators also need to build an enterprise credit scoring system, Boost the upgrading of the credit system of the whole society. At the same time, the state also encourages the promotion of data exchange between third-party credit service institutions and the government, so as to enhance the core competitiveness of the government public credit information center.

The traditional credit score is mainly measured by a few dimensions such as customer consumption ability, which is difficult to comprehensively, objectively and timely reflect the customer's credit. As a communication operator, China Mobile has massive, extensive, high-quality and time-effective data. How to intelligently score customers based on rich big data is a key problem for China Mobile and new continent technology group. The establishment of the operator credit intelligent scoring system can not only improve the social credit system, but also provide rich application value within China Mobile, including the improvement of global customer service quality, credit control of customer arrears, enjoyment of various business concessions according to credit rating, etc. we hope to collect excellent model systems through this modeling competition, Accurately evaluate the user's credit score.

Data description

The data provided this time mainly includes several aspects of user information: identity characteristics, consumption ability, contacts, location trajectory and application behavior preferences. The field description is as follows:

  1. Uniqueness of user code value
  2. Whether the user's real name system is verified. 1 is yes and 0 is No
  3. User age value
  4. Yes for college student customer 1 and no for 0
  5. Blacklist customer 1 yes 0 no
  6. Whether 4G unhealthy customer 1 is yes 0 is no
  7. User network age (month)
  8. Time (month) since the last payment by the user
  9. Last payment amount of payer (yuan)
  10. Average telephone consumption cost of users in recent 6 months (yuan)
  11. Total cost of the current month in the user's bill (yuan)
  12. User's current month account balance (yuan)
  13. Whether the paying user is currently in arrears 1 is yes 0 is no
  14. The first level of user fee sensitivity indicates that the sensitivity level is the largest. According to the results of extreme value calculation method and leaf index weight, the sensitivity level of sensitive users is generated according to the rules: first, the sensitive users are sorted in descending order according to the middle score, the sensitivity level corresponding to the first 5% of users is level 1, and the sensitivity level corresponding to the next 15% of users is level 2; The sensitivity level of the next 15% of users is level 3; The sensitivity level corresponding to the next 25% of users is level 4; The sensitivity level of the last 40% of users is level 5.
  15. Number of people in communication circle in current month
  16. Whether the person who often goes shopping 1 is yes, 0 is No
  17. Average number of shopping malls in recent three months
  18. Have you visited Fuzhou Cangshan Wanda in the current month? 1 is yes, 0 is No
  19. Have you been to Fuzhou Sam's club in the current month? 1 is yes, 0 is No
  20. Whether to watch movies in the current month 1 is yes 0 is no
  21. Whether to visit scenic spots in the current month. 1 is yes, 0 is No
  22. Whether the consumption of stadiums and Gymnasiums in the current month is 1 yes 0 no
  23. Number of online shopping applications used in the current month
  24. Number of logistics express applications used in the current month
  25. Total number of financial applications used in the current month
  26. Number of video playback applications used in the current month
  27. Number of aircraft applications used in the current month
  28. Number of train applications used in the current month
  29. Number of travel information applications used in the current month

Evaluation method

The competition evaluation index adopts MAE coefficient.

Mean absolute difference is a measure of the proximity of the model prediction results to the standard results. The calculation method is as follows:

M A E = 1 n ∑ i = 1 n ∣ p r e d i − y i ∣ MAE=\frac{1}{n}\sum_{i=1}^{n}{\left| pred_{i} - y_{i} \right|} MAE=n1​∑i=1n​∣predi​−yi​∣

among p r e d i pred_{i} predi ^ is the prediction sample, y i y_{i} yi is a real sample. M A E MAE The smaller the MAE value, the closer the predicted data is to the real data. The final result is:

S c o r e = 1 1 + M A E Score=\frac{1}{1+MAE} Score=1+MAE1​

The closer the final result is to 1, the higher the score.

Comprehensive exploration

First of all, as a data competition player, we should analyze and observe the data to make ourselves have a general understanding of the competition question type and data. Let's start the overall exploration of the data.

""" Import base library """
import tqdm
import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import warnings
plt.rc('font', family='SimHei', size=13)
pd.set_option('display.width', 1000)

In the data list, we know that there is a training set compression package and a prediction set compression package in this competition. After decompression in the folder, they are directly combined for unified processing of subsequent data content transformation.

train_data = pd.read_csv('data/data113102/train_dataset.csv')
test_data = pd.read_csv('data/data113102/test_dataset.csv')
columns = ['user_id', 'real_name', 'age', 'whether_college_students',
           'whether_blacklist_customer', 'whether_4G_unhealthy_customers',
           'user_network_age', 'last_payment_long', 'last_payment_amount',
           'average_consumption_value', 'all_fee', 'balance', 'whether_payment_owed',
           'call_sensitivity', 'number_people_circle', 'whether_often_shopping',
           'average_number_appearance', 'whether_visited_Wanda',
           'whether_visited_member_store', 'whether_watch_movie',
           'whether_attraction', 'whether_stadium_consumption',
           'shopping_app_usage', 'express_app_usage', 'financial_app_usage',
           'video_app_usage', 'aircraft_app_usage', 'train_app_usage',
           'tourism_app_usage', 'label']
train_data.columns = columns
test_data.columns = columns[:-1]
df_data = pd.concat([train_data, test_data], ignore_index=True)


""" Data properties """
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 30 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   user_id                         100000 non-null  object 
 1   real_name                       100000 non-null  int64  
 2   age                             100000 non-null  int64  
 3   whether_college_students        100000 non-null  int64  
 4   whether_blacklist_customer      100000 non-null  int64  
 5   whether_4G_unhealthy_customers  100000 non-null  int64  
 6   user_network_age                100000 non-null  int64  
 7   last_payment_long               100000 non-null  int64  
 8   last_payment_amount             100000 non-null  float64
 9   average_consumption_value       100000 non-null  float64
 10  all_fee                         100000 non-null  float64
 11  balance                         100000 non-null  int64  
 12  whether_payment_owed            100000 non-null  int64  
 13  call_sensitivity                100000 non-null  int64  
 14  number_people_circle            100000 non-null  int64  
 15  whether_often_shopping          100000 non-null  int64  
 16  average_number_appearance       100000 non-null  int64  
 17  whether_visited_Wanda           100000 non-null  int64  
 18  whether_visited_member_store    100000 non-null  int64  
 19  whether_watch_movie             100000 non-null  int64  
 20  whether_attraction              100000 non-null  int64  
 21  whether_stadium_consumption     100000 non-null  int64  
 22  shopping_app_usage              100000 non-null  int64  
 23  express_app_usage               100000 non-null  int64  
 24  financial_app_usage             100000 non-null  int64  
 25  video_app_usage                 100000 non-null  int64  
 26  aircraft_app_usage              100000 non-null  int64  
 27  train_app_usage                 100000 non-null  int64  
 28  tourism_app_usage               100000 non-null  int64  
 29  label                           50000 non-null   float64
dtypes: float64(4), int64(25), object(1)
memory usage: 22.9+ MB
print("Common dataset:", df_data.shape[0])
print("Common test set:", test_data.shape[0])
print("Common training sets:", train_data.shape[0])
Total dataset: 100000
 Total test set: 50000
 Total training set: 50000
# "" "data category" ""

for i,name in enumerate(df_data.columns):
    name_sum = df_data[name].value_counts().shape[0] 
    print("{},{}      The number of types of features is: {}".format(i + 1, name, name_sum))
1,user_id      The number of types of features is: 100000
2,real_name      The number of types of features is: 2
3,age      The number of types of features is: 88
4,whether_college_students      The number of types of features is: 2
5,whether_blacklist_customer      The number of types of features is: 2
6,whether_4G_unhealthy_customers      The number of types of features is: 2
7,user_network_age      The number of types of features is: 283
8,last_payment_long      The number of types of features is: 2
9,last_payment_amount      The number of types of features is: 532
10,average_consumption_value      The number of types of features is: 22520
11,all_fee      The number of types of features is: 16597
12,balance      The number of types of features is: 316
13,whether_payment_owed      The number of types of features is: 2
14,call_sensitivity      The number of types of features is: 6
15,number_people_circle      The number of types of features is: 554
16,whether_often_shopping      The number of types of features is: 2
17,average_number_appearance      The number of types of features is: 93
18,whether_visited_Wanda      The number of types of features is: 2
19,whether_visited_member_store      The number of types of features is: 2
20,whether_watch_movie      The number of types of features is: 2
21,whether_attraction      The number of types of features is: 2
22,whether_stadium_consumption      The number of types of features is: 2
23,shopping_app_usage      The number of types of features is: 8382
24,express_app_usage      The number of types of features is: 239
25,financial_app_usage      The number of types of features is: 7232
26,video_app_usage      The number of types of features is: 16067
27,aircraft_app_usage      The number of types of features is: 209
28,train_app_usage      The number of types of features is: 180
29,tourism_app_usage      The number of types of features is: 934
30,label      The number of types of features is: 278
# "" "Statistics" ""


# "" "observe the same distribution of training / test set data" ""


**Conclusion 1: * * the data set is in good condition, and the features are all numerical features. We directly establish a model to submit offline scores for verification. However, most of the features have class tailing. For example, the max of the usage times of tourism information applications in the current month is 87681, which obviously deviates from the number of times of mean about 19. We will analyze these features separately in the future.

**Conclusion 2: * * there are trailing data in both the training set and the test set, so the trailing data is not necessarily abnormal data, which can be determined only after offline verification.

Feature exploration

Next, start to analyze the correlation between characteristics and credit score and carry out relevant feature exploration.

# "" "trailing / sequential feature analysis" ""

f, ax = plt.subplots(figsize=(20, 6))
sns.scatterplot(data=df_data, x='number_people_circle', y='label', color='k', ax=ax)

name_list = ['shopping_app_usage', 'express_app_usage', 'financial_app_usage','video_app_usage', 'aircraft_app_usage', 'train_app_usage',
           'tourism_app_usage', 'last_payment_amount', 'average_consumption_value', 'all_fee']

f, ax = plt.subplots(3, 4, figsize=(20, 20))

for i,name in enumerate(name_list):     
    sns.scatterplot(data=df_data, x=name, y='label', color='b', ax=ax[i // 4][i % 4])

f, ax = plt.subplots(1, 3, figsize=(20, 6))

sns.kdeplot(data=df_data['aircraft_app_usage'], color='r', shade=True, ax=ax[0])
sns.kdeplot(data=df_data['train_app_usage'], color='c', shade=True, ax=ax[1])
sns.kdeplot(data=df_data['tourism_app_usage'], color='b', shade=True, ax=ax[2])

**Conclusion: * * observe the features mentioned above, there is indeed a relevant long tail distribution phenomenon in the above scatter diagram, but the long tail distribution is not necessarily invalid data, just like the missing value itself may represent a certain meaning. In the later stage, when processing the long tail distribution data, it should be verified offline in combination with the model.

# "" "discrete feature analysis" ""

f, ax = plt.subplots(1, 2, figsize=(20, 6))
sns.boxplot(data=df_data, x='last_payment_long', y='label', ax=ax[0])
sns.boxplot(data=df_data, x='whether_payment_owed', y='label', ax=ax[1])

name_list = ['whether_college_students','whether_blacklist_customer', 'whether_4G_unhealthy_customers','whether_often_shopping',
           'whether_visited_Wanda','whether_visited_member_store', 'whether_watch_movie',
           'whether_attraction', 'whether_stadium_consumption','whether_payment_owed']
f, ax = plt.subplots(2, 5, figsize=(20, 12))

for i,name in enumerate(name_list):
    sns.boxplot(data=df_data, x=name, y='label', ax=ax[i // 5][i % 5])

f, ax = plt.subplots(figsize=(10, 6))

sns.boxplot(data=df_data, x='call_sensitivity', y='label', ax=ax)

Data preprocessing involves a lot of content, including feature engineering, which is the largest part of the task. In order to make everyone read more clearly, the following lists some methods to be used in the processing part.

  1. Data cleaning: missing value, abnormal value, consistency;

  2. Feature coding: onehot and label coding;

  3. Feature box: equal frequency, equal distance, clustering, etc

  4. Derived variables: strong interpretability, suitable for model input;

  5. Feature selection: variance selection, chi square selection, regularization, etc;

# df_data = df_data[df_data ['number of people in communication circle in current month'] < = 1750]. Reset_ index(drop=True)
df_data['fee_ratio'] = df_data['all_fee'] / (df_data['balance'] + 1)
df_data['fee_diff'] = df_data['all_fee'] - df_data['average_consumption_value']
df_data['5month_all_fee'] = df_data['average_consumption_value'] * 6 - df_data['all_fee']
df_data['fee_tend'] = df_data['last_payment_amount'] / (df_data['average_consumption_value'] + 1)
df_data['is_bazaar'] = (df_data['whether_visited_Wanda'] + df_data['whether_visited_member_store']).map(lambda x: 1 if x > 0 else 0)
df_data['count_sum'] = df_data['shopping_app_usage'] + df_data['express_app_usage'] + df_data['financial_app_usage'] + df_data['video_app_usage'] + df_data['aircraft_app_usage'] + df_data['train_app_usage'] + df_data['tourism_app_usage']
df_data['user_network_age_month'] = df_data['user_network_age'].apply(lambda x: x % 12)
def get_count(df, column, feature):
    df['idx'] = range(len(df))
    temp = df.groupby(column)['user_id'].agg([(feature, 'count')]).reset_index()
    df = df.merge(temp)
    df = df.sort_values('idx').drop('idx', axis=1).reset_index(drop=True)
    return df
for i in ['last_payment_amount', 'all_fee', 'average_consumption_value', 'fee_diff', ['all_fee', 'average_consumption_value']]:
    df_data = get_count(df_data, i, 'cnt_{}'.format(i))
from sklearn.model_selection import train_test_split
feature_col = [tmp_col for tmp_col in df_data.columns if tmp_col not in ['user_id', 'label']]

st_model = StandardScaler()
for i in feature_col:
    df_data[i] = st_model.fit_transform(df_data[[i]].values)
df_data['label'] = st_model.fit_transform(df_data[['label']].values)

train_data = df_data[:train_data.shape[0]]
test_data = df_data[train_data.shape[0]:]
X_train, X_vaild, y_train, y_vaild = train_test_split(train_data[feature_col],train_data['label'],test_size = 0.2,random_state = 0)
print(X_train.shape, X_vaild.shape)
(40000, 40) (10000, 40)

# for i in feature_col:
#     X_train[i] = st_model.fit_transform(X_train[[i]].values)
#     X_vaild[i] = st_model.fit_transform(X_vaild[[i]].values)
#     test_data[i] = st_model.fit_transform(test_data[[i]].values)

Algorithm model

import paddle
import paddle.fluid as fluid
class WB_Dataset(
    def __init__(self, data, feature_cols, lab=None, dev=False):
        self.feature_cols = feature_cols
        self.lens = data.shape[0]

        self.features = data[feature_cols].values.astype(np.float32)
        self.lab = lab = dev
        if dev == False:
            self.labels = lab.values.astype(np.int64)

    def __getitem__(self, idx):
        features_input = list(self.features[[idx]])
        if == False:
            label_input = self.labels[[idx]]
            return features_input, label_input
            return features_input

    def __len__(self):
        return self.lens
train_dataset = WB_Dataset(X_train, feature_col, y_train, False)
valid_dataset = WB_Dataset(X_vaild, feature_col, y_vaild, False)
test_dataset = WB_Dataset(test_data[feature_col], feature_col, None, True)
import paddle.nn as nn
class Mlp(nn.Layer):
    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.ReLU, drop=0.):
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features
        self.fc1 = nn.Linear(in_features, hidden_features)
        # = nn.functional.batch_norm(in_features)
        self.act = act_layer()
        self.fc2 = nn.Linear(hidden_features, out_features)
        self.drop = nn.Dropout(drop)

    def forward(self, x):
        x = self.fc1(x)
        # x =
        x = self.act(x)
        x = self.drop(x)
        x = self.fc2(x)
        return x
#Define LSTM network
import paddle.fluid as fluid
class MyLSTMModel(fluid.dygraph.Layer):
    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.ReLU, drop=0.):
        self.rnn = paddle.nn.LSTM(in_features, hidden_features, 2, dropout=drop)
        self.flatten = paddle.nn.Flatten()

    def forward(self,input):        # forward defines the execution logic of the network when the actual runtime is executed
        '''Forward calculation'''
        # print('input',input.shape)
        out, (h, c)=self.rnn(input)
        out =self.flatten(out)
        return out
def train_model(train_dataset, valid_dataset, model, optimizer, verbose=100, epochs=5, batch_size=64, shuffle=True):

    with paddle.fluid.dygraph.guard(paddle.CUDAPlace(0)):
        train_loader =, batch_size=batch_size, shuffle=shuffle, places=paddle.CUDAPlace(0))
        valid_loader =, batch_size=batch_size, shuffle=shuffle, places=paddle.CUDAPlace(0))
        test_loader =, batch_size=batch_size, shuffle=shuffle, places=paddle.CUDAPlace(0))

        print('start training ... ')

        loss_list = []

        for epoch in range(epochs):
            for t_batch_id, t_data in enumerate(train_loader()):

                label_data = paddle.to_tensor(t_data[1],dtype='float32')
                feature=paddle.to_tensor(t_data[0][0], dtype='float32')
                predicts = model(feature)

                loss = nn.functional.mse_loss(predicts, label_data, reduction='mean')
                if t_batch_id % (verbose) == 0:
                    print("epoch: {}, batch_id: {}, loss : {}".format(epoch, t_batch_id, math.sqrt(loss.numpy())))  
                # Gradient clearing

        predict_list = []
        #One test per training round
        for v_batch_id, v_data in enumerate(test_loader()):

            feature=paddle.to_tensor(t_data[0][0], dtype='float32')
            predicts = model(feature)
            predict_list.extend(predicts), "./model/model.pdparams"), "./model/adam.pdopt")

        return predict_list, loss_list

#model parameter 
feature_number = 40
hidden_features = 20
out_features = 1

#Training parameters
epochs = 2
model=Mlp(feature_number, hidden_features, out_features)
optim = paddle.optimizer.Adam(learning_rate=learning_rate, parameters=model.parameters())# Adam optimizer

result, loss_list= train_model(train_dataset, valid_dataset,model,optimizer=optim,verbose=verbose,epochs=epochs,batch_size=batchsize,shuffle=True)
start training ... 
epoch: 0, batch_id: 0, loss : 1.0221488935549963
epoch: 0, batch_id: 10, loss : 1.0319473914036887
epoch: 0, batch_id: 20, loss : 1.1616030832364013
epoch: 0, batch_id: 30, loss : 0.9790661875256286
epoch: 0, batch_id: 40, loss : 1.0165814517740945
epoch: 0, batch_id: 50, loss : 0.9623618385930143
epoch: 0, batch_id: 60, loss : 1.050502584157359
epoch: 0, batch_id: 70, loss : 1.0234238718304822
epoch: 1, batch_id: 0, loss : 1.0975484557523207
epoch: 1, batch_id: 10, loss : 1.086625683856127
epoch: 1, batch_id: 20, loss : 0.9919022578127885
epoch: 1, batch_id: 30, loss : 1.0404108806648333
epoch: 1, batch_id: 40, loss : 1.093106216609532
epoch: 1, batch_id: 50, loss : 1.018006920849071
epoch: 1, batch_id: 60, loss : 1.0005267663184374
epoch: 1, batch_id: 70, loss : 1.1276594622696936

Post game summary

After that, we organized A group of excellent teammates. Finally, we designed A model for integration according to our business, and achieved the top 1 in list A and top 1 in List B for the first time. We won the championship from 2522 teams at home and abroad.

In fact, the time you spend in the competition should usually be: Feature Engineering > model fusion > algorithm model > parameter adjustment, or: model fusion > Feature Engineering > algorithm model > parameter adjustment

This article introduces my experience and experience in China mobile consumer group portrait competition, and completes the basic reproduction from entry to champion. You can practice and learn in the competition of this article. Many knowledge can be truly understood only after practice. Competition results are very useful, but what is more important is to learn through competition! In the competition, if you want to achieve better results, it is essential to invest a lot of time. There are many times when your time is not rewarded. Don't be discouraged, believe in yourself and put it into effort and practice.

Introduction to the author

Zheng Yuxuan, master of computer science from East China Normal University, majoring in multimodal recommendation system and natural language processing, published a CCF-A paper, participated in top algorithm competitions at home and abroad for many times and won awards.

He has won the champion of Alibaba Tianchi global data intelligence competition in 2019, the champion of Digital China innovation competition in 2019, the champion of big data application innovation competition in 2019, and the champion of Shanghai University Computer League in 2019.

It won the track champion and championship of Deecamp artificial intelligence training camp in 2021. The team project is GeneBERT. It is the first open multimodal gene pre training model based on large-scale data in China. It was highly praised by Zhang Yaqin, Chen Weiying and other teachers in the report.

Tags: Big Data AI paddlepaddle

Posted on Fri, 03 Dec 2021 19:29:44 -0500 by gabe33