DC turnover rate prediction case analysis

1. Requirement Description

This analysis uses DC employee data for analysis.On the basis of observing the influencing factors of turnover rate, a model is established to predict which employees are more likely to leave.

2. Dataset description

There are 31 variables and 1100 observations in the DC employee dataset.Part of the focus is on variable descriptions as follows:

Employee characteristics can be categorized into the following categories

  • Basic identity information variables: gender, age, education, marital status, education level, major enrollment;
  • Employee company identity variables: length of service, company age (working hours in the company), position, rank, number of enterprises served, department, business travel, and number of years with current managers
  • Compensation and benefits variables: monthly pay, work input, whether to work overtime, performance score, preferred stock purchase level, pay increase ratio, number of training sessions in the previous year, time interval from last promotion
  • Variables related to quality of life: job environment satisfaction, job satisfaction, relationship satisfaction, work-life balance, work distance

3. Feature analysis

3.1 Statistical Analysis

  • Employee Base Information Description
    • The average age of the employees is about 37 years, the maximum age is 60 years and the minimum age is 18 years.
    • Of the total 1100 employees, 178 were terminated, with a turnover rate of 16.2%;
    • The average income of employees is 6483.6, with a median income of 4857.0, a minimum of 1009 and a maximum of 19999.
    • Among the employees, 653 were male and 447 were female; the male turnover rate was 61.2% and the female turnover rate was 38.8%.

3.2 Distribution Analysis

By observing and analyzing some variables, the following problems were found:

  • The turnover rate of 18-23 year-old employees is over 40%, which is in line with the current market judgment on post-95 age. The stability of employees after 25 years of age tends to be stable, and maintains between 20% and 40%.

  • The R&D Department has the largest number of departures, mainly because the company has the largest number of R&D departments. Although the number of departures is large, the R&D Department has the lowest turnover rate. The HR Department has the highest turnover rate, which is also the Department with the lowest number of employees (the base is smaller).

    - There is a negative correlation between employee turnover rate and salary increase ratio. When the increase ratio is the largest, the employee turnover rate exceeds 40%, which is a little strange. If there is no error in the analysis, it means that the small salary base of this part of the employee results in a high turnover rate.

  • The turnover rate of employees with level 1 of work engagement is nearly 40%, reaching 38%!!!What would you like if you didn't pay for a single effort and a single harvest?

  • The turnover rate of overtime employees is three times that of no overtime!!!Why Flexible Work and No Overtime Work Become a Benefit

  • The longer and more stable the employee is in the company, the higher the turnover rate is 35.8% in the range of 0-2 years, a small increase in the range of 20-25 years (with the possibility of starting a business or having no hope of promotion), and the turnover rate will rise to 33% in the range of 31-32 years.

  • Work environment and job satisfaction reflect: low job satisfaction, low environmental satisfaction will lead to increased employee turnover rate

4. Feature Selection

4.1 Data Preprocessing

  • Classify feature classes: numeric and non-numeric
      def  obtain_x(train_df,xtype):
      	dtype_df  = train_df.dtypes.reset_index()
      	dtype_df.columns = ['col','type']
      	return dtype_df[dtype_df.type != xtype].col.values
      float64_col = obtain_x(train_local,'object') #float64 
    
  • Normalizing Loading Data
      min_max_scaler = preprocessing.MinMaxScaler()
      X = min_max_scaler.fit_transform(X)
    
  • Loading data to encode categorical data
      le = preprocessing.LabelEncoder()
      ohe = preprocessing.OneHotEncoder()  
      train = pd.DataFrame()
      local_all = pd.concat([train_local,test_local],axis=0)
      for col in object_col:
      	le.fit(local_all[col])
      	local_all[col] = le.transform(local_all[col])
      	ohe.fit(local_all[col].reshape(-1, 1))
      	ohecol = ohe.transform(local_all[col].reshape(-1, 1)).toarray()
      	ohecol = pd.DataFrame(ohecol[:,1:],index=None)#columns=col+le.classes_
      	ohecol.columns = ohecol.columns.map(lambda x:str(x)+col)
      	train = pd.concat([train2, ohecol],axis=1, ignore_index=False)
    

4.2 Computing the correlation coefficient

  • Calculate the impact of each variable on turnover rate
    print(train.corr()['Attrition'])
    x = train.corr()['Attrition'].index
    y = train.corr()['Attrition'].values
    plt.figure(figsize=(12,5))
    plt.plot(x,y, 'b.-',label="Pearson correlation coefficient")
    plt.legend()
    plt.ylabel("Pearson correlation coefficient")
    plt.xlabel("Variable Name")
    plt.xticks(rotation=45)
    plt.grid()
    for i,j in zip(x,y):
      plt.text(i,j+0.005,'%.2f' %(j),ha='center',va='bottom')
     
    

4.3 Feature selection using feature statistics

  • Remove Low Variance Features
    from sklearn.feature_selection import VarianceThreshold
    sel = VarianceThreshold()
    train_all_sel = sel.fit_transform(train_all)
    train_local4 = train_all.iloc[0:len(train_local)]
    test_local4 = train_all.iloc[len(train_local):]
    # Define the characteristic statistical function
    def col_Sta(train_df):
    	col_sta = pd.DataFrame(columns = ['_columns','_min','_max','_median','_mean','_ptp','_std','_var'])
    	for col in train_df.columns:
    		col_sta = col_sta.append({'_columns':col,'_min':np.min(train_df[col]),'_max':np.max(train_df[col]),\
    			'_median':np.median(train_df[col]),'_mean':np.mean(train_df[col]),\
    			'_ptp':np.ptp(train_df[col]),'_std':np.std(train_df[col]),'_var':np.var(train_df[col])},ignore_index=True)
    	return col_sta
    train_col_sta = col_Sta(train_local4)
    test_col_sta = col_Sta(test_local4)
    train_col_sta
    
  • Calculate Chi-Square Information
    from sklearn.feature_selection import chi2#Call Chi-square Information Volume
    chi2_col,pval_col = chi2(train_local4[feature_end].values,train_y)
    
  • Calculate F Information
    from sklearn.feature_selection import f_classif#Invoke f Info Amount
    f_col,pval_col = f_classif(train_local4[feature_end].values,train_y)
    
  • Calculate Mutual Information
    from sklearn.feature_selection import mutual_info_classif#Call Mutual Information
    mic_col = mutual_info_classif(train_local4[feature_end].values,train_y)
    

5 Modeling

5.1 Logical Regression Prediction

 from sklearn.linear_model import LogisticRegression
 from sklearn.model_selection import cross_val_score
 train_x_0 = train_local[['EnvironmentSatisfaction','JobInvolvement','JobSatisfaction','YearsSinceLastPromotion']]#RelationshipSatisfaction
 train_y_0 = train_local[y_col]
 clf = LogisticRegression(C=10)
 clf.fit(train_x_0, train_y_0)
 scores = cross_val_score(clf, train_x_0, train_y_0)
 print(scores.mean()) #0.8370006974003799
 x_col = [x for x in float64_col if x not in ['Attrition']]
 len(x_col)
 y_col = 'Attrition'
 train_x = train_local[x_col]
 train_y = train_local[y_col]
 clf = LogisticRegression(C=10)
 clf.fit(train_x, train_y)
 scores = cross_val_score(clf, train_x, train_y)
 print(scores.mean()) #0.8469807373205397

5.2 Random Forest Forecast

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, max_depth=5,min_samples_split=2, random_state=0)
clf = clf.fit(train_local4[feature_end], train_y)
scores = cross_val_score(clf, train_local4[feature_end], train_y)
print(scores.mean()) # 0.8470288338984681

5.3 Combination Filtering Create Model

feature_a = feature_sel[(feature_sel['importance']>=0.005)].feature.values
clf = LogisticRegression(C=10)
clf.fit(train_x3[feature_a], train_y)
scores = cross_val_score(clf, train_x3[feature_a], train_y)
print(scores.mean()) #0.8729528894019191

5.4 Confusion Matrix Evaluation Model

TP (true 1): 7
FP (true 0): 3
TN (true 0): 83
FN (true 1): 7
P: 0.7
R: 0.5
ACC: 0.9
Both discovery and recall rates are satisfactory, and model predictions are good

Published an original article. Praise 0. Visits 3
Private letter follow

Tags: Lambda

Posted on Mon, 09 Mar 2020 23:47:13 -0400 by saadshams