1. Requirement Description
This analysis uses DC employee data for analysis.On the basis of observing the influencing factors of turnover rate, a model is established to predict which employees are more likely to leave.
2. Dataset description
There are 31 variables and 1100 observations in the DC employee dataset.Part of the focus is on variable descriptions as follows:
Employee characteristics can be categorized into the following categories
 Basic identity information variables: gender, age, education, marital status, education level, major enrollment;
 Employee company identity variables: length of service, company age (working hours in the company), position, rank, number of enterprises served, department, business travel, and number of years with current managers
 Compensation and benefits variables: monthly pay, work input, whether to work overtime, performance score, preferred stock purchase level, pay increase ratio, number of training sessions in the previous year, time interval from last promotion
 Variables related to quality of life: job environment satisfaction, job satisfaction, relationship satisfaction, worklife balance, work distance
3. Feature analysis
3.1 Statistical Analysis

Employee Base Information Description
 The average age of the employees is about 37 years, the maximum age is 60 years and the minimum age is 18 years.
 Of the total 1100 employees, 178 were terminated, with a turnover rate of 16.2%;
 The average income of employees is 6483.6, with a median income of 4857.0, a minimum of 1009 and a maximum of 19999.
 Among the employees, 653 were male and 447 were female; the male turnover rate was 61.2% and the female turnover rate was 38.8%.
3.2 Distribution Analysis
By observing and analyzing some variables, the following problems were found:

The turnover rate of 1823 yearold employees is over 40%, which is in line with the current market judgment on post95 age. The stability of employees after 25 years of age tends to be stable, and maintains between 20% and 40%.

The R&D Department has the largest number of departures, mainly because the company has the largest number of R&D departments. Although the number of departures is large, the R&D Department has the lowest turnover rate. The HR Department has the highest turnover rate, which is also the Department with the lowest number of employees (the base is smaller).
 There is a negative correlation between employee turnover rate and salary increase ratio. When the increase ratio is the largest, the employee turnover rate exceeds 40%, which is a little strange. If there is no error in the analysis, it means that the small salary base of this part of the employee results in a high turnover rate. 
The turnover rate of employees with level 1 of work engagement is nearly 40%, reaching 38%!!!What would you like if you didn't pay for a single effort and a single harvest?

The turnover rate of overtime employees is three times that of no overtime!!!Why Flexible Work and No Overtime Work Become a Benefit

The longer and more stable the employee is in the company, the higher the turnover rate is 35.8% in the range of 02 years, a small increase in the range of 2025 years (with the possibility of starting a business or having no hope of promotion), and the turnover rate will rise to 33% in the range of 3132 years.

Work environment and job satisfaction reflect: low job satisfaction, low environmental satisfaction will lead to increased employee turnover rate
4. Feature Selection
4.1 Data Preprocessing

Classify feature classes: numeric and nonnumeric
def obtain_x(train_df,xtype): dtype_df = train_df.dtypes.reset_index() dtype_df.columns = ['col','type'] return dtype_df[dtype_df.type != xtype].col.values float64_col = obtain_x(train_local,'object') #float64

Normalizing Loading Data
min_max_scaler = preprocessing.MinMaxScaler() X = min_max_scaler.fit_transform(X)

Loading data to encode categorical data
le = preprocessing.LabelEncoder() ohe = preprocessing.OneHotEncoder() train = pd.DataFrame() local_all = pd.concat([train_local,test_local],axis=0) for col in object_col: le.fit(local_all[col]) local_all[col] = le.transform(local_all[col]) ohe.fit(local_all[col].reshape(1, 1)) ohecol = ohe.transform(local_all[col].reshape(1, 1)).toarray() ohecol = pd.DataFrame(ohecol[:,1:],index=None)#columns=col+le.classes_ ohecol.columns = ohecol.columns.map(lambda x:str(x)+col) train = pd.concat([train2, ohecol],axis=1, ignore_index=False)
4.2 Computing the correlation coefficient

Calculate the impact of each variable on turnover rate
print(train.corr()['Attrition']) x = train.corr()['Attrition'].index y = train.corr()['Attrition'].values plt.figure(figsize=(12,5)) plt.plot(x,y, 'b.',label="Pearson correlation coefficient") plt.legend() plt.ylabel("Pearson correlation coefficient") plt.xlabel("Variable Name") plt.xticks(rotation=45) plt.grid() for i,j in zip(x,y): plt.text(i,j+0.005,'%.2f' %(j),ha='center',va='bottom')
4.3 Feature selection using feature statistics

Remove Low Variance Features
from sklearn.feature_selection import VarianceThreshold sel = VarianceThreshold() train_all_sel = sel.fit_transform(train_all) train_local4 = train_all.iloc[0:len(train_local)] test_local4 = train_all.iloc[len(train_local):] # Define the characteristic statistical function def col_Sta(train_df): col_sta = pd.DataFrame(columns = ['_columns','_min','_max','_median','_mean','_ptp','_std','_var']) for col in train_df.columns: col_sta = col_sta.append({'_columns':col,'_min':np.min(train_df[col]),'_max':np.max(train_df[col]),\ '_median':np.median(train_df[col]),'_mean':np.mean(train_df[col]),\ '_ptp':np.ptp(train_df[col]),'_std':np.std(train_df[col]),'_var':np.var(train_df[col])},ignore_index=True) return col_sta train_col_sta = col_Sta(train_local4) test_col_sta = col_Sta(test_local4) train_col_sta

Calculate ChiSquare Information
from sklearn.feature_selection import chi2#Call Chisquare Information Volume chi2_col,pval_col = chi2(train_local4[feature_end].values,train_y)

Calculate F Information
from sklearn.feature_selection import f_classif#Invoke f Info Amount f_col,pval_col = f_classif(train_local4[feature_end].values,train_y)

Calculate Mutual Information
from sklearn.feature_selection import mutual_info_classif#Call Mutual Information mic_col = mutual_info_classif(train_local4[feature_end].values,train_y)
5 Modeling
5.1 Logical Regression Prediction
from sklearn.linear_model import LogisticRegression from sklearn.model_selection import cross_val_score train_x_0 = train_local[['EnvironmentSatisfaction','JobInvolvement','JobSatisfaction','YearsSinceLastPromotion']]#RelationshipSatisfaction train_y_0 = train_local[y_col] clf = LogisticRegression(C=10) clf.fit(train_x_0, train_y_0) scores = cross_val_score(clf, train_x_0, train_y_0) print(scores.mean()) #0.8370006974003799 x_col = [x for x in float64_col if x not in ['Attrition']] len(x_col) y_col = 'Attrition' train_x = train_local[x_col] train_y = train_local[y_col] clf = LogisticRegression(C=10) clf.fit(train_x, train_y) scores = cross_val_score(clf, train_x, train_y) print(scores.mean()) #0.8469807373205397
5.2 Random Forest Forecast
from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier(n_estimators=100, max_depth=5,min_samples_split=2, random_state=0) clf = clf.fit(train_local4[feature_end], train_y) scores = cross_val_score(clf, train_local4[feature_end], train_y) print(scores.mean()) # 0.8470288338984681
5.3 Combination Filtering Create Model
feature_a = feature_sel[(feature_sel['importance']>=0.005)].feature.values clf = LogisticRegression(C=10) clf.fit(train_x3[feature_a], train_y) scores = cross_val_score(clf, train_x3[feature_a], train_y) print(scores.mean()) #0.8729528894019191
5.4 Confusion Matrix Evaluation Model
TP (true 1): 7
FP (true 0): 3
TN (true 0): 83
FN (true 1): 7
P: 0.7
R: 0.5
ACC: 0.9
Both discovery and recall rates are satisfactory, and model predictions are good