Fraud phone identification

In recent days, I participated in this competition by chance and shared my personal thoughts. You are welcome to discuss and learn with limited ability!
Data source“ Digital Sichuan innovation competition - fraud phone recognition "
The data training set consists of four parts as follows:
User: some basic information of the user

voc: call data within 8 months

SMS and APP: 8-month SMS and online data

The composition of test set and training set is similar, but the month of mobile phone fee consumption is only one month
For the above data, the user's phone is first counted from the user_ No_ m. According to the phone_no_m sort out the data characteristics of call, SMS and Internet in turn
First, sort out each phone in VOC_ No_ M. frequency and number of calls count the number of incoming calls and outgoing calls and the total call duration respectively

user=pd.read_csv('./train_user.csv') print(user.info()) sms=pd.read_csv('./train/train_sms.csv') print(sms.info()) print(sms.head()) user_m = user['phone_no_m'].values print(user_m) sms_up=[] sms_down=[] for name in tqdm(user_m): sms_up.append(sms[(sms['phone_no_m'] == name) & (sms['calltype_id']==1)]['phone_no_m'].count()) sms_down.append(sms[(sms['phone_no_m'] == name) & (sms['calltype_id'] == 2)]['phone_no_m'].count()) user_m=np.array(user_m) sms_up=np.array(sms_up) sms_down=np.array(sms_down) user_sms = pd.DataFrame({'phone_no_m':user_m,'sms_up':sms_up,'sms_down':sms_down}) user_sms.to_csv('user_sms.csv',index=False)

Secondly, sort out each phone in SMS and APP_ No_ M number of SMS messages sent, number of Internet access, traffic cost and other data

user=pd.read_csv('./train_user.csv') print(user.info()) sms=pd.read_csv('./train/train_sms.csv') print(sms.info()) print(sms.head()) user_m = user['phone_no_m'].values print(user_m) sms_up=[] sms_down=[] for name in tqdm(user_m): sms_up.append(sms[(sms['phone_no_m'] == name) & (sms['calltype_id']==1)]['phone_no_m'].count()) sms_down.append(sms[(sms['phone_no_m'] == name) & (sms['calltype_id'] == 2)]['phone_no_m'].count()) user_m=np.array(user_m) sms_up=np.array(sms_up) sms_down=np.array(sms_down) user_sms = pd.DataFrame({'phone_no_m':user_m,'sms_up':sms_up,'sms_down':sms_down}) user_sms.to_csv('user_sms.csv',index=False) #Organize the frequency status of user's SMS user=pd.read_csv('./train_user.csv') print(user.info()) app=pd.read_csv('./train/train_app.csv') print(app.info()) print(app.head()) user_m = user['phone_no_m'].values print(user_m) app_time=[] app_flow=[] for name in tqdm(user_m): app_time.append(app[(app['phone_no_m'] == name)]['phone_no_m'].count()) app_flow.append(app[(app['phone_no_m'] == name)]['flow'].sum()) user_m=np.array(user_m) app_time=np.array(app_time) app_flow=np.array(app_flow) user_app = pd.DataFrame({'phone_no_m':user_m,'app_time':app_time,'app_flow':app_flow}) user_app.to_csv('user_app.csv',index=False) #Organize the frequency of users' access to the Internet and the traffic they spend

Finally, integrate all data into training set data train_data, only cities are reserved in the urban area, and city is not marked in the training set_ Name is complemented by "unknown".

In training, a ratio of incoming and outgoing calls is added as a new feature, and then train_ The call in and call out times, call charges and traffic in data are calculated as the average in 8 months (it's a bit inappropriate, because some fraudulent phone numbers have not been used for 8 months and the call charges are very high in two months, but the voc data is larger and the time spent in one time is more, so it hasn't been reorganized, let's talk about 23333 if you have time).
Then normalize the data except for the city, and finally get_dummies generates the final training data

train_data=pd.read_csv('train_data.csv') #Don't add city first quitmonth=['call_out_times','cal_in_times','call_other_times','call_time','sms_up','sms_down','app_time','app_flow'] train_label = pd.read_csv('./train/train_user.csv')['label'].values train_data.pop('phone_no_m') train_data.pop('mean_arpu') train_data['out/in']=train_data['call_out_times']/(train_data['cal_in_times']+1e-8) data_type = train_data.columns[train_data.dtypes != 'object'] train_data.loc[:,quitmonth]=train_data.loc[:,quitmonth]/8 train_data.loc[:, data_type] = (train_data.loc[:, data_type] - train_data.loc[:, data_type].mean()) / (train_data.loc[:, data_type].std()+1e-4) train_data = pd.get_dummies(train_data) train_data=train_data.values

Finally, we define a model with the number of layers as parameters, and try to get a better training result by adjusting the learning rate, number of layers, regularization, iterations and other parameters.

historys=[] pres=[] k=0 model = mymodel([1,2,2,8]) model.compile(optimizer=tf.keras.optimizers.Adam(1e-3), loss=tf.keras.losses.binary_crossentropy, metrics = ['accuracy']) model.build(input_shape=(None,33)) historys=model.fit(train_data,train_label,batch_size=64,epochs=100,validation_split=0.05) plt.plot(historys.history['val_accuracy']) plt.plot(historys.history['val_loss']) plt.legend([str(nets[k])+'val_accuracy',str(nets[k])+'val_loss']) plt.show() pre=[]

Put the whole one last code Welcome to talk

7 June 2020, 04:36 | Views: 5960

Add new comment

0 comments