Data mining training camp data mining: game problem understanding learning notes

1, Summary of learning points

  1. Information obtained from competition information
  2. Read data
  3. Evaluation and calculation of classification index
  4. On parity calculation of regression index
  5. Understanding of some nouns

2, Learning content:

1. New knowledge learned from the competition

a. Desensitization: process some private information, such as 186 mobile phone numbers****7392 Like this.
b. label encoding Digital form
c. Anonymity is the failure to tell the relevant nature of the data column
d. The evaluation index is to evaluate the gap between the model effect and the actual effect(Specific evaluation indicators (written later)  

2. Lessons learned from the task Code:


From the inside, I know:
a. The head function in pandas can display five pieces of data (five by default)
b. How to read data

3. Evaluation and calculation of classification indicators


import numpy as np
from sklearn.metrics import accuracy_score

We use accuracy here_ Score function, which is an evaluation method. Classification accuracy score refers to the percentage of correct classification.
sklearn.metrics.accuracy_score(y_true, y_pred, *, normalize=True, sample_weight=None)
If normalize is False, y is returned_ And Y in pred_ The number of elements is the same as True. The default is True, and the return is the correct ratio.

  • Prerequisite knowledge:
    I use T for correct prediction and F for wrong prediction
    Use P for positive class and N for negative class
    TP: the prediction is positive and the judgment is correct
    FP: the prediction is positive and the judgment is wrong
    FN: negative prediction, wrong judgment
    TN: the prediction is negative and the judgment is correct
  1. precision accuracy
## Precision
from sklearn import metrics
y_pred = [1,1,0,0,0]
y_true = [1,0,0,1,0]
print('Precision',metrics.precision_score(y_true, y_pred))
'''The resolution is not indicated here average The default value is taken binary. 1 Is the default positive class, then TP(A positive class value predicted to be a positive class, that is, 1) has 1,
TP+FP(That is, the total number predicted as 1) has two, so it is calculated P=TP/(TP+FP)Is 0.5'''

Precision is used here_ Score function. The parameter average in this function is binary by default, so y is required by default_ true, y_ PRED only contains 0 and 1 (that is, binary), and it also involves another parameter pos_label,pos_ Label specifies that it is considered positive_ The value of label (that is, the value regarded as a positive class) is 1 by default, that is, 1 is positive by default (that is, the often said positive class).

  • Other parameters of average, such as the most commonly used macro and weight calculation methods, are also based on the calculation of P.
  1. Recall recall rate
    Here, as above, average goes to the default value binary
print('Recall:',metrics.recall_score(y_true, y_pred))
'''The analysis is here. Go and find it first TP,That is 1.Then find TP+FN(FN: If the prediction is negative, the judgment is wrong, that is, find 2 of the predicted value of 0, but the original value is 1),
therefore recall=TP/(TP+FN)'''
  1. F1_ fraction
print('F1-score:',metrics.f1_score(y_true, y_pred))
#The calculation method is (2*P*R) / (P+R), where P and R are the above accuracy rate and recall rate respectively

4. Parity calculation of regression index

(relevant information and explanations have been written on the notes)

# coding=utf-8
import numpy as np
from sklearn import metrics

# MAPE needs to be implemented by itself. It is not in the sklearn library, so you should write your own code to implement it
def mape(y_true, y_pred):
    return np.mean(np.abs((y_pred - y_true) / y_true))

y_true = np.array([1.0, 5.0, 4.0, 3.0, 2.0, 5.0, -3.0])
y_pred = np.array([1.0, 4.5, 3.8, 3.2, 3.0, 4.8, -2.2])

# MSE mean square error
print('MSE:',metrics.mean_squared_error(y_true, y_pred))
# RMSE root mean square error
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_true, y_pred)))
# MAE mean absolute error
print('MAE:',metrics.mean_absolute_error(y_true, y_pred))
# MAPE mean absolute percentage error
print('MAPE:',mape(y_true, y_pred))

In addition to these, there are about fitting:

## R2 score R2 determination coefficient (goodness of fit), the closer it is to 1, the better
from sklearn.metrics import r2_score
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print('R2-score:',r2_score(y_true, y_pred))

The closer R2 is to 1, the higher the correlation!

3, Learning questions and answers

  1. Encounter unknown functions: I first go to find the information myself, find it from csdn and official documents, and then solve it

4, Learning, thinking and summary

  1. I've just started learning. It's normal to encounter something that won't happen. I hope to keep my mind and check the deficiencies and make up the mistakes
  2. Try to control the learning time every day to about 2 hours. First, ensure long-term thinking. Second, give yourself a pressure and don't relax in learning
  3. Learn while doing and review in time.

Tags: Python Data Mining sklearn

Posted on Fri, 01 Oct 2021 19:35:04 -0400 by davidguz