Cancer classification prediction based on logistic regression - tumor prediction and ROC curve drawing of benign / malignant breast cancer

logistic regression

Logistic Regression is a classification model in machine learning. Logistic Regression is a classification algorithm, although the name has regression. Because the algorithm is simple and efficient, it is widely used in practice. Logical regression is a powerful tool to solve the problem of binary classification.
The input of logistic regression is the result of linear regression.
sigmoid function

The result of regression is input into sigmoid function
Output result: a probability value in the interval of [0, 1], the default value is 0.5, which is the threshold value

sklearn.linear_model.LogisticRegression(solver='liblinear', penalty='l2', C = 1.0)

Optional parameters of solver: {'liblinear', 'sag', 'saga', 'Newton CG', 'lbfgs'},

Default: 'liblinear'; the algorithm used to optimize the problem.
For small datasets, "liblinear" is a good choice, while "sag" and "saga" are faster for large datasets.

For many kinds of problems, only 'Newton CG', 'sag',
'saga' and 'lbfgs' can handle multiple losses; "liblinear" is limited to the "one versus rest" category.

penalty: types of regularization

C: Regularization intensity

Case: cancer classification prediction - benign / malignant breast cancer prediction
Dataset connection:

from sklearn.datasets import load_boston   #Data import
from sklearn.model_selection import train_test_split   #Data segmentation
from sklearn.preprocessing import StandardScaler   #Standardization
from sklearn.linear_model import LogisticRegression  #logistic regression 
from sklearn.externals import joblib    #Model save load
import pandas as pd
import numpy as np
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
# 1. Access to data
# 1.1 setting labels
names = ['Sample code number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape',
                   'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin',
                   'Normal Nucleoli', 'Mitoses', 'Class']
# 1.2 read data
data = pd.read_csv("",
# 2. Data set division
# 2.1 handling missing values
data = data.replace(to_replace= "?" ,value= np.NaN)
data = data.dropna()
# 2.2 set characteristic value and target value
c_data = data.iloc[ : ,1:10]  #Data division, first, then
c_target = data["Class"]
# 2.3 data division
x_train, x_test, y_train, y_test = train_test_split(c_data , c_target , random_state=22)

# 3. Feature Engineering - Standardization
transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)

# 4. Machine learning - logical regression
estimator = LogisticRegression(), y_train)

# 5. Model evaluation
y_predict = estimator.predict(x_test)
estimator.score(x_test, y_test)



Classification evaluation method

Under the classification task, there are four different combinations between the predicted condition and the true condition to form the confusion matrix (applicable to multi classification)

Common evaluation indicators:

Accuracy: (right) (TP+TN) / (TP+TN+FN+FP)
Accuracy - checked accuracy TP/(TP+FP)
Recall rate - Total incomplete TP/(TP+FN)
F1 score reflects the robustness of the model

roc curve and auc index
roc curve: draw the graph through TPR and FPR, and then draw it into an indicator auc
AUC: area under ROC curve, closer to 1, better effect
The closer to 0, the worse the effect, the closer to 0.5, the effect is nonsense

Drawing of ROC curve


AUC calculation API
from sklearn.metrics import roc_ auc_ score sklearn.metrics.roc_ auc_ score(y_ true, y_ Score) calculate ROC curve area, i.e. AUC value
y_true: the true category of each sample, which must be marked with 0 (negative example), 1 (positive example)
y_score: forecast score, which can be the estimated probability, confidence value or return value of the classifier method of a positive class

y_test = np.where(y_test > 2.5, 1, 0)  #Category 1,0
print("AUC Indicators:", roc_auc_score(y_test, y_predict))


AUC index: 0.97432432432432432433

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc  ###Calculating roc and auc
from sklearn import model_selection
# Import some data to play with
c_target = data["Class"].replace(to_replace= 2 ,value= 0)
c_target = c_target.replace(to_replace= 4 ,value= 1)
X = np.array(c_data)
y = np.array(c_target)
# Add noise features to make the problem harder
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]
# shuffle and split training and test sets
train_data, test_data, train_label, test_label = model_selection.train_test_split(X, y, test_size=.3,random_state=0)
#train_data for training sample set, test_data sample set for testing, train_label set corresponding to the training sample, test_label the label set corresponding to the test sample
# Learn to predict each class against the other classifier settings
svm = svm.SVC(kernel='linear', probability=True,random_state=random_state)#Use kernel function as linear kernel, parameter default, create classifier
###Through decision_ Test calculated by function()_ predict_ Value of label, used in ROC_ In the curve() function
test_predict_label =, train_label).decision_function(test_data)
#First, the training samples and tags are trained by fit to get the model, and then the decision is used_ Function to obtain the label set predicted by the model for the test sample set
# Compute ROC curve and ROC area for each class#Calculate tp,fp
#Through the comparison between the label set input by the test sample and the label set predicted by the model, fp,tp, different fp,tp are obtained by the algorithm changing the threshold value through certain rules
fpr,tpr,threshold = roc_curve(test_label, test_predict_label) ###Calculating true and false positive rates
roc_auc = auc(fpr,tpr) ###Calculate the auc value, auc is the area surrounded by the curve, the larger the better
lw = 2
plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc) ###The false positive rate is the abscissa, and the true rate is the ordinate
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")



Posted on Tue, 16 Jun 2020 02:10:02 -0400 by imagenesis