Machine learning Python Version (Section 3)

Model selection and tuning

1. Cross validation


Cross validation process
Cross validation: the data obtained are divided into training and validation sets. The following figure is an example: divide the data into 5 copies, one of which is used as the verification set. Then, after five tests (groups), different verification sets are replaced each time. The results of five groups of models are obtained, and the average value is taken as the final result. Also known as 50% cross validation.

2. Grid search

Super parameter search - grid search


Cross validation works with grid search:

Hyper parameter search - grid search API


Take K-nearest neighbor code as an example to add cross validation and grid search:

def knncls():
    """
    K-Nearest neighbor predicted user sign in location
    :return:None
    """
    # Read data
    data = pd.read_csv("./data/FBlocation/train.csv")
    # print(data.head(10))
    # Processing data
    # 1. Reduce data, query data and publish information
    data = data.query("x > 1.0 &  x < 1.25 & y > 2.5 & y < 2.75")
    # Data of processing time
    time_value = pd.to_datetime(data['time'], unit='s')
    print(time_value)
    # Convert date format to dictionary format
    time_value = pd.DatetimeIndex(time_value)
    # Structure some characteristics
    data['day'] = time_value.day
    data['hour'] = time_value.hour
    data['weekday'] = time_value.weekday
    # Delete timestamp feature
    data = data.drop(['time'], axis=1)
    print(data)
    # Delete less than n target locations
    place_count = data.groupby('place_id').count()
    tf = place_count[place_count.row_id > 3].reset_index()
    data = data[data['place_id'].isin(tf.place_id)]
    # Extract characteristic value and target value from data
    y = data['place_id']
    x = data.drop(['place_id'], axis=1)
    # Partition training set test set of data
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)
    # Feature Engineering (Standardization)
    std = StandardScaler()
    # Standardize the eigenvalues of test set and training set
    x_train = std.fit_transform(x_train)
    x_test = std.transform(x_test)
    # Carry out algorithm flow # Super parameter
    knn = KNeighborsClassifier()
    # # fit, predict,score
    # knn.fit(x_train, y_train)
    #
    # # Get forecast results
    # y_predict = knn.predict(x_test)
    #
    # print("predicted target check-in location is:", y_predict)
    #
    # # Get the accuracy
    # print("forecast accuracy:", knn.score(x_test, y_test))

    # Construct the value of some parameters to search
    param = {"n_neighbors": [3, 5, 10]}

    # Grid search
    gc = GridSearchCV(knn, param_grid=param, cv=2)

    gc.fit(x_train, y_train)

    # Prediction accuracy
    print("Accuracy on test set:", gc.score(x_test, y_test))
    print("Best results in cross validation:", gc.best_score_)
    print("The best model to choose is:", gc.best_estimator_)
    print("Results of each cross validation for each super parameter:", gc.cv_results_)

    return None

Operation result:

summary

cv=2 in grid search means two fold cross validation, that is to divide a group of data into two parts, one training set and one validation set, and then calculate a set of probability. In the second round, exchange the roles of the two sets in the first round, then calculate a set of probability, and finally get the average!! In the same way, 30%,... n-fold is such a sequence of role exchange, in short, only one validation set can be! The parameter list in the grid search means, for example, what value of k-nearest neighbor K?? Then execute in turn to see which value has the highest accuracy. If there are multiple super parameters, arrange and combine them one by one!!!

Classification algorithm decision tree, random forest

Decision tree


for instance:

The dialogue above is for people. Does each sentence match the conditions related to the next step? That is to say, we should judge whether it is necessary to meet each other through experience!
The decision tree can help us do this. Generally speaking, the most important thing for dating is age. We can put age at the root node. As long as we don't meet it, we don't have to talk about it later. If we meet it, we can see the next conditions!

Another example: bank loan data

Like in the past, people used to make an empirical judgment on the lender's information, whether they should or should not make a loan. Now we turn this matter over to the decision tree for judgment. As for which characteristics are important, it can be analyzed by specific problems of the project. As shown in the figure below:

It can be seen from the above figure that whether there is a house is the most important thing. It doesn't matter if there is no house, but you need to have a job to get a loan!

Measurement and function of information


Guess who's the champion? If there are 32 teams, if I don't have the previous record, then the chances of 32 teams winning the championship will be equal! When I know more and more information about the history of hand over, then my prediction will be more and more accurate! When I don't know the history of hand over, I ask for the result by the following methods:


Information entropy


H's terminology is called information entropy, and its unit is bit.
Formula:
When the 32 teams have the same chance of winning the title, the corresponding entropy of information is equal to 5 bits.

We can see that if some information is opened in the figure above, then the information entropy must be less than 5. We can know that information is related to the elimination of uncertainty. The more detailed information is, the less information entropy is, the less uncertainty is!!! ,

One of the bases of decision tree partition information gain



Note: information gain indicates the degree to which the uncertainty of Y-like information is reduced by knowing the information of feature X

Calculation of information gain

Calculation of information entropy:


Calculation of conditional entropy:

These formulas look like some big heads. Let's take the example of loans:

Algorithms used in common decision trees

Decision tree API:

Decision tree case: predicting the life and death of Titanic passengers


The data are as follows:

The code is as follows:

def decision():
    """
    //Decision tree to predict the life and death of Titanic
    :return: None
    """
    # get data
    titan = pd.read_csv("http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt")

    # Process the data to find out the eigenvalue and target value
    x = titan[['pclass', 'age', 'sex']]

    y = titan['survived']

    print(x)
    # Missing value handling
    x['age'].fillna(x['age'].mean(), inplace=True)

    # Split data set to training set test set
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

    # Process (Feature Engineering) features - category - one_hot coding
    dict = DictVectorizer(sparse=False)

    x_train = dict.fit_transform(x_train.to_dict(orient="records"))

    print(dict.get_feature_names())

    x_test = dict.transform(x_test.to_dict(orient="records"))
##Original = "records" by default, it is displayed in a dictionary line by line
    print(x_train)
    # Forecasting with decision tree
    # dec = DecisionTreeClassifier()
    #
    # dec.fit(x_train, y_train)
    #
    # # Prediction accuracy
    # print("forecast accuracy:", dec.score(x_test, y_test))
    #
    # # Derive the structure of decision tree
    # export_graphviz(dec, out_file="./tree.dot", feature_names = ['age', 'pclass=1st', 'pclass=2nd', 'pclass=3rd', 'female', 'male'])

    # Prediction of random forest (super parameter optimization)
    rf = RandomForestClassifier()

    param = {"n_estimators": [120, 200, 300, 500, 800, 1200], "max_depth": [5, 8, 15, 25, 30]}

    # Grid search and cross validation
    gc = GridSearchCV(rf, param_grid=param, cv=2)

    gc.fit(x_train, y_train)

    print("Accuracy:", gc.score(x_test, y_test))

    print("To view the selected parameter model:", gc.best_params_)

    return None

The results are as follows:

Structure and local preservation of decision tree


This code that shows the structure of the decision tree, which I wrote in Titanic above, is annotated. Let's take a look at the structure of the decision tree. The default is the Gini coefficient.

Obviously, the tree is too complex, the Gini coefficient itself will make the decision very detailed, which will lead to the training when some abnormal points will also be drawn a branch, resulting in the test set is not accurate! The extreme example is that a sample occupies a leaf node, which is wrong. How can I go to this branch because of this point..

sample is the number of samples!!

The advantages and disadvantages of decision tree and its improvement



Pruning as the name implies is to cut those abnormal points!!
Here we focus on random forest! Add after pruning

Integrated learning method random forest


What is random forest





Integrated learning API

Advantages of random forest

The code is still with the Titanic!!

Tags: less

Posted on Thu, 25 Jun 2020 07:48:06 -0400 by taya