MACHINE LEARNING [Zhou Zhihua Edition - "watermelon book" - Notes] DAY13 semi supervised learning

13.1 unmarked samples
Let's start with two concepts:

1) Labeled sample

Training sample set Dl={(x1,y1), (x2,y2) , (xl,yl)}, the class marks of these samples are known.

2) unlabeled sample

Training sample set Du = {x L + 1, XL + 2 , xl+u}, u is much larger than l, the class mark of these u samples is unknown.

The supervised learning technology is based on the training of the labeled sample Dl to build the model, and the information contained in the unlabeled sample Du is not used; if there are few labeled samples Dl, the generalization ability of the learned model is not strong, so we need to consider using the unlabeled sample Du. Using the Du learning model:

1) Active learning

First use Dl to train a model, then use this model to select a sample from Du samples, and interact with experts to transform unlabeled samples into labeled samples, and retrain the model with new labeled samples This can greatly reduce the cost of marking, only a small number of unlabeled samples can be marked by experts to improve the model, less queries to get good performance.

Active learning is to introduce expert knowledge and transform some unlabeled samples into labeled samples through external interaction. If we do not pass the external marking, we can also use unlabeled samples, which is the research and development scope of semi supervised learning.

2) Semi supervised learning

Based on the fact that unlabeled samples do not directly contain labeled information, but if they are sampled from the same data source independently and in the same distribution as the labeled information samples, the information they contain about data distribution is helpful to establish the model.

Semi supervised learning: let learners use unlabeled samples automatically to improve learning performance without relying on external interaction. In real tasks, it is a common phenomenon that there are more unlabeled samples and less labeled samples. How to make good use of unlabeled samples to improve the generalization ability of the model is the focus of semi supervised learning research.

To make use of unlabeled samples, it is necessary to assume that the data distribution information revealed by unlabeled samples is related to the category markers.

One is cluster assumption, which assumes that the data has cluster structure and the samples of the same cluster belong to the same category. The other is the manifold assumption, which assumes that the data is distributed on a manifold structure, and the adjacent samples have similar output values; the proximity degree is described by similarity degree, and the distance calculation is probably the most basic.

Manifold hypothesis can be regarded as a generalization of clustering hypothesis, but it has no limitation on the output value. It can be used in more learning tasks than clustering hypothesis. Their essence is the same, based on the assumption that similar samples have similar output.

Semi supervised learning can be further divided into pure semi supervised learning and direct learning: pure semi supervised learning assumes that the unlabeled samples in the training data are not the data to be predicted; while the unlabeled samples considered in the process of direct learning assume that learning are the data to be predicted, and the purpose of learning is to obtain the optimal generalization performance on unlabeled samples.

Pure semi supervised learning is based on the hypothesis of open world, and it is hoped that the learned model can be applied to the data that is not observed in the training process; while direct learning is based on the hypothesis of closed world, only trying to predict the unmarked data that is observed in the learning process.

13.2 generative approach
Generative methods are directly based on generative models. This method assumes that all data, whether labeled or not, are generated by the same underlying model. Under this assumption, unlabeled data and learning objectives are linked through the parameters of potential model, while unlabeled data can be used as the missing parameters of model, which is usually solved by maximum likelihood estimation based on EM algorithm. The key point of generative method is the hypothesis of generative model. Different model hypothesis will produce different methods. Of course, the key point of this method is that the hypothesis of this model must be accurate, that is to say, the generative model of the hypothesis must be consistent with the real data distribution; otherwise, using unmarked data will reduce the generalization performance. This method is simple to implement, but in real tasks, it is often difficult to make accurate model assumptions first, unless you have sufficient and reliable domain knowledge. In the following, the generating semi supervised learning method is illustrated by solving the Gaussian mixture distribution model and EM algorithm.

Other generative semi supervised learning methods can be derived by replacing Gaussian mixture model with mixture expert model and naive Bayesian model.
13.3 semi supervised SVM

13.4 figure semi supervised learning
Given a data set, it is mapped into a graph. Each sample in the data set corresponds to a node in the graph. If the similarity between two samples is very high (or the correlation is very strong), there is an edge between the corresponding nodes. The strength of the edge is directly proportional to the similarity (or correlation) between the samples. The nodes corresponding to marked samples are dyed, while the nodes corresponding to unmarked samples are not dyed; semi supervised learning corresponds to the process of color diffusion or propagation on the graph; a graph corresponds to a matrix, and semi supervised learning algorithm can be deduced and analyzed based on matrix operation.
13.5 bifurcation based approach
Different from the single learner using unlabeled data, such as generative method, semi supervised SVM and graph semi supervised learning, the method based on divergence uses multiple learners, and the divergence between learners is very important to the utilization of unlabeled data. Co training is an important representative of multi view data design and multi view learning.

1) Multiview data

Multi view data means that a data object has multiple attribute sets at the same time, and each attribute set constitutes a view. For example, a movie has multiple attribute sets, such as the attribute set corresponding to image information, the attribute set corresponding to sound information, the attribute set corresponding to subtitle information, and the attribute set corresponding to online publicity and discussion. If only the image attribute set and the sound attribute set in the movie multi view data are considered, a movie segment sample is represented by (< x1, x2 >, y), where xi is the sample in view i, that is, the attribute vector obtained based on the view attribute description. Suppose X1 is the attribute vector in the image view, x2 is the attribute vector in the sound view, and Y is the tag, such as movie type.

2) Compatibility

Suppose different views have compatibility, that is to say, the information about the output space Y contained in them is the same: let y1 represent the marker space discriminated from the image picture information, y2 represent the marker space discriminated from the sound information, then y=y1=y2.

On the basis of compatibility, different view information is complementary, which brings convenience to the construction of learners. For example, if there are two people looking at each other from the image of a movie segment, the movie type cannot be determined, but if the information disclosed by "I love you" in the sound information is added, the movie type can be determined as a love movie.

3) Collaborative training

Cooperative training is formally based on the compatibility and complementarity of multi view data. If the data has two sufficient and conditionally independent views, sufficiency means that each view contains enough information to produce the optimal learner; conditionally independent means that each view is independent under the condition of a given class tag.
13.6 semi supervised clustering

Clustering is a typical unsupervised learning task. However, in the real clustering task, we often get some additional supervisory information,
So we can use semi supervised clustering to get better clustering effect.

Direct code

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_blobs
test_data = make_blobs(n_samples = 375,n_features = 2,centers = 3,cluster_std = 1)
test_data = test_data[0]

def distanceNorm(Norm,D_value): #Distance normalization
    # initialization

    # Norm for distance
    if Norm == '1':
        counter = np.absolute(D_value)
        counter = np.sum(counter)
    elif Norm == '2':
        counter = np.power(D_value,2)
        counter = np.sum(counter)
        counter = np.sqrt(counter)
    elif Norm == 'Infinity':
        counter = np.absolute(D_value)
        counter = np.max(counter)
        raise Exception('We will program this later......')

    return counter

def chi(x):
    if x < 0:
        return 1#Number of points in intercept range
        return 0

def fit(features,t,distanceMethod = '2'):
    # initialization
    labels = list(np.arange(test_data.shape[0]))
    distance = np.zeros((len(labels),len(labels))) #Initialize the matrix to calculate the distance
    distance_sort = list()  #Generate a list of distances for the arrangement of distance sizes
    density = np.zeros(len(labels)) #Initialization density matrix, a point corresponds to a density, so density
    #The number of labels is the number of labels, and the number of labels is the number of samples
    distance_higherDensity = np.zeros(len(labels)) #Initialization matrix of high density points
    1.The first thing we need to calculate is the density of each point. To calculate the density of each point, we need to know the intercept first dc,Then use all points to that point
    //Subtract the intercept from the distance value of, if the number of points whose result is less than 0 after subtracting the intercept.
    2.After the density of each point is calculated, the delta,Calculate for each point Delta Distance, which is defined as comparing the point
    //The minimum distance of all points with large local density. If this point is already the point with the largest local density, then Delta is assigned to all other points to it
    //The maximum distance of.
    3.Finally, the corresponding cluster center is relative Delta If there are several similar points Delta It's worth a lot,
    //And it's closer, so any one of them can be the center of this class

    # compute distance generates distance matrix
    for index_i in range(len(labels)):
        for index_j in range(index_i+1,len(labels)):
            D_value = features[index_i] - features[index_j]
            distance[index_i,index_j] = distanceNorm(distanceMethod,D_value)#The elements in the distance matrix are indexed by row ordinates
    distance += distance.T

    # Calculate optimal cutoff
    distance_sort = np.array(distance_sort)
    cutoff = distance_sort[int(np.round(len(distance_sort) * t))] #t = 0.02
    //The selected distance is sorted according to the whole distance (from small to large), and then the best intercept is the distance of the first 2%,
    t Why choose 0.02,This is proved in other articles. The best intercept selected in this code is rounded round After that

    # computer density produces density matrix
    for index_i in range(len(labels)):
        distance_cutoff_i = distance[index_i] - cutoff
        '''There distance[index_i]Will be distance The matrix is traversed row by row, and the result is distance_cutoff_i
        //Is a row vector containing an entire row, "'"
        for index_j in range(1,len(labels)):
            '''The function of the following code is to get the line vector from above distance_cutoff_i Go through it once again
            //If the value of distance ﹐ cutoff ﹐ I is less than 0, that is, the point is within the specified intercept range of another point, and the number of points within the intercept range is that
            //The density of a point, in which the number of densities is the number of points, the number of samples' '
            density[index_i] += chi(distance_cutoff_i[index_j])

    # search for the max density
    Max = np.max(density) #Get the point with the highest density
    MaxIndexList = list()
    for index_i in range(len(labels)):
        if density[index_i] == Max:
        '''Traverse the density of all points, if the density of a point==Maximum density, adding the index of the change point to a list,
        //In this way, the index '' of all the densest points is included in the list

    # computer distance_higherDensity
    Min = 0
    for index_i in range(len(labels)):
        //Traverse all the points. Once the index (the index of the densest point) in the list is encountered, the row vector of the distance between the points corresponding to the index and all the points will be taken out
        //That is to say, the values of the lines with the highest density in the distance matrix are taken out, and the maximum value is the value of delta.
        delta There are two ways to take the value of:
        1.Calculate for each point Delta Distance, which is defined as the minimum distance of all points with higher local density than the point,
        2.If this point is already the point with the highest local density, then Delta Assign the maximum distance from all other points to it.
        //The second method is to find delta in the following if
        if index_i in MaxIndexList:
            distance_higherDensity[index_i] = np.max(distance[index_i])

            Min = np.max(distance[index_i])
        for index_j in range(1,len(labels)):
            if density[index_i] < density[index_j] and distance[index_i,index_j] < Min:
                Min = distance[index_i,index_j]
        distance_higherDensity[index_i] = Min
        else What we use is to seek delta The first method:
        1.First, get the maximum value of the distance in each distance row vector,
        2.When in i At the time of point, compare the density of this point and all points, find out the point whose density is greater than this point, and whose distance is less than the maximum distance. This distance is assigned to min
        //But at this time, the Min is not all the Min whose density is larger than this point and whose distance is the minimum value. By continuously comparing the minimum value to min, the minimum min is obtained
        //So that's how to find delta

    return density,distance_higherDensity,cutoff

#########  This part is my own idea, which is used to find the cluster center automatically #################
dense,dis_higher,cutoff = fit(test_data,t = 0.02)
gamma = np.sort(dense * dis_higher,axis = 0)
gamma_sorted = gamma[::-1]
P = 2
dn = int(test_data.shape[0] * P / 100)
sum = 0
for i in range(dn + 1):
    if i == 0 or i == dn:
        sj = 0
        sj = abs(abs(gamma_sorted[i - 1] - gamma_sorted[i]) - abs(gamma_sorted[i] - gamma_sorted[i + 1]))
        sum += sj
theta = 1 / (dn - 2) * sum
print('this is theta:',theta)
arr = []
for i in range(2,dn):
    if abs(abs(gamma_sorted[i - 1] - gamma_sorted[i]) - abs(gamma_sorted[i] - gamma_sorted[i + 1])) > theta:
        print(abs(abs(gamma_sorted[i - 1] - gamma_sorted[i]) - abs(gamma_sorted[i] - gamma_sorted[i + 1])))
        arr.append(abs(abs(gamma_sorted[i - 1] - gamma_sorted[i]) - abs(gamma_sorted[i] - gamma_sorted[i + 1])))
        print('this is abs:',
            abs(abs(gamma_sorted[i - 1] - gamma_sorted[i]) - abs(gamma_sorted[i] - gamma_sorted[i + 1])))
print('this is arr:',arr)
ap = np.argmax(arr) + 2
print('this is gamma_sorted[ap]:',gamma_sorted[ap - 1],gamma_sorted[ap],gamma_sorted[ap + 1])

plt.scatter(np.arange(dn),gamma_sorted[:dn],s = 20,c = 'black',alpha = 1)
plt.scatter(np.arange(1,gamma_sorted.shape[0] + 1),gamma_sorted,s = 20,c = 'r',alpha = 1)

Here are some result graphs I made. From the result graphs, it can be seen that this algorithm can still find clustering centers of some data sets, but it is a little difficult to find clustering centers of some data sets automatically.

586 original articles published, 137 praised, 40000 visited+
Private letter follow

Tags: Attribute less

Posted on Mon, 17 Feb 2020 01:12:26 -0500 by biggieuk