Particle swarm optimization to achieve K-means clustering (Matlab source code implementation)

K-means import

K-means is the most commonly used clustering algorithm based on Euclidean distance, which believes that the closer the distance between two targets, the greater the similarity;

1. Implementation of traditional K-means

1.1 algorithm flow

1. Import data, get the data dimension and value range, and determine the number of clusters: K;
2. Initialize the cluster center, often using random numbers( Because it is a random number to generate the initial cluster center, sometimes the algorithm does not converge);
3. Calculate the distance (Euclidean distance) from each point to K cluster centers, and allocate the data to K classes according to the principle of minimum distance;
4. Using the data in K classes, calculate and update the mean value to update K cluster centers;
5. Repeat steps 3-4 until convergence (a. the maximum number of iterations is reached, b. the cluster center or total distance no longer changes significantly).

1.2 algorithm implementation

1.2.1 dataset

The two-dimensional and three-dimensional classification data set used in this experiment. Its data set connection is R3;

1.2.2 main function (mainly drawing)

clc;clear;close;
tic;
% data 
dataset = load("R3.txt");
dataset = dataset(:,1:2);
%% K-means 
K =3;
iter_max = 1000;
tol = 0.0001;
[point,total_dist] = myk_means(dataset,K,iter_max,tol);
%% plot origin dataset
color = {'r','g','b'};
% plot orginal point
subplot(1,2,1);
title("orgin point");
hold on;
scatter(dataset(1:40,1),dataset(1:40,2),'.',color{1});
scatter(dataset(41:80,1),dataset(41:80,2),'.',color{2});
scatter(dataset(81:120,1),dataset(81:120,2),'.',color{3});
legend("cluster1","cluster2","cluster3");
hold off;
%% plot after cluster using K-means
subplot(1,2,2);
title("after Kmeans");
hold on;
for i = 1:K
    point(i).point = [];
     for j = 1:length(point(i).cluster)
            point(i).point = [point(i).point ; dataset(point(i).cluster(j),:)];
     end
     
     scatter(point(i).point(:,1),point(i).point(:,2),".",color{i});
     scatter(point(i).mean(1),point(i).mean(2),"^",color{i});
end
toc;

final_dist = 0;
for i =1:K
    for j = 1:length(point(i).cluster)
        final_dist = final_dist + norm(dataset(point(i).cluster(j),:) - point(i).mean);
    end  
end
disp(['Optimal distance:',num2str(final_dist)]);
figure(2)
plot(total_dist,"-b");
title(['k means',' ','Optimal distance',num2str(total_dist(end))]);
xlabel('iter');
ylabel("total distance");

1.2.3 implementation subject of traditional K-means

% dataset  Incoming dataset   
% K  Number of clusters
% iter_max Maximum number of iterations
% tol  accuracy
% point : Returns a structure, including K The cluster center of each class and the index number of the original data.
function [point,total_dist] = myk_means(dataset,K,iter_max,tol)

total_dist = [];
%% initialize randomly generate K points;
[m,dim] = size(dataset);
max1 = max(dataset);
min1 = min(dataset);
for i = 1:K
    point(i).mean = rand(1,dim).*(max1-min1) + min1;
end
iter = 1;
flag = 1;
while iter < iter_max  && flag
%% assign every point to a  particular cluster using min-distance-rule
count = zeros(K,1);
for i = 1:m
    temp_distance = inf;
    for  j = 1:K
         distance = norm(dataset(i,:)-point(j).mean);
         if distance <temp_distance
             temp_distance = distance;
             index = j;    %% Update categories to which
         end
    end
    count(index) = count(index)+1;
    point(index).cluster(count(index)) = i;
end

%% clear every cluster
for i = 1:K
    point(i).cluster = point(i).cluster(1:count(i));
    temp_mean(i,:) = point(i).mean;   % record the last_mean_point
end

%% compute new_mean_point
for i = 1:K
    sum = zeros(1,dim);
    for j = 1:length(point(i).cluster)
        for n = 1 : dim
            sum(1,n) = sum(1,n) + dataset(point(i).cluster(j),n);        %Get a 1*dim Sum of attribute values of the same category
        end
    end
    point(i).mean = sum./length(point(i).cluster);
end

%% compute distance between last_mean_point and new_mean_point
delta = 0;
for i =1:K
    delta = delta + norm(temp_mean(i,:)-point(i).mean);
end

if delta <= tol
    flag = 0;
end

iter = iter +1;
%% compute dist
tp_dist = 0;
for i =1:K
    for j = 1:length(point(i).cluster)
        tp_dist = tp_dist + norm(dataset(point(i).cluster(j),:) - point(i).mean);
    end  
end
total_dist = [total_dist;tp_dist];
end
end

1.3 experimental results

The cluster center (with two decimal places reserved for the result) is: [11.24,11.62][ 9.99,10.05];[ 12.06,10.03]

1.3.1 comparison and clustering results of original label data

1.3.2 convergence diagram of total clustering distance


Summary of traditional K-means method: the speed is very fast, and the results can be obtained after several iterations; Occasionally, there is no convergence (the initial cluster center is sensitive).

2 particle swarm optimization algorithm to achieve K-means

Use the same dataset R3 as traditional K-means

2.1 construction of optimization function (fitness function)

The K-means clustering problem is transformed into an optimization problem, that is, find K clustering centers to minimize the sum of the distances from all points to the category!!!

2.2 code implementation of fitness function

 This part consists of two parts,
	1,Allocation: first calculate the Euclidean distance of the whole data set for each group of cluster centers, and then allocate categories according to the principle of minimum Euclidean distance;
	2,Total distance( fitness):Calculate the Euclidean distance from all data to the category center, sum and return, that is, the Euclidean distance of the cluster center of the group fitness;

2.2.1 minimum distance criterion allocation

function [total_dist,res] = assignment(one_pop,dataset,K_C)
[data_num,dim] = size(dataset);

for k = 1:K_C
    res(k).mean(1,1:dim) = one_pop(1,dim*k-1:dim*k);
end

% distance Minimum principle
total_dist = 0; 
count = zeros(K_C,1);

for i = 1:data_num
    temp_dist = inf;
    dist_record = zeros(K_C,1);
    for k =1:K_C
        dist_record(k) = norm(dataset(i,1:dim)- res(k).mean(1,1:dim));
        if(temp_dist > dist_record(k))
            count_flag = k;                                         %%The class with the smallest recording distance
            temp_dist = dist_record(k);
        end
    end

%     res(count_flag).index(,1) = i;           %% Serial number corresponding to record category

    count(count_flag,1) = count(count_flag,1) + 1;    
    total_dist = total_dist + dist_record(count_flag);    
    res(count_flag).index(count(count_flag,1)) = i;           %% Serial number corresponding to record category
end

end

2.2.2 to calculate the fitness of the population, you need to call assignment

function fitness = Fitness(x_pop,dataset,K_C)
[pop_num,~] = size(x_pop);

fitness = zeros(pop_num,1);
% Pass in each cluster center
	for i = 1:pop_num
	    [fitness(i),~] = assignment(x_pop(i,:),dataset,K_C);
	end
end

2.3 main function (particle swarm optimization)

Particle swarm optimization process:
1. Initialization population and velocity, initialization parameters c1, c2, inertia factor w;
2. Calculate the fitness of the initial population and initialize, a. individual optimal location b. individual optimal fitness c. population optimal location d. population optimal fitness;
3. Update the individual velocity of the population (considering the problem of velocity boundary treatment), and update the individual position of the population (considering the problem of individual cross-border treatment);
4. Check and update, individual optimal location, individual optimal fitness, population optimal location, population optimal fitness;
5. Repeat steps 3-4 until convergence or the maximum number of iterations is reached.

2.3.1 PSO-K-means implementation subject

1. According to the number of cluster centers K_C and data dimension dim to determine the individual length: K_C*dim;
2. The algorithm may not update the optimal value between the two generations, so the added convergence condition is changed to that the difference and change between the three generations are less than the accuracy, that is, it is necessary to update the three generations without change before judging to stop the iteration

clc;clear;close;
%  Particle Swarm Optimization  solve  k-means problem 
%  min
tic;
dataset = load("R3.txt");
dataset = dataset(:,1:2);

[~,dim] = size(dataset);

K_C = 3;         %Number of categories
tol = 10e-6;

% pso parameter
pop_num = 100;
iter_max = 100;
c1 = 1.5;
c2 = 1.5;
w = 0.9;      %Inertia coefficient

pos_max = max(dataset);
pos_min = min(dataset);
V_max = [0.5,0.5];
V_min = [-0.5,-0.5];


%Write arguments to the same population
for k = 1:K_C
    x_max(1,k*dim-1:k*dim) = pos_max;
    x_min(1,k*dim-1:k*dim) = pos_min;
    v_max(1,k*dim-1:k*dim) = V_max;
    v_min(1,k*dim-1:k*dim) = V_min;
end
x_pop = zeros(pop_num,K_C*dim);          %Initialize population individuals
v_pop = zeros(pop_num,K_C*dim);          %Initialize population speed
for i = 1:pop_num
        x_pop(i,1:K_C*dim) = rand(1,K_C*dim).*(x_max-x_min)+ x_min.*ones(1,K_C*dim);
        v_pop(i,1:K_C*dim) = rand(1,K_C*dim).*(v_max-v_min)+ v_min.*ones(1,K_C*dim);
end           

fitness = Fitness(x_pop,dataset,K_C);
[~,index] = sort(fitness);
gbest = x_pop(index(1),:);  % Group extreme position
gbest_fitness = fitness(index(1));        % Extreme value of population fitness
pbest = x_pop;              % Individual extreme position
pbest_fitness = fitness;    % Individual extreme fitness

fit_iter = zeros(iter_max,1);
fit_iter(1,1) = gbest_fitness; 
iter = 2;
flag = 1; %Accuracy control, set to 0 when the accuracy meets the requirements
while(iter <= iter_max && flag == 1)    
    %Update crowd speed and position
    for i = 1:pop_num
        %Update group speed
        v_pop(i,:) = w*v_pop(i,:) + c1*rand*(pbest(i,:)-x_pop(i,:)) + c2*rand*(gbest-x_pop(i,:));%rand yes[0,1]random number 
        
        for j = 1:K_C*dim                  %Velocity boundary treatment, boundary rebound
            if(v_pop(i,j)> v_max(1,j))
%                 v_pop(i,j) = -(v_pop(i,j) - v_max(1,j));   % Negative after processing
                v_pop(i,j) =  v_max(1,j);   
            end
            if(v_pop(i,j)  < v_min(1,j))
%                 v_pop(i,j) = v_min(1,j) - v_pop(i,j) ;     % Correction after processing
                v_pop(i,j) =  v_min(1,j);
            end
        end
        
        %Update group location
        x_pop(i,:) = x_pop(i,:) + 0.5 * v_pop(i,:);
       
        for j = 1:K_C*dim                  %Position boundary processing, out of bounds, boundary value assignment processing
            if(x_pop(i,j)> x_max(1,j))
                x_pop(i,j) = x_max(1,j);   % Maximum value after processing
            end
            if(x_pop(i,j)  < x_min(1,j))
                x_pop(i,j) = x_min(1,j) ;     % Minimum value after processing
            end
        end
    end %i End of cycle
    
    % Recalculate fitness
    fitness = Fitness(x_pop,dataset,K_C);
    for i = 1:pop_num
        
        %Update individual extremum and extremum position
        if (fitness(i) < pbest_fitness(i))
            pbest(i,:) = x_pop(i,:);
            pbest_fitness(i) = fitness(i); 
            
         % The updating of individual extremum takes into account the updating of global extremum   
            if (pbest_fitness(i) < gbest_fitness )
                gbest = pbest(i,:);    
                gbest_fitness = pbest_fitness(i);
            end
            
        end       
    end % i End of cycle    
    fit_iter(iter,1) = gbest_fitness;
    
    sum = 0;
    % Calculate generation 3 difference sum
    if(  iter > 3 )
        for co = 1:3
            sum = sum + abs(fit_iter(iter+1-co,1) - fit_iter(iter-co,1));
        end
        if(sum < tol)
            flag = 0;
            cooo = iter;
        end
    end
    
    iter = iter + 1;
end

[~,res] = assignment(gbest,dataset,K_C);
toc;

figure(1);
hold on;
plot(fit_iter(1:cooo));
xlabel("iteration");
ylabel('total distance');
title('PSO && k-means',' ','Optimal distance:',num2str(fit_iter(cooo)));
hold off;

2.3.2 realization results

Cluster centers (two decimal places are reserved, and the order of cluster centers is not in order) are: [11.19,11.65], [10.01,10.01], [12.09,10.00]

PSO kmeans summary: the advantage of using this evolutionary algorithm is that it does not need to determine the initial value, the convergence is relatively stable, and it is the global optimal value.
Disadvantages: the amount of computation is larger than that of traditional methods;

Tags: MATLAB Algorithm Machine Learning

Posted on Thu, 02 Sep 2021 18:49:47 -0400 by sundawg