# K-means import

K-means is the most commonly used clustering algorithm based on Euclidean distance, which believes that the closer the distance between two targets, the greater the similarity;

## 1. Implementation of traditional K-means

### 1.1 algorithm flow

1. Import data, get the data dimension and value range, and determine the number of clusters: K;
2. Initialize the cluster center, often using random numbers( Because it is a random number to generate the initial cluster center, sometimes the algorithm does not converge);
3. Calculate the distance (Euclidean distance) from each point to K cluster centers, and allocate the data to K classes according to the principle of minimum distance;
4. Using the data in K classes, calculate and update the mean value to update K cluster centers;
5. Repeat steps 3-4 until convergence (a. the maximum number of iterations is reached, b. the cluster center or total distance no longer changes significantly).

### 1.2 algorithm implementation

#### 1.2.1 dataset

The two-dimensional and three-dimensional classification data set used in this experiment. Its data set connection is R3;

#### 1.2.2 main function (mainly drawing)

```clc;clear;close;
tic;
% data
dataset = dataset(:,1:2);
%% K-means
K =3;
iter_max = 1000;
tol = 0.0001;
[point,total_dist] = myk_means(dataset,K,iter_max,tol);
%% plot origin dataset
color = {'r','g','b'};
% plot orginal point
subplot(1,2,1);
title("orgin point");
hold on;
scatter(dataset(1:40,1),dataset(1:40,2),'.',color{1});
scatter(dataset(41:80,1),dataset(41:80,2),'.',color{2});
scatter(dataset(81:120,1),dataset(81:120,2),'.',color{3});
legend("cluster1","cluster2","cluster3");
hold off;
%% plot after cluster using K-means
subplot(1,2,2);
title("after Kmeans");
hold on;
for i = 1:K
point(i).point = [];
for j = 1:length(point(i).cluster)
point(i).point = [point(i).point ; dataset(point(i).cluster(j),:)];
end

scatter(point(i).point(:,1),point(i).point(:,2),".",color{i});
scatter(point(i).mean(1),point(i).mean(2),"^",color{i});
end
toc;

final_dist = 0;
for i =1:K
for j = 1:length(point(i).cluster)
final_dist = final_dist + norm(dataset(point(i).cluster(j),:) - point(i).mean);
end
end
disp(['Optimal distance:',num2str(final_dist)]);
figure(2)
plot(total_dist,"-b");
title(['k means',' ','Optimal distance',num2str(total_dist(end))]);
xlabel('iter');
ylabel("total distance");
```

#### 1.2.3 implementation subject of traditional K-means

```% dataset  Incoming dataset
% K  Number of clusters
% iter_max Maximum number of iterations
% tol  accuracy
% point : Returns a structure, including K The cluster center of each class and the index number of the original data.
function [point,total_dist] = myk_means(dataset,K,iter_max,tol)

total_dist = [];
%% initialize randomly generate K points;
[m,dim] = size(dataset);
max1 = max(dataset);
min1 = min(dataset);
for i = 1:K
point(i).mean = rand(1,dim).*(max1-min1) + min1;
end
iter = 1;
flag = 1;
while iter < iter_max  && flag
%% assign every point to a  particular cluster using min-distance-rule
count = zeros(K,1);
for i = 1:m
temp_distance = inf;
for  j = 1:K
distance = norm(dataset(i,:)-point(j).mean);
if distance <temp_distance
temp_distance = distance;
index = j;    %% Update categories to which
end
end
count(index) = count(index)+1;
point(index).cluster(count(index)) = i;
end

%% clear every cluster
for i = 1:K
point(i).cluster = point(i).cluster(1:count(i));
temp_mean(i,:) = point(i).mean;   % record the last_mean_point
end

%% compute new_mean_point
for i = 1:K
sum = zeros(1,dim);
for j = 1:length(point(i).cluster)
for n = 1 : dim
sum(1,n) = sum(1,n) + dataset(point(i).cluster(j),n);        %Get a 1*dim Sum of attribute values of the same category
end
end
point(i).mean = sum./length(point(i).cluster);
end

%% compute distance between last_mean_point and new_mean_point
delta = 0;
for i =1:K
delta = delta + norm(temp_mean(i,:)-point(i).mean);
end

if delta <= tol
flag = 0;
end

iter = iter +1;
%% compute dist
tp_dist = 0;
for i =1:K
for j = 1:length(point(i).cluster)
tp_dist = tp_dist + norm(dataset(point(i).cluster(j),:) - point(i).mean);
end
end
total_dist = [total_dist;tp_dist];
end
end
```

### 1.3 experimental results

The cluster center (with two decimal places reserved for the result) is: [11.24,11.62][ 9.99,10.05]；[ 12.06,10.03]

#### 1.3.2 convergence diagram of total clustering distance

Summary of traditional K-means method: the speed is very fast, and the results can be obtained after several iterations; Occasionally, there is no convergence (the initial cluster center is sensitive).

## 2 particle swarm optimization algorithm to achieve K-means

Use the same dataset R3 as traditional K-means

### 2.1 construction of optimization function (fitness function)

The K-means clustering problem is transformed into an optimization problem, that is, find K clustering centers to minimize the sum of the distances from all points to the category!!!

### 2.2 code implementation of fitness function

``` This part consists of two parts,
1,Allocation: first calculate the Euclidean distance of the whole data set for each group of cluster centers, and then allocate categories according to the principle of minimum Euclidean distance;
2,Total distance( fitness):Calculate the Euclidean distance from all data to the category center, sum and return, that is, the Euclidean distance of the cluster center of the group fitness;
```

#### 2.2.1 minimum distance criterion allocation

```function [total_dist,res] = assignment(one_pop,dataset,K_C)
[data_num,dim] = size(dataset);

for k = 1:K_C
res(k).mean(1,1:dim) = one_pop(1,dim*k-1:dim*k);
end

% distance Minimum principle
total_dist = 0;
count = zeros(K_C,1);

for i = 1:data_num
temp_dist = inf;
dist_record = zeros(K_C,1);
for k =1:K_C
dist_record(k) = norm(dataset(i,1:dim)- res(k).mean(1,1:dim));
if(temp_dist > dist_record(k))
count_flag = k;                                         %%The class with the smallest recording distance
temp_dist = dist_record(k);
end
end

%     res(count_flag).index(,1) = i;           %% Serial number corresponding to record category

count(count_flag,1) = count(count_flag,1) + 1;
total_dist = total_dist + dist_record(count_flag);
res(count_flag).index(count(count_flag,1)) = i;           %% Serial number corresponding to record category
end

end
```

#### 2.2.2 to calculate the fitness of the population, you need to call assignment

```function fitness = Fitness(x_pop,dataset,K_C)
[pop_num,~] = size(x_pop);

fitness = zeros(pop_num,1);
% Pass in each cluster center
for i = 1:pop_num
[fitness(i),~] = assignment(x_pop(i,:),dataset,K_C);
end
end
```

### 2.3 main function (particle swarm optimization)

Particle swarm optimization process:
1. Initialization population and velocity, initialization parameters c1, c2, inertia factor w;
2. Calculate the fitness of the initial population and initialize, a. individual optimal location b. individual optimal fitness c. population optimal location d. population optimal fitness;
3. Update the individual velocity of the population (considering the problem of velocity boundary treatment), and update the individual position of the population (considering the problem of individual cross-border treatment);
4. Check and update, individual optimal location, individual optimal fitness, population optimal location, population optimal fitness;
5. Repeat steps 3-4 until convergence or the maximum number of iterations is reached.

#### 2.3.1 PSO-K-means implementation subject

1. According to the number of cluster centers K_C and data dimension dim to determine the individual length: K_C*dim；
2. The algorithm may not update the optimal value between the two generations, so the added convergence condition is changed to that the difference and change between the three generations are less than the accuracy, that is, it is necessary to update the three generations without change before judging to stop the iteration

```clc;clear;close;
%  Particle Swarm Optimization  solve  k-means problem
%  min
tic;
dataset = dataset(:,1:2);

[~,dim] = size(dataset);

K_C = 3;         %Number of categories
tol = 10e-6;

% pso parameter
pop_num = 100;
iter_max = 100;
c1 = 1.5;
c2 = 1.5;
w = 0.9;      %Inertia coefficient

pos_max = max(dataset);
pos_min = min(dataset);
V_max = [0.5,0.5];
V_min = [-0.5,-0.5];

%Write arguments to the same population
for k = 1:K_C
x_max(1,k*dim-1:k*dim) = pos_max;
x_min(1,k*dim-1:k*dim) = pos_min;
v_max(1,k*dim-1:k*dim) = V_max;
v_min(1,k*dim-1:k*dim) = V_min;
end
x_pop = zeros(pop_num,K_C*dim);          %Initialize population individuals
v_pop = zeros(pop_num,K_C*dim);          %Initialize population speed
for i = 1:pop_num
x_pop(i,1:K_C*dim) = rand(1,K_C*dim).*(x_max-x_min)+ x_min.*ones(1,K_C*dim);
v_pop(i,1:K_C*dim) = rand(1,K_C*dim).*(v_max-v_min)+ v_min.*ones(1,K_C*dim);
end

fitness = Fitness(x_pop,dataset,K_C);
[~,index] = sort(fitness);
gbest = x_pop(index(1),:);  % Group extreme position
gbest_fitness = fitness(index(1));        % Extreme value of population fitness
pbest = x_pop;              % Individual extreme position
pbest_fitness = fitness;    % Individual extreme fitness

fit_iter = zeros(iter_max,1);
fit_iter(1,1) = gbest_fitness;
iter = 2;
flag = 1; %Accuracy control, set to 0 when the accuracy meets the requirements
while(iter <= iter_max && flag == 1)
%Update crowd speed and position
for i = 1:pop_num
%Update group speed
v_pop(i,:) = w*v_pop(i,:) + c1*rand*(pbest(i,:)-x_pop(i,:)) + c2*rand*(gbest-x_pop(i,:));%rand yes[0,1]random number

for j = 1:K_C*dim                  %Velocity boundary treatment, boundary rebound
if(v_pop(i,j)> v_max(1,j))
%                 v_pop(i,j) = -(v_pop(i,j) - v_max(1,j));   % Negative after processing
v_pop(i,j) =  v_max(1,j);
end
if(v_pop(i,j)  < v_min(1,j))
%                 v_pop(i,j) = v_min(1,j) - v_pop(i,j) ;     % Correction after processing
v_pop(i,j) =  v_min(1,j);
end
end

%Update group location
x_pop(i,:) = x_pop(i,:) + 0.5 * v_pop(i,:);

for j = 1:K_C*dim                  %Position boundary processing, out of bounds, boundary value assignment processing
if(x_pop(i,j)> x_max(1,j))
x_pop(i,j) = x_max(1,j);   % Maximum value after processing
end
if(x_pop(i,j)  < x_min(1,j))
x_pop(i,j) = x_min(1,j) ;     % Minimum value after processing
end
end
end %i End of cycle

% Recalculate fitness
fitness = Fitness(x_pop,dataset,K_C);
for i = 1:pop_num

%Update individual extremum and extremum position
if (fitness(i) < pbest_fitness(i))
pbest(i,:) = x_pop(i,:);
pbest_fitness(i) = fitness(i);

% The updating of individual extremum takes into account the updating of global extremum
if (pbest_fitness(i) < gbest_fitness )
gbest = pbest(i,:);
gbest_fitness = pbest_fitness(i);
end

end
end % i End of cycle
fit_iter(iter,1) = gbest_fitness;

sum = 0;
% Calculate generation 3 difference sum
if(  iter > 3 )
for co = 1:3
sum = sum + abs(fit_iter(iter+1-co,1) - fit_iter(iter-co,1));
end
if(sum < tol)
flag = 0;
cooo = iter;
end
end

iter = iter + 1;
end

[~,res] = assignment(gbest,dataset,K_C);
toc;

figure(1);
hold on;
plot(fit_iter(1:cooo));
xlabel("iteration");
ylabel('total distance');
title('PSO && k-means'，' '，'Optimal distance:',num2str(fit_iter(cooo)));
hold off;

```

### 2.3.2 realization results

Cluster centers (two decimal places are reserved, and the order of cluster centers is not in order) are: [11.19,11.65], [10.01,10.01], [12.09,10.00]

PSO kmeans summary: the advantage of using this evolutionary algorithm is that it does not need to determine the initial value, the convergence is relatively stable, and it is the global optimal value.
Disadvantages: the amount of computation is larger than that of traditional methods;

Posted on Thu, 02 Sep 2021 18:49:47 -0400 by sundawg