# Data dimensionality reduction and visualization -- t-SNE

t-SNE is the best data dimensionality reduction and visualization method at present, but its disadvantages are also obvious, such as large memory and long running time. However, when we want to classify high-dimensional data, and we don't know whether this data set has good separability (i.e. small spacing between similar classes and large spacing between different classes), we can project it into 2-dimensional or 3-dimensional space through t-SNE. If it is separable in low dimensional space, the data is separable; If it is not separable in high-dimensional space, the data may not be separable, or it may simply be because it cannot be projected into low-dimensional space.

the principle, parameters and examples of t-SNE will be briefly introduced below.

## t-distributed Stochastic Neighbor Embedding(t-SNE)

t-SNE (TSNE) converts the similarity between data points into probability. The similarity in the original space is represented by Gaussian joint probability, and the similarity in the embedded space is represented by "Student t distribution".

although data dimensionality reduction and visualization methods such as Isomap, LLE and variants are more suitable for expanding a single continuous low-dimensional manifold. However, if we want to accurately visualize the similarity relationship between samples, as shown in the following figure, t-SNE performs better for the S-curve (images of different colors represent data of different categories). Because t-SNE mainly focuses on the local structure of data.

The visualization effect is evaluated by the Kullback Leibler (KL) divergence of the joint probability of the original space and the embedded space, that is, the function of KL divergence is used as the loss function, and then the loss function is minimized by gradient descent, and finally the convergence result is obtained. Note that the loss is not a convex function, that is, multiple runs with different initial values will converge to the local minimum of the KL divergence function, so as to obtain different results. Therefore, it is sometimes useful to try different random number seeds (different random distributions can be obtained by setting seed in Python) and select the result with the lowest KL divergence value.

the disadvantages of using t-SNE are probably:

- The computational complexity of t-SNE is very high. It may take several hours in millions of sample data sets, while PCA can be completed in seconds or minutes
- The Barnes hut SNE method (described below) is limited to 2D or 3D embedding.
- The algorithm is random, and multiple experiments with different seeds can produce different results. Although it is OK to choose the result with the minimum loss, it may take many experiments to select the super parameters.
- The global structure is not explicitly reserved. This problem can be alleviated by PCA initialization point (using init ='pca').

### Optimized t-SNE

the main purpose of t-SNE is the visualization of high-dimensional data. Therefore, when the data is embedded in 2D or 3D, the effect is the best. Sometimes optimizing KL divergence can be a bit tricky. There are five parameters that can control the optimization of t-SNE, that is, it will affect the final visualization quality:

- perplexity
- Early amplification factor
- learning rate
- maximum number of iterations
- angle

### Barnes-Hut t-SNE

Barnes hut t-SNE mainly optimizes the speed of traditional t-SNE. It is the most popular t-SNE method. At the same time, it is different from traditional t-SNE:

- Barnes hut works only when the target dimension is 3 or less. Mainly 2D visualization.
- Barnes hut is only suitable for dense input data. Sparse data matrix can only be embedded by specific methods, or can be approximated by projection, for example sklearn.decomposition.TruncatedSVD
- Barnes hut is an approximation. The angle parameter is used to control the approximation. Therefore, when the parameter method="exact", TSNE() uses the traditional method. At this time, the angle parameter cannot be used.
- Barnes hut can handle more data. Barnes hut can be used to embed hundreds of thousands of data points.

for visualization purposes (which is the main use of t-SNE), Barnes hut method is strongly recommended. method="exact", although the traditional t-SNE method can reach the theoretical limit of the algorithm and has better effect, it is subject to computational constraints and can only visualize small data sets.

for MNIST, after t-SNE visualization, characters can be separated naturally according to labels. See the routine at the end of this paper; The handwritten characters after PCA dimensionality reduction visualization will overlap between different categories, which also proves the strength of the nonlinear characteristics of t-SNE. It is worth noting that the failure to show well separated uniformly labeled groups with t-SNE in 2D does not necessarily mean that the data can not be correctly classified by the supervised model, and it may also be because 2D is not enough to accurately show the internal structure of the data.

### matters needing attention

- The data set should have the same scale on all feature dimensions

### Parameter description

parameters | description |
---|---|

n_components | int, 2 by default, is the dimension of the embedded space (the embedded space means the result space) |

perplexity | float, the default value is 30. The larger the data set, the larger the parameter value required. The recommended value bits are 5-50 |

early_exaggeration | float, which defaults to 12.0, controls the tightness of natural clusters in the original space in the embedded space and the space between them. For larger values, the space between natural clusters in embedded space will be larger. Again, the choice of this parameter is not very critical. If the cost function increases during the initial optimization, the parameter value may be too high. |

learning_rate | float, default:200.0, learning rate, recommended value is 10.0-1000.0 |

n_iter | int, default:1000, maximum iterations |

n_iter_without_progress | int, default:300. The maximum number of iterations of another form must be a multiple of 50 |

min_grad_norm | float, default:1e-7. If the gradient is lower than this value, stop the algorithm |

metric | string or callable, measurement method of accuracy |

init | string or numpy array, default: "random", which can be 'random', 'pca' or a numpy array (shape=(n_samples, n_components). |

verbose | int, default:0, is the training process visible |

random_state | int, RandomState instance or None, default:None, which controls the generation of random numbers |

method | string, default: 'barnes_hut', the default value for large data sets and 'exact' for small data sets |

angle | float, default:0.5, available only when method='barnes_hut ' |

attributes | description |
---|---|

embedding_ | Embedding vector |

kl_divergence | Final KL divergence |

n_iter_ | Number of iterations |

Methods | description |
---|---|

fit | Project X into an embedded space |

fit_transform | Project X into an embedded space and return the conversion result |

get_params | Get parameters of t-SNE |

set_params | Set parameters of t-SNE |

### example

#### Hello World

For a simple example, input four three-dimensional data, and then reduce the dimension through t-SNE to call two-dimensional data.

import numpy as np from sklearn.manifold import TSNE X = np.array([[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 1]]) tsne = TSNE(n_components=2) tsne.fit_transform(X) print(tsne.embedding_) '''output [[ 3.17274952 -186.43092346] [ 43.70787048 -283.6920166 ] [ 100.43157196 -145.89025879] [ 140.96669006 -243.15138245]]'''

#### Dimension reduction and visualization of S-curve

The data on the S-curve is high-dimensional data, in which different colors represent different categories of data. When we embed it into the two-dimensional space through t-SNE, we can see that the category information between data points is perfectly preserved

# coding='utf-8' """# A description of various dimensionality reduction on the S-curve data set.""" from time import time import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D from matplotlib.ticker import NullFormatter from sklearn import manifold, datasets # # Next line to silence pyflakes. This import is needed. # Axes3D n_points = 1000 # X It's a(1000, 3)2D data, color It's a(1000,)1D data X, color = datasets.samples_generator.make_s_curve(n_points, random_state=0) n_neighbors = 10 n_components = 2 fig = plt.figure(figsize=(8, 8)) # Created a figure，The title is"Manifold Learning with 1000 points, 10 neighbors" plt.suptitle("Manifold Learning with %i points, %i neighbors" % (1000, n_neighbors), fontsize=14) '''draw S Curve 3 D image''' ax = fig.add_subplot(211, projection='3d') ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=color, cmap=plt.cm.Spectral) ax.view_init(4, -72) # Initialize Perspective '''t-SNE''' t0 = time() tsne = manifold.TSNE(n_components=n_components, init='pca', random_state=0) Y = tsne.fit_transform(X) # Converted output t1 = time() print("t-SNE: %.2g sec" % (t1 - t0)) # Algorithm time ax = fig.add_subplot(2, 1, 2) plt.scatter(Y[:, 0], Y[:, 1], c=color, cmap=plt.cm.Spectral) plt.title("t-SNE (%.2g sec)" % (t1 - t0)) ax.xaxis.set_major_formatter(NullFormatter()) # Set label display format to null ax.yaxis.set_major_formatter(NullFormatter()) # plt.axis('tight') plt.show()

#### Dimension reduction and visualization of handwritten digits

The handwritten numeral data set here is a pile of 8 * 8 arrays, and each array represents a handwritten numeral. As shown in the figure below.

t-SNE reduces the dimension of 8 * 8, i.e. 64 dimensional data to 2 dimensions and displays it in the plan. Only 0-5 and 6 handwritten digits are selected here.

# coding='utf-8' """t-SNE Visualize handwritten digits""" from time import time import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from sklearn.manifold import TSNE def get_data(): digits = datasets.load_digits(n_class=6) data = digits.data label = digits.target n_samples, n_features = data.shape return data, label, n_samples, n_features def plot_embedding(data, label, title): x_min, x_max = np.min(data, 0), np.max(data, 0) data = (data - x_min) / (x_max - x_min) fig = plt.figure() ax = plt.subplot(111) for i in range(data.shape[0]): plt.text(data[i, 0], data[i, 1], str(label[i]), color=plt.cm.Set1(label[i] / 10.), fontdict={'weight': 'bold', 'size': 9}) plt.xticks([]) plt.yticks([]) plt.title(title) return fig def main(): data, label, n_samples, n_features = get_data() print('Computing t-SNE embedding') tsne = TSNE(n_components=2, init='pca', random_state=0) t0 = time() result = tsne.fit_transform(data) fig = plot_embedding(result, label, 't-SNE embedding of the digits (time %.2fs)' % (time() - t0)) plt.show(fig) if __name__ == '__main__': main()