Guide map of this section: https://www.processon.com/view/link/5fd9f5dc5653bb06f344d655
outline
1. Numerical calculation library SciPyBased on Numpy, SciPy adds many common modules in mathematical calculation, scientific calculation and engineering calculation, such as linear algebra, equation solving, signal processing, statistics, image processing, sparse matrix, etc.
Next, we mainly introduce some important sub modules, which are very important in our subsequent NLP career.
2. Fitting and optimizationThe scipy.optimize module provides many numerical optimization algorithms that we can use for least squares fitting and function minimization
Least square fitting
Least square method is a mathematical optimization technology, which seeks the best matching function with the data by minimizing the square error.
If p is used to represent the parameters to be determined in the function, the goal is to find a set of p to minimize the value of function S, so the least square fitting is expressed as:
In machine learning, the least square method is used to estimate the parameters of linear regression model.
Linear regression model is a model that uses linear model to solve real value prediction problem.
Here, through an example, the least square fitting of two-dimensional data points is performed using the optimize.leastsq() function, and the parameters (slope and intercept) of the straight line are calculated.
import numpy as np from scipy import optimize # 7 two-dimensional data points, with X and Y coordinates as follows X = np.array([8.19, 2.72, 6.39, 8.71, 4.7, 2.66, 3.78]) Y = np.array([7.01, 2.78, 6.47, 6.71, 4.1, 4.23, 4.05]) def residuals(p): """Calculate to p Is the error between the straight line of the parameter and the original data""" k, b = p # Parameter p consists of slope k and intercept b return Y - (k * X + b) # Using leastsq, enter the error function and initial parameters r = optimize.leastsq(residuals, [1, 0]) k, b = r[0] print('Slope k={}, intercept b={}'.format(k, b))
Slope k=0.6134953491930442, intercept b=1.794092543259387
Calculate function extremum
The optimize library also provides many algorithms for finding the minimum value of functions, such as Nelder Mead, Powell, CG, BFGS, Newton CG, L-BFGS-B, etc.
Here, BFGS algorithm and CG algorithm are used to solve the minimum value * * of Rosenbrock function.
Rosenbrock function is often used to test the convergence speed of minimization algorithm,
-
The function formula is as follows:
-
The function image is as follows:
The characteristic of this function is that there is a flat valley area, it is easy to converge to the valley area, but it is difficult to find the minimum point in the valley area.
Combining the formula and image, we can easily know that the minimum value occurs at (1,1) and the minimum value is 0
Some optimization algorithms need to specify fprim parameters, that is, give the partial derivative of the objective function. The partial derivative of f(x,y) to x, y is:
import numpy as np from scipy import optimize def target_func(p): """Objective function, rosenbrock function""" x, y = p z = (1 - x) ** 2 + 100 * (y - x ** 2) ** 2 return z def prime_func(p): """Derivative of objective function""" x, y = p dx = -2 + 2 * x - 400 * x * (y - x ** 2) dy = 200 * y - 200 * x ** 2 return np.array([dx, dy]) init_point = np.array([0, 2]) # Initial point selection (0,2) # Using bfgs algorithm result1 = optimize.fmin_bfgs(target_func, init_point, prime_func) # Set the objective function, initial value and derivative function print(result1) # Using cg algorithm result2 = optimize.fmin_cg(target_func, init_point, prime_func) print(result2)
Optimization terminated successfully. Current function value: 0.000000 Iterations: 23 Function evaluations: 30 Gradient evaluations: 30 [0.99999989 0.99999977] Optimization terminated successfully. Current function value: 0.000000 Iterations: 19 Function evaluations: 43 Gradient evaluations: 43 [1.00000005 1.00000009]3. Linear algebra and matrix decomposition
Both Numpy and SciPy provide linear algebra library linalg, but SciPy's linear algebra library is more comprehensive than Numpy.
Here, we mainly demonstrate the calculation of eigenvalues and eigenvectors of matrices and matrix decomposition
Calculate the eigenvalues and eigenvectors of the matrix
Matrix A of n*n can be regarded as a linear transformation of n-dimensional space. If x is a vector in n-dimensional space, Ax (matrix product of a and x) is a linear transformation of X, and the result is a new vector.
If x is the eigenvector of transformation matrix A, after the linear transformation of Ax, the new vector is still the same as the X direction, but the length may be transformed, and the scaling value of the length is determined by the eigenvalue λ decision.
Therefore, the eigenvector and eigenvalue have the following formula:
Take the linear transformation matrix on the two-dimensional plane as an example.
import numpy as np from scipy import linalg # Define transformation matrix A A = np.array([[1, -0.3], [-0.1, 0.9]]) # Solve its eigenvalues and eigenvectors evalues, evectors = linalg.eig(A) print(evalues) print(evectors)
[1.13027756+0.j 0.76972244+0.j] [[ 0.91724574 0.79325185] [-0.3983218 0.60889368]]
singular value decomposition
Singular value decomposition is an important matrix decomposition technique in linear algebra, which has important applications in the field of machine learning.
Assuming that X is a matrix of m*n, there is a decomposition such that:
[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-ei96page-1637250312971) (C: \ users \ winter \ appdata \ roaming \ typora \ user images \ image-20211118233420758. PNG)]
Where, matrix U is m*m dimension, Σ It is a diagonal matrix of m*n dimension, and matrix V is n*n dimension. Such decomposition is singular value decomposition, Σ The elements on the diagonal are the singular values of matrix X, and the singular values are arranged from large to small.
Let's perform singular value decomposition on a matrix X:
import numpy as np from scipy import linalg # Definition matrix X (3*5) X = np.arange(15).reshape(3,5) print(X, X.shape,'\n') # Return singular value svd_values = linalg.svdvals(X) print(svd_values, '\n') # Singular value decomposition, return three matrices U, sigma, V = linalg.svd(X) print(U, U.shape, '\n') print(sigma, sigma.shape, '\n') # sigma is a matrix Σ One dimensional matrix composed of diagonal elements print(V, V.shape, '\n')
[[ 0 1 2 3 4] [ 5 6 7 8 9] [10 11 12 13 14]] (3, 5) [3.17420265e+01 2.72832424e+00 9.19513106e-16] [[-0.15425367 -0.89974393 0.40824829] [-0.50248417 -0.28432901 -0.81649658] [-0.85071468 0.3310859 0.40824829]] (3, 3) [3.17420265e+01 2.72832424e+00 9.19513106e-16] (3,) [[-0.34716018 -0.39465093 -0.44214167 -0.48963242 -0.53712316] [ 0.69244481 0.37980343 0.06716206 -0.24547932 -0.55812069] [ 0.45650723 -0.61153491 0.23629065 -0.46400549 0.38274252] [-0.2754905 -0.21468338 0.83080946 0.08439319 -0.42502878] [-0.34015605 0.52908988 0.23221189 -0.69106924 0.26992351]] (5, 5)
The size of singular value measures the importance of some potential concepts. We often need to screen out the most important singular values to reduce the dimension and reconstruct U Σ V three matrices.
keep_dims = 2 # Keep 2 dimensions sigma2 = np.diag(sigma[:keep_dims]) # Retain two singular values and reconstruct the sigma matrix U2 = U[:, :keep_dims] # Reduced dimension reconstruction matrix U V2 = V[:keep_dims, :] # Reduced dimension reconstruction matrix V print(U2.shape, sigma2.shape, V2.shape)
(3, 2) (2, 2) (2, 5)
In NLP, latent semantic analysis (LSA) is performed using SVD. At this time, our input matrix A is a word document matrix, each row represents a word, and each column represents a document. The value A(i,j) in the matrix is the frequency of word i in document j.
As shown below:
We decompose the word document matrix A through SVD and intercept the most important 100 singular values to obtain three matrices: X, B, y
- X is the reduced dimension word matrix, each row represents a word, and each column identifies a potential semantic concept;
- Y is the document matrix after dimensionality reduction, each column represents a document, and each row identifies a potential topic concept;
- B. identify the correlation between each semantic class and each topic class;
After decomposition, if you need to calculate the similarity between words and documents, you can only calculate 100 dimensions, otherwise you need to calculate 500000 or 1 million dimensions.
4. Statistical distribution and testThe stats module in SciPy contains a variety of random variables with probability distribution, including continuous random variables and discrete random variables.
from scipy import stats import numpy as np
Normal distribution
Let's take the normal distribution as an example. Define a random variable X that conforms to the normal distribution
X = stats.norm(loc=1.0, scale=2.0) # The mean is 1 and the standard deviation is 2 X.stats() # Print random variables
(array(1.), array(4.))#Mean and variance
Next, call the rvs() method of random variable X to get an array of samples with multiple random sampling values
samples = X.rvs(size=1000) # Sample 1000 times to obtain array samples samples.shape
(1000,)
Next, verify the following through the np.mean() method and np.var() method
np.mean(samples), np.var(samples)
(1.038664111644651, 3.831619958926069)
We can also use the fit() method to fit the random sampling sequence samples, which returns the parameters of normal distribution (mean and standard deviation)
stats.norm.fit(samples)
(1.038664111644651, 1.9574524154947086)
Chi square distribution and chi square test
Chi square distribution
If n independent random variables ξ ₁, ξ ₂,…, ξ n. If they all obey the standard normal distribution, the sum of squares of these n random variables obeying the standard normal distribution constitutes a new random variable, and its distribution law is called chi square distribution.
Chi square test
The chi square distribution can be used to test the independence, that is, to judge whether the difference between the observed value and the theoretical value is only caused by random error. Independence test, generally speaking, is to judge whether it is irrelevant.
We assume that there are two bags and five balls, and the distribution of balls in each bag is different. Now it is necessary to judge whether the balls in the two bags are evenly distributed;
def choose_balls(probs, size): """Set the distribution of balls in the current bag to probs,Then take size Times, returns the number of times each ball is taken""" n_ball = len(probs) # Number of balls rv = stats.rv_discrete(values=(range(n_ball), probs)) # Discrete random variables are defined by probability distribution probs, which can be taken as the id number of the ball (0 ~ n_ball) samples = rv.rvs(size=size) # Use the random variable rv for size sampling counts = np.bincount(samples) # Count the number of times each ball appears return counts counts1 = choose_balls([0.18, 0.24, 0.25, 0.16, 0.17], 500) # For the first bag, five balls are distributed with unequal probability, and 500 times are taken counts2 = choose_balls([0.2] * 5, 500) # The second bag, 5 balls evenly distributed, take 500 times np.random.seed(42) # Enter the frequency of each ball into the chisquare function chi1, p1 = stats.chisquare(counts1) # Chi square test 1st bag print("counts =", counts1, "chi1 =", chi1, "p1 =", p1) chi2, p2 = stats.chisquare(counts2) # Chi square test 2nd bag print("counts =", counts2, "chi2 =", chi2, "p2 =", p2)
counts = [ 95 106 137 80 82] chi1 = 21.54 p1 = 0.00024741435438580667 counts = [ 99 91 101 111 98] chi2 = 2.08 p2 = 0.7210475511959114
The null hypothesis of chi square test is the probability that the sample meets the target. It can be seen from the above test results:
- The p value corresponding to the first bag is only 0.017, that is, if the balls in the first bag really conform to the average distribution, the probability of the obtained observation results [103 109 122 78 88] is only 0.017. Therefore, the null hypothesis can be overturned, that is, the balls in the bag are unlikely to be evenly distributed.
- On the contrary, under the uniform distribution of the second bag, the probability of obtaining the observation [111 103 106 87 93] is 0.428, so the null hypothesis cannot be overturned, that is, the ball in the bag cannot be denied to be uniformly distributed;
We can draw its probability density function:
The corresponding positions of chil and chi2 are χ^ 2_ 1 and χ^ 2_ 2; p1 and p2 correspond to each other χ^ 2_ 1 right side area and χ^ 2_ 2 area of right side
In machine learning, we often have to make feature selection, that is, delete some unimportant and category independent features. We can use chi square test to select features.
We use chi square statistics to measure the lack of feature and category independence. The greater the chi square, the smaller the independence, the greater the correlation, and the more important the feature is. The following is the number of documents with the characteristic word * * "gambling" * * under the task of mail classification 2 (garbage or not):
table = [ # Enter the form above [40, 10], [60, 90] ] chi2, p, _, _ = stats.chi2_contingency(table) # Chi square test print(chi2, p)
22.42666666666667 2.1832165337148537e-065. Sparse matrix
Why use sparse matrix
In machine learning modeling, many large matrices often appear. Most of the elements in these matrices are 0, which is sparse matrix. It is very wasteful to directly store sparse matrix with array. Due to the sparsity of matrix, we can only save its non-zero elements, so as to save memory.
In the traditional vector space model, we represent each document as a word vector arranged in a fixed order, and each value in the vector is the word frequency of the current word in the current document. As shown below:
This matrix is sparse because many elements in it are 0, that is, the word frequency is 0. There are often hundreds of thousands of words in the thesaurus, and the number of documents may be thousands. If you do not use sparse matrix storage, it will cause a huge waste of memory. Even, your machine will exit because of insufficient memory.
Several commonly used sparse matrices
Many sparse matrix formats are provided in scipy.spark, and each format has different uses. Where dok_matrix and lil_matrix is suitable for adding elements gradually.
from scipy import sparse
Look at DOK first_ matrix,
- It inherits from dict and uses dictionary to save non-zero elements in matrix;
- The key of the dictionary is the tuple that holds (row, column) information; Value is the corresponding element value;
- Obviously dok_matrix is very suitable for adding, deleting and changing elements;
a = sparse.dok_matrix((10, 5)) # A dok sparse matrix with shape (10,5) is defined, and all elements are initially 0 a[1, :3] = 1.0, 2.0, 3.0 # Set the first row and the first three columns of elements print((list(a.keys()))) # Print the key of sparse matrix, that is, the index of non-zero elements print((list(a.values()))) # Print the value of the sparse matrix, that is, the value of the corresponding non-zero element print(a[0,0], a[1,1]) # Print element at (0,0) position and element at (1,1) position
[(1, 0), (1, 1), (1, 2)] [1.0, 2.0, 3.0] 0.0 2.0
Let's look at lil again_ matrix
- lil_matrix uses two lists to store non-zero elements;
- data saves the non-zero elements in each row, and rows saves the columns where the non-zero elements are located;
- This format is also suitable for adding, deleting and changing elements, and can quickly obtain row related data;
b = sparse.lil_matrix((10, 5)) # The lil sparse matrix with shape (10,5) is defined, and the initial value is 0 # Set non-0 element b[0, 1] = 1 b[1, 0] = 2 b[1, 1] = 3 print(b.data, b.data.shape, '\n') # Print data attribute, 10 lines, list of non-zero element values for each line print(b.rows, b.data.shape) # Print the rows attribute, 10 lines, and a list of non-zero element column numbers for each line
[list([1.0]) list([2.0, 3.0]) list([]) list([]) list([]) list([]) list([]) list([]) list([]) list([])] (10,) [list([1]) list([0, 1]) list([]) list([]) list([]) list([]) list([]) list([]) list([]) list([])] (10,)
Finally, let's look at coo_matrix
- coo_matrix uses three arrays row col data to store the information of non-zero elements;
- The three arrays are of the same length. row saves the rows of non-zero elements, col saves the columns of non-zero elements, and data saves the values of non-zero elements;
- coo_matrix does not support the addition, deletion and retrieval of elements. Once created, it can almost only be transferred to other matrices for processing;
- It supports repeated elements, that is, elements at the same position can appear multiple times. When transformed into other matrices, multiple values at the same position will be summed;
# Define row col data of 4 non-zero elements respectively row = [2, 3, 3, 2] col = [3, 4, 2, 3] data = [1, 2, 3, 10] c = sparse.coo_matrix((data, (row, col)), shape=(5, 6)) # Generate coo sparse matrix according to row col data # Print the three attributes of row col data print(c.col) print(c.row) print(c.data, '\n') print(c.toarray()) # Convert it into an array, and the two elements 1,10 at position (2,3) are summed
[3 4 2 3] [2 3 3 2] [ 1 2 3 10] [[ 0 0 0 0 0 0] [ 0 0 0 0 0 0] [ 0 0 0 11 0 0] [ 0 0 3 0 2 0] [ 0 0 0 0 0 0]]
c[0, 0] = 1 # An error is reported when trying to modify an element
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-35-c3fc3415dffd> in <module>() ----> 1 c[0, 0] = 1 # An error is reported when trying to modify an element TypeError: 'coo_matrix' object does not support item assignment
c[0,0]#An error is reported and cannot be viewed