# Machine learning coursera Chapter 3 programming homework

## Multi-class Classification and Neural Networks

### lrCostFunction

The whole topic gives two data sets, one is about X, y and the other is about theta. Each line of X is a training data, that is, a bitmap of handwritten digits. Each picture is 20 * 20, so there are 400 columns, and each column represents the gray value at a point in the image.

The first step is to write out the vectorization calculation method of loss function:

Note that the dimension of X is (m, n). According to the prompt in the note, it can be inferred that the dimension of theta is (n, k), where k is the number of class es. Here, it refers to the number of types of handwritten digits, i.e. 10 kinds from 0 to 9.

Then the dimension of X*theta is (m, k). It should be noted that we are dealing with a classification problem, so we need to use the sigmoid function to narrow it to 0 ~ 1. For the ith column of theta, its value represents the probability that it is classified into this class, that is, the closer it is to 1, the more likely it is to be the ith class. In this problem, it is the number i (the 10th number is 0).

This is the teacher's idea in the previous class, turning multi classification problems into multiple binary classification problems. That is, for each number, we first set category i as a separate category, and all other categories are classified into another category. After 10 cycles, we get the predicted value p (0 < = P < = 1) of the number for each category.

For classification problems, we need to use the cost function in logistic regression, because this function has a good property, that is, when y is 0, the closer the value of x is to 0, the closer J is to 0; Conversely, when y is 1, the closer the value of x is to 1, the closer J is to 0. Otherwise J will approach infinity.

After writing the expression of J, we need to normalize it, that is, add an λ 2 m ∑ j = 1 n θ j 2 \frac{\lambda}{2m}\sum_{j=1}^{n}\theta_{j}^2 2m λ ∑j=1n θ j2, notice that J starts from 1 and we don't have to θ 0 \theta_0 θ 0 ﹤ when calculating the loss, there will also be no loss after derivation, that is, when calculating the gradient θ 0 \theta_0 θ 0, without losing generality, we can make θ 0 = 0 \theta_0=0 θ If 0 = 0, there is no need for classification discussion. So we need a temp vector temp = [0; theta(2:end)];. Then we use this temp vector instead of theta for subsequent calculation.

Note that temp is a column vector. To find temp.^2, we just need to calculate temp '* temp.

When calculating grad, pay special attention to the dimensions of each variable, X(m, n), h(m, 1), y(m, 1)

function [J, grad] = lrCostFunction(theta, X, y, lambda) %LRCOSTFUNCTION Compute cost and gradient for logistic regression with %regularization % J = LRCOSTFUNCTION(theta, X, y, lambda) computes the cost of using % theta as the parameter for regularized logistic regression and the % gradient of the cost w.r.t. to the parameters. % Initialize some useful values m = length(y); % number of training examples % You need to return the following variables correctly J = 0; grad = zeros(size(theta)); % ====================== YOUR CODE HERE ====================== % Instructions: Compute the cost of a particular choice of theta. % You should set J to the cost. % Compute the partial derivatives and set grad to the partial % derivatives of the cost w.r.t. each parameter in theta % % Hint: The computation of the cost function and gradients can be % efficiently vectorized. For example, consider the computation % % sigmoid(X * theta) % % Each row of the resulting matrix will contain the value of the % prediction for that example. You can make use of this to vectorize % the cost function and gradient computations. % % Hint: When computing the gradient of the regularized cost function, % there're many possible vectorized solutions, but one solution % looks like: % grad = (unregularized gradient for logistic regression) % temp = theta; % temp(1) = 0; % because we don't add anything for j = 0 % grad = grad + YOUR_CODE_HERE (using the temp variable) % h = sigmoid(X * theta); % unregularized cost for logistic regression J = (1.0/m) * sum(-y.*log(h) - (1-y).*log(1-h)); % regularized cost temp = [0; theta(2:end)]; J = J + (lambda/(2.0*m)) * temp' * temp; % unregularized gradient for logistic regression grad = (1.0/m) * X' * (h - y); % regularized gradient grad = grad + (1.0/m) * lambda * temp; % ============================================================= grad = grad(:); end

### oneVsAll

Note that for the binary classification problem, y must be 0 or 1 to represent which category it belongs to. In this way, we cycle through each class C. for the training data belonging to C, we record y as 1, and those not belonging to C as 0, and put the trained theta into all_ The c-th row of theta, which represents this row, multiplied by X, can tell us whether the data belongs to class C. So, all_ When theta is multiplied by X, line C represents whether the data belongs to class C (the value represents probability).

We also note that the function in fmincg has a parameter t (which is an anonymous function). This t is the theta term of the lrCostFunction function function we wrote. As for theta, it is because we use the 'GradObj' mode, that is, gradient descent. This mode requires updating the value of theta every training, And the last theta value needs to be used when calculating the theta value next time, so we need to use this theta as a function parameter for fmincg to call.

function [all_theta] = oneVsAll(X, y, num_labels, lambda) %ONEVSALL trains multiple logistic regression classifiers and returns all %the classifiers in a matrix all_theta, where the i-th row of all_theta %corresponds to the classifier for label i % [all_theta] = ONEVSALL(X, y, num_labels, lambda) trains num_labels % logistic regression classifiers and returns each of these classifiers % in a matrix all_theta, where the i-th row of all_theta corresponds % to the classifier for label i % Some useful variables m = size(X, 1); n = size(X, 2); % You need to return the following variables correctly all_theta = zeros(num_labels, n + 1); % Add ones to the X data matrix X = [ones(m, 1) X]; % ====================== YOUR CODE HERE ====================== % Instructions: You should complete the following code to train num_labels % logistic regression classifiers with regularization % parameter lambda. % % Hint: theta(:) will return a column vector. % % Hint: You can use y == c to obtain a vector of 1's and 0's that tell you % whether the ground truth is true/false for this class. % % Note: For this assignment, we recommend using fmincg to optimize the cost % function. It is okay to use a for-loop (for c = 1:num_labels) to % loop over the different classes. % % fmincg works similarly to fminunc, but is more efficient when we % are dealing with large number of parameters. % % Example Code for fmincg: % % % Set Initial theta % initial_theta = zeros(n + 1, 1); % % % Set options for fminunc % options = optimset('GradObj', 'on', 'MaxIter', 50); % % % Run fmincg to obtain the optimal theta % % This function will return theta and the cost % [theta] = ... % fmincg (@(t)(lrCostFunction(t, X, (y == c), lambda)), ... % initial_theta, options); % initial_theta = zeros(n + 1, 1); options = optimset('GradObj', 'On', 'MaxIter', 50); for c = 1: num_labels [theta] = ... fmincg(@(t)(lrCostFunction(t, X, (y == c), lambda)), ... initial_theta, options); all_theta(c, :) = theta'; end % ========================================================================= end

### predictOneVsAll

According to the above analysis, the dimension of X is (m, n), all_theta's dimension is (class, n), that is, all_ Each row of theta represents the theta corresponding to a class. If you multiply X by this theta, you will get the probability value that the data set belongs to this class. And if you use all_theta multiplies x to get the probability value that the data set belongs to each class.

X * all_ The dimension of the result of theta 'is (m, class). Column c represents the probability that the row data belongs to class c. We need to find the maximum probability and take its subscript as the predicted value.

function p = predictOneVsAll(all_theta, X) %PREDICT Predict the label for a trained one-vs-all classifier. The labels %are in the range 1..K, where K = size(all_theta, 1). % p = PREDICTONEVSALL(all_theta, X) will return a vector of predictions % for each example in the matrix X. Note that X contains the examples in % rows. all_theta is a matrix where the i-th row is a trained logistic % regression theta vector for the i-th class. You should set p to a vector % of values from 1..K (e.g., p = [1; 3; 1; 2] predicts classes 1, 3, 1, 2 % for 4 examples) m = size(X, 1); num_labels = size(all_theta, 1); % You need to return the following variables correctly p = zeros(size(X, 1), 1); % Add ones to the X data matrix X = [ones(m, 1) X]; % ====================== YOUR CODE HERE ====================== % Instructions: Complete the following code to make predictions using % your learned logistic regression parameters (one-vs-all). % You should set p to a vector of predictions (from 1 to % num_labels). % % Hint: This code can be done all vectorized using the max function. % In particular, the max function can also return the index of the % max element, for more information see 'help max'. If your examples % are in rows, then, you can use max(A, [], 2) to obtain the max % for each row. % % X(m, n), all_theta(class, n), where n is the class num % The result is (m, class), where in the ith row, every jth col is % the ith example's probability of being in the jth class [~, p] = max(sigmoid(X * all_theta'), [], 2); % ========================================================================= end

### predict

Finally, let's complete a simple neural network. Here, there are 400 input x, each X represents a pixel, there is only one hidden layer, there are 25 nodes in the layer, and the final output node is 10. Each node is a vector, representing the probability that each item in the data set belongs to this class.

As the teacher has said before, the dimension of theta on each floor is(
R
j
+
1
,
R
j
+
1
R_{j+1}, R_{j}+1
Rj+1, Rj+1) accordingly, we use the formula

z
=
θ
a
a
=
g
(
z
)
z=\theta a \newline a=g(z)
z=θaa=g(z)

The formula can be listed.

function p = predict(Theta1, Theta2, X) %PREDICT Predict the label of an input given a trained neural network % p = PREDICT(Theta1, Theta2, X) outputs the predicted label of X given the % trained weights of a neural network (Theta1, Theta2) % Useful values m = size(X, 1); num_labels = size(Theta2, 1); % You need to return the following variables correctly p = zeros(size(X, 1), 1); % ====================== YOUR CODE HERE ====================== % Instructions: Complete the following code to make predictions using % your learned neural network. You should set p to a % vector containing labels between 1 to num_labels. % % Hint: The max function might come in useful. In particular, the max % function can also return the index of the max element, for more % information see 'help max'. If your examples are in rows, then, you % can use max(A, [], 2) to obtain the max for each row. % A = [ones(size(X, 1), 1) X]; % Theta1(len(j+1), len(j)+1) A(m, len(j)+1) z = A * Theta1'; A = sigmoid(z); A = [ones(size(A, 1), 1) A]; z = A * Theta2'; A = sigmoid(z); [~, p] = max(A, [], 2); % ========================================================================= end