# FM algorithm principle and python practice

### 1, Introduce FM

In the traditional linear model such as LR, each feature is independent. If we need to consider the direct interaction between feature and feature, we may need to manually cross combine the features; nonlinear SVM can do kernel mapping for the features, but it can not learn well when the features are highly sparse.

Factorization machine (FM) is a machine learning algorithm based on matrix decomposition proposed by Steffen render in 2010. The core of the algorithm is feature combination, so as to reduce human participation in feature combination. In the calculation of advertising and recommendation system, CTR prediction (click through) is a very important link. To judge whether an item is recommended or not, it needs to be determined according to the ranking of CTR estimated click through rate. The FM model is introduced as an example. Suppose an advertising classification problem, according to the characteristics of users and advertising bits, predict whether users click on the ads.

We manually generate the source data as follows:

import pandas as pd
train = [
{"country": "China", "type": "Game", "age": 19,"Clicked":1},
{"country": "USA", "type": "Game", "age": 33,"Clicked":1},
{"country": "UK", "type": "Music", "age": 55,"Clicked":1},
{"country": "China", "type": "Movie", "age": 20,"Clicked":0},
{"country": "China", "type": "Music", "age": 25,"Clicked":1},
]
df = pd.DataFrame(train)

"Clicked" is a label, and Age, Country and Type are characteristics. Since the former and the latter features are of category Type, they need to be transformed into numerical features by one hot coding.

pd.get_dummies(df)

As can be seen from the above table, after one hot coding, most of the sample data features are relatively sparse.

In the above example, each sample has 7-dimensional features, but on average only 3-dimensional features have non-zero values. In fact, this situation is not unique to this example, which is common in real application scenarios. For example, in CTR/CVR prediction, the user's gender, occupation, education level, category preference, commodity category, etc. will lead to the sparsity of sample data after one hot coding conversion. Especially for the characteristics of commodity category, for example, there are about hundreds of last level categories of commodities, and hundreds of numerical characteristics are generated by one hot coding, but only one of these hundreds of characteristics of each sample is valid (non-zero). Thus, data sparsity is an inevitable challenge in practical problems. Another feature of one hot coding is the large feature space.

For example, the commodity category has 100 dimensional features, and a category feature is converted into 100 dimensional numerical features, with a sharp increase in feature space. At the same time, by observing a large number of sample data, we can find that after some features are associated, the correlation between them and the label will be improved. For example, "USA" and "thanks giving", "China" and "Chinese New Year" have a positive impact on users' clicks.

In other words, users from "China" are likely to have a large number of browsing and buying behaviors in "Chinese New Year", while users from "thanks giving" will not have special consumption behaviors. The positive correlation between this association feature and label is common in practical problems, such as the products of "cosmetics", "women", "ball games accessories", "men", "movie tickets", and "movie" > category preference. Therefore, it is very meaningful to introduce the combination of two features.

### 2, Introduce implicit variable V

For the sake of simplicity, we only consider the case of second-order cross, and the specific model is as follows:

Among them, n represents the number of features of the sample, xi is the value of the ith feature, w0, wi, wij are model parameters, only when xi and xj are not 0, cross is meaningful. In the case of sparse data, there will be very few samples that meet the cross term is not 0, when the training samples are insufficient, it is easy to lead to inadequate and inaccurate training of the parameter wij, which ultimately affects the effect of the model. Then, the training problem of cross term parameters can be approximately solved by matrix decomposition, with the following formula.

For any positive definite matrix W, as long as k is large enough, there is a matrix W, which makes W=VVT. However, in the case of sparse data, we should choose a smaller k, because there is not enough data to estimate wj. Limiting the size of k improves the generalization ability of the model. Why can we improve the generalization ability of the model? Let's take an example of the original paper

If we want to calculate the cross term between user A and movie ST, it is obvious that there is no such situation in the training data, which will lead to=0, but we can approximate it. First of all, users B and C have similar vectors VB and Vc, because they have similar prediction scores for SW, soSumIt's similar. Users A and C have different vectors because the prediction scores for TI and SW are completely different. Next, the vectors for SW and ST may be similar, because user B's ratings for both films are similar. Finally, we can see that,AndIt's similar.

Let's use the code to measure the example at the beginning of our article

#Note: the processing object of DictVectorizer is symbolic (non digital) but characteristic data with certain structure, such as dictionary, etc., which converts the symbol to the 0 / 1 representation of the number
v = DictVectorizer()
X = v.fit_transform(train)
y = array([1,1,1,0,1])
fm = pylibfm.FM(num_factors=5,
num_iter=200,
#verbose=True,
initial_learning_rate=0.001,
learning_rate_schedule="optimal")
fm.fit(X,y)

fm.predict(v.transform({"country": "USA", "type": "Movie", "age": 33}))

array([0.05668886])

For forecast data, "country": "USA" and "type": "Movie", there is no intersection, but we can find that in the first data and the fourth data, type=Game and type=Movie, one click and one don't click, indicating that these two are opposite; while the first data and the second data indicate "country"= "USA" and "country"= "China" is similar, so this one in the test will not be clicked.

fm.predict(v.transform({"country": "USA", "type": "Music", "age": 20}))

array([0.98671174])

In the same way, we can find that in the first data and the fifth data, type=Game and type=Music have been clicked, indicating that they are the same. The first data and the second data show that "country"= "USA" and "country"= "China" are similar, so this one in the test is likely to be clicked.

### 3, Formula transformation

Through formula transformation, the linear complexity can be reduced as follows (using the idea of x y = [(x+y)^2 – x^2 – y^2]/2):

FM can be applied to many prediction tasks, such as regression, classification, sorting and so on.

1. Regression: y(x) is directly used as the prediction value, and the loss function can be least square error;

2. Binary Classification: y(x) needs to be converted to binary labels, such as 0,1. The loss function can be hinge loss or logit loss;

3. Rank: x may need to be converted to the form of pair wise, such as (Xa,Xb). The loss function can use pair wise loss

### 4, Loss function

logit loss is used as the optimization criterion in binary classification problem, namely

Notes:

One might ask why the loss function is not the same as the loss function we saw before?

In general, the loss function is

In this case, Y ∈ {0,1} is used. However, this function is cumbersome when we do code calculation, so we convert it

Y ∈ {- 1,1}, so we need to use the above loss function to solve the gradient descent

among

### 5, Algorithm flow

1. Initialize weight w0w1 wn and V

2. Calculate for each sample:

### 6, python practice

We use gradient descent iteration and pylibfm direct prediction respectively

Let's use gradient descent iteration to demonstrate

Let's use the diabetes data set, first pour it into the training set

#Define functions for importing data
global max_list
global min_list
dataMat = []
labelMat = []

fr = open(data)#Open file

currLine = line.strip().split(',')
#lineArr = [1.0]
lineArr = []

for i in range(featureNum):
#There are 9 columns of data in the dataset, the first 8 columns are variables, and the last is label
lineArr.append(float(currLine[i]))

dataMat.append(lineArr)
#Label from [0, 1] to [- 1, 1]
labelMat.append(float(currLine[-1]) * 2 - 1)

data_array = np.array(dataMat)
#Take out the maximum and minimum values of each column and prepare for normalization
max_list = np.max(data_array,axis=0)
min_list = np.min(data_array,axis=0)

scalar_dataMat = []
for row in dataMat:
#Normalization
scalar_row = normalize(row,max_list,min_list)
scalar_dataMat.append(scalar_row)
return scalar_dataMat, labelMat
#dataMatrix uses mat, classLabels is list
m, n = shape(dataMatrix)
alpha = 0.01#Learning rate
#Initialization parameters
#w = random.randn(n, 1)#Where n is the number of features
w = zeros((n, 1))
w_0 = 0.
v = normalvariate(0, 0.2) * ones((n, k))

for it in range(iter):
#print (it)
for x in range(m):#Stochastic optimization, for each sample
inter_1 = dataMatrix[x] * v
inter_2 = multiply(dataMatrix[x], dataMatrix[x]) * multiply(v, v)#multiply corresponding elements
#Complete cross items
interaction = sum(multiply(inter_1, inter_1) - inter_2) / 2.

p = w_0 + dataMatrix[x] * w + interaction#Output of calculation forecast
#print "y: ",p
loss = sigmoid(classLabels[x] * p[0, 0]) - 1
#print "loss: ",loss

#Update w_0
w_0 = w_0 - alpha * loss * classLabels[x]

for i in range(n):
if dataMatrix[x, i] != 0:
#Update w_i
w[i, 0] = w[i, 0] - alpha * loss * classLabels[x] * dataMatrix[x, i]
#Update v
for j in range(k):
v[i, j] = v[i, j] - alpha * loss * classLabels[x] * (dataMatrix[x, i] * inter_1[0, j] - v[i, j] * dataMatrix[x, i] * dataMatrix[x, i])

return w_0, w, v

Finally, calculate the accuracy rate

def getAccuracy(dataMatrix, classLabels, w_0, w, v):
m, n = shape(dataMatrix)
allItem = 0
error = 0
result = []
for x in range(m):
allItem += 1
#Bring in the parameter w_, W, V
inter_1 = dataMatrix[x] * v
inter_2 = multiply(dataMatrix[x], dataMatrix[x]) * multiply(v, v)#multiply corresponding elements
#Complete cross items
interaction = sum(multiply(inter_1, inter_1) - inter_2) / 2.
p = w_0 + dataMatrix[x] * w + interaction#Output of calculation forecast
pre = sigmoid(p[0, 0])

result.append(pre)
if pre < 0.5 and classLabels[x] == 1.0:
error += 1
elif pre >= 0.5 and classLabels[x] == -1.0:
error += 1
else:
continue
print (result)
return float(error) / allItem

Finally, we can test it

if __name__ == '__main__':
w_0, w, v = stocGradAscent(mat(dataTrain), labelTrain, 20, 200)
print ("The test accuracy is:%f" % (1 - getAccuracy(mat(dataTest), labelTest, w_0, w, v))  )

The accuracy of the final prediction is:

Test accuracy: 0.791045

2. Prediction by pylibfm

from sklearn.model_selection import train_test_split
from pyfm import pylibfm
from sklearn.feature_extraction import DictVectorizer
from scipy.sparse import csr_matrix
#Compression matrix
dataTrain = csr_matrix(dataTrain)
fm= pylibfm.FM(num_factors=8,
num_iter=200,
verbose=True,
initial_learning_rate=0.001,
learning_rate_schedule="optimal")
fm1.fit(dataTrain, labelTrain)
dataTest = csr_matrix(dataTest)
y_preds = fm.predict(dataTest)
y_preds_label = y_preds > 0.5
from sklearn.metrics import log_loss,accuracy_score
print ("Validation log loss: %.4f" % log_loss(labelTest, y_preds))
print ("accuracy: %.4f" % accuracy_score(labelTest, y_preds_label))

The final output is:

Validation log loss: 0.4851
accuracy: 0.7948

Install pip install git+https://github.com/coreylynch/pyFM

Parameter setting

num_factors : int, The dimensionality of the factorized 2-way interactions
Num ITER: int, number of iterations
Learning rate schedule (optional parameter): string type, learning rate iteration parameter value:

Initial "learning" rate: double, the initial learning rate, which is equal to 0.01 by default
Expression: labels are real values
classification: Labels are 1 or 0
Verbose: bool, whether to print the training error of the current iteration
Power_t: double, the exponent for inverse scaling learning rate [default 0.5]
T0: the initial value of double, learning rate schedule. The default value is 0.001

Complete code and data can be concerned about official account: no confusion.

Reference resources:

https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf

https://zhuanlan.zhihu.com/p/50426292