# Theory and practice of PNN model

Turn from
https://blog.csdn.net/livan1234/article/details/84990287?spm=1001.2014.3001.5502

1. Principle
PNN, fully known as product based neural network, considers that the expression of cross features learned after embedding is input into MLP (multi-layer perceptron, i.e. neural network) is not sufficient. An idea of product layer is proposed, which reflects the DNN network structure of Cross Signs Based on multiplication, as shown in the following figure: According to the idea of the paper, we also look at the network structure from top to bottom:

Output layer
The output layer is very simple. The network output of the upper layer is mapped to the interval of (0,1) through a full link layer after sigmoid function conversion to obtain the predicted value of our click through rate: l2 layer
According to the output of l1 layer, through a full link layer and activated by relu, we get the output result of l2: l1 floor
The output of l1 layer is calculated by the following formula: The key point is coming soon. We can see that when we get the output of l1 layer, we input three parts: lz, lp and b1. b1 is our offset term, which can be ignored here. The calculation of lz and lp is the essence of PNN. Let's go slowly

Product Layer

The idea of product comes from that in ctr estimation, the relationship between features is more an "and" relationship than an "add" relationship. For example, for people who are male and like games, the combination of the former can better reflect the meaning of feature intersection than the latter.

The product layer can be divided into two parts, one is the linear part lz and the other is the nonlinear part lp. The forms of the two parts are as follows: Here, we will use an operation method defined in the paper, which is actually the point multiplication of the matrix: Let's continue to introduce the network structure. We will introduce the Product Layer in more detail in the next chapter.

Embedding Layer

The Embedding Layer is the same as that in DeepFM. The features of each field are converted into vectors of the same length, which is represented by f. loss function
Use the same loss function as logistic regression, as follows: 2. Product Layer details
As mentioned earlier, the product layer can be divided into two parts, one is the linear part lz and the other is the nonlinear part lp. Looking at the above formula, we first need to know z and p, which are obtained from our embedding layer, where z is a linear signal vector, so we directly use the embedding layer to obtain: The equal sign plus a triangle used in the paper actually means equal. You can think that z is the copy of the embedding layer.

For p, a formula is needed for mapping:  Different choices of g make us have two PNN calculation methods, one is called Inner PNN, or IPNN for short, and the other is called Outer PNN, or OPNN for short.

Next, we will introduce these two PNN models in detail. Because it involves complexity analysis, we first define the size of Embedding as M, the size of field as N, and the length of lz and lp as D1.

2.1 IPNN
The schematic diagram of IPNN is as follows: The calculation method of p in IPNN is as follows, that is, the inner product is used to represent pij: Therefore, pij is actually a number. The time complexity of a pij is M and the size of P is NN, so the time complexity of P is NNM. The time complexity of lp obtained from P is NND1. Therefore, for IPNN, the total time complexity is NN(D1+M) This paper optimizes this structure. It can be seen that our p is a symmetric matrix, so our weight can also be a symmetric matrix, and the symmetric matrix can be decomposed as follows: Therefore:  Therefore: Thus: It can be seen that our weight only needs D1 * N, and the time complexity becomes D1MN.

2.2 OPNN
The schematic diagram of OPNN is as follows: p in OPNN is calculated as follows: At this time, pij is the matrix of MM, and the time complexity of calculating a pij is MM, and p is the matrix of NNMM, so the event complexity of calculating p is NNMM. Therefore, the time complexity of calculating lp becomes D1 * NNM*M. This obviously costs a lot. In order to reduce the responsibility, the paper uses the idea of superposition, which redefines the p matrix: Here, the time complexity of calculating p becomes D1M(M+N)

3. Code practice
Finally, it's the exciting code practice link. I've always wanted to find a code with better implementation. There's nothing suitable to look for tensorflow, but pytorch has a good one. There's no way. I can only implement it myself. Therefore, the code in this paper is obtained strictly according to the paper. I hope you can correct any mistakes or improvements.

https://github.com/princewen/tensorflow_practice/tree/master/Basic-PNN-Demo.

The code of this article is improved according to the previous DeepFM code. We only introduce the implementation part of the model. For other data processing details, you can refer to the code on my github

Model input

The input of the model mainly includes the following parts: feat_index is a sequence number of features, which is mainly used to select our embedding through embedding_lookup. feat_value is the corresponding eigenvalue. If it is a discrete feature, it is 1. If it is not a discrete feature, the original eigenvalue is retained. label is the actual value. dropout is also defined to prevent over fitting.

Weight construction

The weight consists of four parts: first, the weight of the embedding layer, then the weight of the product layer, including linear signal weight and square signal weight, which are defined respectively according to IPNN and OPNN. Finally, the weight of each layer of Deep Layer and the weight of the output layer.

For linear signal weights, the size is D1 * N * M
For the square signal weight, the size of IPNN is D1 * N and OPNN is D1 * M * M.

```def _initialize_weights(self):
weights = dict()

#embeddings
weights['feature_embeddings'] = tf.Variable(
tf.random_normal([self.feature_size,self.embedding_size],0.0,0.01),
name='feature_embeddings')
weights['feature_bias'] = tf.Variable(tf.random_normal([self.feature_size,1],0.0,1.0),name='feature_bias')

#Product Layers
if self.use_inner:
else:
tf.random_normal([self.deep_init_size, self.embedding_size,self.embedding_size], 0.0, 0.01))

weights['product-linear'] = tf.Variable(tf.random_normal([self.deep_init_size,self.field_size,self.embedding_size],0.0,0.01))
weights['product-bias'] = tf.Variable(tf.random_normal([self.deep_init_size,],0,0,1.0))
#deep layers
num_layer = len(self.deep_layers)
input_size = self.deep_init_size
glorot = np.sqrt(2.0/(input_size + self.deep_layers))

weights['layer_0'] = tf.Variable(
np.random.normal(loc=0,scale=glorot,size=(input_size,self.deep_layers)),dtype=np.float32
)
weights['bias_0'] = tf.Variable(
np.random.normal(loc=0,scale=glorot,size=(1,self.deep_layers)),dtype=np.float32
)

for i in range(1,num_layer):
glorot = np.sqrt(2.0 / (self.deep_layers[i - 1] + self.deep_layers[i]))
weights["layer_%d" % i] = tf.Variable(
np.random.normal(loc=0, scale=glorot, size=(self.deep_layers[i - 1], self.deep_layers[i])),
dtype=np.float32)  # layers[i-1] * layers[i]
weights["bias_%d" % i] = tf.Variable(
np.random.normal(loc=0, scale=glorot, size=(1, self.deep_layers[i])),
dtype=np.float32)  # 1 * layer[i]

glorot = np.sqrt(2.0/(input_size + 1))
weights['output'] = tf.Variable(np.random.normal(loc=0,scale=glorot,size=(self.deep_layers[-1],1)),dtype=np.float32)
weights['output_bias'] = tf.Variable(tf.constant(0.01),dtype=np.float32)

return weights
```

Embedding Layer
This part is very simple. Select the embedding value in the corresponding weights ['feature_embeddings'] according to the feature_index, and then multiply it by the corresponding feature_value:

```# Embeddings
self.embeddings = tf.nn.embedding_lookup(self.weights['feature_embeddings'],self.feat_index) # N * F * K
feat_value = tf.reshape(self.feat_value,shape=[-1,self.field_size,1])
self.embeddings = tf.multiply(self.embeddings,feat_value) # N * F * K

```

Product Layer
According to the previous introduction, we calculate the linear signal vector, the quadratic signal vector and the offset term respectively, add them and get the input of the depth network through relu activation.

```# Linear Singal
linear_output = []
for i in range(self.deep_init_size):
linear_output.append(tf.reshape(
tf.reduce_sum(tf.multiply(self.embeddings,self.weights['product-linear'][i]),axis=[1,2]),shape=(-1,1)))# N * 1

self.lz = tf.concat(linear_output,axis=1) # N * init_deep_size

# Quardatic Singal
if self.use_inner:
for i in range(self.deep_init_size):
theta = tf.multiply(self.embeddings,tf.reshape(self.weights['product-quadratic-inner'][i],(1,-1,1))) # N * F * K

else:
embedding_sum = tf.reduce_sum(self.embeddings,axis=1)
p = tf.matmul(tf.expand_dims(embedding_sum,2),tf.expand_dims(embedding_sum,1)) # N * K * K
for i in range(self.deep_init_size):
theta = tf.multiply(p,tf.expand_dims(self.weights['product-quadratic-outer'][i],0)) # N * K * K

self.lp = tf.concat(quadratic_output,axis=1) # N * init_deep_size

self.y_deep = tf.nn.dropout(self.y_deep, self.dropout_keep_deep)
```

Deep Part
The Deep Part in the paper actually has only one layer, but we can set it at will and finally get the output:

```# Deep component
for i in range(0,len(self.deep_layers)):