# Large factory technology implementation | parallel double tower CTR structure @ recommendation and calculation advertising Series in Tencent information flow recommendation and ranking

• Author: Han Xinzi @ ShowMeAI, Joan @ Tencent

The double tower model is the most commonly used and classic structure in the algorithm implementation in many fields such as recommendation, search and advertising. In actual application, each tower in the double tower structure will be upgraded, and the new network structure in CTR estimation will be used to replace the fully connected DNN. In this issue, we see the scheme of skillfully parallel CTR model applied to the double towers in the recommended scenario of Tencent browser team.

# Read the full text in one picture

Implementation code

For the implementation of CTR estimation methods such as DCN / FM / DeepFM / FFM / CIN(xDeepFM) involved in the text, please go to GitHub to check: https://github.com/ShowMeAI-H...

Relevant data is returned to the official account (AI algorithm Institute) to restore the recommended and CTR dataset, and download links.

Large factory technology implementation of multi task and multi-objective modeling in CTR estimation. Welcome to see the related articles in the same series

# 1, Double tower model structure

## 1.1 introduction to model structure

The two tower model is widely used in the recall and ranking stages of recommendation, search, advertising and other fields. The model structure is as follows:

In the double tower model structure, the User tower is on the left and the Item tower is on the right. Correspondingly, we can also divide the features into two categories:

• User related features: user basic information, group statistical attributes and interactive Item sequences;
• Item related features: item basic information, attribute information, etc.

If there is a Context feature, it can be placed in the user side tower.

In the original version of the structure, the classical DNN model (i.e. fully connected structure) is in the middle of the two towers. The feature Embedding passes through several MLP hidden layers, and the two towers output User Embedding and Item Embedding codes respectively.

In the training process, User Embedding and Item Embedding do inner product or Cosine similarity calculation, so that the current User is closer to the positive example Item in the Embedding space and farther away from the negative example Item in the Embedding space. The loss function can use standard cross entropy loss (treat the problem as a classification problem), or BPR or Hinge Loss (treat the problem as a representation learning problem).

The advantages of the two tower model are obvious:

• Clear structure. After modeling and learning the User and Item respectively, the estimation can be completed interactively
• After the training, the online information process is efficient and the performance is excellent. In the online serving stage, the Item vector is pre calculated. You can calculate the User vector once according to the change characteristics, and then calculate the inner product or cosine.

The twin tower model also has disadvantages:

• The original two tower model structure has limited features and can not use cross features.
• Under the limitation of model structure, User and Item are built separately and can only interact through the final inner product, which is not conducive to the learning of User Item interaction.

## 1.3 optimization of double tower model

Based on the above limitations, Tencent information flow team (QQ browser novel recommendation scene) optimized the structure of the double tower model, enhanced the model structure and effect, and achieved good benefits. The specific methods are as follows:

• Replace the DNN simple structure in the double tower structure with the "parallel" structure of the effective CTR module (MLP, DCN, FM, FFM, CIN), make full use of the characteristic cross advantages of different structures, and broaden the "width" of the model to alleviate the bottleneck of the inner product of the double tower;
• LR is used to learn the weights of multiple "parallel" twin towers. LR weights are finally integrated into User Embedding, so that the final model still maintains the inner product form.

# 2, Parallel double tower model structure

The parallel two tower model can be divided into three layers: input layer, presentation layer and matching layer. Corresponding to the three levels in the figure, the processing and operation are as follows.

## 2.1 Input Layer

Tencent QQ browser has the following two characteristics under the novel scene:

• User characteristics: user id, user portrait (age, gender, city), behavior sequence (click, read, collect), external behavior (browser information, Tencent video, etc.);
• Item features: novel content features (novel id, classification, label, etc.), statistical features, etc;

The User and Item features are discretized and mapped into Feature Embedding to facilitate network construction in the presentation layer.

## 2.2 Representation Layer

• The input application depth neural network CTR module (MLP, DCN, FM, CIN, etc.) is used for learning. Different modules can learn the fusion and interaction of input layer feature s in different ways.
• Based on the representation of different module learning, a parallel structure is constructed for matching layer calculation.
• The feature interaction (information crossing in the tower) of user user and item item in the presentation layer can be achieved in the branch of the tower, while the feature interaction of user item can only be realized through the operation of the upper layer.

## 2.3 Matching Layer

• The User and Item vectors obtained from the presentation layer are hadamard products respectively according to different parallel models. After splicing, the results are fused through LR, and the final score is calculated.
• In the online serving stage, the weight of each dimension of LR can be pre integrated into User Embedding, so that online scoring is still an inner product operation.

# 3, Presentation layer structure of double towers - MLP/DCN structure

MLP structure (multi-layer full connection) is generally used in the twin towers. Tencent QQ browser team also introduced the Cross Network structure in DCN to explicitly construct high-order feature interaction. The reference structure is DCN mix, an improved version of Google paper.

## 3.1 DCN structure

DCN is characterized by introducing Cross Network, a Cross Network structure, to extract cross combination features, avoiding the process of manual features in traditional machine learning. The network structure is simple, the complexity is controllable, and multi-order cross features are obtained with the increase of depth.

The specific structure of DCN model is shown in the figure above:

• The bottom layer is the Embedding layer, and the Embedding is stack ed.
• The upper layer is parallel Cross Network and Deep Network.
• The header is the Combination Layer, which gets the Output from the result stack of Cross Network and Deep Network.

## 3.2 introduction of optimized DCN-V2 structure

On the basis of DCN, Google proposes an improved version of DCN mix / dcn-v2, which is improved for Cross Network. We mainly focus on the change of calculation method of Cross Network:

### 3.2.1 original Cross Network calculation method

Under the original calculation formula, after multi-layer calculation, we can explicitly learn the high-dimensional feature interaction. The problem is that the final $k$order interaction result $X is proved_ {k}$equals $X_ The product of {0}$and a scalar (but the scalar of different $X {0}$is different, and $X {0}$and $X {K}$are not linear). Under this calculation method, the expression of Cross Network is limited.

### 3.2.2 improved Cross Network calculation method

The processing of DCN mix in Google's improved version is as follows:

• W is changed from vector to matrix, and larger parameters bring stronger expression ability (the actual w matrix can also be decomposed).
• Change feature interaction mode: no longer use outer product, but apply Hadamard product.

### 3.2.3 DCN-V2 code reference

The code implementation of DCN-v2 and ctr application cases can be referred to Google official implementation (https://github.com/tensorflow...)

The core improved deep cross layer code is as follows:

class Cross(tf.keras.layers.Layer):
"""Cross Layer in Deep & Cross Network to learn explicit feature interactions.

A layer that creates explicit and bounded-degree feature interactions
efficiently. The call method accepts inputs as a tuple of size 2
tensors. The first input x0 is the base layer that contains the original
features (usually the embedding layer); the second input xi is the output
of the previous Cross layer in the stack, i.e., the i-th Cross
layer. For the first Cross layer in the stack, x0 = xi.

The output is x_{i+1} = x0 .* (W * xi + bias + diag_scale * xi) + xi,
where .* designates elementwise multiplication, W could be a full-rank
matrix, or a low-rank matrix U*V to reduce the computational cost, and
diag_scale increases the diagonal of W to improve training stability (
especially for the low-rank case).

References:
1. [R. Wang et al.](https://arxiv.org/pdf/2008.13535.pdf)
See Eq. (1) for full-rank and Eq. (2) for low-rank version.
2. [R. Wang et al.](https://arxiv.org/pdf/1708.05123.pdf)

Example:

# python
# after embedding layer in a functional model:
input = tf.keras.Input(shape=(None,), name='index', dtype=tf.int64)
x0 = tf.keras.layers.Embedding(input_dim=32, output_dim=6)
x1 = Cross()(x0, x0)
x2 = Cross()(x0, x1)
logits = tf.keras.layers.Dense(units=10)(x2)
model = tf.keras.Model(input, logits)

Args:
projection_dim: project dimension to reduce the computational cost.
Default is None such that a full (input_dim by input_dim) matrix
W is used. If enabled, a low-rank matrix W = U*V will be used, where U
is of size input_dim by projection_dim and V is of size
projection_dim by input_dim. projection_dim need to be smaller
than input_dim/2 to improve the model efficiency. In practice, we've
observed that projection_dim = d/4 consistently preserved the
accuracy of a full-rank version.
diag_scale: a non-negative float used to increase the diagonal of the
kernel W by diag_scale, that is, W + diag_scale * I, where I is an
identity matrix.
use_bias: whether to add a bias term for this layer. If set to False,
no bias term will be used.
kernel_initializer: Initializer to use on the kernel matrix.
bias_initializer: Initializer to use on the bias vector.
kernel_regularizer: Regularizer to use on the kernel matrix.
bias_regularizer: Regularizer to use on bias vector.

Input shape: A tuple of 2 (batch_size, input_dim) dimensional inputs.
Output shape: A single (batch_size, input_dim) dimensional output.
"""

def __init__(
self,
projection_dim: Optional[int] = None,
diag_scale: Optional[float] = 0.0,
use_bias: bool = True,
kernel_initializer: Union[
Text, tf.keras.initializers.Initializer] = "truncated_normal",
bias_initializer: Union[Text,
tf.keras.initializers.Initializer] = "zeros",
kernel_regularizer: Union[Text, None,
tf.keras.regularizers.Regularizer] = None,
bias_regularizer: Union[Text, None,
tf.keras.regularizers.Regularizer] = None,
**kwargs):

super(Cross, self).__init__(**kwargs)

self._projection_dim = projection_dim
self._diag_scale = diag_scale
self._use_bias = use_bias
self._kernel_initializer = tf.keras.initializers.get(kernel_initializer)
self._bias_initializer = tf.keras.initializers.get(bias_initializer)
self._kernel_regularizer = tf.keras.regularizers.get(kernel_regularizer)
self._bias_regularizer = tf.keras.regularizers.get(bias_regularizer)
self._input_dim = None

if self._diag_scale < 0:
raise ValueError(
"diag_scale should be non-negative. Got diag_scale = {}".format(
self._diag_scale))

def build(self, input_shape):
last_dim = input_shape[-1]

if self._projection_dim is None:
self._dense = tf.keras.layers.Dense(
last_dim,
kernel_initializer=self._kernel_initializer,
bias_initializer=self._bias_initializer,
kernel_regularizer=self._kernel_regularizer,
bias_regularizer=self._bias_regularizer,
use_bias=self._use_bias,
)
else:
self._dense_u = tf.keras.layers.Dense(
self._projection_dim,
kernel_initializer=self._kernel_initializer,
kernel_regularizer=self._kernel_regularizer,
use_bias=False,
)
self._dense_v = tf.keras.layers.Dense(
last_dim,
kernel_initializer=self._kernel_initializer,
bias_initializer=self._bias_initializer,
kernel_regularizer=self._kernel_regularizer,
bias_regularizer=self._bias_regularizer,
use_bias=self._use_bias,
)
self.built = True

def call(self, x0: tf.Tensor, x: Optional[tf.Tensor] = None) -> tf.Tensor:
"""Computes the feature cross.

Args:
x0: The input tensor
x: Optional second input tensor. If provided, the layer will compute
crosses between x0 and x; if not provided, the layer will compute
crosses between x0 and itself.

Returns:
Tensor of crosses.
"""

if not self.built:
self.build(x0.shape)

if x is None:
x = x0

if x0.shape[-1] != x.shape[-1]:
raise ValueError(
"x0 and x dimension mismatch! Got x0 dimension {}, and x "
"dimension {}. This case is not supported yet.".format(
x0.shape[-1], x.shape[-1]))

if self._projection_dim is None:
prod_output = self._dense(x)
else:
prod_output = self._dense_v(self._dense_u(x))

if self._diag_scale:
prod_output = prod_output + self._diag_scale * x

return x0 * prod_output + x

def get_config(self):
config = {
"projection_dim":
self._projection_dim,
"diag_scale":
self._diag_scale,
"use_bias":
self._use_bias,
"kernel_initializer":
tf.keras.initializers.serialize(self._kernel_initializer),
"bias_initializer":
tf.keras.initializers.serialize(self._bias_initializer),
"kernel_regularizer":
tf.keras.regularizers.serialize(self._kernel_regularizer),
"bias_regularizer":
tf.keras.regularizers.serialize(self._bias_regularizer),
}
base_config = super(Cross, self).get_config()
return dict(list(base_config.items()) + list(config.items()))

# 4, Presentation layer structure of two towers - FM/FFM/CIN structure

Another structure commonly used in CTR estimation is FM series structure. Typical models include FM, FFM, DeepFM and xDeepFM. Their special modeling methods can also mine effective information. The substructure of the above model is also used in the final model of Tencent QQ browser team.

The feature interactions of MLP and DCN mentioned above are crossed, and some feature interactions cannot be explicitly specified. The FM / FFM / CIN structure in FM series model can explicitly operate the feature granularity interaction, and from the calculation formula, they all have a good inner product form, which can directly realize the feature granularity interaction of double tower modeling user item.

## 4.1 introduction of FM structure

$$y = \omega_{0}+\sum_{i=1}^{n} \omega_{i} x_{i}+\sum_{i=1}^{n-1} \sum_{j=i+1}^{n}<v_{i}, v_{j}>x_{i} x_{j}$$

FM is the most common model structure in CTR prediction. It constructs the second-order interaction of features through matrix decomposition. The calculation formula is expressed as the summation of pairwise inner products of feature vectors \ (V {I} \) and \ (V {J} \) (which can be regarded as the group inner product of feature Embedding in deep learning). The allocation rate can be converted into the form of summation and inner product through inner product operation.

$$\begin{array}{c} y=\sum_{i} \sum_{j}\left\langle V_{i}, V_{j}\right\rangle=\left\langle\sum_{i} V_{i}, \sum_{j} V_{j}\right\rangle \\ i \in \text { user fea, } \quad j \in \text { item fea } \end{array}$$

In the Tencent QQ browser team novel recommendation scenario, only the User Item interaction is considered (because the second-order interaction within the User or Item has been captured by the model mentioned above). As shown in the above formula, \ (i \) \$is the feature of the User side and \ (j \) is the feature of the Item side. The conversion of the allocation rate is calculated through the inner product, The second-order feature interaction of User Item can also be transformed into the sum of User and Item feature vectors (sum pooling in neural network) and then the inner product, which can be easily transformed into double tower structure.

## 4.2 introduction of FFM structure

FFM model is an upgraded version of FM. Compared with FM, it has the concept of field. FFM attributes the features of the same nature to the same field. The constructed implicit vector is not only related to the feature, but also related to the field. The final feature interaction can be in different implicit vector spaces, so as to improve the differentiation ability and enhance the effect. FFM can also be transformed into a double tower inner product structure through some methods.

$$y(\mathbf{x})=w_{0}+\sum_{i=1}^{n} w_{i} x_{i}+\sum_{i=1}^{n} \sum_{j=i+1}^{n}\left\langle\mathbf{v}_{i f_{j}}, \mathbf{v}_{j f_{i}}\right\rangle x_{i} x_{j}$$

An example of conversion is as follows:

img10.png

User has two feature fields and Item has three feature fields. Any two feature interactions in the figure have independent Embedding vectors. According to FFM formula, to calculate the second-order interaction of user Item, all inner products need to be calculated and summed.

img11.png

We reorder the embedded features of User and Item, and then splice them to convert FFM into double tower inner product form. Both User user and Item items in FFM are in the tower, so we can calculate them in advance and put them into the first-order items.

img12.png

In the practical application of Tencent QQ browser team, it is found that the AUC on the training data of the twin towers with FFM is significantly improved, but the increase in the amount of parameters has brought serious over fitting, and the width of the twin towers after the above structural adjustment is extremely wide (may reach 10000 level), which has a great impact on the performance efficiency. The further optimization methods are as follows:

• Manually screen the User and Item feature field s involved in the interaction of FFM training features, and control the width of the two towers (about 1000).
• Adjust the Embedding parameter initialization mode of FFM (close to 0) and learning rate (reduce).

The final effect is not ideal, so the team does not actually use FFM online.

## 4.3 introduction of CIN structure

The FM and FFM mentioned above can complete second-order feature interaction, while the CIN structure proposed in xDeepFM model can realize higher-order feature interaction (such as user user item, user user item item, user item item, etc.). Tencent QQ browser team has tried two uses to apply CIN to the double tower structure:

### 4.3.1 CIN(User) * CIN(Item)

Each tower of the twin towers generates its own multi-order CIN results of User and Item, and then sum pooling generates User/Item vectors respectively, and then the inner product of User and Item vectors

According to the allocation rate, we disassemble the sum pooling re inner product formula and find that this calculation method has actually realized the multi-level interaction of user item:

$$\begin{array}{c} \left(U^{1}+U^{2}+U^{3}\right) *\left(I^{1}+I^{2}+I^{3}\right) \\ U^{1} I^{1}+U^{1} I^{2}+U^{1} I^{3}+U^{2} I^{1}+U^{2} I^{2}+U^{2} I^{3}+U^{3} I^{1}+U^{3} I^{2}+U^{3} I^{3} \end{array}$$

The implementation process of this usage is also relatively simple. For the double tower structure, do CIN in the towers on both sides to generate each order result, and then do sumpooling on the result. Finally, similar to the FM principle, realize the interaction of each order of user item through inner product.

This processing method has certain disadvantages: the second-order and above feature interaction of the generated User Item has limitations similar to FM (for example, U1 is the result of multiple feature sumpooling provided by the User side, and the inner product of the results on U1 and Item side is calculated. Due to the calculation of sum pooling, the importance of each User feature becomes the same here).

### 4.3.2 CIN( CIN(User) ， CIN(Item) )

The second processing method is: after the multi-order CIN results of User and Item are generated in the tower on each side of the double tower, the CIN results of User and Item are explicitly interacted with each other again (instead of calculating the inner product after sum pooling) and converted to the inner product of the double tower, as shown in the following figure:

The following figure shows the formula of CIN calculation. After sum pooling, the form of multiple convolution results remains unchanged (weighted summation of two hadamard products)

The form of CIN is similar to that of FFM. It can also be converted into the form of double tower inner product through the operation of "rearrangement + splicing", and the generated double tower width is also very large (10000 levels). However, unlike FFM, all features of CIN interact, and the feature Embedding used at the bottom layer is shared, while FFM has independent Embedding for each second-order interaction. Therefore, the Tencent QQ browser team basically did not have the fitting problem in the practical attempt, and the use of the second method and the first method is slightly better in the experimental effect.

The following is the experimental results of Tencent QQ browser novel recommendation service (comparing various single CTR models and parallel double tower structures):

## 5.1 some of the analyses given by the team are as follows

• CIN2 has the best effect in the single structure double tower model, followed by DCN and CIN1 double tower structure;
• Compared with the single double tower structure, the effect of parallel double tower structure is also significantly improved;
• The structure of CIN2 is used in parallel scheme 2, and the width of double towers reaches 20000 +, which poses a certain challenge to the performance of online serving. Considering the effect and deployment efficiency, parallel double tower scheme 1 can be selected.

## 5.2 some training details and experience given by the team

• Considering the computational complexity of FM/FFM/CIN and other structures, they are only trained on the selected feature subset, and the category features with higher dimensions are mainly selected, such as User id, behavior history id, novel id, tag id, etc., as well as a small number of statistical features. Less than 20 feature field s are selected on the User side and Item side respectively;
• For each parallel double tower structure, each model does not share the underlying feature Embedding, and trains its own Embedding respectively;
• feature Embedding dimension selection, MLP/DCN pair category feature dimension is 16, and non category feature dimension is 32
• The feature Embedding dimension of FM/FFM/CIN is unified as 32.

# 6, Experimental results of Tencent team

The A/B Test experiment was launched in the rough arrangement stage of the novel recommendation scene. The click through rate and reading conversion rate models of the experimental group used "parallel two tower scheme I" and the control group used "MLP two tower model", as shown in the figure below, which significantly improved the business indicators:

• Click through rate + 6.8752%
• Reading conversion rate + 6.2250%
• Book addition conversion rate + 6.5775%

# 7, Relevant code implementation reference

Implementation code

For the implementation of CTR estimation methods such as DCN / FM / DeepFM / FFM / CIN(xDeepFM) involved in the text, please go to GitHub to check: https://github.com/ShowMeAI-H...

Relevant data is returned to the official account (AI algorithm Institute) to restore the recommended and CTR dataset, and download links.

# 8, References

• [1] Huang, Po-Sen, et al. "Learning deep structured semantic models for web search using clickthrough data." Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 2013.
• [2] S. Rendle, "Factorization machines," in Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 995–1000, 2010.
• [3] Yuchin Juan, et al. "Field-aware Factorization Machines for CTR Prediction." Proceedings of the 10th ACM Conference on Recommender SystemsSeptember 2016 Pages 43–
• [4] Jianxun Lian, et al. "xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems" Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningJuly 2018 Pages 1754–1763
• [5] Ruoxi Wang, et al. "Deep & Cross Network for Ad Click Predictions" Proceedings of the ADKDD'17August 2017 Article No.: 12Pages 1–
• [6] Wang, Ruoxi, et al. "DCN V2: Improved Deep & Cross Network and Practical Lessons for Webscale Learning to Rank Systems" In Proceedings of the Web Conference 2021 (WWW '21); doi:10.1145/3442381.3450078