Tianchi - zero foundation introduction recommendation system news recommendation multi-channel recall 02

What is a recall

Let's take a look at the recommended system architecture

For multiple recall

   recall layer is to roughly arrange the original and large-scale data through simple features and models, reduce the size of the data without losing the main data information, and return the candidate set to be sorted. Each path of multi-channel recall can adopt different characteristics and algorithm models, and the paths are independent. Recall layer is the product of practical engineering, and there will be appropriate algorithm model in actual production. Finally, when importing the sorting layer, you can assign different weights for path fusion.

  there are 250000 lines of recommended data in this news, which is necessary to establish a recall layer. Three debugging modes will be involved in the competition:

  • debug mode: extract part of the data from the training set for debugging to ensure that the model can run through (baseline)
  • Offline mode: load the whole training set for training
  • Online mode: load the whole data set (training set + test set) and get the final submission result

  the contents of the recall layer include:

  • Data loading in different modes
  • Loading of data of each recall path
  • Design of each recall path algorithm
  • Integration of recall paths

Data loading in debug mode

#debug mode to obtain training data
def get_train_sample(path, samples=10000):
	#Get non duplicate user_id
    all_click = pd.read_csv(data_path + 'testA_click_log.csv')
    all_user_ids = all_click.user_id.unique() 
	#You may choose the same one
    sample_user_ids = np.random.choice(all_user_ids, size=sample_nums, replace=False) 
    all_click = all_click[all_click['user_id'].isin(sample_user_ids)]
    #De duplication by line
    all_click = all_click.drop_duplicates((['user_id', 'click_article_id', 'click_timestamp'])) 
    return all_click
#Get article attribute data
def get_item_info(path):
    item_info = pd.read_csv(path + 'articles.csv')
    item_info = item_info.rename(columns={'article_id': 'click_article_id'})
    return item_info
#Read the embedding content of the article
def get_item_emb_dict(path):
    item_emb = pd.read_csv(path + 'articles_emb.csv')
    #Because the article id attribute
    item_emb_cols = [x for x in item_emb.columns if 'emb' in x]
    
    # Normalize and divide the value by two norms
    # axis=1: find the second norm by line
    # keepdims: whether to keep the dimension of the original data
    item_emb_np = item_emb[item_emb_cols].values
    item_emb_np = item_emb_np / np.linalg.norm(item_emb_np, axis=1, keepdims=True)
    
    #Compress iteratable data into a dictionary
    item_emb_dict = dict(zip(item_emb_df['article_id'], item_emb_np))
    
    #Save the iteratable object locally and read it with pickle.load()
    pickle.dump(item_emb_dict, open(path + 'item_content_emb.pkl', 'wb'))
    
    return item_emb_dict

Recall strategy list

The purpose of listing the recall strategy first is the principle of the recall method, and the second is to find out what the data requirements of the required strategy look like.
Recall strategies involved in this article:

  • Youtube DNN recall
  • Collaborative filtering of articles
  • User collaborative filtering

Key points of Youtube DNN recall layer:
Reread the thesis of Youtube in-depth learning recommendation system. Every word is Abas, and it is amazing
Interpretation of word vector word2vec model

  1. Embed the video with word2vec method
  2. After feature engineering, the newly added features are spliced with Video Embedding as input, and the input feature length of each user is required to be the same
  3. Training by RELU neural network
  4. It is transformed into a classification problem. The calculation is accelerated by negative sampling, and the softmax prediction category is input

The article focuses on collaborative filtering:

  1. Recall historical articles and similar articles to users
  2. Recall similar articles need to be associated with the following rules:
    1. Consider the weight of the order of similar articles and historical click articles
    2. Consider the weight of the creation time difference between similar articles and historical click articles
    3. Article content similarity weight (similar articles and historical click articles do not have similarity)

Key points of user collaborative filtering:

  1. Recommend similar user history click articles to users
  2. Recall similar articles need to be associated with the following rules:
    1. Calculate the similarity between the recommended user's historical click articles and similar user's historical click articles
    2. Calculate the creation time difference between the recommended user's historical click articles and similar user's historical click articles
    3. Calculate the sum of the relative positions of the historical click articles of the recommended users and the historical click articles of similar users

According to the above requirements and rules, you probably know what kind of data format you need. The following is the preparation of data format.

Data preparation for each recall path

  1. If you want similar articles, you need a similar item matrix
#Obtain the user's click article sequence according to the click time {user1: [(item1, time1), (item2, time2)..]...}
#The similarity between historical items is not calculated, that is, in user1, item1 and item2 do not calculate the similarity
'''
If you cycle to user1 of item1,Then calculate item1 For the similarity with other items of the next user, the weight of the above three aspects shall be taken into account in calculating the similarity, that is (2.1,2.2,2.3)
'''
def get_user_item_time(click_df):
    
    click_df = click_df.sort_values('click_timestamp')
    
    def make_item_time_pair(df):
        return list(zip(df['click_article_id'], df['click_timestamp']))
    
    user_item_time_df = click_df.groupby('user_id')['click_article_id', 'click_timestamp'].apply(lambda x: make_item_time_pair(x))\
                                                            .reset_index().rename(columns={0: 'item_time_list'})
    user_item_time_dict = dict(zip(user_item_time_df['user_id'], user_item_time_df['item_time_list']))
    
    return user_item_time_dict

#Article creation time
def item_created_time(df):
    item_created_time_dict = dict(zip(df["click_article_id"] , df["created_at_ts"]))
    return item_created_time_dict
   
#difficulty
def itemcf_sim(df, item_created_time_dict):
    """
        Calculation of similarity matrix between articles
        :param df: data sheet
        :item_created_time_dict:  Dictionary of article creation time
        return : Similarity matrix between articles
    """
    
    user_item_time_dict = get_user_item_time(df)
    
    # Calculate item similarity
    i2i_sim = {}
    item_cnt = defaultdict(int)
    for user,(item_time_list) in user_item_time_dict.items():
    	#{user1: [(item1, time1), (item2, time2)..]
    	#Ignore the user_id and get the following list
        for loc1 , (i , i_click_time) in enumerate(item_time_list):
        	# loc1 is the index where the items are located, which is the position in the list. The items in the list are arranged according to the click order of the user
            item_cnt[i] += 1
            i2i_sim.setdefault(i, {}) #If there is no i in the original dictionary, add and set the initial value {}, and if there is no i, return the original value
            for loc2 , (j , j_click_time) in enumerate(item_time_list):
                #Similarity is not calculated for the same items
                if i == j:
                    continue
                
                #It doesn't matter if you don't know how to calculate the weights
                # Consider the forward and reverse order clicks of the article    
                loc_alpha = 1.0 if loc2 > loc1 else 0.7
                # Location information weight,
                loc_weight = loc_alpha * (0.9 ** (np.abs(loc2 - loc1) - 1))
                # Click time weight
                click_time_weight = np.exp(0.7 ** np.abs(i_click_time - j_click_time))
                # Weight of creation time of two articles
                created_time_weight = np.exp(0.8 ** np.abs(item_created_time_dict[i] - item_created_time_dict[j]))
                
                i2i_sim[i].setdefault(j, 0)
                # The weights of various factors are considered to calculate the similarity between the final articles
                i2i_sim[i][j] += loc_weight * click_time_weight * created_time_weight / math.log(len(item_time_list) + 1)
                
    
    i2i_sim_ = i2i_sim.copy()
    for i, related_items in i2i_sim.items():
        for j, wij in related_items.items():
            i2i_sim_[i][j] = wij / math.sqrt(item_cnt[i] * item_cnt[j])
            
    # The obtained similarity matrix is saved locally
    pickle.dump(i2i_sim_, open(path + 'itemcf_i2i_sim.pkl', 'wb'))
    
    return i2i_sim_

'''
last i2i_sim_The format of the results is as follows
{195839: {191971: 0.17002324109946493,
  194300: 0.14028226677401606,
  166581: 0.004738731071607025,
  272143: 0.0030777894292628007,
  285298: 0.014267230873702937,
  39857: 0.03787405668573889,
  194717: 0.0762200687996933,
  233717: 0.010576962590031045,
  36399: 0.00891114964092883,
  195868: 0.0353609868198151,
  ...
  }
'''
  1. Calculate the user similarity matrix according to the logic of the item similarity matrix
#The logic is the same, but the weight calculation is different
#Get the user sequence {item1: [(user1, time1), (user2, time2)...]...}
def get_item_user_time_dict(click_df):
    def make_user_time_pair(df):
        return list(zip(df['user_id'], df['click_timestamp']))
    
    click_df = click_df.sort_values('click_timestamp')
    item_user_time_df = click_df.groupby('click_article_id')['user_id', 'click_timestamp'].apply(lambda x: make_user_time_pair(x))\
                                                            .reset_index().rename(columns={0: 'user_time_list'})
    
    item_user_time_dict = dict(zip(item_user_time_df['click_article_id'], item_user_time_df['user_time_list']))
    return item_user_time_dict


#Calculate the user's activity, a kind of weight calculation
def get_user_activate_degree_dict(all_click_df):
    all_click_df_ = all_click_df.groupby('user_id')['click_article_id'].count().reset_index()
    
    # User activity normalization
    mm = MinMaxScaler()
    all_click_df_['click_article_id'] = mm.fit_transform(all_click_df_[['click_article_id']])
    user_activate_degree_dict = dict(zip(all_click_df_['user_id'], all_click_df_['click_article_id']))
    
    return user_activate_degree_dict


#The logic is the same as above
def usercf_sim(click_df, user_activate_degree_dict):
    """
        Calculation of user similarity matrix
        :param all_click_df: data sheet
        :param user_activate_degree_dict: Dictionary of user activity
        return User similarity matrix
    """
    item_user_time_dict = get_item_user_time_dict(click_df)
    
    u2u_sim = {}
    user_cnt = defaultdict(int)
    for item, user_time_list in item_user_time_dict.items():
        for u, click_time in user_time_list:
            user_cnt[u] += 1
            u2u_sim.setdefault(u, {})
            for v, click_time in user_time_list:
                u2u_sim[u].setdefault(v, 0)
                if u == v:
                    continue
                # The average user activity is used as the weight of activity
                activate_weight = 100 * 0.5 * (user_activate_degree_dict[u] + user_activate_degree_dict[v])   
                u2u_sim[u][v] += activate_weight / math.log(len(user_time_list) + 1)
    
    u2u_sim_ = u2u_sim.copy()
    for u, related_users in u2u_sim.items():
        for v, wij in related_users.items():
            u2u_sim_[u][v] = wij / math.sqrt(user_cnt[u] * user_cnt[v])
    
    # The obtained similarity matrix is saved locally
    pickle.dump(u2u_sim_, open(path + 'usercf_u2u_sim.pkl', 'wb'))

    return u2u_sim_

# %%time = 54.1 s

'''
Final results u2u_sim_Format:
{203172: {203172: 0.0,
  238503: 0.14830159613688895,
  230262: 0.43069329645723503,
  228301: 0.1689511714607187,
  242286: 0.1329408251817319,
  245198: 0.15484630639404126,
  227057: 0.41426036891152124,
  238944: 0.07430792178193243,
  241560: 0.08650412384989443,
  228013: 0.08088853325519581,
  ,,,
  }
'''
  1. Data needed for YouTube DNN recall

The data length of each user is required to be the same. Convert the data dimension of each user into 1*N dimension, that is, one line.

'''
structure YoutubeDNN input data
 adopt gen_data_set Function to filter data features and construct positive and negative samples,Output training test set
 adopt gen_model_input The function adjusts the data dimension to be consistent, and outputs the input training set and training set label of the model
'''
def gen_data_set(data, negsample=2):
    data.sort_values("click_timestamp", inplace=True)
    #Get the id of all clicked articles
    item_ids = data['click_article_id'].unique()

    train_set = [] #Training set
    test_set = [] #Test set
    #Each time, the user id and the user's action are traversed
    for reviewerID, hist in data.groupby('user_id'):
        #The user has clicked on the article list
        pos_list = hist['click_article_id'].tolist()
        #Negative sample
        if negsample > 0:
        	# Select negative samples from articles that the user has not read
            candidate_set = list(set(item_ids) - set(pos_list))  
            # For each positive sample, select n negative samples
            neg_list = np.random.choice(candidate_set,size=len(pos_list)*negsample,replace=True)  
            
        # When the length is only one, you need to put this data into the training set, otherwise the embedding you finally learn will be missing
        if len(pos_list) == 1:
            train_set.append((reviewerID, [pos_list[0]], pos_list[0],1,len(pos_list)))
            test_set.append((reviewerID, [pos_list[0]], pos_list[0],1,len(pos_list)))
            
        # Sliding window constructs positive and negative samples to supplement training samples
        for i in range(1, len(pos_list)):
            hist = pos_list[:i]
            
            if i != len(pos_list) - 1:
            	# Positive sample [user_id, his_item, pos_item, label, len(his_item)]
            	# Description [user id, historical clicked articles, last clicked articles, label 1 or 0, number of historical clicked articles]
                train_set.append((reviewerID, hist[::-1], pos_list[i], 1, len(hist[::-1])))  
                for negi in range(negsample):
               		 # Negative sample [user_id, his_item, neg_item, label, len(his_item)]
                    train_set.append((reviewerID, hist[::-1], neg_list[i*negsample+negi], 0,len(hist[::-1]))) 
            else:
                # Take the longest sequence length as the test data
                test_set.append((reviewerID, hist[::-1], pos_list[i],1,len(hist[::-1])))
                
    # Disorder order
    random.shuffle(train_set)
    random.shuffle(test_set)
    
    return train_set, test_set

#Get model input
def gen_model_input(train_set,user_profile,seq_max_len):

    train_uid = np.array([line[0] for line in train_set])
    train_seq = [line[1] for line in train_set]
    train_iid = np.array([line[2] for line in train_set])
    train_label = np.array([line[3] for line in train_set])
    train_hist_len = np.array([line[4] for line in train_set])

	#Fill in the data and pull the data dimensions into consistency
	#No label added
    train_seq_pad = pad_sequences(train_seq, maxlen=seq_max_len, padding='post', truncating='post', value=0)
    train_model_input = {"user_id": train_uid, "click_article_id": train_iid, "hist_article_id": train_seq_pad,
                         "hist_len": train_hist_len}

    return train_model_input, train_label

difficulty

#
def youtubednn_u2i_dict(data, topk=20):    
    sparse_features = ["click_article_id", "user_id"]
    SEQ_LEN = 30 # The user clicks the length of the sequence, short filling and long truncation
    
    user_profile_ = data[["user_id"]].drop_duplicates('user_id')
    item_profile_ = data[["click_article_id"]].drop_duplicates('click_article_id')  
    
    # Category code
    features = ["click_article_id", "user_id"]
    feature_max_idx = {}
    
    for feature in features:
        lbe = LabelEncoder()
        data[feature] = lbe.fit_transform(data[feature])
        feature_max_idx[feature] = data[feature].max() + 1
    
    # Extract the portraits of user and item. What features should be further analyzed and considered here
    user_profile = data[["user_id"]].drop_duplicates('user_id')
    item_profile = data[["click_article_id"]].drop_duplicates('click_article_id')  
    
    user_index_2_rawid = dict(zip(user_profile['user_id'], user_profile_['user_id']))
    item_index_2_rawid = dict(zip(item_profile['click_article_id'], item_profile_['click_article_id']))
    
    # Partition training and test sets
    # Because the amount of data required for deep learning is usually very large, in order to ensure the effect of recall, the training samples are often expanded in the form of sliding window
    train_set, test_set = gen_data_set(data, 0)
    # Sort out the input data. See the above functions for specific operations
    train_model_input, train_label = gen_model_input(train_set, user_profile, SEQ_LEN)
    test_model_input, test_label = gen_model_input(test_set, user_profile, SEQ_LEN)
    
    # Determine the dimension of Embedding
    embedding_dim = 16
    
    # Organize the data into a form that the model can input directly
    user_feature_columns = [SparseFeat('user_id', feature_max_idx['user_id'], embedding_dim),
                            VarLenSparseFeat(SparseFeat('hist_article_id', feature_max_idx['click_article_id'], embedding_dim,
                                                        embedding_name="click_article_id"), SEQ_LEN, 'mean', 'hist_len'),]
    item_feature_columns = [SparseFeat('click_article_id', feature_max_idx['click_article_id'], embedding_dim)]
    
    # Definition of model 
    # num_sampled: number of samples in negative sampling
    # YouTube DNN from deepmatch
    model = YoutubeDNN(user_feature_columns, item_feature_columns, num_sampled=5, user_dnn_hidden_units=(64, embedding_dim))
    # Model compilation
    model.compile(optimizer="adam", loss=sampledsoftmaxloss)  
    
    # For model training, the proportion of validation set can be defined here. If it is set to 0, it means that the full amount of data is trained directly
    history = model.fit(train_model_input, train_label, batch_size=256, epochs=1, verbose=1, validation_split=0.0)
    
    # After training the model, extract the Embedding of the training, including user side and item side
    test_user_model_input = test_model_input
    all_item_model_input = {"click_article_id": item_profile['click_article_id'].values}

    user_embedding_model = Model(inputs=model.user_input, outputs=model.user_embedding)
    item_embedding_model = Model(inputs=model.item_input, outputs=model.item_embedding)
    
    # Save current item_embedding and user_ Embedded sorting may be used, but it should be noted that it should correspond to the original id when saving
    user_embs = user_embedding_model.predict(test_user_model_input, batch_size=2 ** 12)
    item_embs = item_embedding_model.predict(all_item_model_input, batch_size=2 ** 12)
    
    # embedding normalize before saving
    user_embs = user_embs / np.linalg.norm(user_embs, axis=1, keepdims=True)
    item_embs = item_embs / np.linalg.norm(item_embs, axis=1, keepdims=True)
    
    # Convert Embedding into a dictionary to facilitate query
    raw_user_id_emb_dict = {user_index_2_rawid[k]: \
                                v for k, v in zip(user_profile['user_id'], user_embs)}
    raw_item_id_emb_dict = {item_index_2_rawid[k]: \
                                v for k, v in zip(item_profile['click_article_id'], item_embs)}
    # Save Embedding locally
    pickle.dump(raw_user_id_emb_dict, open(path + 'user_youtube_emb.pkl', 'wb'))
    pickle.dump(raw_item_id_emb_dict, open(path + 'item_youtube_emb.pkl', 'wb'))
    
    # faiss immediate search through user_embedding searches the topk item s with the highest similarity
    index = faiss.IndexFlatIP(embedding_dim)
    # Normalization has been carried out above. Normalization can not be carried out here
#     faiss.normalize_L2(user_embs)
#     faiss.normalize_L2(item_embs)
    index.add(item_embs) # Index the item vector
    sim, idx = index.search(np.ascontiguousarray(user_embs), topk) # Query the most similar topk item s through user
    
    user_recall_items_dict = collections.defaultdict(dict)
    for target_idx, sim_value_list, rele_idx_list in tqdm(zip(test_user_model_input['user_id'], sim, idx)):
        target_raw_id = user_index_2_rawid[target_idx]
        # Starting from 1 is to remove the commodity itself, so the final similar commodity is only topk-1
        for rele_idx, sim_value in zip(rele_idx_list[1:], sim_value_list[1:]): 
            rele_raw_id = item_index_2_rawid[rele_idx]
            user_recall_items_dict[target_raw_id][rele_raw_id] = user_recall_items_dict.get(target_raw_id, {})\
                                                                    .get(rele_raw_id, 0) + sim_value
            
    user_recall_items_dict = {k: sorted(v.items(), key=lambda x: x[1], reverse=True) for k, v in user_recall_items_dict.items()}
    # Sort the recall results
    
    # Save recall results
    # Here, the recall results are obtained directly through vectors. Compared with the above recall methods, the above only obtains the similarity matrix of i2i and u2u, and collaborative filtering recall is required to obtain the recall results
    # This recall result can be evaluated directly. For convenience, an evaluation function can be written to evaluate all recall results
    pickle.dump(user_recall_items_dict, open(path + 'youtube_u2i_dict.pkl', 'wb'))
    return user_recall_items_dict

Recall of each path

For the three recall methods, the recall effect needs to be evaluated, so the recall function metrics is added_ recall

# The hit rates in the first 10, 20, 30, 40 and 50 articles of the recall were evaluated in turn
def metrics_recall(user_recall_items_dict, trn_last_click_df, topk=5):
    last_click_item_dict = dict(zip(trn_last_click_df['user_id'], trn_last_click_df['click_article_id']))
    user_num = len(user_recall_items_dict)
    
    for k in range(10, topk+1, 10):
        hit_num = 0
        for user, item_list in user_recall_items_dict.items():
            # Get the results of the first k recalls
            tmp_recall_items = [x[0] for x in user_recall_items_dict[user][:k]]
            if last_click_item_dict[user] in set(tmp_recall_items):
                hit_num += 1
        
        hit_rate = round(hit_num * 1.0 / user_num, 5)
        print(' topk: ', k, ' : ', 'hit_num: ', hit_num, 'hit_rate: ', hit_rate, 'user_num : ', user_num)

# Get the historical click and last click of the current data
def get_hist_and_last_click(all_click):
    
    all_click = all_click.sort_values(by=['user_id', 'click_timestamp'])
    click_last_df = all_click.groupby('user_id').tail(1)

    # If the user has only one click and hist is empty, the user will not be visible during training. At this time, it will be revealed by default
    def hist_func(user_df):
        if len(user_df) == 1:
            return user_df
        else:
            return user_df[:-1]

    click_hist_df = all_click.groupby('user_id').apply(hist_func).reset_index(drop=True)

    return click_hist_df, click_last_df

#Get hot items
def get_item_topk_click(click_df, k):
    topk_click = click_df['click_article_id'].value_counts().index[:k]
    return topk_click
#Youtube DNN recall
trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)
user_multi_recall_dict['youtubednn_recall'] = youtubednn_u2i_dict(trn_hist_click_df, topk=20)
# Recall effect evaluation
metrics_recall(user_multi_recall_dict['youtubednn_recall'], trn_last_click_df, topk=20)

#itemCF recall
def item_based_recommend(user_id, user_item_time_dict, i2i_sim, sim_item_topk, recall_item_num, item_topk_click, item_created_time_dict, emb_i2i_sim):
    """
        Recall based on article collaborative filtering
        :param user_id: user id
        :param user_item_time_dict: Dictionaries, Get the user's click article sequence according to the click time   {user1: [(item1, time1), (item2, time2)..]...}
        :param i2i_sim: Dictionary, article, similarity matrix
        :param sim_item_topk: Integer, select the first integer that is most similar to the current article k Article
        :param recall_item_num: Integer, number of last recalled articles
        :param item_topk_click: List, the list of articles with the most clicks, and users recall and complete it
        :param emb_i2i_sim: Dictionary content based embedding Calculated article similarity matrix
        
        return: List of recalled articles [(item1, score1), (item2, score2)...]
    """
    # Get articles of user history interaction
    user_hist_items = user_item_time_dict[user_id]
    user_hist_items_ = {user_id for user_id, _ in user_hist_items}
    
    item_rank = {}
    for loc, (i, click_time) in enumerate(user_hist_items):
        for j, wij in sorted(i2i_sim[i].items(), key=lambda x: x[1], reverse=True)[:sim_item_topk]:
            if j in user_hist_items_:
                continue
            
            # Article creation time difference weight
            created_time_weight = np.exp(0.8 ** np.abs(item_created_time_dict[i] - item_created_time_dict[j]))
            # Position weight of historical articles in similar articles and historical click article sequences
            loc_weight = (0.9 ** (len(user_hist_items) - loc))
            
            content_weight = 1.0
            if emb_i2i_sim.get(i, {}).get(j, None) is not None:
                content_weight += emb_i2i_sim[i][j]
            if emb_i2i_sim.get(j, {}).get(i, None) is not None:
                content_weight += emb_i2i_sim[j][i]
                
            item_rank.setdefault(j, 0)
            item_rank[j] += created_time_weight * loc_weight * content_weight * wij
    
    # Less than 10, complete with popular products
    if len(item_rank) < recall_item_num:
        for i, item in enumerate(item_topk_click):
            if item in item_rank.items(): # The filled item should not be in the original list
                continue
            item_rank[item] = - i - 100 # Just give a negative number
            if len(item_rank) == recall_item_num:
                break
    
    item_rank = sorted(item_rank.items(), key=lambda x: x[1], reverse=True)[:recall_item_num]
        
    return item_rank



user_recall_items_dict = collections.defaultdict(dict)
user_item_time_dict = get_user_item_time(trn_hist_click_df)
i2i_sim = pickle.load(open(path + 'emb_i2i_sim.pkl','rb'))

sim_item_topk = 20
recall_item_num = 10

item_topk_click = get_item_topk_click(trn_hist_click_df, k=50)

for user in trn_hist_click_df['user_id'].unique():
    user_recall_items_dict[user] = item_based_recommend(user, user_item_time_dict, i2i_sim, sim_item_topk, 
                                                        recall_item_num, item_topk_click, item_created_time_dict, emb_i2i_sim)
    
user_multi_recall_dict['embedding_sim_item_recall'] = user_recall_items_dict
pickle.dump(user_multi_recall_dict['embedding_sim_item_recall'], open(path + 'embedding_sim_item_recall.pkl', 'wb'))
#Recall evaluation
metrics_recall(user_multi_recall_dict['embedding_sim_item_recall'], trn_last_click_df, topk=recall_item_num)
#UserCF
def user_based_recommend(user_id, user_item_time_dict, u2u_sim, sim_user_topk, recall_item_num, 
                         item_topk_click, item_created_time_dict, emb_i2i_sim):
    """
        Recall based on article collaborative filtering
        :param user_id: user id
        :param user_item_time_dict: Dictionaries, Get the user's click article sequence according to the click time   {user1: [(item1, time1), (item2, time2)..]...}
        :param u2u_sim: Dictionary, article, similarity matrix
        :param sim_user_topk: Integer, select the first integer that is most similar to the current user k Users
        :param recall_item_num: Integer, number of last recalled articles
        :param item_topk_click: List, the list of articles with the most clicks, and users recall and complete it
        :param item_created_time_dict: Article creation time list
        :param emb_i2i_sim: Dictionary content based embedding Calculated article similarity matrix
        
        return: List of recalled articles [(item1, score1), (item2, score2)...]
    """
    # Historical interaction
    user_item_time_list = user_item_time_dict[user_id]    #  [(item1, time1), (item2, time2)..]
    user_hist_items = set([i for i, t in user_item_time_list])   # There are multiple interactions between a user and an article, which must be repeated here
    
    items_rank = {}
    for sim_u, wuv in sorted(u2u_sim[user_id].items(), key=lambda x: x[1], reverse=True)[:sim_user_topk]:
        for i, click_time in user_item_time_dict[sim_u]:
            if i in user_hist_items:
                continue
            items_rank.setdefault(i, 0)
            
            loc_weight = 1.0
            content_weight = 1.0
            created_time_weight = 1.0
            
            # The current article has a weight interaction with the historical articles viewed by the user
            for loc, (j, click_time) in enumerate(user_item_time_list):
                # Relative position weight when clicking
                loc_weight += 0.9 ** (len(user_item_time_list) - loc)
                # Content similarity weight
                if emb_i2i_sim.get(i, {}).get(j, None) is not None:
                    content_weight += emb_i2i_sim[i][j]
                if emb_i2i_sim.get(j, {}).get(i, None) is not None:
                    content_weight += emb_i2i_sim[j][i]
                
                # Create time difference weights
                created_time_weight += np.exp(0.8 * np.abs(item_created_time_dict[i] - item_created_time_dict[j]))
                
            items_rank[i] += loc_weight * content_weight * created_time_weight * wuv
        
    # Heat complement
    if len(items_rank) < recall_item_num:
        for i, item in enumerate(item_topk_click):
            if item in items_rank.items(): # The filled item should not be in the original list
                continue
            items_rank[item] = - i - 100 # Just give me a plural
            if len(items_rank) == recall_item_num:
                break
        
    items_rank = sorted(items_rank.items(), key=lambda x: x[1], reverse=True)[:recall_item_num]    
    
    return items_rank
user_recall_items_dict = collections.defaultdict(dict)
user_item_time_dict = get_user_item_time(trn_hist_click_df)

u2u_sim = pickle.load(open(path + 'usercf_u2u_sim.pkl', 'rb'))

sim_user_topk = 20
recall_item_num = 10
item_topk_click = get_item_topk_click(trn_hist_click_df, k=50)

for user in tqdm(trn_hist_click_df['user_id'].unique()):
    user_recall_items_dict[user] = user_based_recommend(user, user_item_time_dict, u2u_sim, sim_user_topk, \
                                                        recall_item_num, item_topk_click, item_created_time_dict, emb_i2i_sim)    

pickle.dump(user_recall_items_dict, open(path + 'usercf_u2u2i_recall.pkl', 'wb'))
#Recall evaluation
metrics_recall(user_recall_items_dict, trn_last_click_df, topk=recall_item_num)

Recall merger

def combine_recall_results(user_multi_recall_dict, weight_dict=None, topk=25):
    final_recall_items_dict = {}
    
    # Each recall result is normalized according to the user, so as to facilitate the weight addition between the following recall results and the items of the same user
    def norm_user_recall_items_sim(sorted_item_list):
        # If there is no article or only one article in the cold start, return directly. The reason for this may be that the number of articles recalled in the cold start is too small,
        # After rule-based filtering, there will be no articles. Here you can also do some other strategic filtering
        if len(sorted_item_list) < 2:
            return sorted_item_list
        
        min_sim = sorted_item_list[-1][1]
        max_sim = sorted_item_list[0][1]
        
        norm_sorted_item_list = []
        for item, score in sorted_item_list:
            if max_sim > 0:
                norm_score = 1.0 * (score - min_sim) / (max_sim - min_sim) if max_sim > min_sim else 1.0
            else:
                norm_score = 0.0
            norm_sorted_item_list.append((item, norm_score))
            
        return norm_sorted_item_list
    
    print('Multiple recall merging...')
    for method, user_recall_items in tqdm(user_multi_recall_dict.items()):
        print(method + '...')
        # When calculating the final recall result, you can also set a weight for each recall result
        if weight_dict == None:
            recall_method_weight = 1
        else:
            recall_method_weight = weight_dict[method]
        
        for user_id, sorted_item_list in user_recall_items.items(): # Normalization
            user_recall_items[user_id] = norm_user_recall_items_sim(sorted_item_list)
        
        for user_id, sorted_item_list in user_recall_items.items():
            # print('user_id')
            final_recall_items_dict.setdefault(user_id, {})
            for item, score in sorted_item_list:
                final_recall_items_dict[user_id].setdefault(item, 0)
                final_recall_items_dict[user_id][item] += recall_method_weight * score  
    
    final_recall_items_dict_rank = {}
    # In case of multi-channel recall, the final recall quantity can also be controlled
    for user, recall_item_dict in final_recall_items_dict.items():
        final_recall_items_dict_rank[user] = sorted(recall_item_dict.items(), key=lambda x: x[1], reverse=True)[:topk]

    # Save the final result dictionary after multi-channel recall to local
    pickle.dump(final_recall_items_dict_rank, open(os.path.join(path, 'final_recall_items_dict.pkl'),'wb'))

    return final_recall_items_dict_rank

reference material:
Tianchi zero foundation introduction recommendation system

Tags: Python Data Mining

Posted on Sun, 21 Nov 2021 02:22:23 -0500 by BLottman