## Preface

There are many recommendation algorithms, and the most basic one is collaborative filtering. Some time ago, I was interested in KTV data. You just sang familiar songs when you went to sing. Is there any way to give you some suggestions to expand the width of singing. KTV recommendation may consider many factors, such as singer's range, age, region, preference, etc. The first version of the algorithm is only recommended to users from the perspective of item base. Because of personal interest, there is no iterative process of model feedback, and those interested can be implemented by themselves.

## Collaborative filtering algorithm

Collaborative filtering, also called behavioral similarity recall, is actually a kind of similarity calculation based on co-occurrence. The collaborative filtering algorithm of Item Base has several key concepts:

### Similarity calculation

There are many kinds of similarity calculation: co occurrence similarity, Euclidean distance, Pearson correlation coefficient, etc. the co occurrence similarity is used here, and the formula is as follows:

Where N(i) is the number of users who like i songs, similarly N(j) is the number of users who like J songs, and the numerator is the number of users who like i and j at the same time. The formula is an improved one. N(j) is added to the molecule to punish the similarity. i won't go into details here.

### ItemBase and UserBase

UserBase

Look for users with similar interests, and then recommend songs with the same preferences to the recommended users. It is found in the table that both A and C users like i and k songs, so the two users are similar, so recommend songs l of C users to A users. If we use co-occurrence to express it. The detailed calculation here involves user scoring and sorting and summarizing similar user data. i'm here for an overview.

Users / songs | Song i | Song j | Song k | Song l |
---|---|---|---|---|

User A | 1 | 1 | Recommend | |

User B | 1 | |||

User C | 1 | 1 | 1 |

ItemBase

Similar to UserBase, when calculating the similarity, we use the song matrix to find similar songs, and then recommend them according to the historical data of users. The general principle is shown in the following table. It is found in the table that i and k songs are liked by a and B users, so i and k are similar. If C users like i songs, they should also like similar k songs

Users / songs | Song i | Song j | Song k |
---|---|---|---|

User A | 1 | 1 | |

User B | 1 | 1 | 1 |

User C | 1 | Recommend |

ItemBase is used here

### Algorithm implementation

Get the one hot matrix of users to songs

- De duplicate songs and sort by song name
- Get the conversion dictionary of songs and indexes

Calculate the co-occurrence matrix of songs to songs

- Calculate co-occurrence matrix

- Count the number of occurrences of a single song

- Calculation of co-occurrence rate formula calculation of co-occurrence rate

Recommend

If users like i songs

Get recommended songs as k songs

## code implementation

### get data

import elasticsearch import elasticsearch.helpers import re import numpy as np import operator def trim_song_name(song_name): """ //Process song names, filter out useless content and blanks """ song_name = song_name.strip() song_name = re.sub("-?[.*?]", "", song_name) song_name = re.sub("-?(.*?)", "", song_name) song_name = re.sub("-?(.*?)", "", song_name) return song_name def get_data(size=0): """ //Get the dictionary of uid = > work name list """ cur_size=0 ret = {} es_client = elasticsearch.Elasticsearch() search_result = elasticsearch.helpers.scan( es_client, index="ktv_works", doc_type="ktv_works", scroll="10m", query={} ) all_songs_list = [] all_songs_set = set() for hit_item in search_result: cur_size += 1 if size>0 and cur_size>size: break item = hit_item['_source'] work_list = item['item_list'] ret[item['uid']] = [trim_song_name(item['songname']) for item in work_list] return ret def get_uniq_song_sort_list(song_dict): """ //Merge duplicate songs and sort by song name """ return sorted(list(set(np.concatenate(list(song_dict.values())).tolist())))

### Similarity calculation

import math # Co occurrence matrix col_show_count_matrix = np.zeros((song_count, song_count)) one_trik_matrix = np.zeros(song_count) for i in range(song_count): for j in range(song_count): if i>j: # A matrix whose diagonal matrix is only half calculated one_trik_matrix = np.zeros(song_count) one_trik_matrix[i] = 1 one_trik_matrix[j] = 1 ret_m = user_song_one_hot_matrix.dot(one_trik_matrix.T) col_show_value = len([ix for ix in ret_m if ix==2]) col_show_count_matrix[i,j] = col_show_value col_show_count_matrix[j,i] = col_show_value # Similarity matrix col_show_rate_matrix = np.zeros((song_count, song_count)) # Song count N(i) matrix song_count_matrix = np.zeros(song_count) for i in range(song_count): song_col = user_song_one_hot_matrix[:,i] song_count_matrix[i] = len([ix for ix in song_col if ix>=1]) # Similarity matrix calculation for i in range(song_count): for j in range(song_count): if i>j: # A matrix whose diagonal matrix is only half calculated # Similarity calculation N(i)nN(j)/sqart(N(i)*N(j)) rate_value = col_show_count_matrix[i,j]/math.sqrt(song_count_matrix[i]*song_count_matrix[j]) col_show_rate_matrix[i,j] = rate_value col_show_rate_matrix[j,i] = rate_value

### Recommend

import operator def get_songs_from_recommand(col_recommand_matrix): return [(int_to_song[k],r_value) for k,r_value in enumerate(col_recommand_matrix) if r_value>0] input_song = "Decade" # Construct recommended matrix one_trik_matrix = np.zeros(song_count) one_trik_matrix[song_to_int[input_song]] = 1 col_recommand_matrix = col_show_rate_matrix.dot(one_trik_matrix.T) recommand_array = get_songs_from_recommand(col_recommand_matrix) sorted_x = sorted(recommand_array, key=lambda k:k[1], reverse=True) # Get recommended results print(sorted_x)

### Result

[('Sansheng Sanshi', 0.5773502691896258), ('see you at the next intersection', 0.5773502691896258), ('love without breaking up', 0.5773502691896258),...]