# KTV song recommendation - Collaborative Filtering

## Preface

There are many recommendation algorithms, and the most basic one is collaborative filtering. Some time ago, I was interested in KTV data. You just sang familiar songs when you went to sing. Is there any way to give you some suggestions to expand the width of singing. KTV recommendation may consider many factors, such as singer's range, age, region, preference, etc. The first version of the algorithm is only recommended to users from the perspective of item base. Because of personal interest, there is no iterative process of model feedback, and those interested can be implemented by themselves.

## Collaborative filtering algorithm

Collaborative filtering, also called behavioral similarity recall, is actually a kind of similarity calculation based on co-occurrence. The collaborative filtering algorithm of Item Base has several key concepts:

### Similarity calculation

There are many kinds of similarity calculation: co occurrence similarity, Euclidean distance, Pearson correlation coefficient, etc. the co occurrence similarity is used here, and the formula is as follows:

Where N(i) is the number of users who like i songs, similarly N(j) is the number of users who like J songs, and the numerator is the number of users who like i and j at the same time. The formula is an improved one. N(j) is added to the molecule to punish the similarity. i won't go into details here.

### ItemBase and UserBase

UserBase

Look for users with similar interests, and then recommend songs with the same preferences to the recommended users. It is found in the table that both A and C users like i and k songs, so the two users are similar, so recommend songs l of C users to A users. If we use co-occurrence to express it. The detailed calculation here involves user scoring and sorting and summarizing similar user data. i'm here for an overview.

Users / songs Song i Song j Song k Song l
User A 1   1 Recommend
User B   1
User C 1   1 1

ItemBase

Similar to UserBase, when calculating the similarity, we use the song matrix to find similar songs, and then recommend them according to the historical data of users. The general principle is shown in the following table. It is found in the table that i and k songs are liked by a and B users, so i and k are similar. If C users like i songs, they should also like similar k songs

Users / songs Song i Song j Song k
User A 1   1
User B 1 1 1
User C 1   Recommend

ItemBase is used here

### Algorithm implementation

Get the one hot matrix of users to songs

• De duplicate songs and sort by song name
• Get the conversion dictionary of songs and indexes

Calculate the co-occurrence matrix of songs to songs

• Calculate co-occurrence matrix

• Count the number of occurrences of a single song

• Calculation of co-occurrence rate formula calculation of co-occurrence rate

Recommend

If users like i songs

Get recommended songs as k songs

## code implementation

### get data

```import elasticsearch
import elasticsearch.helpers
import re
import numpy as np
import operator

def trim_song_name(song_name):
"""
//Process song names, filter out useless content and blanks
"""
song_name = song_name.strip()
song_name = re.sub("-?[.*?]", "", song_name)
song_name = re.sub("-?(.*?)", "", song_name)
song_name = re.sub("-?(.*?)", "", song_name)
return song_name

def get_data(size=0):
"""
//Get the dictionary of uid = > work name list
"""
cur_size=0
ret = {}

es_client = elasticsearch.Elasticsearch()
search_result = elasticsearch.helpers.scan(
es_client,
index="ktv_works",
doc_type="ktv_works",
scroll="10m",
query={}
)

all_songs_list = []
all_songs_set = set()
for hit_item in search_result:
cur_size += 1
if size>0 and cur_size>size:
break

item = hit_item['_source']
work_list = item['item_list']
ret[item['uid']] = [trim_song_name(item['songname']) for item in work_list]

return ret

def get_uniq_song_sort_list(song_dict):
"""
//Merge duplicate songs and sort by song name
"""
return sorted(list(set(np.concatenate(list(song_dict.values())).tolist())))

```

### Similarity calculation

```import math

# Co occurrence matrix
col_show_count_matrix = np.zeros((song_count, song_count))
one_trik_matrix = np.zeros(song_count)
for i in range(song_count):
for j in range(song_count):
if i>j: # A matrix whose diagonal matrix is only half calculated
one_trik_matrix = np.zeros(song_count)
one_trik_matrix[i] = 1
one_trik_matrix[j] = 1

ret_m = user_song_one_hot_matrix.dot(one_trik_matrix.T)
col_show_value = len([ix for ix in ret_m if ix==2])
col_show_count_matrix[i,j] = col_show_value
col_show_count_matrix[j,i] = col_show_value

# Similarity matrix
col_show_rate_matrix = np.zeros((song_count, song_count))

# Song count N(i) matrix
song_count_matrix = np.zeros(song_count)
for i in range(song_count):
song_col = user_song_one_hot_matrix[:,i]
song_count_matrix[i] = len([ix for ix in song_col if ix>=1])

# Similarity matrix calculation
for i in range(song_count):
for j in range(song_count):
if i>j: # A matrix whose diagonal matrix is only half calculated
# Similarity calculation N(i)nN(j)/sqart(N(i)*N(j))
rate_value = col_show_count_matrix[i,j]/math.sqrt(song_count_matrix[i]*song_count_matrix[j])
col_show_rate_matrix[i,j] = rate_value
col_show_rate_matrix[j,i] = rate_value

```

### Recommend

```import operator

def get_songs_from_recommand(col_recommand_matrix):
return [(int_to_song[k],r_value) for k,r_value in enumerate(col_recommand_matrix) if r_value>0]

# Construct recommended matrix
one_trik_matrix = np.zeros(song_count)
one_trik_matrix[song_to_int[input_song]] = 1

col_recommand_matrix = col_show_rate_matrix.dot(one_trik_matrix.T)
recommand_array = get_songs_from_recommand(col_recommand_matrix)
sorted_x = sorted(recommand_array, key=lambda k:k[1], reverse=True)

# Get recommended results
print(sorted_x)

```

### Result

[('Sansheng Sanshi', 0.5773502691896258), ('see you at the next intersection', 0.5773502691896258), ('love without breaking up', 0.5773502691896258),...]

810 original articles published, 713 praised, 150000 visitors+

Tags: ElasticSearch Lambda

Posted on Fri, 13 Mar 2020 05:49:33 -0400 by control