Building recommendation system based on collaborative filtering of users

1. Overview

Before that, I introduced how to build a recommendation system. Today, I will introduce how to build a practical recommendation based on the collaborative filtering of users.

2. Content

Collaborative filtering technology is widely used in recommendation system, which is a rapidly developing research field. Two common methods are memory based and model based.

  • Based on memory: recommend mainly by calculating approximation, such as collaborative filtering based on user (used based) and item based. In these two modes, user interaction matrix will be constructed first, then row vector and column vector of matrix can be used to represent users and items, and then calculate similarity between users and items to recommend;
  • Based on the model: it is mainly to fill in the interaction matrix to predict the possibility of a user buying an item.

In order to solve these problems, collaborative filtering model can be established to recommend products to customers using purchase data. Next, we implement the details step by step through user-based collaborative filtering (based on memory) and actual combat. User based system filtering embodies that people with similar characteristics have similar preferences. For example, user A recommends item C to user B, while B has purchased many similar items, and the evaluation is also high. Then, in the future, user B will probably buy item C, and user B will recommend item C based on similarity measurement.

2.1 collaborative filtering based on users and users

In this way, users similar to query users are identified, and the expected score is estimated as the weighted average of these similar users' scores. The Python language used in the actual combat needs to rely on the following libraries:

  • pandas
  • numpy
  • sklearn

Python environment:

  • Version 3.7.6
  • Anaconda3

2.2 scoring function

Here, for non personalized collaborative filtering (excluding the likes, dislikes and historical scores of active users), a score with user u and item I as input parameters is returned. This function outputs a score to quantify the degree of user U's preference for item I. This is usually done by rating people who are similar to users. The formula involved is as follows:

 

Where s is the predicted score, u is the user, I is the item, r is the score given by the user, and w is the weight. In this case, our score is equal to the sum of each user's evaluation of the project minus the average evaluation of the user multiplied by A certain weight, which indicates how similar the user is to other users or how much contribution the user makes to other users' prediction. This is the weight between user u and v, with scores between 0 and 1, where 0 is the lowest and 1 is the highest. In theory, it looks perfect. Why do we need to subtract the average score from each user's score? Why use weighted average instead of simple average? This is because of the types of users we deal with. First, people usually score on different scales. User A may be A positive and optimistic user, who will give user A an average score of his favorite movies (such as 4 or 5). User B is not optimistic or has A high rating standard. He may score his favorite movie between 2 and 5. 2 points of user B corresponds to 4 points of user A. The improvement is that the algorithm efficiency can be improved by standardizing user rating. One method i s to calculate the score of s(u,i), which is the average evaluation of each item plus some deviation. By using cosine similarity to calculate the weight given in the above formula, at the same time, normalize the data according to the above method, and analyze some data in panda.

2.2.1 import Python dependency package

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import pairwise_distances

2.2.2 loading data sources

The sample code for loading data is as follows:

movies = pd.read_csv("data/movies.csv")
Ratings = pd.read_csv("data/ratings.csv")
Tags = pd.read_csv("data/tags.csv")

The results are previewed as follows:

print(movies.head())
print(Ratings.head())
print(Tags.head())

 

Build data:

Mean = Ratings.groupby(by="userId", as_index=False)['rating'].mean()
Rating_avg = pd.merge(Ratings, Mean, on='userId')
Rating_avg['adg_rating'] = Rating_avg['rating_x'] - Rating_avg['rating_y']
print(Rating_avg.head())

The results are as follows:

 

2.3 cosine similarity

For the above formula, we need to find users with similar ideas. Finding a user who likes and dislikes it sounds interesting, but how do we find the similarity? Then we need to use cosine similarity to see how similar users are. It is usually calculated based on the user's past rating.

Here we use Python's sklearn's cosine_similarity function to calculate the similarity, and do some data preprocessing and data cleaning. The example code is as follows:

check = pd.pivot_table(Rating_avg,values='rating_x',index='userId',columns='movieId')
print(check.head())
final = pd.pivot_table(Rating_avg,values='adg_rating',index='userId',columns='movieId')
print(final.head())

The results are as follows:

The above figure contains a lot of NaN values, because each user has not seen all the movies, so this type of matrix is called sparse matrix. The method similar to matrix decomposition is used to deal with this kind of sparsity. Next, let's make correlation substitution for these NaN values.

There are usually two ways:

  1. Use the user average on the line;
  2. User movie average on column

The code is as follows:

# Replacing NaN by Movie Average
final_movie = final.fillna(final.mean(axis=0))
print(final_movie.head())

# Replacing NaN by user Average
final_user = final.apply(lambda row: row.fillna(row.mean()), axis=1)
print(final_user.head())

The results are as follows:

 

Next, we start to calculate the similarity between users, the code is as follows:

# user similarity on replacing NAN by item(movie) avg
cosine = cosine_similarity(final_movie)
np.fill_diagonal(cosine, 0)
similarity_with_movie = pd.DataFrame(cosine, index=final_movie.index)
similarity_with_movie.columns = final_user.index
# print(similarity_with_movie.head())

# user similarity on replacing NAN by user avg
b = cosine_similarity(final_user)
np.fill_diagonal(b, 0 )
similarity_with_user = pd.DataFrame(b,index=final_user.index)
similarity_with_user.columns=final_user.index
# print(similarity_with_user.head())

The results are as follows:

 

Then, let's check whether our similarity is valid. The code is as follows:

def get_user_similar_movies( user1, user2 ):
    common_movies = Rating_avg[Rating_avg.userId == user1].merge(
    Rating_avg[Rating_avg.userId == user2],
    on = "movieId",
    how = "inner" )
    return common_movies.merge( movies, on = 'movieId' )

a = get_user_similar_movies(370,86309)
a = a.loc[ : , ['rating_x_x','rating_x_y','title']]
print(a.head())

The results are as follows:

 

From the above figure, we can see that the generated similarity is almost the same, consistent with the authenticity.

2.4 adjacent users

The similarity of all users is calculated in 2.3, but in the field of big data, the combination of recommendation system and big data is very important. Take movie recommendation as an example to build a matrix (862 * 862), which is a very small matrix compared with the actual user data (million, ten million or more). So when calculating the score of any item, it is not a good solution or method to always view all other users. Therefore, using the idea of adjacent users, only K sets of similar users are selected for specific users.

Next, we take a value of 30 for K. all users have 30 neighboring users. The code is as follows:

def find_n_neighbours(df,n):
    order = np.argsort(df.values, axis=1)[:, :n]
    df = df.apply(lambda x: pd.Series(x.sort_values(ascending=False)
           .iloc[:n].index, 
          index=['top{}'.format(i) for i in range(1, n+1)]), axis=1)
    return df

# top 30 neighbours for each user
sim_user_30_u = find_n_neighbours(similarity_with_user,30)
print(sim_user_30_u.head())

sim_user_30_m = find_n_neighbours(similarity_with_movie,30)
print(sim_user_30_m.head())

The results are as follows:

 

2.5 calculate final score

The implementation code is as follows:

def User_item_score(user,item):
    a = sim_user_30_m[sim_user_30_m.index==user].values
    b = a.squeeze().tolist()
    c = final_movie.loc[:,item]
    d = c[c.index.isin(b)]
    f = d[d.notnull()]
    avg_user = Mean.loc[Mean['userId'] == user,'rating'].values[0]
    index = f.index.values.squeeze().tolist()
    corr = similarity_with_movie.loc[user,index]
    fin = pd.concat([f, corr], axis=1)
    fin.columns = ['adg_score','correlation']
    fin['score']=fin.apply(lambda x:x['adg_score'] * x['correlation'],axis=1)
    nume = fin['score'].sum()
    deno = fin['correlation'].sum()
    final_score = avg_user + (nume/deno)
    return final_score

score = User_item_score(320,7371)
print("score (u,i) is",score)

The results are as follows:

 

The prediction score we calculated here is 4.25, so we can think that the user (370) may like the movie with ID (7371). Next, we recommend movies to the user (370), and the implementation code is as follows:

Rating_avg = Rating_avg.astype({"movieId": str})
Movie_user = Rating_avg.groupby(by = 'userId')['movieId'].apply(lambda x:','.join(x))

def User_item_score1(user):
    Movie_seen_by_user = check.columns[check[check.index==user].notna().any()].tolist()
    a = sim_user_30_m[sim_user_30_m.index==user].values
    b = a.squeeze().tolist()
    d = Movie_user[Movie_user.index.isin(b)]
    l = ','.join(d.values)
    Movie_seen_by_similar_users = l.split(',')
    Movies_under_consideration = list(set(Movie_seen_by_similar_users)-set(list(map(str, Movie_seen_by_user))))
    Movies_under_consideration = list(map(int, Movies_under_consideration))
    score = []
    for item in Movies_under_consideration:
        c = final_movie.loc[:,item]
        d = c[c.index.isin(b)]
        f = d[d.notnull()]
        avg_user = Mean.loc[Mean['userId'] == user,'rating'].values[0]
        index = f.index.values.squeeze().tolist()
        corr = similarity_with_movie.loc[user,index]
        fin = pd.concat([f, corr], axis=1)
        fin.columns = ['adg_score','correlation']
        fin['score']=fin.apply(lambda x:x['adg_score'] * x['correlation'],axis=1)
        nume = fin['score'].sum()
        deno = fin['correlation'].sum()
        final_score = avg_user + (nume/deno)
        score.append(final_score)
    data = pd.DataFrame({'movieId':Movies_under_consideration,'score':score})
    top_5_recommendation = data.sort_values(by='score',ascending=False).head(5)
    Movie_Name = top_5_recommendation.merge(movies, how='inner', on='movieId')
    Movie_Names = Movie_Name.title.values.tolist()
    return Movie_Names

user = int(input("Enter the user id to whom you want to recommend : "))
predicted_movies = User_item_score1(user)
print(" ")
print("The Recommendations for User Id : 370")
print("   ")
for i in predicted_movies:
    print(i)

The results are as follows:

 

3. Summary

Based on the collaborative filtering of users, the process is described as follows:

  1. Collect & store data
  2. Load data
  3. Data modeling (data preprocessing & data cleaning)
  4. Calculation similarity (cosine similarity, adjacent calculation)
  5. Score forecast (forecast and final score calculation)
  6. Item recommendation

4. Conclusion

This blog will share with you here. If you have any questions in the process of research and study, you can add a group to discuss or send an email to me. I will try my best to answer for you and share with you!

In addition, the blogger has published a Book< Kafka is not hard to learn >And< Hadoop big data mining from entry to advanced practice >, like friends or classmates, you can click the purchase link in the bulletin board to buy the blogger's book for learning, thank you for your support. Pay attention to the official account below. According to the prompts, you can get free teaching video of books.

Tags: Lambda Python Big Data kafka

Posted on Fri, 26 Jun 2020 00:30:55 -0400 by lakshmiyb