1, Analysis background
online shopping has become an indispensable part of people's life. Based on the data of Taobao app platform, this project analyzes user behavior through relevant indicators, so as to explore user related behavior patterns.
The data set used in this paper contains the user behavior data of Taobao App mobile terminal within one month from November 18, 2014 to December 18, 2014. The data is recorded for 12256906 days, with a total of 6 columns of data.
Data link: Dataset - Alibaba cloud Tianchi
-
user_id: user identity
-
item_id: Commodity id
-
behavior_type: user behavior type (including clicking, collecting, adding to shopping cart and paying, represented by numbers 1, 2, 3 and 4 respectively)
-
user_geohash: geographic location
-
item_category: category ID (category to which the commodity belongs)
-
Time: the time when the user behavior occurs
2, Analytical framework
3, Analysis and visualization
1. Import related libraries
%matplotlib inline import numpy as np import pandas as pd import matplotlib as mpl import matplotlib.pyplot as plt import seaborn as sns from datetime import datetime import warnings mpl.rcParams["font.family"] = "SimHei" mpl.rcParams["axes.unicode_minus"] = False warnings.filterwarnings('ignore')
2. Data preprocessing
2.1 data preview
df = pd.read_csv("tianchi_mobile_recommend_train_user.csv",dtype=str) df.shape df.info() df.sample(10)
2.2 calculation of missing rate
df.isnull().sum()/len(df['user_id'])
2.3 delete the geographic location column
del df['user_geohash'] df.head()
two point four The processing time column is divided into date column and hour column
df['date'] = df['time'].str[0:10] df['hour'] = df['time'].str[11:] df.head()
two point five Change the time and date columns to the standard date format, and the hour column to the int format
df['date'] = pd.to_datetime(df['date']) df["time"] = pd.to_datetime(df["time"]) df["hour"] = df["hour"].astype(int) df.info()
two point six Arrange the data in ascending order according to the time column and generate a new index
df = df.sort_values(by='time',ascending=True).reset_index(drop=True) df.head()
2.7 viewing data distribution
df.describe(include='all')
two point eight Give an overview of the time data
df["date"].unique()
3. User behavior analysis
3.1 calculation of PV and UV
total_pv = df["user_id"].count() total_pv total_uv = df["user_id"].nunique() total_uv
As can be seen from the figure, the total page views of the website are 12256906 times, and the number of independent visitors on the page is 10000.
3.2 PV and UV in date dimension
pv_daily = df.groupby('date')['user_id'].count().reset_index() uv_daily = df.groupby('date')['user_id'].nunique().reset_index() pv_uv_daily = pv_daily.merge(uv_daily,how='left',on='date') pv_uv_daily.columns = ['date',"pv","uv"] pv_uv_daily.head() pv_daily.set_index('date',inplace=True) uv_daily.set_index('date',inplace=True) plt.figure(figsize=(16,10)) plt.subplot(211) plt.plot(pv_daily['user_id'],c="r") plt.title("Total page views per day(PV)") plt.xticks(fontsize=12) plt.yticks(fontsize=12) plt.subplot(212) plt.plot(uv_daily['user_id'],c="g") plt.title("Number of unique visitors per page per day(UV)") plt.xticks(fontsize=12) plt.yticks(fontsize=12) plt.tight_layout() plt.savefig("PV and UV Change trend of",dpi=300)
It can be seen from the figure that pv and uv data show a high positive correlation. Before and after the double 12, pv and uv fluctuated between 350000 and 400000. When the double 12, page views increased sharply, which proved that the effect of this activity was very good.
3.3 PV and UV in time dimension
pv_hour = df.groupby('hour')['user_id'].count() uv_hour = df.groupby('hour')['user_id'].nunique() pv_uv_hour = pd.concat([pv_hour,uv_hour],axis=1) pv_uv_hour.columns=["pv_hour","uv_hour"] pv_uv_hour.head() fig = plt.figure(figsize=(20,10)) ax1 = fig.add_subplot(111) ax1.plot(pv_uv_hour["pv_hour"],c="steelblue",label="Total page views per hour") plt.ylabel("Page views") plt.legend(loc = 2) ax2 = ax1.twinx() ax2.plot(pv_uv_hour["uv_hour"],c="red",label="Page unique visitors per hour") plt.ylabel("Page unique visitors") plt.xticks(range(0,24),pv_uv_hour.index) plt.legend(loc = 6, bbox_to_anchor=(0, 0.9)) plt.tight_layout() plt.savefig("Hourly PV and UV Change trend of",dpi=300)
As can be seen from the figure, from 22:00 p.m. to 5:00 a.m., the number of users and traffic on the page gradually decrease, and many people are resting during this period. From 6:00 to 10:00 in the morning, the number of users gradually shows an upward trend, and there is a relatively stable state from 10:00 to 18:00. This time period is normal working hours. However, from 18:00 to 22:00 in the evening, the number of users surged sharply, reaching the maximum number of users visited in a day. Operators can refer to the active time period of users and take some promotional activities.
3.4 calculate the number of users who click, collect, add shopping cart and pay
type_1 = df[df['behavior_type']=="1"]["user_id"].count() type_2 = df[df['behavior_type']=="2"]["user_id"].count() type_3 = df[df['behavior_type']=="3"]["user_id"].count() type_4 = df[df['behavior_type']=="4"]["user_id"].count() print("Click user:",type_1) print("Favorite users:",type_2) print("Add cart user:",type_3) print("Payment user:",type_4)
As can be seen from the figure, users click on the page – > collect and add shopping cart – > pay, which gradually shows a downward trend. This analysis will be further explained in the funnel diagram below.
3.5 click, collect, add shopping cart and pay users in the date dimension
pv_date_type = pd.pivot_table(df, index='date',columns='behavior_type',\ values='user_id',aggfunc='count') pv_date_type.columns=['click','Collection','add to cart','payment'] pv_date_type.head() fig = plt.figure(figsize=(20,10)) ax1 = fig.add_subplot(111) ax1.plot(pv_date_type['Collection'],label='Collection',linestyle = '--') ax1.plot(pv_date_type['add to cart'],label= 'add to cart',linestyle = '-.') ax1.plot(pv_date_type['payment'],label= 'payment') plt.ylim(0,30000) plt.legend(loc=2) ax2 = ax1.twinx() ax2.plot(pv_date_type['click'],label='click',color='r',linestyle = ':') plt.legend(loc = 'upper right')
As can be seen from the figure, clicking, collecting, adding to the shopping cart and paying fluctuated sharply around the double 12, which also proved that the effect of this activity was very good.
3.6 click, collect, add shopping cart and pay users in week dimension
pv_time_week_type = pd.pivot_table(df,index='time_week',columns='behavior_type',\ values='user_id',aggfunc='count') pv_time_week_type.columns=['click','Collection','add to cart','payment'] pv_time_week_type.head() fig = plt.figure(figsize=(20,10)) ax1 = fig.add_subplot(111) ax1.plot(pv_time_week_type['Collection'],label='Collection',linestyle = '--') ax1.plot(pv_time_week_type['add to cart'],label= 'add to cart',linestyle = '-.') ax1.plot(pv_time_week_type['payment'],label= 'payment') plt.ylim(0,60000) plt.legend(loc=2) ax2 = ax1.twinx() ax2.plot(pv_time_week_type['click'],label='click',color='r',linestyle = ':') plt.legend(loc = 'upper right') plt.savefig("Different user behaviors in different weeks PV Change trend",dpi=300)
As can be seen from the above figure, the payment behavior is mainly completed on Friday in the whole week, and it is relatively flat on Saturday and Sunday. Other behaviors also have the same pattern, especially click behavior, which increases rapidly from Monday to decreases rapidly after Thursday, showing relatively violent fluctuations.
3.7 click, collect, add shopping cart and pay users in the time dimension
pv_hour_type = pd.pivot_table(df,index='hour',columns='behavior_type',\ values='user_id',aggfunc='count') pv_hour_type.columns=['click','Collection','add to cart','payment'] pv_hour_type.head() plt.figure(figsize=(16,10)) sns.lineplot(data = pv_hour_type[['Collection', 'add to cart', 'payment']]) plt.legend(loc="best") pv_hour_type["click"].plot(c="r",linewidth=2,label="click",secondary_y=True) plt.legend(loc = 1) plt.tight_layout() plt.savefig("Different user behaviors in different hours PV Change trend",dpi=300)
As can be seen from the above figure, the four behaviors show a high positive correlation, which gradually decreases from 22:00 p.m. to 5:00 a.m., and the number of users gradually increases from 6:00 a.m. to 10:00 a.m. there is a relatively stable state from 10:00 to 18:00 p.m. from 18:00 to 22:00 p.m., users surge sharply, reaching the maximum number of users in a day.
three point eight Breakdown of user behavior in the top 10 payment times
df['user_id1'] = df['user_id'] buy_first = pd.pivot_table(df, index = 'user_id', columns = 'behavior_type',\ values = 'user_id1', aggfunc='count') buy_first.columns = ['click','Collection','add to cart','payment'] buy_first_10 = buy_first.sort_values(by='payment',ascending = False)[0:10] buy_first_10 plt.figure(figsize=(16,10)) plt.subplot(311) plt.plot(buy_first_10["click"],c='steelblue') plt.title("Change trend of hits") plt.subplot(312) plt.plot(buy_first_10["Collection"],c='g') plt.title("Change trend of collection number") plt.subplot(313) plt.plot(buy_first_10["add to cart"],c='r') plt.title("Change trend of adding shopping cart") plt.tight_layout() plt.savefig("The top 10 paying users have the change trend of clicking, collecting and adding to the shopping cart",dpi=300)
It can be seen from the above figure that the users who purchase the most often do not necessarily click, collect and join the car purchase most often.
4. Analysis of user consumption behavior
4.1 per capita consumption times of daily consumption users
total_custome = df[df['behavior_type'] == "4"].groupby(["date","user_id"])["behavior_type"].count().reset_index().rename(columns={"behavior_type":"total"}) total_custome.head() total_custome2 = total_custome.groupby("date")["total"].sum()/total_custome.groupby("date")["total"].count() total_custome2.head(10) x = len(total_custome2.index) y = total_custome2.index.astype(str) plt.figure(figsize=(20,7)) plt.plot(total_custome2.values) plt.xticks(range(0,31,5),[y[i] for i in range(0,x,5)]) plt.title('Per capita consumption per day') plt.tight_layout() plt.savefig("Per capita consumption per day",dpi=300)
4.2 daily consumption times of active users
df["operation"] = 1 aa = df.groupby(["date","user_id",'behavior_type'])["operation"].count().\ reset_index().rename(columns={"operation":"total"}) aa.head(10) #Total daily consumption times / daily consumption number (nunique) aa1 = aa.groupby("date").apply(lambda x: x[x["behavior_type"]=="4"]["total"].sum()/x["user_id"].nunique()) aa1.head(10) x = len(aa1.index) y = aa1.index.astype(str) plt.figure(figsize=(20,7)) plt.plot(aa1.values) plt.xticks(range(0,31,5),[y[i] for i in range(0,x,5)]) plt.title('Consumption times of active users per day') plt.tight_layout() plt.savefig('Consumption times of active users per day',dpi=300)
4.3 payment rate analysis
rate = aa.groupby('date').apply(lambda x:x[x['behavior_type']=='4']['user_id'].nunique()/x['user_id'].nunique()) rate.head() x=len(rate.index) y=rate.index.astype(str) plt.figure(figsize=(20,7)) plt.plot(rate.values) plt.xticks(range(0,31,5),[y[i] for i in range(0,31,5)]) plt.title('Payment rate analysis') plt.tight_layout() plt.savefig('Payment rate analysis',dpi=300)
four point four Repurchase rate analysis
re_buy = df[df['behavior_type']=='4'].groupby('user_id')['date'].nunique() re_buy.head() re_buy[re_buy>1].count()/re_buy.count()
4.5 conversion analysis
df_count = df.groupby("behavior_type").size().reset_index().\ rename(columns={"behavior_type":"link",0:"Number of people"}) type_dict = { "1":"click", "2":"Collection", "3":"add to cart", "4":"payment" } df_count["link"] = df_count["link"].map(type_dict) a = df_count.iloc[0]["Number of people"] b = df_count.iloc[1]["Number of people"] c = df_count.iloc[2]["Number of people"] d = df_count.iloc[3]["Number of people"] funnel = pd.DataFrame({"link":["click","Collect and add shopping cart","payment"],"Number of people":[a,b+c,d]}) funnel["Overall conversion"] = [i/funnel["Number of people"][0] for i in funnel["Number of people"]] funnel["Single conversion"] = np.array([1.0,2.0,3.0]) for i in range(0,len(funnel["Number of people"])): if i == 0: funnel["Single conversion"][i] = 1.0 else: funnel["Single conversion"][i] = funnel["Number of people"][i] / funnel["Number of people"][i-1] # The drawing is as follows import plotly.express as px import plotly.graph_objs as go trace = go.Funnel( y = ["click", "Collect and add shopping cart", "purchase"], x = [funnel["Number of people"][0], funnel["Number of people"][1], funnel["Number of people"][2]], textinfo = "value+percent initial", marker=dict(color=["deepskyblue", "lightsalmon", "tan"]), connector = {"line": {"color": "royalblue", "dash": "solid", "width": 3}}) data =[trace] fig = go.Figure(data) fig.show()
As can be seen from the above figure, since collecting and joining car purchase are user behaviors with purchase intention, and in no order, we regard their combination as a stage. As can be seen from the funnel chart and the funnel table above, the conversion rate from browsing to having purchase intention (collecting and adding to the shopping cart) is only 5%, but the conversion rate from real purchase to purchase is only 1%. Looking at the "single conversion rate", the conversion rate from having purchase intention to real purchase has reached 20%. It shows that the stage from browsing to collecting and adding to the shopping cart is an important link in the improvement of indicators.
5. RFM analysis
recent_buy = df[df['behavior_type']=='4'].groupby('user_id')['date'].max().\ apply(lambda x:(datetime.strptime('2014-12-20','%Y-%m-%d')-x).days).reset_index().\ rename(columns={"date":"recent"}) buy_freq = df[df['behavior_type']=='4'].groupby('user_id')['date'].count().reset_index().\ rename(columns={"date":"freq"}) rfm = pd.merge(recent_buy,buy_freq,how='inner',on='user_id') r_bins = [0,5,10,15,20,50] f_bins = [1,30,60,90,120,900] rfm['r_score'] = pd.cut(rfm['recent'],bins=r_bins,labels=[5.0,4.0,3.0,2.0,1.0],right = False) rfm['f_score'] = pd.cut(rfm['freq'],bins=f_bins,labels=[1.0,2.0,3.0,4.0,5.0],right = False) rfm['r_score']=rfm['r_score'].astype(float) rfm['f_score']=rfm['f_score'].astype(float) avg_r =rfm['r_score'].mean() avg_f =rfm['f_score'].mean() def func1(x): if x > avg_r: return 'high' else: return 'low' def func2(x): if x > avg_f: return 'high' else: return 'low' rfm["r"] =rfm['r_score'].apply(func1) rfm["f"] =rfm['f_score'].apply(func2) rfm['rfm'] = rfm["r"] + '-' + rfm["f"] def trans_labels(x): if x == "high-high": return"Important value customers" elif x == "low-high": return"Important call back customers" elif x == "high-low": return"Important deep ploughing customers" else: return"Important customer recovery" rfm["label"] = rfm['rfm'].apply(trans_labels) rfm["label"].value_counts() rfm.loc[rfm["label"] =='Important value customers','color']='g' rfm.loc[~(rfm["label"] =='Important value customers'),'color']='r' rfm.plot.scatter('freq','recent',c= rfm.color)
It can be seen from the figure that important value customers are mainly concentrated in the range of 20-100 times of consumption frequency in recent 5 days, which is also related to the recent double 12 activities.
6, User activity analysis
pivoted_counts = df.pivot_table(index= 'user_id', columns= 'date', values= 'user_id1', aggfunc= 'count').fillna(0) # All greater than once are set to 1 df_purchase = pivoted_counts.applymap(lambda x:1 if x>0 else 0) df_purchase.head() def active_status(data): status = [] for i in range(31): if data[i] == 0: if len(status)>0: if status[i-1] == "unreg": # Unregistered customer status.append("unreg") else: # Inactive user status.append("unactive") else: status.append("unreg") # If you spend more this month else: if len(status) == 0: # new user status.append("new") else: if status[i-1] == "unactive": # Return user status.append("return") elif status[i-1] == "unreg": status.append("new") else: status.append("active") return pd.Series(status) purchase_stats = df_purchase.apply(active_status,axis =1) purchase_stats.columns = df_purchase.columns purchase_stats.head() purchase_stats_ct = purchase_stats.replace('unreg',np.NaN).apply(lambda x:pd.value_counts(x)) purchase_stats_ct.fillna(0).T.plot.area()
As can be seen from the above figure, both active customers and returning customers fluctuated significantly before and after December 12, which is related to the double 12 activity of Taobao. In other times, active customers, new customers, returning customers and inactive customers maintain a relatively stable state.
7, Analysis and summary
- Business process
Users need to be guided to collect and join the shopping cart, such as issuing coupons, full reduction activities, giving gifts and other preferential ways to users, so as to improve the purchase conversion rate through mass sales. And for high click through goods, we should focus on analysis and optimize the recommendation mechanism of goods, so that users can click to buy.
- In the time dimension
In the user's leisure time period, such as 7-10 o'clock in commuting time, 12-13 o'clock in lunch time and 6-10 o'clock in the evening, promotional activities and some related marketing should be carried out. In the choice of week, the focus should be on Monday to Friday, especially the activities on Thursday and Friday, which can greatly increase the conversion rate of user purchase.
- User group
VIP services need to be provided to users of important value, for example, VIP can get coupons; For important retained customers, send SMS and e-mail, or push by APP client, so that customers can come back for consumption. In addition, customers can be recalled again through some festival discounts. For important development and maintenance users, we can focus on increasing consumption frequency and stimulating consumption through activities such as coupons.