Taobao user behavior analysis project: Python

1, Analysis background

  online shopping has become an indispensable part of people's life. Based on the data of Taobao app platform, this project analyzes user behavior through relevant indicators, so as to explore user related behavior patterns.

        The data set used in this paper contains the user behavior data of Taobao App mobile terminal within one month from November 18, 2014 to December 18, 2014. The data is recorded for 12256906 days, with a total of 6 columns of data.

  Data link: Dataset - Alibaba cloud Tianchi

  • user_id: user identity

  • item_id: Commodity id

  • behavior_type: user behavior type (including clicking, collecting, adding to shopping cart and paying, represented by numbers 1, 2, 3 and 4 respectively)

  • user_geohash: geographic location

  • item_category: category ID (category to which the commodity belongs)

  • Time: the time when the user behavior occurs

2, Analytical framework

3, Analysis and visualization

 

1. Import related libraries

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib as  mpl
import matplotlib.pyplot as  plt
import seaborn as sns
from datetime import datetime
import warnings

mpl.rcParams["font.family"] = "SimHei"
mpl.rcParams["axes.unicode_minus"] = False

warnings.filterwarnings('ignore')

2. Data preprocessing

2.1 data preview

df = pd.read_csv("tianchi_mobile_recommend_train_user.csv",dtype=str)
df.shape
df.info()
df.sample(10)

  2.2 calculation of missing rate

df.isnull().sum()/len(df['user_id'])

   2.3 delete the geographic location column

del df['user_geohash']
df.head()

  two point four   The processing time column is divided into date column and hour column

df['date'] = df['time'].str[0:10]
df['hour'] = df['time'].str[11:]
df.head()

  two point five   Change the time and date columns to the standard date format, and the hour column to the int format

df['date'] = pd.to_datetime(df['date'])
df["time"] = pd.to_datetime(df["time"])
df["hour"] = df["hour"].astype(int)
df.info()

  two point six   Arrange the data in ascending order according to the time column and generate a new index

df = df.sort_values(by='time',ascending=True).reset_index(drop=True)
df.head()

  2.7 viewing data distribution

df.describe(include='all')

  two point eight   Give an overview of the time data

df["date"].unique()

3. User behavior analysis

3.1 calculation of PV and UV

total_pv = df["user_id"].count()
total_pv
total_uv = df["user_id"].nunique()
total_uv

      As can be seen from the figure, the total page views of the website are 12256906 times, and the number of independent visitors on the page is 10000.

3.2 PV and UV in date dimension

pv_daily = df.groupby('date')['user_id'].count().reset_index()
uv_daily = df.groupby('date')['user_id'].nunique().reset_index()
pv_uv_daily = pv_daily.merge(uv_daily,how='left',on='date')
pv_uv_daily.columns = ['date',"pv","uv"]
pv_uv_daily.head()
pv_daily.set_index('date',inplace=True)
uv_daily.set_index('date',inplace=True)

plt.figure(figsize=(16,10))
plt.subplot(211)
plt.plot(pv_daily['user_id'],c="r")
plt.title("Total page views per day(PV)")
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.subplot(212)
plt.plot(uv_daily['user_id'],c="g")
plt.title("Number of unique visitors per page per day(UV)")
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.tight_layout()
plt.savefig("PV and UV Change trend of",dpi=300)

        It can be seen from the figure that pv and uv data show a high positive correlation. Before and after the double 12, pv and uv fluctuated between 350000 and 400000. When the double 12, page views increased sharply, which proved that the effect of this activity was very good.

3.3 PV and UV in time dimension

pv_hour = df.groupby('hour')['user_id'].count()
uv_hour = df.groupby('hour')['user_id'].nunique()
pv_uv_hour = pd.concat([pv_hour,uv_hour],axis=1)
pv_uv_hour.columns=["pv_hour","uv_hour"]
pv_uv_hour.head()

fig = plt.figure(figsize=(20,10))
ax1 = fig.add_subplot(111)

ax1.plot(pv_uv_hour["pv_hour"],c="steelblue",label="Total page views per hour")
plt.ylabel("Page views")
plt.legend(loc = 2)

ax2 = ax1.twinx()
ax2.plot(pv_uv_hour["uv_hour"],c="red",label="Page unique visitors per hour")
plt.ylabel("Page unique visitors")
plt.xticks(range(0,24),pv_uv_hour.index)
plt.legend(loc = 6, bbox_to_anchor=(0, 0.9))
plt.tight_layout()
plt.savefig("Hourly PV and UV Change trend of",dpi=300)

        As can be seen from the figure, from 22:00 p.m. to 5:00 a.m., the number of users and traffic on the page gradually decrease, and many people are resting during this period. From 6:00 to 10:00 in the morning, the number of users gradually shows an upward trend, and there is a relatively stable state from 10:00 to 18:00. This time period is normal working hours. However, from 18:00 to 22:00 in the evening, the number of users surged sharply, reaching the maximum number of users visited in a day. Operators can refer to the active time period of users and take some promotional activities.

  3.4 calculate the number of users who click, collect, add shopping cart and pay

type_1 = df[df['behavior_type']=="1"]["user_id"].count()
type_2 = df[df['behavior_type']=="2"]["user_id"].count()
type_3 = df[df['behavior_type']=="3"]["user_id"].count()
type_4 = df[df['behavior_type']=="4"]["user_id"].count()
print("Click user:",type_1)
print("Favorite users:",type_2)
print("Add cart user:",type_3)
print("Payment user:",type_4)

        As can be seen from the figure, users click on the page – > collect and add shopping cart – > pay, which gradually shows a downward trend. This analysis will be further explained in the funnel diagram below.

  3.5 click, collect, add shopping cart and pay users in the date dimension

pv_date_type = pd.pivot_table(df, index='date',columns='behavior_type',\
               values='user_id',aggfunc='count')
pv_date_type.columns=['click','Collection','add to cart','payment']
pv_date_type.head()

fig = plt.figure(figsize=(20,10))
ax1 = fig.add_subplot(111)
ax1.plot(pv_date_type['Collection'],label='Collection',linestyle = '--')
ax1.plot(pv_date_type['add to cart'],label= 'add to cart',linestyle = '-.')
ax1.plot(pv_date_type['payment'],label= 'payment')
plt.ylim(0,30000)
plt.legend(loc=2)

ax2 = ax1.twinx()
ax2.plot(pv_date_type['click'],label='click',color='r',linestyle = ':')

plt.legend(loc = 'upper right')

    As can be seen from the figure, clicking, collecting, adding to the shopping cart and paying fluctuated sharply around the double 12, which also proved that the effect of this activity was very good.

   3.6 click, collect, add shopping cart and pay users in week dimension

pv_time_week_type = pd.pivot_table(df,index='time_week',columns='behavior_type',\
                              values='user_id',aggfunc='count')
pv_time_week_type.columns=['click','Collection','add to cart','payment']
pv_time_week_type.head()



fig = plt.figure(figsize=(20,10))
ax1 = fig.add_subplot(111)
ax1.plot(pv_time_week_type['Collection'],label='Collection',linestyle = '--')
ax1.plot(pv_time_week_type['add to cart'],label= 'add to cart',linestyle = '-.')
ax1.plot(pv_time_week_type['payment'],label= 'payment')
plt.ylim(0,60000)
plt.legend(loc=2)

ax2 = ax1.twinx()
ax2.plot(pv_time_week_type['click'],label='click',color='r',linestyle = ':')

plt.legend(loc = 'upper right')
plt.savefig("Different user behaviors in different weeks PV Change trend",dpi=300)

        As can be seen from the above figure, the payment behavior is mainly completed on Friday in the whole week, and it is relatively flat on Saturday and Sunday. Other behaviors also have the same pattern, especially click behavior, which increases rapidly from Monday to decreases rapidly after Thursday, showing relatively violent fluctuations.

  3.7 click, collect, add shopping cart and pay users in the time dimension

pv_hour_type = pd.pivot_table(df,index='hour',columns='behavior_type',\
                              values='user_id',aggfunc='count')
pv_hour_type.columns=['click','Collection','add to cart','payment']
pv_hour_type.head()


plt.figure(figsize=(16,10))

sns.lineplot(data = pv_hour_type[['Collection', 'add to cart', 'payment']])
plt.legend(loc="best")
pv_hour_type["click"].plot(c="r",linewidth=2,label="click",secondary_y=True)
plt.legend(loc = 1)

plt.tight_layout()
plt.savefig("Different user behaviors in different hours PV Change trend",dpi=300)

      As can be seen from the above figure, the four behaviors show a high positive correlation, which gradually decreases from 22:00 p.m. to 5:00 a.m., and the number of users gradually increases from 6:00 a.m. to 10:00 a.m. there is a relatively stable state from 10:00 to 18:00 p.m. from 18:00 to 22:00 p.m., users surge sharply, reaching the maximum number of users in a day.

    three point eight   Breakdown of user behavior in the top 10 payment times

df['user_id1'] = df['user_id']
buy_first = pd.pivot_table(df, index = 'user_id', columns = 'behavior_type',\
            values = 'user_id1', aggfunc='count')
buy_first.columns = ['click','Collection','add to cart','payment']
buy_first_10 = buy_first.sort_values(by='payment',ascending = False)[0:10]
buy_first_10

plt.figure(figsize=(16,10))

plt.subplot(311)
plt.plot(buy_first_10["click"],c='steelblue')
plt.title("Change trend of hits")

plt.subplot(312)
plt.plot(buy_first_10["Collection"],c='g')
plt.title("Change trend of collection number")

plt.subplot(313)
plt.plot(buy_first_10["add to cart"],c='r')
plt.title("Change trend of adding shopping cart")

plt.tight_layout()
plt.savefig("The top 10 paying users have the change trend of clicking, collecting and adding to the shopping cart",dpi=300)

 

    It can be seen from the above figure that the users who purchase the most often do not necessarily click, collect and join the car purchase most often.

4. Analysis of user consumption behavior

4.1 per capita consumption times of daily consumption users

total_custome = df[df['behavior_type'] == "4"].groupby(["date","user_id"])["behavior_type"].count().reset_index().rename(columns={"behavior_type":"total"})
total_custome.head()

total_custome2 = total_custome.groupby("date")["total"].sum()/total_custome.groupby("date")["total"].count()
total_custome2.head(10)


x = len(total_custome2.index)
y = total_custome2.index.astype(str)

plt.figure(figsize=(20,7))


plt.plot(total_custome2.values)
plt.xticks(range(0,31,5),[y[i] for i in range(0,x,5)])

plt.title('Per capita consumption per day')
plt.tight_layout()
plt.savefig("Per capita consumption per day",dpi=300)

  4.2 daily consumption times of active users

df["operation"] = 1
aa = df.groupby(["date","user_id",'behavior_type'])["operation"].count().\
      reset_index().rename(columns={"operation":"total"})
aa.head(10)

#Total daily consumption times / daily consumption number (nunique)
aa1 = aa.groupby("date").apply(lambda x: x[x["behavior_type"]=="4"]["total"].sum()/x["user_id"].nunique())
aa1.head(10)

x = len(aa1.index)
y = aa1.index.astype(str)

plt.figure(figsize=(20,7))


plt.plot(aa1.values)
plt.xticks(range(0,31,5),[y[i] for i in range(0,x,5)])

plt.title('Consumption times of active users per day')
plt.tight_layout()
plt.savefig('Consumption times of active users per day',dpi=300)

  4.3 payment rate analysis

rate = aa.groupby('date').apply(lambda x:x[x['behavior_type']=='4']['user_id'].nunique()/x['user_id'].nunique())
rate.head()


x=len(rate.index)
y=rate.index.astype(str)

plt.figure(figsize=(20,7))
plt.plot(rate.values)

plt.xticks(range(0,31,5),[y[i] for i in range(0,31,5)])

plt.title('Payment rate analysis')
plt.tight_layout()
plt.savefig('Payment rate analysis',dpi=300)

  four point four   Repurchase rate analysis

re_buy = df[df['behavior_type']=='4'].groupby('user_id')['date'].nunique()
re_buy.head()

re_buy[re_buy>1].count()/re_buy.count()

4.5 conversion analysis

df_count = df.groupby("behavior_type").size().reset_index().\
           rename(columns={"behavior_type":"link",0:"Number of people"})
           
type_dict = {
    "1":"click",
    "2":"Collection",
    "3":"add to cart",
    "4":"payment"
}
df_count["link"] = df_count["link"].map(type_dict)

a = df_count.iloc[0]["Number of people"]
b = df_count.iloc[1]["Number of people"]
c = df_count.iloc[2]["Number of people"]
d = df_count.iloc[3]["Number of people"]
funnel = pd.DataFrame({"link":["click","Collect and add shopping cart","payment"],"Number of people":[a,b+c,d]})

funnel["Overall conversion"] = [i/funnel["Number of people"][0] for i in funnel["Number of people"]]
funnel["Single conversion"] = np.array([1.0,2.0,3.0])
for i in range(0,len(funnel["Number of people"])):
    if i == 0:
        funnel["Single conversion"][i] = 1.0
    else:
        funnel["Single conversion"][i] = funnel["Number of people"][i] / funnel["Number of people"][i-1]
# The drawing is as follows
import plotly.express as px
import plotly.graph_objs as go

trace = go.Funnel(
    y = ["click", "Collect and add shopping cart", "purchase"],
    x = [funnel["Number of people"][0], funnel["Number of people"][1], funnel["Number of people"][2]],
    textinfo = "value+percent initial",
    marker=dict(color=["deepskyblue", "lightsalmon", "tan"]),
    connector = {"line": {"color": "royalblue", "dash": "solid", "width": 3}})
    
data =[trace]

fig = go.Figure(data)

fig.show()

        As can be seen from the above figure, since collecting and joining car purchase are user behaviors with purchase intention, and in no order, we regard their combination as a stage. As can be seen from the funnel chart and the funnel table above, the conversion rate from browsing to having purchase intention (collecting and adding to the shopping cart) is only 5%, but the conversion rate from real purchase to purchase is only 1%. Looking at the "single conversion rate", the conversion rate from having purchase intention to real purchase has reached 20%. It shows that the stage from browsing to collecting and adding to the shopping cart is an important link in the improvement of indicators.

5. RFM analysis

recent_buy = df[df['behavior_type']=='4'].groupby('user_id')['date'].max().\
             apply(lambda x:(datetime.strptime('2014-12-20','%Y-%m-%d')-x).days).reset_index().\
             rename(columns={"date":"recent"})

buy_freq = df[df['behavior_type']=='4'].groupby('user_id')['date'].count().reset_index().\
          rename(columns={"date":"freq"})

rfm = pd.merge(recent_buy,buy_freq,how='inner',on='user_id')

r_bins = [0,5,10,15,20,50]
f_bins = [1,30,60,90,120,900]
rfm['r_score'] = pd.cut(rfm['recent'],bins=r_bins,labels=[5.0,4.0,3.0,2.0,1.0],right = False)
rfm['f_score'] = pd.cut(rfm['freq'],bins=f_bins,labels=[1.0,2.0,3.0,4.0,5.0],right = False)
rfm['r_score']=rfm['r_score'].astype(float)
rfm['f_score']=rfm['f_score'].astype(float)

avg_r =rfm['r_score'].mean()
avg_f =rfm['f_score'].mean()

def func1(x):
    if x > avg_r:
        return 'high'
    else:
        return 'low'

def func2(x):
    if x > avg_f:
        return 'high'
    else:
        return 'low'

rfm["r"] =rfm['r_score'].apply(func1)
rfm["f"] =rfm['f_score'].apply(func2)

rfm['rfm'] = rfm["r"] + '-' + rfm["f"] 


def trans_labels(x):
    if x == "high-high":
        return"Important value customers"
    elif x == "low-high":
        return"Important call back customers"
    elif x == "high-low":
        return"Important deep ploughing customers"
    else:
        return"Important customer recovery"

rfm["label"] = rfm['rfm'].apply(trans_labels)

rfm["label"].value_counts()

rfm.loc[rfm["label"] =='Important value customers','color']='g'
rfm.loc[~(rfm["label"] =='Important value customers'),'color']='r'
rfm.plot.scatter('freq','recent',c= rfm.color)

 

 

 

        It can be seen from the figure that important value customers are mainly concentrated in the range of 20-100 times of consumption frequency in recent 5 days, which is also related to the recent double 12 activities.

6, User activity analysis

pivoted_counts = df.pivot_table(index= 'user_id',
                               columns= 'date',
                               values= 'user_id1',
                               aggfunc= 'count').fillna(0)
# All greater than once are set to 1
df_purchase = pivoted_counts.applymap(lambda x:1 if x>0 else 0)
df_purchase.head()


def active_status(data):
    status = []
    for i in range(31):
        if data[i] == 0:
            if len(status)>0:
                if status[i-1] == "unreg":
                 # Unregistered customer
                    status.append("unreg")
                else:
                 # Inactive user
                    status.append("unactive")
            else:
                status.append("unreg")
            
        # If you spend more this month
        else:
            if len(status) == 0:
             # new user
                status.append("new")
            else:
                if status[i-1] == "unactive":
                 # Return user
                    status.append("return")
                elif status[i-1] == "unreg":
                    status.append("new")
                else:
                    status.append("active")
    return pd.Series(status)  

purchase_stats = df_purchase.apply(active_status,axis =1)
purchase_stats.columns = df_purchase.columns
purchase_stats.head()



purchase_stats_ct = purchase_stats.replace('unreg',np.NaN).apply(lambda x:pd.value_counts(x))

purchase_stats_ct.fillna(0).T.plot.area()

         As can be seen from the above figure, both active customers and returning customers fluctuated significantly before and after December 12, which is related to the double 12 activity of Taobao. In other times, active customers, new customers, returning customers and inactive customers maintain a relatively stable state.

7, Analysis and summary

  • Business process

      Users need to be guided to collect and join the shopping cart, such as issuing coupons, full reduction activities, giving gifts and other preferential ways to users, so as to improve the purchase conversion rate through mass sales. And for high click through goods, we should focus on analysis and optimize the recommendation mechanism of goods, so that users can click to buy.

  • In the time dimension

        In the user's leisure time period, such as 7-10 o'clock in commuting time, 12-13 o'clock in lunch time and 6-10 o'clock in the evening, promotional activities and some related marketing should be carried out. In the choice of week, the focus should be on Monday to Friday, especially the activities on Thursday and Friday, which can greatly increase the conversion rate of user purchase.

  • User group

        VIP services need to be provided to users of important value, for example, VIP can get coupons; For important retained customers, send SMS and e-mail, or push by APP client, so that customers can come back for consumption. In addition, customers can be recalled again through some festival discounts. For important development and maintenance users, we can focus on increasing consumption frequency and stimulating consumption through activities such as coupons.

Tags: Python Data Analysis

Posted on Tue, 12 Oct 2021 00:16:45 -0400 by Corpheous