Taobao APP user behavior data analysis case -- Python


1, Project introduction

1.1 analysis background

Over the years, the rapid development of e-commerce business has gradually become a part of people's life shopping. The popularity of mobile Internet makes online shopping more rapid and convenient. As one of the largest e-commerce trading platforms in China, Taobao has a huge traffic advantage. Taobao APP provides users with systems and functions such as search, product classification and navigation, user personalized recommendation and advertising, and guides users to effectively understand their favorite products and promote user consumption. The above systems and functions distribute the huge traffic of the home page well. At the same time, we can continuously adjust the system and function upgrading by collecting and analyzing user data, so as to continuously improve the user's consumption experience and form a good cycle. In addition, the emergence of Taobao's collection, shopping cart and other functions provides users with a convenient channel.
Based on the content characteristics of data set, this paper analyzes the user's behavior path in APP, understands the basic situation of current business and user's behavior habits, conducts deeper mining and analysis for possible problems, and puts forward corresponding solutions.

1.2 data set introduction

The data set comes from Alibaba cloud Tianchi official data set: User Behavior Data from Taobao for Recommendation.
The data set contains all the behaviors of about one million random users (including click, purchase, add purchase and collection) between November 25, 2017 and December 3, 2017. There are more than 100 million records in the data set. Each row of the data set represents a user behavior, which is composed of user ID, commodity ID, commodity category ID, behavior type and timestamp, separated by commas. A detailed description of each column in the dataset is as follows:

Column name explain
User ID: user_id Integer type, serialized user id
Commodity ID: item_id Integer type, serialized commodity ID
Commodity category ID: category_id Integer type, the category ID of the serialized product
Behavior type: behavior String, enumeration type, including ("pv", "buy", "cart", "fav")
Timestamp: timestamp Time stamp when the behavior occurred

The following is a description of the behavior types

Behavior type explain
pv Product details page pv, equivalent to clicking
buy Purchase of goods
cart Add item to cart
fav Collections

1.3 data understanding

  1. The data set only covers 9 days of user behavior, and the time is relatively short;
  2. The content of dataset only includes user ID, commodity ID, commodity category ID, behavior type and time stamp, which has great limitations. The four dimensions of behavior, commodity ID, time and commodity category ID are of analytical value.
  3. There are more than 100 million records in the data set, and the data volume is too large. This paper only extracts 2 million of them as representative analysis, which includes 19544 user behavior records. In order to avoid mistakes in readers' understanding of the following data, we should first state that the data expressed in the following articles are based on the behavior data of these 19544 users from November 25, 2017 to December 03, 2017 for a total of 9 days, rather than all the user behavior data.

2, Clear problem & Analysis Framework Construction

2.1 ask questions

According to the data content and characteristics of the data set, this paper mainly analyzes the following aspects:

  1. Analyze the common traffic indicators of Taobao APP.
  2. Analyze the common user indicators of Taobao APP.
  3. This paper studies the behavior distribution of users in Taobao APP in different time scales, and analyzes whether there is a certain time rule for user behavior.
  4. Analyze the user's behavior path of using APP (click, collect, add shopping cart, purchase).
  5. This paper analyzes whether there are hot selling "hot money" in the products purchased by users.

2.2 analysis framework construction

3, Data processing

3.1 lead in

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import time
import pyecharts.options as opts
from pyecharts.charts import Funnel
import copy

%matplotlib inline
# Set parameters related to matplolib
plt.rcParams["font.sans-serif"]=["SimHei"]     #Used to display Chinese labels normally
plt.rcParams["axes.unicode_minus"] = False   #Used to display negative sign normally"ggplot")

3.2 import data

# set field name
columns = ["user_id","item_id","category_id","behavior","timestamp"]

# Read data, use get_chunk gets 2 million pieces of data
data = pd.read_csv("E:/Data analysis/data set/UserBehavior.csv",header=None,sep=",",iterator=True,names=columns)
user_behavior = data.get_chunk(2000000)

# Initial data preview

3.3 data cleaning

3.3.1 view overall data information

The data dimension is 2 million × 5, the data types of each field meet the requirements, and the size of the filtered data set is 76.3M.

3.3.2 missing value handling

See how many records have missing values.


Run the code, the result shows that there is no missing value, and the data is relatively complete and clean.

3.3.3 repeated value processing

See how many records are duplicate.


Run the code, the result shows that there are no duplicate records in the data set, and no further processing is needed.

3.3.4 disassembling time dimension and adding new columns

# Time stamp keeps the original data form, adding a new column of date and hour
# The role of localtime is to format the time stamp as local time

user_behavior = user_behavior.assign(time=user_behavior["timestamp"].map(lambda x: time.strftime("%Y-%m-%d %H:%M:%S",time.localtime(x))))
user_behavior = user_behavior.assign(date=user_behavior["timestamp"].map(lambda x: time.strftime("%Y-%m-%d",time.localtime(x))))
user_behavior = user_behavior.assign(hour=user_behavior["timestamp"].map(lambda x: time.strftime("%H",time.localtime(x))))

3.3.5 removing outliers

According to the introduction of the source data set, it is known that the date of the data is included between November 25, 2017 and December 3, 2017, so the data can be handled abnormally according to this rule.

# Filtering time, excluding records not in the required time interval
user_behavior = user_behavior[(user_behavior["date"]>"2017-11-24")&(user_behavior["date"]<"2017-12-04")]

# After processing, view the overall data dimension again

Output results: (1999009, 7), indicating that the original data set contains user behavior records that are not in the range from November 25, 2017 to December 3, 2017. Now the data has completed the cleaning work and entered the data analysis link.

4, Analysis of flow indicators

4.1 flow type absolute index

According to the content of the data set, the absolute indicators of flow category in this analysis mainly include the following 4 items:

  1. Daily pv: the number of views or clicks on Taobao APP page on that day;
  2. Daily uv: number of users entering Taobao APP on the same day (need to be de duplicated);
  3. Number of active users per day: define the number of users who generate three or more behavioral data on Taobao APP on that day (need to be de duplicated);
  4. Number of users paid per day: the number of users who generate purchase behavior (buy) on Taobao APP on the same day (need to be de duplicated).

The specific code is as follows:

# Daily pv
daily_pv = user_behavior[user_behavior["behavior"]=="pv"][["user_id","date"]].groupby("date").count()

# Daily uv
daily_uv = user_behavior[["user_id","date"]].groupby("date").nunique()
daily_uv = daily_uv.drop("date",axis=1)
daily_uv = daily_uv.rename(columns={"user_id":"uv"})

# Daily active user: if the user acts more than 2 times in the same day, it will be regarded as active
daily_active = user_behavior[["user_id","behavior","date"]].groupby(["date","user_id"]).count()
daily_active = daily_active[daily_active["behavior"]>2]
daily_active = daily_active.count(level=0)

# Daily payment users
daily_purchaser = user_behavior[user_behavior["behavior"]=="buy"]
daily_purchaser = daily_purchaser[["user_id","date","behavior"]].groupby(["date","user_id"]).count()
daily_purchaser = daily_purchaser.count(level=0)

# Merge data
df_target = daily_pv.join([daily_uv,daily_active,daily_purchaser])

Use matplolib to draw a line chart.

plt.figure(figsize=(10,4),dpi=200)  # Set the size of the drawing

plt.ylabel("Company/ten thousand",fontsize=10)


plt.title("Absolute index of flow",fontsize=14)

plt.plot(df_target["pv"]/10000,linewidth = '2',marker="o",color="#00868B",label="pv")
plt.plot(df_target["uv"]/10000,linewidth = '2',marker="o",color="#ff5338",label="uv")
plt.plot(df_target["active_user"]/10000,linewidth="2",marker="o",color="#fe8c00",label="Active users")
plt.plot(df_target["buy_user"]/10000,linewidth="2",marker="o",color="#595959",label="Payment user")


The number of clicks per day is one order of magnitude higher than other data. After pv is excluded, the line chart of other data is as follows:

Summary of absolute indicators of flow category:

  • 1. The number of daily pv, daily uv, daily active users and daily payment users showed an upward trend, and they all increased significantly on December 2 and 3, with daily pv increased by 33.0%, daily uv increased by 34.7%, daily active users increased by 32.0%, and daily payment users increased by 24.0% (the data is worth showing from the average of the last two days except the previous seven days).
  • 2. The above indicators have been greatly improved from December 2 to 3. Although December 2-3 is a weekend, the weekend reason can be excluded based on the traffic analysis of November 25-6 of the last weekend. Because December 2-3 is close to the double 12, it is speculated that the most likely reason is the traffic increase brought by the preheating of the double 12 series promotional activities.

4.2 relative indicators of flow

The traffic indicators can reflect the overall operation of APP, while the relative indicators of traffic can reflect the overall deeper operation. According to the above traffic indicators, the corresponding traffic relative indicators include the following:

  1. pv per capita per day: that is to say, pv per day / uv per day;
  2. Proportion of daily active users: number of daily active users / daily uv;
  3. Proportion of daily payment users: number of daily payment users / daily uv;
  4. Skip rate: refers to the number of users who only browse once / uv in the statistical time; the statistical time period selected in this paper is one day.

The specific code is as follows:

# pv per capita per day
df_target["pv/uv"] = df_target["pv"]/df_target["uv"]

# Proportion of daily active users
df_target["active_rate"] = df_target["active_user"]/df_target["uv"]

# Proportion of monthly payment users
df_target["buy_rate"] = df_target["buy_user"]/df_target["uv"]



# Set drawing size
fig, ax1 = plt.subplots(1,1,figsize=(10,4),dpi=200)

n = df_target.shape[0]
x = np.arange(n)

color = '#595959'

ax1.set_title("Daily average flow proportion index")
ax1.set_ylabel('proportion', color=color,fontsize=14)  

# Draw the proportion of daily active users, df_target["active_rate"], color="#fe8c00",width=0.3,label="Proportion of active users")

# Draw the proportion of daily payment users,df_target["buy_rate"],width=0.3,color=color,label="Proportion of paying users")

ax1.tick_params(axis='y', labelcolor=color,labelsize=14)

# Generate secondary axis, draw daily uv and daily pv per capita
ax2 = ax1.twinx()
color = '#ff5338'

# Draw daily uv
ax2.plot(df_target["uv"]/1000,color=color,marker="o",linewidth = '2',label="everyday uv(Company:Thousands of people)")

# Draw pv per capita per day
ax2.plot(df_target["pv/uv"],color="#00868B",marker="o",linewidth = '2',label="per capita pv(Company:second)")


fig.legend(bbox_to_anchor=(1.15, 1.0)) #Set legend location

The statistics of jump loss rate are as follows:
(1) When the statistical time is one day, i.e. the rate of APP loss per day

df = user_behavior[["date","user_id","behavior"]]
# Assign values to different behaviors
df = df.assign(behavior_num = df["behavior"].map({"pv":1,"fav":2,"cart":4,"buy":8}))

# The following is to calculate the daily trip rate of APP

# According to date,user_id group, for different behaviors_ Sum of num
df1= df.groupby(["date","user_id"]).sum()

# Filter out users with and as 1. These users are users with pv only once in the day
daily_loss = df1[df1["behavior_num"]==1].count(level=0)

# Calculate the daily loss rate of APP
daily_loss = daily_uv.join(daily_loss)
daily_loss["lose_rate"] = daily_loss["user_lose"]/daily_loss["uv"]


The broken line chart of daily APP loss rate is as follows:

plt.figure(figsize=(10,2),dpi=200)  # Set the size of the drawing



plt.title("Daily jump loss rate",fontsize=12)

plt.plot(daily_loss["lose_rate"],linewidth = '2',marker="o",color="#ff5338")

Summary of relative indicators of flow category:

  1. pv per capita fluctuated gently, with an average of 13.0 times, that is, each user browsed 13 pages per day.
  2. The proportion of daily active users and daily payment users remained stable at about 79% and 19%.
  3. From December 2 to 3, the number of users increased by 34.7%, but the active proportion and payment proportion decreased by 2.0% and 7.9% respectively (the data from the average value of the last two days except the previous seven days). This shows that the payment proportion of users attracted by the activity is not higher than that of ordinary users, and the traffic conversion rate brought by the promotion activity reduces the overall payment conversion rate. It is recommended to be more accurate in the event drainage.
  4. Before the start of the activity (i.e. the first 7 days), the daily jump rate of app was about 10.3% and the user's stickiness was very high; from December 2 to 3 after the start of the activity, the daily jump rate of APP increased to 11.0%, 7.0% higher than before the activity, which also showed that the accuracy of activity drainage was lower than before the activity.

5, User index analysis

5.1 number of user purchases

1. Description and statistics of user purchase times (within 9 days)

df_buy_times = user_behavior[user_behavior["behavior"]=="buy"][["user_id","behavior"]].groupby(["user_id"]).count()

# User purchase times description statistics

From the above, in these 9 days, among the 19544 users, there are 13330 users who have purchase behaviors. On average, each user buys about 3 times, more than half of them buy less than 3 times, and most of them buy less than 5 times. The overall data shows a right deviation, that is to say, there are a small number of users who buy many times, and the number of purchases reaches 72 times at most.

2. Distribution of user purchase times



plt.title("Distribution of user purchase times",fontsize=12)
plt.xlabel("Number of purchases",fontsize=10)
plt.ylabel("Number of users",fontsize=10)



The majority of users purchase within 10 times during this period. The following is the distribution of users who purchase within 10 times.
With the increase of purchase times, the number of people is also declining, which is in line with the industry law of consumption behavior. In these 9 days, 72.8% of users purchase times below the average, and the proportion of users who consume high frequency (> 10) is very small, which is 2.2%. It can be identified as extremely loyal users, which needs to be focused on.

5.2 repurchase rate

Repurchase rate: refers to the proportion of the number of users who generate two or more purchases in the total number of users in the statistical cycle.
1. Repurchase rate in 9 days:

# Filter information for purchasing users
df_buy = user_behavior[user_behavior["behavior"]=="buy"][["user_id","date","behavior"]]

# Calculation of repurchase rate in 9 days
df_rebuy_9 = df_buy.groupby("user_id").behavior.count()

2. Repurchase rate per day:

# Group count by date, user ID
df_rebuy_1 = df_buy.groupby(["date","user_id"]).count()

# Count the number of users purchased every day
buy_user_total = df_rebuy_1.count(level=0)

# Count the number of users with repurchase
repurchase_user = df_rebuy_1[df_rebuy_1["behavior"]>=2].count(level=0)

# Calculate repurchase rate
df_repur_rate = buy_user_total.join(repurchase_user)
df_repur_rate["repur_rate"] = df_repur_rate["repur_amount"]/df_repur_rate["total"]

# Draw line chart of repurchase rate
plt.figure(figsize=(10,3),dpi=200)  # Set the size of the drawing

plt.title("Daily repurchase rate",fontsize=12)


Summary of repurchase rate:

  1. According to the reference given in the Book Lean data analysis: when the repeated purchase rate reaches more than 30% in 90 days, the e-commerce mode is in the loyalty mode, and the repurchase rate of Taobao APP9 days is as high as 66%. Therefore, at present, Taobao is in the loyalty mode of e-commerce, so the focus of business should be on customer loyalty, that is, to encourage loyal repeat customers to spend more frequently.
  2. The repurchase rate of users fluctuates around 30% every day. At the beginning of December, there is a small drop. It is speculated that users are restraining consumption and waiting for more preferential activities before the double 12 activity. Of course, the normality of data can only be determined by comparing the data of the same period last year.

5.3 repurchase rate

Because the data set only contains 9 days of data, this paper divides the 9 days into 3 time windows, each of which has a length of 3 days, so as to analyze the repurchase rate of users.

# Later, it involves adding and subtracting time, converting data set date field to time format,
user_behavior["date"] = pd.to_datetime(user_behavior["date"])

# Divide the time into three time windows, one every three days
# Add a new column to set the date of the same time window to the day with the largest date
for m, n in zip(range(1, 10),
    if m % 3 == 0:
        user_behavior.loc[(( + np.timedelta64(0,"D"))<= n) & (( + np.timedelta64(2,"D") >= n)), 'date1'] = n

# Group by time window
df_buy_back = user_behavior[user_behavior.behavior == 'buy'].groupby('date1').user_id.unique()

# Define an empty list, record the corresponding time window and repurchase rate
days = []
back_buy_rates = []

# Calculation of repurchase rate
for i in range(0,len(df_buy_back.index)-1):
    users = df_buy_back.values[i].tolist()   # Get the user ID of a time window
    for user in users:
        if user in df_buy_back.values[i+1].tolist():    # If the user also consumes in the next time window, count 1
            count += 1
    # Calculate the repo rate and record       
    buy_back_rate = count/len(users)
    # Record the corresponding time window information

# Generate DataFrame output
back_buy_rate = pd.DataFrame({'date': days, 'buy_back_rate': back_buy_rates})


Summary of repurchase rate: from November 25 to November 27, that is, the repurchase rate of users in the first time window is 44.5%; from November 28 to November 30 in the second time window, the repurchase rate of users is 46.7%, which is slightly higher than that in the first time window, indicating that user loyalty is increasing.

5.4 retention rate

Retention rate refers to the proportion of users who start to use the application within a certain period of time and continue to use the application after a period of time. Since the dataset has only 9 days of data, it is not possible to confirm whether the user is a new user. Therefore, this paper will simulate and analyze the retention rate of Taobao APP as a new user who logs in for the first time in 9 days.

# Define the calculation retention function, n is the corresponding number of days
def nday_retention_rate(df, n):
     # Create an empty set, record the user ID; create an empty list, record the corresponding date and retention rate
    users2 = set()
    days = []
    nday_retentions = []
    # Filter out the calculable dates
    dates =[:-n]
    # Loop through every day
    for date in dates:
        # Record the user ID of the previous day
        users1 = copy.deepcopy(users2)
        ids = df[ == date].user_id.unique().tolist()
        for i in ids:
            users2.add(i)      # Record the user ID of the day
        users = users2 - users1      # Subtract the two sets to get the new user ID of the day
        # List of user ID s using APP after n days
        nday_users = df[ == date + np.timedelta64(n,'D')].user_id.unique()
        counts = 0
        # Traverse the user ID after n days. If the user ID is consumed on the same day, count 1
        for nday_user in nday_users:
            if nday_user in users:
                counts += 1
        # Calculate the retention rate and record the corresponding date and retention rate
        nday_retention_rate = counts / len(users)
    # Create corresponding data box
    df_retention_rate = pd.DataFrame({'date': days, 'The first{}Daily retention rate'.format(n): nday_retentions})
    return df_retention_rate

# Call the function to calculate the next day, the third day and the seventh retention
retention_rate1 = nday_retention_rate(user_behavior,1)
retention_rate3 = nday_retention_rate(user_behavior,3)
retention_rate7 = nday_retention_rate(user_behavior,7)

# merge
retention_rate = pd.merge(retention_rate1,retention_rate3,how="left",on="date")
retention_rate =pd.merge(retention_rate,retention_rate7,how="left",on="date")

6, User behavior analysis

6.1 user behavior time analysis

In this data set, users' behaviors mainly include browsing click (pv), collection (fav), shopping cart (cart) and purchase payment (buy). Next, it analyzes the user's purchase behavior in the unit of day and hour, and analyzes the user's behavior rules in different time scales.

1. Statistics of click, favorite, add shopping cart and payment times in date dimension

daily_behavior = user_behavior.pivot_table(index="date",columns="behavior",aggfunc="count",values="item_id")

plt.figure(figsize=(12,4),dpi=200)  # Set the size of the drawing




plt.title("Daily user behavior statistics")


2. Statistics of times of clicking, collecting, adding shopping cart and paying in the hour dimension

hourly_behavior = user_behavior.pivot_table(index="hour",columns="behavior",aggfunc="count",values="item_id")

plt.figure(figsize=(12,4),dpi=200)  # Set the size of the drawing




plt.title("Hourly user behavior statistics")


Summary of user behavior in different time scales:

  1. From the perspective of date: the number of collections, the number of additional purchases, the number of payments and the number of page views all increased significantly from December 2 to 3, with the number of collections increased by 34.7%, the number of additional purchases increased by 38.9%, and the number of payments increased by 19.7%;
  2. From the perspective of hour dimension: 19:00-23:00 is the peak of user activity. From 22:00 to 4:00 in the morning, the user activity gradually decreases until the lowest value in the day; from 5:00 to 10:00, the user activity begins to increase; from 10:00 to 10:00~ 18 this period of time is a relatively stable state, because this period of time is roughly working time; it can be seen that the active time of most users is in line with the biological clock of human beings, so it is recommended to refer to the active time of users and take some promotional activities accordingly.
  3. Of course, if there are conditions, we can further analyze the characteristics of the main active user groups in different time periods, and then, according to the characteristics of the main active user groups, we can push different products in different time periods.

6.2 user behavior path analysis

6.2.1 funnel analysis of "visit add shopping cart pay purchase"

Taobao APP's collection function is mainly to help users collect goods they like but don't buy immediately, while shopping cart has the similar function of collecting goods, but it is mainly convenient for users to buy goods in batches, both of which are used to improve the user's shopping experience. However, the purchase of goods after collection is still faced with the choice of adding shopping cart and direct purchase. Therefore, the funnel analysis of this article will exclude collection, only analyze the conversion of pv cart buy, and the conversion cycle of this article is set to 9 days.

# Extract data
df_route = user_behavior[["user_id","behavior"]]
# Remove the same behaviors of users. When analyzing, you only need to know whether there are corresponding behaviors, and you don't need to know the number of behaviors
df_route = df_route.drop_duplicates(["user_id","behavior"])

# There are 16 combinations of 4 behaviors, according to whether there are or not. Assign values to behavior data: 1, 2, 4, 8. After permutation and combination, the values add up to 0-15
df_route = df_route.assign(behavior_num = df_route["behavior"].map({"pv":1,"fav":2,"cart":4,"buy":8}))
# According to user grouping, the sum of behavior values is counted, and the behavior combination of users is determined
df_route = df_route.groupby("user_id").sum()

# Classification according to user behavior combination
user_route = df_route.groupby("behavior_num").count()

# Behavior path corresponding to combination value
user_route["behavior_com"] = ["pv","fav","pv-fav","cart","pv-cart","fav-cart","pv-fav-cart","buy","pv-buy",

Next, we need to screen out the number of users in each link of the funnel.

# Number of users to browse: that is, the combination value is odd
pv = (user_route.loc[::2]).sum().values[0]

# Number of users who browse and join the shopping cart: pv-cart: 5, pv-fav-cart: 7, pv-cart-buy: 13, pv-fav-cart-buy: 15
pv_cart = (user_route.loc[[5,7,13,15]]).sum().values[0]

# Browse, join the shopping cart, and purchase the number of users, namely: pv cart buy: 13, pv fav cart buy: 15
pv_cart_buy = (user_route.loc[[13,15]]).sum().values[0]

funnel_data = pd.DataFrame({"link":["pv","cart","buy"],"Number of people":[pv,pv_cart,pv_cart_buy]})

Calculate the conversion rate of each link and draw the funnel diagram.

con_list=[100 if i==0 else round(funnel_data["Number of people"][i]/funnel_data["Number of people"][i-1]*100,2) for i in range(0,len(funnel_data["Number of people"]))]   
funnel_data["Link conversion rate"] = con_list

x_data =funnel_data["link"]
y_data = funnel_data["Link conversion rate"]

# Draw funnel
data = [[x_data[i], y_data[i]] for i in range(len(x_data))]
funnel1 = Funnel(init_opts=opts.InitOpts(width="800px", height="400px",renderer= "svg"))
                    tooltip_opts=opts.TooltipOpts(is_show=True,trigger="item", formatter="{a} <br/>{b} : {c}%"),
                    label_opts=opts.LabelOpts(is_show=True, position="inside",formatter="{b} {c}%"),
                    itemstyle_opts=opts.ItemStyleOpts(border_color="#fff", border_width=1),
# Set title and legend
funnel1.set_global_opts(title_opts=opts.TitleOpts(title="User single conversion funnel",pos_left="center"),


Funnel analysis summary:

  1. The statistical cycle of this funnel analysis is 9 days. There were 19544 users in the data set, but only 19463 users had page browsing behavior, and the rest 81 users' browsing behavior was presumably generated outside the time range of the data set. Therefore, the statistical analysis takes the form of behavior assignment, first excluding these users.
  2. According to the analysis, in these 9 days, 75% of users have the habit of using to join the shopping cart after browsing; 71.93% of users will have purchase behavior after joining the shopping cart.
  3. Due to the limitation of the time range of the data set, it is impossible to know whether the conversion rate of each link in these 9 days is normal at present. It needs to be compared with the historical data to make the final judgment.

6.2.2 user behavior path distribution

# Filter out the users with pv in the nine days, and arrange by the number of users
user_route = user_route.loc[[1,3,5,7,9,11,13,15]]
user_route = user_route.sort_values(by ="user_amount",ascending=False)

# Draw bar chart


x = range(len(user_route["behavior_com"]))

plt.title("User behavior path distribution")

rects =,user_route["user_amount"],width=0.6,color="#6462cc")


# Set data labels, number of people in each path and their respective proportion
for rect in rects:
    height = rect.get_height()   
    rect_x = rect.get_x()        
    plt.text(rect.get_x() + rect.get_width()/2,height+400,str(height)+'people', ha='center',fontsize=8)
    plt.text(rect.get_x() + rect.get_width()/2,height+100,str('{:.2f}'.format(user_route["user_amount"].values[count]/Sum *100)) + "%",ha='center',fontsize=10) 
    count+=1 #Percentage increase

Summary of user behavior path distribution:

  1. This analysis is based on user browsing, with a total of 8 paths. In these 9 days, the three paths with the highest proportion are: browse - add shopping cart - purchase, browse - collect - add shopping cart - purchase and browse - add shopping cart, accounting for 32.72%, 21.22% and 14.36% respectively, accounting for 68.3% of the total, and the proportion of other user paths is similar.
  2. In the 9 days, 68.14% of the users paid for purchase, 75% of the users used the shopping cart function, and 40.12% of the users used the collection function (the data is accumulated by the corresponding path containing the behavior).
  3. 28.48% of the users use the collection function and pay to purchase. Within the 9 days, the payment conversion rate of the collection function is 71.00% (28.48% / 40.12%); the payment conversion rate of the users who do not use the collection function is 66.23% (39.66% / 59.88%). Therefore, the collection function can improve the payment conversion rate of the users to a certain extent, which is 7.18%.
  4. The proportion of users who use shopping cart and pay for purchase is 53.94%, the payment conversion rate of shopping cart function is 71.92% (53.94% / 75%), the payment conversion rate of users who do not use shopping cart function is 56.80% (14.20% / 25%), similarly, shopping cart can also improve the payment conversion rate of users, and the improvement rate is as high as 26.62%.
  5. To sum up, the conversion rate of users who use the collection and payment functions is higher than that of users who do not use the two functions. Therefore, it is recommended to optimize the collection and shopping cart functions to the maximum extent, so as to promote users to use the two functions more frequently, so as to promote the overall payment conversion rate. In addition, it is also possible to set reminders for users who have goods in the collection and shopping cart Yes, for example, if you haven't bought the goods in the shopping cart for 3 days, you can remind the user in time; or if the goods in the shopping cart have preferential information, you can remind the user in time.

6.3 collection / shopping cart purchase time interval distribution

From the above analysis, it can be seen that the conversion rate of users using the functions of collection and shopping cart is relatively high in these 9 days. Next, further analysis is made on the time interval distribution between users collecting goods and adding shopping cart to purchase in 9 days. Operators can set the reminder function for users according to the time interval distribution.

# Filter data set
df_cart = user_behavior[user_behavior["behavior"]=="cart"][["user_id","item_id","time"]]
df_buy = user_behavior[user_behavior["behavior"]=="buy"][["user_id","item_id","time"]]

# Join two tables
df_cart_buy = pd.merge(df_cart,df_buy,how="inner",on=["user_id","item_id"],suffixes=('_cart', '_buy'))
df_cart_buy["diff_time"] = df_cart_buy["time_buy"]-df_cart_buy["time_cart"]

# Time format conversion, unit: hour
df_cart_buy['diff_time'] = df_cart_buy['diff_time'].map(lambda x : x.days * 24 +x.seconds/3600)

# Filter out the time difference greater than 0, because there is the behavior of adding shopping cart after the user purchases in the data set
df_cart_buy = df_cart_buy[df_cart_buy["diff_time"]>=0]

# Group, 9 days in total, divided into 9 groups
bins = [0,24,48,72,96,120,144,168,192,240]
df_cart_buy["time_bins"] = pd.cut(df_cart_buy["diff_time"],bins=bins,include_lowest=False)

# Statistics by group, drawing
df = df_cart_buy.groupby("time_bins").user_id.count()


rects =,len(df.index)),df.values,color="#6462cc")
plt.xlabel("Company: h",fontsize=10)
plt.title("Add shopping cart - purchase time interval distribution(9 Within days)",pad=15,fontsize=12)

# Set data label
for rect in rects:
    height = rect.get_height()   
    rect_x = rect.get_x()        
    plt.text(rect.get_x() + rect.get_width()/2,height+50,str(height)+'people', ha='center',fontsize=9)
    plt.text(rect.get_x() + rect.get_width()/2,height+300,str('{:.2f}'.format(df.values[count]/Sum *100)) + "%",ha='center',fontsize=9) 

Collection / shopping cart - purchase time interval distribution summary:

  1. As can be seen from the above figure, in these 9 days, nearly 60% of users who join the shopping cart and have the willingness to purchase will complete the purchase within 24 hours, and over 87.9% of users will purchase within 3 days; for users who collect and have the willingness to purchase, nearly 77% of users will complete the purchase within 24 hours, and over 92% of users will purchase within 3 days.
  2. It can be seen from the above analysis that when the user has not purchased the shopping cart after collecting and adding the shopping cart for 3 days, it can be roughly inferred that the user's purchase intention is not strong. At this time, the user can be guided to purchase through appropriate prompts and activities, or the corresponding similar goods can be recommended for the user to choose.

7, Analysis on the products purchased by users

In the 9-day user behavior data, there are 6492 commodity categories, 3149 of which are purchased by users, accounting for 48.5%; there are nearly 630000 commodities in total, and only 32000 commodities with purchase records, accounting for only 5.1%. Next, it analyzes the commodity categories and commodities that users prefer to buy.

7.1 analysis of purchase times of commodity categories

1. description and statistics of times of purchasing goods

# Filter record information purchased by users
df_buy = user_behavior[user_behavior["behavior"]=="buy"]

# Group statistics for commodity categories
df_buy_category = df_buy[["user_id","category_id"]].groupby("category_id").count()
df_buy_category = df_buy_category.rename(columns={"user_id":"buy_times"})

# Descriptive statistics of purchase times of commodity categories

In these 9 days, each commodity category was purchased 12.8 times on average, but more than 50% of the commodity categories were only purchased 3 times, showing a right deviation obviously, indicating that there is a part of the relatively hot commodity categories that has improved the average level. Next, find out the top 20 commodity categories.
2. Top 20 commodity categories

# Ranking, top 20 hot sellers
df_cat_top20= df_buy_category.sort_values(by="buy_times",ascending=False).head(20)

# Add the proportion in the total number of top 20 hot sales purchases
df_cat_top20["percent"] = (df_cat_top20.buy_times)/df_cat_top20.buy_times.sum()*100

# Draw pie chart

plt.legend(loc="upper right",fontsize=8,bbox_to_anchor=(1.4,0.9),borderaxespad=0.3)
plt.title("Hot sale TOP20 Proportion of commodity transactions",pad=10,fontsize=12)

In these 9 days, the purchase times of top 20 hot goods accounted for 20.6% of the total purchase times, and the purchase times of 0.64% of top 20 hot goods accounted for 20.6% of the total purchase times, which also conforms to the law of 28, so it is suggested to pay more attention to the top 20 hot goods.

7.2 analysis of purchase times

1. Description and statistics of purchase times

# Filter record information purchased by users
df_buy = user_behavior[user_behavior["behavior"]=="buy"]

# Group statistics for goods
df_buy_item = df_buy[["user_id","item_id"]].groupby("item_id").count()
df_buy_item = df_buy_item.rename(columns={"user_id":"buy_times"})

# Description statistics of times of purchase

In these 9 days, each kind of commodity was purchased 1.2 times on average, but more than 75% of the commodities were purchased only once, and the commodities with the most purchase times were only 35 times, and there was no commodity with a very concentrated purchase quantity, indicating that in these 9 days, the commodity sales mainly depended on the accumulation effect of long tail commodities, rather than the promotion of pop-up commodities.
2. Top 20 products

# Ranking, top 20 hot sellers
df_item_top20= df_buy_item.sort_values(by="buy_times",ascending=False).head(20)

# Add the proportion in the total number of top 20 hot sales purchases
df_item_top20["percent"] = (df_item_top20.buy_times)/df_item_top20.buy_times.sum()*100

# Draw pie chart

plt.legend(loc="upper right",fontsize=8,bbox_to_anchor=(1.4,0.9),borderaxespad=0.3)
plt.title("Hot sale TOP20 Proportion of goods purchased",pad=10,fontsize=12)

In these 9 days, the number of purchases of top 20 hot sellers accounts for 0.71% of the total number of purchases, which is relatively low, again indicating that there is no driving force of pop-up products in these 9 days.

7.3 distribution of commodity conversion rate

# Filter user browsed records
df_pv = user_behavior[user_behavior["behavior"]=="pv"]

# Group the goods and count the number of users who have browsed the goods
df_pv_item = df_pv[["user_id","item_id"]].groupby("item_id").nunique()
df_pv_item = df_pv_item.rename(columns={"user_id":"pv_user_amount"})
df_pv_item = df_pv_item.drop("item_id",axis=1)

# Filter record information purchased by users
df_buy = user_behavior[user_behavior["behavior"]=="buy"]

# Group the goods and count the number of users who have purchased the goods
df_buy_item = df_buy[["user_id","item_id"]].groupby("item_id").nunique()
df_buy_item = df_buy_item.rename(columns={"user_id":"buy_user_amount"})
df_buy_item = df_buy_item.drop("item_id",axis=1)

# Join two tables and calculate the purchase conversion rate of goods
df_item = pd.merge(df_buy_item,df_pv_item,left_index=True,right_index=True,how="inner")
df_item["con_rate"] = df_item["buy_user_amount"]/df_item["pv_user_amount"]

# Exclude items with more buyers than visitors
df_item = df_item[df_item["con_rate"]<=1]

# Draw histogram

plt.title("Distribution of commodity conversion rate",fontsize=12)
plt.xlabel("Commodity conversion rate",fontsize=10)
plt.ylabel("Number of goods",fontsize=10)



# Number of products with conversion rate over 90% and number of users more than 1
df = df_item[(df_item["con_rate"]>=0.9) & (df_item["con_rate"]<=1.0) & (df_item["buy_user_amount"]>1)].sort_values("buy_user_amount",ascending=False)

  1. As shown in the figure above, in 9 days, there are two centralized areas in the distribution of commodity conversion rate. Nearly 67% of the commodity conversion rate is below 50%, and nearly 32% of the commodity conversion rate is above 90%. However, 97.3% of the commodities with high conversion rate are purchased by only one user. Therefore, these commodities cannot be divided into high conversion rate commodities at present, because the traffic base is too small. It is suggested that we should pay attention to the change of conversion rate with the increase of flow.
  2. The number of users who purchase goods with low conversion rate is not large, but due to the large UV, the overall conversion rate is relatively low. For this part of goods, it is recommended that the corresponding businesses do a flow loss analysis to improve the user's experience from clicking into the product details page to the final purchase, so as to promote the conversion rate.

8, Summary

This paper analyzes 2 million behavior records generated by Taobao APP19544 users in 9 days. According to the characteristics of data set content, it mainly analyzes from four aspects: traffic index, user type index, user behavior and user's purchase of goods. The following are some conclusions and suggestions of this analysis.

  1. Through the analysis of traffic indicators, it is found that although the preheating of double 12 activities has brought a significant increase in the number of visitors to Taobao APP, the proportion of daily active users and daily payment users has declined to a certain extent, while the jump rate has also increased by a small margin, and the overall user quality is declining, which indicates that the accuracy of activity drainage is not enough. Therefore, it is recommended to analyze the historical data and business objectives to determine whether the personalized recommendation function needs to be improved and strengthened.
  2. The user index analysis mainly analyzes the number of purchases, repurchase rate, repurchase rate and retention rate. From the analysis of repurchase rate, it is found that Taobao is in the loyalty mode of e-commerce, and the focus of business should be on customer loyalty, that is to encourage loyal repeat customers to spend more frequently. Other indicators can be used to measure and compare the overall operation of users in these 9 days. If there is a significant increase or decrease, further in-depth disassembly analysis is needed.
  3. In the user behavior analysis, we found that:
    (1) . there is a certain time rule for users to use the APP. In a week, users are more active on weekends, while in a day, 19-23 is the peak of users' activity. Therefore, it is recommended not to miss these time points when carrying out important marketing activities, so that more active users can be reached, and the activity effect will be better.
    (2) . the payment conversion rate of the user group using collection and shopping cart will be significantly higher than that of the user group not using these two functions. It is recommended to pass the AB test to make users more convenient to use the collection and shopping cart functions.
    (3) The vast majority of users with purchase intention will purchase within 3 days after purchasing and collecting goods. Therefore, it is recommended to send the user a reminder of the promotional activities of the collected and purchased goods. In addition, the user has not purchased the car or the collected goods 3 days later, so the user can be reminded to purchase them in time, or similar goods can be recommended for the user to choose.
  4. In terms of commodity analysis, in 9 days, the purchase times of commodity categories reflect the 28 laws of the consumer industry; however, most of the commodities have been purchased only once, and there is no hot sale "hot money", and the conversion rate of commodities has also appeared in two centralized areas. For commodities with low conversion rate and high exposure rate, it is suggested to optimize the display effect and improve the user's purchase experience, and then put forward High conversion rate; for high conversion rate and low exposure rate commodities, it is suggested to optimize the exposure channel and actively drain. At the same time, Taobao APP can incline certain resources to commodities with high conversion rate and high exposure rate, forming a virtuous circle, but also strengthen the supervision of cheating.

The above is my analysis and introduction of the user behavior data set of Taobao APP. Due to the lack of business problem background and historical relevant data as a comparison, this article more shows my understanding of the analysis indicators of relevant e-commerce indicators and the problems reflected by the data results. There is still a long way to go for the real use of data analysis to drive operation and landing, so welcome to put forward Guidance and common progress.

Tags: Lambda less Windows Mobile

Posted on Wed, 10 Jun 2020 00:37:26 -0400 by lordfrikk