Data analysis of E-Commerce orders based on Python

1, Data information

Data source: https://www.heywhale.com/mw/dataset/5f64a07b71c700003072df60/file

Note: there are 104557 pieces of data in this data set, with 11 fields in total

Field:

  • id: serial number
  • orderID: order id
  • userID: user id
  • goodsID: Commodity id
  • orderAmount: total order amount
  • payment: the actual amount paid by the buyer
  • Channelid: channel id
  • platfromType: purchase channel
  • orderTime: order generation time
  • payTime: order payment time, which is the data from January 1, 2019 to May 28, 2020 (mainly based on order payment time)
  • chargeback: refund or not

2, Conclusion first

1. Sales analysis: the overall conversion rate of users is high, and the effect of double 11 and double 12 activities is obvious

  • Continue to maintain the user conversion rate. In the future, we can use more holidays to copy the promotion activities of double 11 and double 12 to increase sales
  • Targeted shop optimization and channel promotion for web pages and Ali platforms, and draw on the popular points of two shopping channels from WechatMP WeChat official account platform and APP, which can be targeted at embedding these shopping channels and analyzing user paths and page evaluation.

2. Analysis of user purchase times: the repurchase rate and per capita purchase times are very low, and the platform is mainly "one-time sale"

  • Improve brand influence, cultivate user loyalty, and further tap the purchasing power of existing customers

3. Use RFM model to classify user value: take targeted operation measures for different customer groups

  • Important value users: invest in farming resources, provide vip services and personalized services, and continue to guide their consumption while maintaining their stickiness
  • Potential customers: these customers need to dig deep into them, sell some products with higher value, ask them for comments, find out their demands and pain points according to the comments, and attract them
  • Important deep ploughing users: send promotion reminders and coupons to them from time to time
  • New users: organize community activities, provide free trial or lucky draw of small samples in the group from time to time, and get large coupons when the number of new users reaches a certain number, so as to improve user interest and create brand awareness
  • Important exchange for users: precision marketing provides users with goods they may be concerned about or interested in according to user portraits, and wins them back by continuing to close stores or producing more attractive products
  • Ordinary users: they can punch in the store every day to get points, which can be exchanged for popular product samples or in store lucky draw to keep them paying attention to the store
  • Important recovery of users: focus on return visits, ask targeted questions about the possible reasons why users will no longer buy, and re market to customers on this basis, such as answering service attitude and appeasing; If you answered that the product quality has been improved, please try again and send a sample, etc
  • Lost customers: try to restore their interest. If it fails, they will no longer waste time

3, Analytical framework

1. Sales volume analysis: purchase time, commodity category, purchase channel and conversion rate

  • Conversion rate: the conversion process is order creation - > order payment - > order closing (not refunded)

2. Analysis of user purchase times: distribution of user purchase times (frequency chart), per capita purchase times (payment without refund), repurchase rate (repurchase rate = number of people who have purchased twice or more / number of people with purchase behavior)

3. Use RFM model to classify user value

4, Data cleaning

import pandas as pd
import matplotlib.pyplot as plt
# Preliminary understanding of data
data = pd.read_excel('/home/kesci/input/dssj9032/order2019.xlsx')
data.info()
data.sample(10)

It can be seen from the observation that only the missing value in the changelid column indicates that someone directly enters the platform to place an order without channel recommendation. The missing value will not affect the analysis results, so it will not be handled temporarily
Meaningless spaces exist in the purchase channel, which shall be removed:  

data['platfromType'] = [i.replace(' ','') for i in data['platfromType']]

  Extract the date from the order payment time to facilitate subsequent sales calculation:

data['payDate'] = data['payTime'].dt.date
data['payDate'] = pd.to_datetime(data['payDate'])

# At the same time, the "days" column is added for subsequent auxiliary calculation of "R" indicators
data['days'] = (pd.to_datetime('today') - data['payDate']).dt.days

5, Data analysis

1. Sales analysis

All transactions are calculated based on the final profit. Therefore, considering the refund, the actual payment amount of the buyer is summed for the existing orders according to the order payment time. Due to the large time span of data, statistics are carried out in monthly units

data_not_chargeback = data[data.chargeback=='no']
df = data_not_chargeback.groupby(pd.Grouper(key='payDate',freq='M'))['payment'].sum()   
x =[f'{y}year{m}month' for y,m in zip(df.index.year, df.index.month)] 
plt.figure(figsize=(16,9),dpi=160)
plt.plot(range(len(x)), df)
plt.text(0,df.max(),f'The total sales are{round(df.sum(),2)}element',fontsize=20)
plt.xticks(range(len(x)), x, rotation=45)
for x,y in zip(range(len(x)),df.values):
     plt.text(x, y+7000,int(y),ha='center')   #Label data points
plt.ylabel('sales volume')
plt.xlabel('time')
plt.title('Monthly trend of sales',fontsize=25)
plt.show()

 

The following information can be obtained from the "monthly trend of sales":

  1. During the period from January 1, 2019 to May 28, 2020, the highest sales volume occurs in the month of "double 11" activity - November, with sales reaching 10.64 million, accounting for 10.08% of the total sales volume in 2019
  2. There was a significant decline from January to February, and it began to rise rapidly in March. It reached a small peak in May, and then it was in the peak period. It fell back to a certain extent in October. Then, the month of double 11 and double 12 ushered in the annual peak. After entering 2020, due to the impact of the epidemic, it quickly fell to the bottom and continued to be depressed. Until May, the sales volume was only 434 yuan
# Carry out sales statistics according to commodity categories and find out the top ten best commodities sold in actual transactions
hot_10 = pd.DataFrame({
    'sales volume':data['payment'].value_counts(),
    'Actual sales volume':data_not_chargeback['payment'].value_counts()
}).sort_values(by='sales volume', ascending=False)
hot_10['Proportion of total sales(%)'] = hot_10['sales volume'].apply(lambda x : round((x/hot_10['sales volume'].sum())*100,2))
hot_10['Actual sales'] = data_not_chargeback.groupby(by='orderAmount')['payment'].sum()
hot_10['Proportion of sales(%)']=hot_10['Actual sales'].apply(lambda x : round((x/hot_10['Actual sales'].sum())*100,2))
print(f'{"-"*20}Top 10{"-"*20}')
print(hot_10.head(10))

 

 

# Make sales statistics according to purchase channels
saleChannel = data_not_chargeback.groupby(pd.Grouper(key='platfromType'))['payment'].sum() 
x_ = list(saleChannel.index)
plt.figure(figsize=(16,9),dpi=160)
plt.bar(range(len(x_)), height=saleChannel, width=0.5)
plt.xticks(range(len(x_)), x_, rotation=45)
for x,y in zip(range(len(x_)),saleChannel.values):
     plt.text(x, y+8000,int(y),ha='center')   #Label data points
plt.ylabel('sales volume')
plt.xlabel('Purchase channel')
plt.title('Sales by purchase channel',fontsize=25)
plt.show()

 

As can be seen from the above picture, users purchase mainly through WechatMP WeChat official account platform and APP, followed by web page version, and then Ali platform, WeChat store and wap are the lowest.  

  Now view the distribution quantity of chargeback field values:

import seaborn as sns
print(data['chargeback'].value_counts())
sns.countplot(x='chargeback', data=data)

# Conversion rate
from pyecharts.charts import Funnel
from pyecharts import options as opts

# All customers in this dataset have payment and purchase behavior, but there is partial refund behavior after payment. Therefore, the actual transaction is indicated only when the value of chargeback field is "no"
rates = pd.Series({
    'establish':data['orderTime'].count(),
    'payment':data['payTime'].count(),
    'Actual transaction':data[data.chargeback=='no'].shape[0]
},name='Order quantity').to_frame()  

# Absolute conversion rate = number of orders in each phase / number of orders created
rates['Overall conversion'] = rates['Order quantity'].apply(lambda x: round(x*100/rates.iloc[0,0],3))
print(rates)

c=(
    Funnel() 
    .add(
        'Conversion rate',  
        [list(z) for z in zip(rates.index,rates['Overall conversion'])],   
        # Set label position and data display form
        label_opts=opts.LabelOpts(position='inside',formatter='{b}:{c}')
    )  # Filling funnel chart content
    .set_global_opts(title_opts=opts.TitleOpts(title='Overall conversion(%)'))
)
c.render_notebook()  

From the "overall conversion rate", we can see that the overall conversion rate of users is still high, and the actual transaction conversion rate reaches 86.82%, which needs to be maintained

Summary: Double eleven double 12 activities are very effective. Follow up can be made to use more months of holidays to duplicate sales promotion activities of double eleven and double sales, increase sales volume, and then optimize shop design and channel promotion for web pages and Ali platforms. Meanwhile, draw the user's welcome points from the two purchasing channels of WechatMP WeChat official account platform and APP. - targeted analysis and page evaluation of these purchase channels as well as user behavior path

2. Analysis of user purchase times

# Draw the distribution frequency diagram of user purchase times
user_buy_time = data[data.chargeback=='no'].groupby('userID').count()['goodsID']
plt.hist(x=user_buy_time, bins=10, range=[0,100])
plt.xlabel('buy_time')
plt.ylabel('user_num')
plt.title('user_buy_time')

 

# Number of purchases per capita
total_buy_time = data[data.chargeback=='no'].count()['userID']
total_paying_user_num = data[data.chargeback=='no'].nunique()['userID']
user_avg_buy_time = total_buy_time/total_paying_user_num
user_avg_buy_time

 

# Repurchase rate = number of people who have purchased twice or more / number of people who have purchased
user_buy_time = data[data.chargeback=='no'].groupby('userID').count()
user_twice_time = user_buy_time[user_buy_time['goodsID']>2].count()['goodsID']
user_buy = data[data.chargeback=='no'].nunique()['userID']
rebuy_rate = user_twice_time/user_buy
rebuy_rate

​ 

Observation results: it can be seen from the distribution frequency chart of user purchase times that all users who successfully complete this transaction (payment is successful and no refund) have a transaction frequency of less than 10 times; The number of purchases per user is 1, and the repurchase rate is only 3.93%

Summary: Based on the above results, the e-commerce platform is basically a one-time transaction, and the repurchase rate of paying users is very low, so we need to find ways to cultivate a group of loyal users; Combined with the frequency chart of user purchase times, all paying users of the platform have only a transaction frequency of less than 10 times. It is necessary to vigorously tap the purchasing power of these customers and launch targeted marketing solutions

3. Use RFM model to classify user value

1) Calculate three indicators of RFM

# Establish a PivotTable and calculate the three indicators of RFM
df2 = data_not_chargeback.pivot_table(index="userID",
                    values=["orderID", "days", "payment"],
                    aggfunc={"orderID":"count", "days":"min", "payment":"sum"})
df2 = df2[['days', 'orderID', 'payment']]
df2.columns = ['R', 'F', 'M']
df2.reset_index()
df2.head(10)

 

# Score the three indicators of RFM respectively
avg_r = df2['R'].mean()
avg_f = df2['F'].mean()
avg_m = df2['M'].mean()
 
df2['R_score'] = [1 if i>avg_r else 0 for i in df2['R']]
df2['F_score'] = [1 if i>avg_f else 0 for i in df2['F']]
df2['M_score'] = [1 if i>avg_m else 0 for i in df2['M']]

# Label users
def functions(x):
    if x.iloc[0]==1 and x.iloc[1]==1 and x.iloc[2]==1:
        return "Important value customers"
    elif x.iloc[0]==1 and x.iloc[1]==1 and x.iloc[2]==0:
        return "Potential customers"
    elif x.iloc[0]==1 and x.iloc[1]==0 and x.iloc[2]==1:
        return "Important deep ploughing customers"
    elif x.iloc[0]==1 and x.iloc[1]==0 and x.iloc[2]==0:
        return "New customers"
    elif x.iloc[0]==0 and x.iloc[1]==1 and x.iloc[2]==1:
        return "Important call back customers"
    elif x.iloc[0]==0 and x.iloc[1]==1 and x.iloc[2]==0:
        return "General customers"
    elif x.iloc[0]==0 and x.iloc[1]==0 and x.iloc[2]==1:
        return "Important customer recovery"
    elif x.iloc[0]==0 and x.iloc[1]==0 and x.iloc[2]==0:
        return "Lost customers"
    
df2['label'] = df2[['R_score', 'F_score', 'M_score']].apply(functions,axis=1)
df2.sample(10)

 

# Draw a comparison chart of the number of different types of customers
df3 = df2.groupby("label").agg({"label":"count"})
df3['Proportion of different customers'] = df3["label"].apply(lambda x:x/np.sum(df3["label"]))  
df3.index.name = 'Customer label'
df3 = df3.sort_values(by="label", ascending=True)
plt.figure(figsize=(6,4), dpi=100)
x = df3.index
y = df3['label']
plt.barh(x, height=0.5, width=y, align='center')
plt.title('Comparison of different types of customers')
for x,y in enumerate(y):
    plt.text(y+450, x, y, va='center', fontsize=14)    
plt.xticks(np.arange(0,30001,2000), rotation=45)

plt.tight_layout()

 

# Draw cumulative consumption amount of different types of customers
df3 = df2.groupby("label").agg({"M":"sum"})
df3["M"] = df3["M"].apply(lambda x:round(x))
df3["Proportion of different customers"] = df3["M"].apply(lambda x:x/np.sum(df3["M"]))
df3 = df3.sort_values(by="M",ascending=True)
plt.figure(figsize=(6,4),dpi=100)
x = df3.index
y = df3["M"]
plt.barh(x,height=0.5,width=y,align="center")
plt.title("Cumulative consumption amount of different types of customers")
for x,y in enumerate(y):
    plt.text(y+500,x,y,va="center",fontsize=14)

plt.tight_layout()

 

 

Key value users are the most important group, but the number accounts for only 2359 people in the total customer group, while new customers account for the largest proportion in actual transactions, which shows that the platform is still doing well. The focus is how to retain customers, become loyal and practical users of the platform and buy back repeatedly.

Put forward targeted operation suggestions for the above customer groups:

  1. Important value users: invest in farming resources, provide vip services and personalized services, and continue to guide their consumption while maintaining their stickiness  
    2. Potential customers: these customers need to dig deep into them, sell some products with higher value, ask them for comments, find out their demands and pain points according to the comments, and attract them  
    3. Important deep ploughing users: send them promotion reminders and coupons from time to time  
    4. New users: organize community activities, provide free trial or lucky draw of small samples in the group from time to time, and get large coupons when the number of new users reaches a certain number, so as to improve users' interest and create brand awareness  
    5. Important exchange for users: precision marketing provides users with goods they may be concerned about or interested in according to user portraits, and wins them back by continuing to close stores or producing more attractive products  
    6. Ordinary users: they can punch in the store every day to get points, which can be exchanged for popular product samples or in store lucky draw, so that they can continue to pay attention to the store  
    7. Important recovery of users: focus on return visits, ask targeted questions about the possible reasons why users will no longer buy, and re market to customers on this basis, such as answering service attitude and appeasing; If you answered that the product quality has been improved, please try again and send a sample, etc  
    8. Losing customers: try to restore their interest. If there is no result, no more time will be wasted  

Tags: Python Data Analysis

Posted on Mon, 08 Nov 2021 03:52:51 -0500 by CyberGarage