Consumer behavior analysis of an online music retail platform

Source: Ustinian - kesci.com
Original link: Consumer behavior analysis of CDNow website users
Click on the above link A kind of Direct online operation without environment configuration

1, Project background

CDNow, once an online music retail platform, was invested and acquired by potelsmann Entertainment Group in Germany, with a total asset value of more than US $1 billion at its best. This paper mainly analyzes the user purchase details of CDNow website to analyze the user consumption behavior of the website, so that the operation Department is more targeted in marketing, so as to save costs and improve efficiency.

2, Ask questions

Analyze the consumption characteristics of users. The analysis framework is as follows:

3, Data processing

3.1 import data

#Import common libraries
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
plt.style.use('ggplot')  #Change the design style and beautify it with its own form, which is a style of r language
Matplotlib is building the font cache using fc-list. This may take a moment.
#Import source data
columns = ['user ID','Date of purchase','Order number','Order amount']
df = pd.read_table("/home/kesci/input/cdnow5658/CDNOW.txt",names = columns,sep = '\s+')
  • Load the package and data. The file is txt. Open it with the read table method. Because the original data does not contain the header, you need to assign. The string is a space partition, which matches any whitespace character with the \ s +.
  • In general, csv data is separated by commas, but this data is separated by multiple spaces
  • The consumption industry or e-commerce industry is generally analyzed by four fields: order number, order amount, purchase date and user ID. Basically, these four fields can be analyzed in a very rich way.
#Default output top five lines
df.head()

  • For observation data, the purchase date column represents time, but now it's just a series of numbers combined by month, year and day. The data is not a time type and has no time meaning, so it needs to be converted. The purchase amount is a decimal.
  • In the data, a user orders multiple times in the same day or different days. For example, the user with user ID 2 bought twice on January 12.

3.2 descriptive statistics

#Summary statistics for numeric columns
df.describe()

Describe is to describe statistics and judge the integrity of user data characteristics:

  • From the statistical description of the data, it can be seen that the user purchases 2.41 goods per order and consumes 35.89 yuan per order on average.
  • The standard deviation of the quantity of purchased goods is 2.33, indicating that the data has certain volatility; the median is 2 goods, and the 75 quantile is 3 goods, indicating that the purchase quantity of most orders is small. The maximum value is 99, and the number is relatively high. The purchase amount is similar, most orders are concentrated in small amount.
  • Generally speaking, the data distribution of consumption is long tail. Most users are small, but a small number of users contribute the largest part of the income, commonly known as 28.

3.3 data processing

#Index, data type and memory information
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69659 entries, 0 to 69658
Data columns (total 4 columns):
//User ID 69659 non null Int64
//Purchase date 69659 non null Int64
//Number of orders 69659 non null Int64
//Order amount 69659 non null float64
dtypes: float64(1), int64(3)
memory usage: 2.1 MB

Check whether there is a null value in the data type and data; the original data has no null value and is very clean. Next we need to transform the data type of time.
When using pandas for data processing, we often encounter the problem of data type. When we get the data, we should first make sure that we get the correct data type. If the data type is not correct, we need to transform the data type, and then we can do the data processing.

# Convert purchase date column to data type
df['Date of purchase'] = pd.to_datetime(df.Date of purchase,format = '%Y%m%d') #Y four digit date part, y two digit date part
df['Month'] = df.Date of purchase.values.astype('datetime64[M]') 
df.head()

  • Pd.to'datetime can convert a specific string or number to a time format, where the format parameter is used for matching. For example, 19970101,% y matches the first four digits 1997, if y lowercase matches only two digits 97,% m matches 01,% d matches 01.
  • In addition, the hour is% h and the minute is% M. note that the case of the hour is different from that of the month. The second is% s. If it's 1997-01-01, it's% Y-%m-%d, and so on.
  • astype can also convert time format, such as [M] to month. We regard month as the main event window of consumption behavior, and which time window we choose depends on consumption frequency.

4, Analysis on the general consumption trend of users

# Set the size of the graph and add sub graphs
plt.figure(figsize=(20,15))
# Total monthly sales
plt.subplot(221)
df.groupby('Month')['Order amount'].sum().plot(fontsize=24) 
plt.title('Total sales',fontsize=24) 
 
#Monthly consumption
plt.subplot(222)
df.groupby('Month')['Date of purchase'].count().plot(fontsize=24)
plt.title('Consumption frequency',fontsize=24) 

#Monthly sales
plt.subplot(223)
df.groupby('Month')['Order number'].sum().plot(fontsize=24)
plt.title('Total sales volume',fontsize=24)

#Number of consumers per month
plt.subplot(224)
df.groupby('Month')['user ID'].apply(lambda x:len(x.unique())).plot(fontsize=24)
plt.title('Consumption figures',fontsize=24)
plt.tight_layout() # Set spacing for subgraphs
plt.show()

  • The overall trend of the four line charts is basically the same. It can be seen that the sales volume in the first three months of 1997 was particularly high, followed by a sharp rise in sales volume, which plummeted after March, nearly stable.
  • Why is this presented? We assume that there is something wrong with the users. In the early period of time, there are abnormal values among the users. The second assumption is that there are all kinds of promotion and marketing, but there is only consumption data here, so it can't be judged.
  • On the other hand, during the period from February to March, it can be found that the number of consumers slightly decreased, but the total sales volume and total sales volume still increased. Does this mean that there are high-value customers among the users in March that we need to focus on?

5, Analysis of individual consumption data of users

5.1 description and statistics of user consumption amount and consumption times

# Group by user id
group_user = df.groupby('user ID').sum()
group_user.describe()

From the perspective of users, each user buys an average of 7 CDs, and the most users buy 1033 CDs. The user's average consumption amount (customer unit price) is 100 yuan, and the standard deviation is 240. Considering the quantile and the maximum value, the average value is close to the 75 quantile, and there must be a small number of high consumption users, which also conforms to the 28 rule.

5.2 scatter diagram of user consumption amount and consumption times

#Query criteria: order amount < 4000
group_user.query('Order amount < 4000').plot.scatter(x='Order amount',y='Order number')

  • Drawing the scatter diagram of users, users are healthy and regular. Because this is the sales data of CD website, the goods are relatively single, and the relationship between the amount and the quantity of goods is linear, with few outliers.

5.3 distribution chart of user consumption amount (28 rules)

Group user. Order amount. plot.hist(bins = 20)
#If bins = 20, it is divided into 20 pieces, the maximum amount is 14000, and each item is 700

  • It can be seen from the figure that the consumption of users tends to be centralized, which may be caused by individual maximum interference.
  • You can exclude the maximum and look at the distribution
group_user.query("Order amount< 800")['Order amount'].plot.hist(bins=20)

After filtering out the users whose consumption amount is less than 800, we can see:

  • The consumption capacity of most users is not high. Nearly half of the users' consumption amount is less than 40 yuan, and the number of high consumption users (> 200 yuan) is less than 2000.
  • From the histogram above, it can be seen that the consumption ability of most users is not high, and most of them are concentrated in a very low consumption level. High consumption users can hardly see it on the graph, which also conforms to the industry law of consumption behavior.
  • Although there is extreme data interference, most users are still concentrated in a relatively low consumption level.

5.4 distribution chart of user consumption times (28 rule)

group_user.query('Order number < 100').Order number.hist(bins = 40)

  • Most users buy less than 3 CD, and the number of users who buy a large number of CD is not large.

6, Analysis of consumer cycle

6.1 user purchase cycle (by order)

#Time interval per purchase per user
order_diff = df.groupby('user ID').apply(lambda x:x['Date of purchase'] - x['Date of purchase'].shift())
order_diff.head(10)
User ID   
1     0        NaT
2     1        NaT
      2     0 days
3     3        NaT
      4    87 days
      5     3 days
      6   227 days
      7    10 days
      8   184 days
4     9        NaT
 Name: date of purchase, dtype: timedelta64[ns]
order_diff.describe()
count                      46089
mean     68 days 23:22:13.567662
std      91 days 00:47:33.924168
min              0 days 00:00:00
25%             10 days 00:00:00
50%             31 days 00:00:00
75%             89 days 00:00:00
max            533 days 00:00:00
Name: Date of purchase, dtype: object
  • The average purchase time interval for each user is 68 days, and the longest is 533 days. If you want to recall users, the consumption interval of about 60 days is better.
  • Most users spend less than 100 days.

6.1.2 distribution of user consumption cycle

plt.figure(figsize=(15,5))
plt.hist((order_diff / np.timedelta64(1, 'D')).dropna(), bins = 50)
plt.xlabel('Consumption cycle',fontsize=24)
plt.ylabel('frequency',fontsize=24)
plt.title('Distribution of consumer cycle',fontsize=24);

  • In a typical (exponential) long tail distribution, the consumption interval of most users is indeed relatively short. We might as well set the time recall point to give coupons immediately after consumption, ask users how about the gifts 10 days after consumption, remind users of the expiration of coupons 20 days after consumption, and send messages 30 days after consumption.

6.2 user life cycle (by first & last consumption)

orderdt_min=df.groupby('user ID').Date of purchase.min()#First consumption
orderdt_max=df.groupby('user ID').Date of purchase.max()#Last consumption
(orderdt_max-orderdt_min).head()
User ID
1     0 days
2     0 days
3   511 days
4   345 days
5   367 days
 Name: date of purchase, dtype: timedelta64[ns]
#Calculate the average life cycle of users
(orderdt_max-orderdt_min).mean()
Timedelta('134 days 20:55:36.987696')
  • The average life cycle of all users is 134 days, which is higher than expected, but the average is not representative. Next, let's look at the distribution.
((orderdt_max-orderdt_min)/np.timedelta64(1,'D')).hist(bins=15)
#Because the data type is timedelta time, histogram cannot be made directly, so it is converted to numerical value first.
#The conversion method can be divided by the timedelta function directly, np.timedelta 64 (1, 'd'), D represents day, 1 represents day, which is used as unit.
#Because max min is already expressed as days, the division of the two is the period

  • Most users only consume once, and the big part of all the life cycle is focused on 0 days.
  • But this is not the answer we want. We might as well exclude the new customers who only consume once to calculate the life cycle of all the old customers who have consumed twice or more.
#Calculate the life cycle of all customers who have consumed more than two times
life_time = (orderdt_max - orderdt_min).reset_index()
life_time.head()

#User life cycle distribution
plt.figure(figsize=(10,5))
life_time['life_time'] = life_time.Date of purchase / np.timedelta64(1,'D')
life_time[life_time.life_time > 0].life_time.hist(bins = 100, figsize = (12,6))

  • As can be seen from the above figure, the user life cycle shows a double peak trend. Users in the 20-day life cycle are one peak, while users in the 400-500 day life cycle are another peak.
  • According to this situation, customers should be guided within 20 days to promote their re consumption, form consumption habits and extend their life cycle; users in 100 to 400 days should also launch targeted marketing activities according to their characteristics to guide their continuous consumption.
    Analyze the average user life cycle after removing users with 0-day life cycle.
#Average user life cycle after removing users with 0 day life cycle
life_time[life_time.life_time>0].Date of purchase.mean()
Timedelta('276 days 01:04:31.344216')
  • It can be seen that if users strengthen the guidance of re consumption after the first consumption, their life cycle can be extended to twice the original.

7, User hierarchy

Layered by user value - RFM model
In order to carry out refined operation, RMF model can be used to calculate the user value index (measuring the benefit from history to current user contribution), where
Last consumption - R: the interval between the last transaction time of the customer. The larger the R value is, the longer the customer's transaction takes place; otherwise, the closer the transaction takes place.
Consumption frequency - F: the number of transactions the customer has made in the last period of time. The higher the F value is, the more frequent the customer transactions are; otherwise, the less active the customer transactions are.
Consumption amount - M: the amount of the customer's transaction in the latest period of time. The higher the m value is, the higher the customer value is; otherwise, the lower the customer value is.
According to the above three dimensions, segment customers

rfm = df.pivot_table(index = 'user ID',
                     values = ['Order amount','Date of purchase','Order number'],
                     aggfunc = {'Order amount':'sum',
                                'Date of purchase':'max',
                                'Order number':'sum'})
rfm.head()

# The difference between the maximum date and the current date is R
rfm['R'] = (rfm['Date of purchase'].max() - rfm['Date of purchase']) / np.timedelta64(1,'D')
rfm.rename(columns = {'Order amount':'M',
                     'Order number':'F'},
          inplace=True)

# Building rfm model formula
def get_rfm(x):
    level = x.apply(lambda x:'1' if x>=0 else '0')
    label = level['R'] + level['F'] + level['M']
    d = {'111':'Important value customers',
        '011':'Important customer retention',
        '101':'Important retention customers',
        '001':'Key development customers',
        '110':'General value customers',
        '010':'General customer retention',
        '100':'General retention of customers',
        '000':'General development customers'}
    
    result = d[label]
    return result
rfm['label'] = rfm[['R','F','M']].apply(lambda x:(x-x.mean()) / x.std()).apply(get_rfm,axis=1)
rfm.head()
#Summation
rfm.groupby('label').sum()

  • In column M, the cumulative consumption amount of customers at different levels is important to keep the cumulative consumption amount of customers at 159203.62, ranking the highest
rfm.groupby('label').size()
label
 General value customers 77
 General retention customers 206
 General development customer 3300
 General retention customers 14074
 Important value customers 787
 Important customer retention 4554
 Key development customers 331
 Important retention customers 241
dtype: int64
  • The above is the number of consumers at different levels. The number of consumers who generally retain users ranks the first, with 14074, and the number of consumers who are important to keep customers ranking the second, with 4554. The gap between the above and the general retained users is large, but the accumulated consumption amount is the largest. The business side can classify the customers according to the results, reduce the marketing cost and improve the ROI.

8, User quality analysis

a = df.groupby('user ID')['Date of purchase'].agg(['min','max']).reset_index()
new_old = (a['min'] == a['max']).value_counts().values

plt.pie(x = new_old,
       autopct = '%.1f%%',
       shadow = True,
       explode = [0.08,0],
       textprops = {'fontsize' : 11})
plt.axis('equal') 
plt.legend(['Only once','Multiple consumption'])

  • More than half of the users only consume once, which also shows that the operation is not good and the retention effect is not good.

  • Definition of repurchase rate: the proportion of users who consume twice or more in a certain time window in the total consumption users. The time window here is month. If a user has two orders in the same day, he or she will be counted as a repurchase user

#Orders per user per month
pivoted_df=df.pivot_table(index='user ID',columns='Month',values='Date of purchase',#Pivot table
                       aggfunc='count').fillna(0)#Some users have not consumed it in a certain month, which is represented by nan, and here it is filled with 0

pivoted_df.head()

#Conversion: more than 2 times of consumption is recorded as 1, 1 time of consumption is recorded as 0, and 0 time of consumption is recorded as NAN  
#applymap for all data in dataframe
pivoted_df_transf=pivoted_df.applymap(lambda x: 1 if x>1 else np.nan if x==0 else 0)
pivoted_df_transf.head()

#count counts the number of all non empty data to indicate the total number of users. Sum calculates the sum of non-zero data to indicate the number of users who have consumed more than two times
df_duplicate =pd.DataFrame(pivoted_df_transf.sum()/pivoted_df_transf.count()).reset_index()
df_duplicate.columns = ['Date', 'DuplicatedRate']
df_duplicate['Date'] = df_duplicate.Date.astype(str).apply(lambda x:x[:-3])

plt.figure(figsize = (15,6))
plt.plot(df_duplicate.Date, df_duplicate.DuplicatedRate)
plt.xlabel('time', fontsize=24)
plt.ylabel('Repeat purchase rate',fontsize=24)
# plt.ylim(0,1)
plt.title('Change of repurchase rate',fontsize=24)

  • Note: it can be seen from the figure that the repurchase rate is in the early stage, because a large number of new users join the relationship, the repurchase rate of new customers is not high, for example, in January, the repurchase rate of new customers is only about 6%. In the later stage, the users are all the old customers left by the wave, and the repurchase rate is relatively stable, around 20%. Looking at new customers and old customers alone, there is a gap of about three times in the repurchase rate.

Buy back rate: refers to the proportion of users who consume in a certain time window and still consume in the next time window. For example, I consume 1000 users in January, 300 of them still consume in February, and the repurchase rate is 30%.

#repurchasing
#Average monthly consumption amount of each user
pivoted_money=df.pivot_table(index='user ID',columns='Month',values='Order amount',
                             aggfunc='mean').fillna(0)

columns_month=df.Month.sort_values().astype('str').unique()
pivoted_money.columns=columns_month
pivoted_money.head()

#Record those with consumption as 1 and those without consumption as 0
pivoted_purchase=pivoted_money.applymap(lambda x:1 if x>0 else 0)
pivoted_purchase.head()

#If consumption occurs in the current month, it will be recorded as 1 in the next month; if there is no consumption in the next month, it will be recorded as 0; if there is no consumption in the current month, it will be recorded as nan
def purchase_return(data):
    status=[]
    for i in range(17):#Cycle 17 months
        if data[i]==1:#If this month's consumption
            if data[i+1]==1:#Consumption next month
                status.append(1)#It was recorded as 1.
            if data[i+1]==0:#No consumption next month, it will be recorded as 0
                status.append(0)
        else:
            status.append(np.nan)
    status.append(np.nan)
    return pd.Series(status, index=columns_month)

pivoted_purchase_return=pivoted_purchase.apply(purchase_return,axis=1)#axis=1 means the calculation direction is in the direction of the row, left and right operations
pivoted_purchase_return.head()

df_purchase = (pivoted_purchase_return.sum() / pivoted_purchase_return.count()).reset_index()
df_purchase.columns = ['Date', 'PurchaseRate']
df_purchase['Date'] = df_purchase.Date.astype(str).apply(lambda x:x[:-3])

plt.figure(figsize = (15,5))
plt.plot(df_purchase.Date, df_purchase.PurchaseRate)
plt.xlabel('time', fontsize=24)
plt.ylabel('repurchasing', fontsize=24)
plt.title('Change in repo rate', fontsize=24);

  • As can be seen from the figure above, the repurchase rate of users in the early stage is not high. The repurchase rate in January is only about 15%, and the repurchase rate has been stable at about 30% since April.
  • It can be seen from the data of the number of users who have repurchase consumption every month that the number of repurchase users has a downward trend as a whole.
  • The analysis of buyback rate once again shows that for new users, the three months after their first consumption is an important period, which requires marketing strategies to actively guide their re consumption and continuous consumption.
  • In addition, for the regular customers with continuous consumption, we should also timely launch preferential activities to feed back the regular customers, so as to strengthen the loyalty of the regular customers.
Published 4 original articles, won 1 praise and 256 visitors
Private letter follow

Tags: Lambda less R Language

Posted on Mon, 16 Mar 2020 23:48:26 -0400 by roice