Analyze the sales data of hospital drugstores and see what amazing findings can be found in Python data analysis results

preface

The text and pictures of this article are from the Internet, only for learning and communication, not for any commercial purpose. The copyright belongs to the original author. If you have any questions, please contact us in time for handling.

Generally speaking, the basic process of data analysis includes the following steps:

 

1. Ask questions - the indicators we want to know (average consumption, age distribution of customers, turnover trends, etc.)

2. Import data - import the original data source into Jupyter Notebook (web crawler, data reading, etc.)

3. Data cleaning - data cleaning refers to finding and correcting identifiable errors in data files (checking data consistency, handling invalid and missing values, etc.)

4. Build model (advanced model building will use machine learning algorithm)

5. Data visualization -- matplotib library, etc

 

Specific basic knowledge such as numpy library has been introduced in previous articles:

Numy basic knowledge sharing

Introduction to Pandas

 

Let's look at an example -- Analysis of sales data of hospital drugstores

 

 

Ask questions:

The information we want to know is: monthly average consumption, monthly average consumption times, customer unit price and consumption trend

 

Understanding data:

1. Read Excel data (it's better not to have Chinese or special symbols in the path, otherwise the path will prompt that the error cannot be found.

It is better to put the file in a simple English path.)

import pandas as pd
fileNameStr='D:\Sales data of Chaoyang Hospital in 2018.xlsx'          #read Ecxcel data
xls = pd.ExcelFile(fileNameStr, dtype='object')
salesDf = xls.parse('Sheet1',dtype='object')

 

We can first view the basic information of the following data:

salesDf.head()          #Print out the first 5 lines to make sure the data is working properly

 

 

 

salesDf.shape           #How many rows, how many columns

 

 

salesDf.dtypes          #View data types for each column

 

 

 

Data cleaning:

 

 

1. Select a subset (not used in this case):

Using the loc method to select a subset

#subSalesDf=salesDf.loc[0:4, 'purchase time': 'sales quantity']

 

2. Rename column name:

colNameDict = {'Purchase time':'Sales time'}                  #take'Purchase time'Change to'Sales time'
salesDf.rename(columns = colNameDict,inplace=True)
salesDf.head()                                       #View the first five lines

 

 

 

inplace=False, the data frame itself will not change, but a new data frame will be created after the change. The default inplace is False, inplace=True, and the data frame itself will change

3. Missing data processing:

There are three kinds of python missing values: None, NA, NaN

1) Python's built-in None value

2) In pandas, the missing value is expressed as NA, indicating that it is not available.

3) For numeric data, pandas uses the floating-point value NaN (Not a Number) for missing data.

Data comes out later. If an error is encountered:... Foliot error, there is a missing value, which needs to be handled

print('Size before deleting missing value',salesDf.shape)
salesDf=salesDf.dropna(subset=['Sales time','Social Security Card No'],how='any') #Delete the blank row in the column (sales time, social security card number)
print('Size after deletion',salesDf.shape)

 

 

how='any 'means to delete any missing value in any given column

If there is too much missing data, we can build a model and use the method of inserting values to supplement the data (introduced in the article of machine learning later)

4. Data type conversion

At the beginning of import, we imported all data by string type. Now we need to change the data type of sales quantity, receivable amount and actual received amount to numerical type.

salesDf['sales volumes'] = salesDf['sales volumes'].astype('float')
salesDf['Amount receivable'] = salesDf['Amount receivable'].astype('float')
salesDf['Paid in amount'] = salesDf['Paid in amount'].astype('float')
print('Converted data type:\n',salesDf.dtypes)

 

Using the astype () method for type conversion

 

Let's change the date format

 

 

Use the split() method to split the string with ''. Return the first element of the list, uniformly change the format of adult month day, and remove the day of the week.

Next, we divide the sales time by defining a function to split the string:

def splitSaletime(timeColSer):
    timeList=[]
    for value in timeColSer:           #For example, 2018-01-01 Friday, divided into: 2018-01-01
        dateStr=value.split(' ')[0]
        timeList.append(dateStr)
 
    timeSer=pd.Series(timeList)        #Turn the list into one-dimensional data Series type
    return timeSer

 

Input: timeColSer - sales time column, a Series data type

Output: the time after splitting. The return is also a Series data type

timeSer=salesDf.loc[:,'Sales time']    #Get the sales time column
dateSer=splitSaletime(timeSer)      #Split the string to get the sales date

salesDf.loc[:,'Sales time']=dateSer    #Modify the value of the sales time column
salesDf.head()

 

 

 

 

(Note: if an error is reported after running: AttributeError: 'float' object has no attribute 'split' is because the empty cell in Excel reads into pandas as as a null value (NaN). This NaN is a floating-point type and is generally treated as a null value. So remove the NaN to separate the strings)

 

 

5. String conversion date

We use the pd.to'datetime method to convert strings to date format. The format of the incoming data is the date format of the original data - Format =% Y -% m -% d 'fixed writing: Y represents year, M represents month, and D represents day.

salesDf.loc[:,'Sales time']=pd.to_datetime(salesDf.loc[:,'Sales time'],
                                    format='%Y-%m-%d',
                                    errors='coerce')
salesDf.dtypes

 

 

 

errors='coerce ': if the original data does not conform to the date format, the converted value is null NaT

Therefore, we need to run the code to delete the null value after the conversion, because the date that does not conform to the format needs to be deleted when it is converted to null value.

salesDf=salesDf.dropna(subset=['Sales time','Social Security Card No'],how='any')

 

6. Data sorting

Use pd.sort ﹣ values method to sort the data, by means of which columns, ascending=True, descending = false

print('Data set before sorting')
salesDf.head()
salesDf=salesDf.sort_values(by='Sales time',     #Ascending by sales date
                    ascending=True)
print('Sorted dataset')
salesDf.head(3)

 

 

 

 

 

Next, we rename the line number: the reset UU index method generates index values from 0 to N in order

salesDf=salesDf.reset_index(drop=True)
salesDf.head()

 

 

 

7. Handling of outliers

First, we use the description () method to view the description statistics of all the data in each column of the data frame:

salesDf.describe()

 

 

 

(count: total, mean: average, std: standard deviation, min: minimum, 25%: lower quartile, 50%: median, 75%: upper quartile, max: maximum)

We found that the minimum value is less than 0. The analysis should be caused by errors in the recording process.

Let's delete the outliers: filter out the data whose sales quantity is greater than 0 through condition judgment

#Set query criteria
querySer=salesDf.loc[:,'sales volumes']>0
#Apply query criteria
print('Before deleting outliers:',salesDf.shape)
salesDf=salesDf.loc[querySer,:]
print('After deleting the outliers:',salesDf.shape)

 

 

In this way, we can basically complete the steps of data cleaning.

 

Build the model:

First indicator: monthly average consumption times = total consumption times / months

Note: in the same day, all the consumption of the same person is counted as one consumption. According to the column name (sales time, community card number), if the two column values are the same at the same time, only one will be kept, and the duplicate data will be deleted by using drop_duplicates

kpi1_Df=salesDf.drop_duplicates(
    subset=['Sales time', 'Social Security Card No']
)

totalI=kpi1_Df.shape[0]             #Total number of consumption -- how many lines

print('Total consumption times=',totalI)

 

 

To calculate the number of months, we need to know the earliest consumption time and the latest consumption time:

#Step 1: sort by sales time in ascending order
kpi1_Df=kpi1_Df.sort_values(by='Sales time',
                    ascending=True)
kpi1_Df=kpi1_Df.reset_index(drop=True)     #Rename row name( index)

#Step 2: get the time range
startTime=kpi1_Df.loc[0,'Sales time']         #Minimum time value
endTime=kpi1_Df.loc[totalI-1,'Sales time']    #Maximum time value

#Step 3: calculate the number of months
daysI=(endTime-startTime).days             #Days
monthsI=daysI//30                          #Months: Operators“//”It means to take the integral division and return the integral part of quotient, for example, 9 / / 2 the output result is 4
print('Months:',monthsI)

 

 

Calculate the number of months with days / 30 (discard the remainder)

Final calculated monthly average consumption times = total consumption times / months

kpi1_I=totalI // monthsI
print('Business indicator 1: monthly average consumption times=',kpi1_I)

 

 

 

Second indicator: monthly average consumption amount = total consumption amount / number of months

The total consumption amount is equal to the sum of the received amount, which can be obtained quickly by sum function

totalMoneyF=salesDf.loc[:,'Paid in amount'].sum()   #Total consumption amount
monthMoneyF=totalMoneyF / monthsI            #Monthly average consumption amount
print('Business indicator 2: monthly average consumption amount=',monthMoneyF)

 

 

 

Third indicator: customer unit price = average transaction amount = total consumption amount / total consumption times

'''
totalMoneyF: Total consumption amount
totalI: Total consumption times
'''
pct=totalMoneyF / totalI
print('Customer unit price:',pct)

 

 

Fourth indicator: consumption trend

#Before operation, copy the data to another data frame to prevent impact on the previously cleaned data frame
groupDf=salesDf

#Step 1: rename the row name( index)Is the value of the column where the sales time is
groupDf.index=groupDf['Sales time']

#Step 2: Group
gb=groupDf.groupby(groupDf.index.month)

#Step 3: apply the function to calculate the total consumption of each month
mounthDf=gb.sum()

mounthDf

 

 

 

If you want to learn Python or are learning python, there are many Python tutorials, but are they up to date? Maybe you have learned something that someone else probably learned two years ago. Share a wave of the latest Python tutorials in 2020 in this editor. Access to the way, private letter small "information", you can get free Oh!

Tags: Python Excel jupyter Attribute

Posted on Tue, 12 May 2020 03:52:20 -0400 by BIOSTALL