[Python office automation] quickly split excel into batches according to different values in a column

This situation is often encountered in the work: for the sake of data security, it is not convenient to send the full amount of data to all our objects. A full amount of data needs to be separated according to different channels (with field identification), and then sent to the corresponding objects respectively.

If this type of work is more and more, it needs to do an automatic small program, one key split, so there is this post. Of course, after the split, you will also need to automatically send email to different objects, which is not described in this post.
- - - - - - -
Here, we use the rental data from the shell to find the house, assuming that we need to split excel according to different apartment brands. The split has been tested successfully, and the results are as follows:

All codes are attached as follows for reference.

# 2020-03-06 sufm#

import pandas as pd

data = pd.read_excel(
    "D:\\Python_Efficiency\\sources\\20200305 Guangzhou renting_Shell finding room_Houyi collector.xlsx")
data.head(3)
Title content__list--item--des content__list--item-price brand
0 Room 2, hall 2, wangkeshan scenic city Huangpu - Science City - Vanke mountain view city \ n / \ n 65 ㎡ \ n 2200 yuan / month NaN
1 Whole rental, Xinrun garden, room 1, hall 1, compound East Panyu Shiqiao North Xinrun Garden 2300 yuan / month Chain family
2 Room 1, hall 1, Jindao Garden Liwan - Hedong - Jindao garden \ n / \ n 43 ㎡ \ n / East 1500 yuan / month NaN
rows = data.shape[0]  #Get rows shape[1] get columns
rows
2967
a=data["brand"].unique() #View the unique value of the brand column
b=len(a)   #See how many different unique values are in the brand column
c=data["brand"].value_counts() #View the number of unique values in the brand column
print("[See brand Unique value of column]",a,"\n","- - -")
print("[See brand How many different unique values does the column have]",b,"\n","- - -")
print(c)

# Found NAN value in brand column. For ease of understanding, we will replace the Chinese "A non brand intermediary rental"
[view the unique value of brand column] [nan 'chain home' 'free', 'Guangzhou Youjia Apartment', 'Linyou', 'Wo Apartment', 'one room in room apartment', 'long rent by the builder', 'Longquan Apartment' group creation property '
 'Huiting Apartment', 'shengjiayu Apartment', 'Jiyu Apartment', 'Fengyu Apartment', 'Murphy Apartment', 'maishang Apartment', 'Junyu Apartment', 'comma Apartment', 'Hanshi Apartment' medium rise long rent '
 'love waiting Apartment', 'Xinghe mengke Apartment', 'Yijia Apartment', 'Anyu', '52 group rent', 'Guangzhou Lingyu', 'Youjia Youth Apartment', 'she Apartment', 'Changsheng Apartment' Lemon small house '
 'Junli International Apartment' 'Sen yuechuangxiang'] 
 - - -
[see how many different unique values are in the brand column] 32 
 - - -
Free 740
 Chainman 734
 Guangzhou Youjia apartment 63
 Junyu 39
 52 group rental 22
 Long term rental of construction party 22
 One room apartment 22
 Longquan apartment 19
 Linyou 18
 Shengjia apartment 17
 Medium rise long rent 17
 Fengyu 16
 Maishang apartment 13
 Murphy apartment 13
 Wo apartment 9
 Her apartment 7
 Trace apartment 6
 Youjia youth apartment 5
 Changsheng apartment 5
 Guangzhou Lingyu 4
 House 4
 Hanshi apartment 3
 Group creation property 2
 Yijia apartment 2
 Junli international apartment 2
 Comma apartment 2
 Love waiting apartment 1
 Huiting apartment 1
 Xinghe mengke apartment 1
 Lemon house 1
 Sen yuechuangxiang 1
Name: brand, dtype: int64
data.isnull().sum(axis = 0)    #View missing values of data columns
# 1156 missing values were found in the brand column.
Title                              0
content__list--item--des        0
content__list--item-price       0
brand                        1156
dtype: int64
data.fillna(value={"brand":"A Non brand agency rental"},inplace=True)    
# Using the fillna function, replace the missing value of the brand column with "A non brand intermediary rental", and the inplace parameter is required to change the original table 

# Reference source: 
# data3.fillna(value = {'gender': data3['gender'].mode()[0], # Replace missing gender with mode of gender
# 'age':data3['age'].mean() # Replace missing age with average age
# },
# inplace = True# Modify data in place
# )
data.head(3)
Title content__list--item--des content__list--item-price brand
0 Room 2, hall 2, wangkeshan scenic city Huangpu - Science City - Vanke mountain view city \ n / \ n 65 ㎡ \ n 2200 yuan / month A non brand intermediary rental
1 Whole rental, Xinrun garden, room 1, hall 1, compound East Panyu Shiqiao North Xinrun Garden 2300 yuan / month Chain family
2 Room 1, hall 1, Jindao Garden Liwan - Hedong - Jindao garden \ n / \ n 43 ㎡ \ n / East 1500 yuan / month A non brand intermediary rental
# data.brand = data.brand.map({"A Non brand agency rental":"B Non brand agency rental"}) # Try replacing values
# # Using map function, replace "A non brand intermediary rental" in brand column with "B non brand intermediary rental". The original table can be changed without inplace parameter or other similar parameters 
# # However, for the value that is not "A non brand intermediary rental", such as "chain home" and other brand apartment names, it becomes A "NaN" null value. The gain is not worth the loss. We should add another condition to explain that when brand="A", the replacement of brand="B" can only be performed. This is not in-depth 
# data.head(10)
data["brand2"]=data["brand"]
data.head(3)
Title content__list--item--des content__list--item-price brand brand2
0 Room 2, hall 2, wangkeshan scenic city Huangpu - Science City - Vanke mountain view city \ n / \ n 65 ㎡ \ n 2200 yuan / month A non brand intermediary rental A non brand intermediary rental
1 Whole rental, Xinrun garden, room 1, hall 1, compound East Panyu Shiqiao North Xinrun Garden 2300 yuan / month Chain family Chain family
2 Room 1, hall 1, Jindao Garden Liwan - Hedong - Jindao garden \ n / \ n 43 ㎡ \ n / East 1500 yuan / month A non brand intermediary rental A non brand intermediary rental
data["brand"][0]
'A Non brand agency rental'
test_df = pd.DataFrame()
test_df
department_list = []
 
for i in range(rows):
    temp = data["brand"][i]
    if temp not in department_list:
        department_list.append(temp)   #Store the classification of demand department in a list
 
for department in department_list:
    new_df = pd.DataFrame()
 
    for i in range (0, rows):
        if data["brand"][i] == department:
            new_df = pd.concat([new_df, data.iloc[[i],:]], axis = 0, ignore_index = True)
    
#     new_df.to_excel(str(department)+".xls", sheet_name=department, index = False)   #Save each brand apartment as a new excel
    new_df.to_excel("D:\\Python_Efficiency\\sources\\20200305 Guangzhou renting_Shell finding room_Houyi collector_Split result\\"+ str(department)+ "-20200305 Guangzhou renting_Shell finding room.xlsx", sheet_name=department, index = False)   #Save each brand apartment as a new excel

43 original articles published, 10 praised, 20000 visitors+
Private letter follow

Tags: Excel shell

Posted on Mon, 09 Mar 2020 01:23:09 -0400 by phpn00bf4life