In the party war of sweet and salty zongzi, Python crawls and analyzes the zongzi data on Taobao


The text and pictures of this article are from the Internet, only for learning and communication, not for any commercial purpose. The copyright belongs to the original author. If you have any questions, please contact us in time for handling.


To access Taobao data, the method adopted this time is: Selenium controls the automatic operation of Chrome browser. In fact, we can also use Ajax interface to construct links, but it is very cumbersome (including encryption keys, etc.), so it will save a lot of things to directly use selenium to simulate browsers;

The most common problem is that the chrome driver does not match the version of Google browser, which can be easily solved. Next, we start to use selenium to grab Taobao products, and use Xpath to parse the product name, price, number of payers, store name and shipping address information, and finally save the data locally.

The process of reptile is as follows:


selenium automatic crawling (requires Taobao to scan and log in once)

from selenium import webdriver

# Search for product, get product page number
def search_product(key_word):
    # Positioning input box
    # Define the click button and click
    # Maximize window: for our convenience
    # Wait 15 seconds. Give us enough time to scan
    # Locate the "page number" to get "100 pages of this text"
    page_info = browser.find_element_by_xpath('//div[@class="total"]').text
    # It should be noted that: findall()A list is returned, although there is only one element at this time and it is also a list.
    page = re.findall("(\d+)",page_info)[0]
    return page


See the end of the article for detailed code download of crawler.

Data collation

At this time, we crawl to the data:


Data before collation

The data is still relatively rough. We need to deal with several problems:

  1. Add column name
  2. Remove duplicate data (there will be duplicates during page turning and crawling)
  3. Replace the record with blank number of purchasers with 0 person to pay
  4. Convert the number of purchasers into sales volume (note that part of the unit is 10000)
  5. Delete items without shipping address and extract provinces

Part code:

# Delete items without shipping address and pick up Province
df = df[df['Shipping Address '].notna()]
df['province'] = df['Shipping Address '].str.split(' ').apply(lambda x:x[0])

# Delete extra columns
df.drop(['Number of payers', 'Shipping Address ', 'num', 'unit'], axis=1, inplace=True)

# Reset index
df = df.reset_index(drop=True)



Collated data

In this way, we have finished the data cleaning and sorting, which is convenient for the next step of visualization.

By the way, make a sort to see what dumplings are most expensive!

df1 = df.sort_values(by="Price", axis=0, ascending=False)



Top 5 zongzi

The top three are from the flagship store of imperial tea restaurant. Let's see what the 1780 yuan zongzi looks like!


Want to try it

Data visualization

In this paper, we intend to use pyecharts for visualization. Some students may use the old version (0.5X). The 1.x version of pyecharts is not compatible with the old version (0.5X). If it cannot be imported, this may be the problem.

Visualization: all statements are based on v1.7.1. You can query your version of pyecharts through the following statements:

import pyecharts


Fan chart

The most expensive zongzi is 1780 yuan. It seems that you can't afford it. What price do you buy?

First, it is divided according to the interval recommended by Taobao:

def price_range(x): #Price range according to Taobao recommendation
    if x <= 22:
        return '22 Under yuan'
    elif x <= 115:
        return '22-115 element'
    elif x <= 633:
        return '115-633 element'
        return '633 Over yuan'


Then pyecharts is used to generate the sales proportion of zongzi in different price ranges.


Proportion of zongzi sales in different price ranges

It seems that zongzi (gift box) within 100 yuan is the normal bearing range of everyone, but I still choose 5 yuan and 3 at the gate of the community.

Cloud of words

We use jieba to segment the commodity name we get, and generate word cloud.

from pyecharts.charts import WordCloud
from pyecharts.globals import SymbolType

# Cloud of words
word1 = WordCloud(init_opts=opts.InitOpts(width='1350px', height='750px'))
word1.add("", [*zip(key_words.words, key_words.num)],
          word_size_range=[20, 200],
word1.set_global_opts(title_opts=opts.TitleOpts('Cloud chart of zongzi commodity name'),
word1.render("Cloud chart of zongzi commodity name.html")



Cloud chart of zongzi commodity name

Around the huge zongzi are several prominent Keywords: gift box, fresh meat, egg yolk, Jiaxing, Dousha, Dragon Boat Festival. Apart from the Dragon Boat Festival related vocabulary, we seem to know the popularity of several flavors through the key word size.

Compared with the reference data, they are basically the same.

I love zongzi.

As for Jiaxing, we will continue to mention it later.


Bar chart

We found the most expensive zongzi above, so what is the best selling zongzi / shop?



Top 10 sales volume of zongzi

Wufangzhai has 4 finalists in total, one of which has a sales volume of 1 million +, which should be more than that (see the 10w + of wechat). Zhenzhen Laolao followed closely, with 3 zongzi in top 10. The other brands are daoxiangcun and zhiweiguan. Well, the ninth one is selling zongzi leaves. It seems that the demand for making zongzi is quite large.


Zongzi shop sales top 10

In fact, the sales volume of zongzi shop top 10 is similar to that of commodities. The official flagship store of wufangzhai and the old flagship store of Zhenzhen take the lead.

After checking, five Fang Zhai, really old, are two brands of zongzi in Jiaxing. No wonder Jiaxing is so prominent in the cloud map of CI. Jiaxing belongs to Zhejiang Province, and the second place in sales volume is here. Does Zhejiang not account for a large proportion.


Continue to use pyecharts to generate sales distribution map of zongzi in each province

from pyecharts.charts import Map 

# Calculate sales volume
province_num = df.groupby('province')['sales volume'].sum().sort_values(ascending=False) 

# draw a map
map1 = Map(init_opts=opts.InitOpts(width='1350px', height='750px'))
map1.add("", [list(z) for z in zip(province_num.index.tolist(), province_num.values.tolist())],
map1.set_global_opts(title_opts=opts.TitleOpts(title='Sales distribution of zongzi in different provinces'),
map1.render("Sales distribution of zongzi in different provinces.html")



Sales distribution of zongzi in different provinces

It's a big difference in sales.

It can be said that China's zongzi look at Zhejiang, Zhejiang zongzi look at Jiaxing [3] (by calculating the delivery address of zongzi in Zhejiang, the sales accounted for 70.6%, while Jiaxing accounted for 87.4% of Zhejiang)


Jiaxing zongzi in documentary "China on the tip of the tongue"

Tags: Python Selenium Google Lambda

Posted on Mon, 22 Jun 2020 03:20:35 -0400 by sps