The text and pictures of this article are from the Internet, only for learning and communication, not for any commercial purpose. The copyright belongs to the original author. If you have any questions, please contact us in time for handling.
To access Taobao data, the method adopted this time is: Selenium controls the automatic operation of Chrome browser. In fact, we can also use Ajax interface to construct links, but it is very cumbersome (including encryption keys, etc.), so it will save a lot of things to directly use selenium to simulate browsers;
The most common problem is that the chrome driver does not match the version of Google browser, which can be easily solved. Next, we start to use selenium to grab Taobao products, and use Xpath to parse the product name, price, number of payers, store name and shipping address information, and finally save the data locally.
The process of reptile is as follows:
selenium automatic crawling (requires Taobao to scan and log in once)
from selenium import webdriver # Search for product, get product page number def search_product(key_word): # Positioning input box browser.find_element_by_id("q").send_keys(key_word) # Define the click button and click browser.find_element_by_class_name('btn-search').click() # Maximize window: for our convenience browser.maximize_window() # Wait 15 seconds. Give us enough time to scan time.sleep(15) # Locate the "page number" to get "100 pages of this text" page_info = browser.find_element_by_xpath('//div[@class="total"]').text # It should be noted that: findall()A list is returned, although there is only one element at this time and it is also a list. page = re.findall("(\d+)",page_info) return page
See the end of the article for detailed code download of crawler.
At this time, we crawl to the data:
Data before collation
The data is still relatively rough. We need to deal with several problems:
- Add column name
- Remove duplicate data (there will be duplicates during page turning and crawling)
- Replace the record with blank number of purchasers with 0 person to pay
- Convert the number of purchasers into sales volume (note that part of the unit is 10000)
- Delete items without shipping address and extract provinces
# Delete items without shipping address and pick up Province df = df[df['Shipping Address '].notna()] df['province'] = df['Shipping Address '].str.split(' ').apply(lambda x:x) # Delete extra columns df.drop(['Number of payers', 'Shipping Address ', 'num', 'unit'], axis=1, inplace=True) # Reset index df = df.reset_index(drop=True) df.head(10)
In this way, we have finished the data cleaning and sorting, which is convenient for the next step of visualization.
By the way, make a sort to see what dumplings are most expensive!
df1 = df.sort_values(by="Price", axis=0, ascending=False) df1.iloc[:5,:]
Top 5 zongzi
The top three are from the flagship store of imperial tea restaurant. Let's see what the 1780 yuan zongzi looks like!
Want to try it
In this paper, we intend to use pyecharts for visualization. Some students may use the old version (0.5X). The 1.x version of pyecharts is not compatible with the old version (0.5X). If it cannot be imported, this may be the problem.
Visualization: all statements are based on v1.7.1. You can query your version of pyecharts through the following statements:
import pyecharts print(pyecharts.__version__)
The most expensive zongzi is 1780 yuan. It seems that you can't afford it. What price do you buy?
First, it is divided according to the interval recommended by Taobao:
def price_range(x): #Price range according to Taobao recommendation if x <= 22: return '22 Under yuan' elif x <= 115: return '22-115 element' elif x <= 633: return '115-633 element' else: return '633 Over yuan'
Then pyecharts is used to generate the sales proportion of zongzi in different price ranges.
Proportion of zongzi sales in different price ranges
It seems that zongzi (gift box) within 100 yuan is the normal bearing range of everyone, but I still choose 5 yuan and 3 at the gate of the community.
Cloud of words
We use jieba to segment the commodity name we get, and generate word cloud.
from pyecharts.charts import WordCloud from pyecharts.globals import SymbolType # Cloud of words word1 = WordCloud(init_opts=opts.InitOpts(width='1350px', height='750px')) word1.add("", [*zip(key_words.words, key_words.num)], word_size_range=[20, 200], shape=SymbolType.DIAMOND) word1.set_global_opts(title_opts=opts.TitleOpts('Cloud chart of zongzi commodity name'), toolbox_opts=opts.ToolboxOpts()) word1.render("Cloud chart of zongzi commodity name.html")
Cloud chart of zongzi commodity name
Around the huge zongzi are several prominent Keywords: gift box, fresh meat, egg yolk, Jiaxing, Dousha, Dragon Boat Festival. Apart from the Dragon Boat Festival related vocabulary, we seem to know the popularity of several flavors through the key word size.
Compared with the reference data, they are basically the same.
I love zongzi.
As for Jiaxing, we will continue to mention it later.
We found the most expensive zongzi above, so what is the best selling zongzi / shop?
Top 10 sales volume of zongzi
Wufangzhai has 4 finalists in total, one of which has a sales volume of 1 million +, which should be more than that (see the 10w + of wechat). Zhenzhen Laolao followed closely, with 3 zongzi in top 10. The other brands are daoxiangcun and zhiweiguan. Well, the ninth one is selling zongzi leaves. It seems that the demand for making zongzi is quite large.
Zongzi shop sales top 10
In fact, the sales volume of zongzi shop top 10 is similar to that of commodities. The official flagship store of wufangzhai and the old flagship store of Zhenzhen take the lead.
After checking, five Fang Zhai, really old, are two brands of zongzi in Jiaxing. No wonder Jiaxing is so prominent in the cloud map of CI. Jiaxing belongs to Zhejiang Province, and the second place in sales volume is here. Does Zhejiang not account for a large proportion.
Continue to use pyecharts to generate sales distribution map of zongzi in each province
from pyecharts.charts import Map # Calculate sales volume province_num = df.groupby('province')['sales volume'].sum().sort_values(ascending=False) # draw a map map1 = Map(init_opts=opts.InitOpts(width='1350px', height='750px')) map1.add("", [list(z) for z in zip(province_num.index.tolist(), province_num.values.tolist())], maptype='china' ) map1.set_global_opts(title_opts=opts.TitleOpts(title='Sales distribution of zongzi in different provinces'), visualmap_opts=opts.VisualMapOpts(max_=300000), toolbox_opts=opts.ToolboxOpts() ) map1.render("Sales distribution of zongzi in different provinces.html")
Sales distribution of zongzi in different provinces
It's a big difference in sales.
It can be said that China's zongzi look at Zhejiang, Zhejiang zongzi look at Jiaxing  (by calculating the delivery address of zongzi in Zhejiang, the sales accounted for 70.6%, while Jiaxing accounted for 87.4% of Zhejiang)
Jiaxing zongzi in documentary "China on the tip of the tongue"