Introduction to this article
The Dragon Boat Festival is coming. How about traveling? Go home? Visit relatives and friends? You have to bring zongzi. Then:
- What brand of zongzi do you choose?
- What flavor of zongzi do you choose?
- What price range to choose?
This year, Mr. Huang used Python to crawl the "zongzi data" above JD for analysis to see what he found! In this paper, data crawling, data cleaning and data visualization are convenient, but you can simply complete a small data analysis project, so that you can have a comprehensive application of knowledge.
The whole idea is as follows:
- Crawl web pages: https://www.jd.com/
- Crawling Description: Based on Jingdong website, we search the website "zongzi" data, which is about 100 pages. The fields we crawl include both the relevant information of the primary page and some information of the secondary page;
- Crawling idea: first analyze the primary page of a page of data, then analyze the secondary page, and finally turn the page;
- Crawling fields: name (title), price, brand (store) and category (taste) of zongzi;
- Tools used: requests + lxml + pandas + time + re + pyechards
- Website parsing method: xpath
The final effect is as follows:
Data crawling
Jingdong website is generally dynamically loaded, that is, in a general way, it can only crawl to the first 30 data of a page (a page has a total of 60 data).
Based on this article, I only used the most basic method to crawl the first 30 data of each page (if you are interested, you can go down and crawl all the data by yourself).
So, what fields are crawled in this paper? I'll give you a presentation. If you are interested, you can crawl more fields and do more detailed analysis.
Let's show you the crawler Code:
import pandas as pd import requests from lxml import etree import chardet import time import re def get_CI(url): headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36'} rqg = requests.get(url,headers=headers) rqg.encoding = chardet.detect(rqg.content)['encoding'] html = etree.HTML(rqg.text) # Price p_price = html.xpath('//div/div[@class="p-price"]/strong/i/text()') # name p_name = html.xpath('//div/div[@class="p-name p-name-type-2"]/a/em') p_name = [str(p_name[i].xpath('string(.)')) for i in range(len(p_name))] # Deep url deep_ur1 = html.xpath('//div/div[@class="p-name p-name-type-2"]/a/@href') deep_url = ["http:" + i for i in deep_ur1] # From here, we get the information of "secondary page" brands_list = [] kinds_list = [] for i in deep_url: rqg = requests.get(i,headers=headers) rqg.encoding = chardet.detect(rqg.content)['encoding'] html = etree.HTML(rqg.text) # brand brands = html.xpath('//div/div[@class="ETab"]//ul[@id="parameter-brand"]/li/@title') brands_list.append(brands) # category kinds = re.findall('>Category:(.*?)</li>',rqg.text) kinds_list.append(kinds) data = pd.DataFrame({'name':p_name,'Price':p_price,'brand':brands_list,'category':kinds_list}) return(data) x = "https://search.jd.com/Search?keyword=%E7%B2%BD%E5%AD%90&qrst=1&wq=%E7%B2%BD%E5%AD%90&stock=1&page=" url_list = [x + str(i) for i in range(1,200,2)] res = pd.DataFrame(columns=['name','Price','brand','category']) # Page turning is performed here for url in url_list: res0 = get_CI(url) res = pd.concat([res,res0]) time.sleep(3) # Save data res.to_csv('aliang.csv',encoding='utf_8_sig')
The data finally crawled, like this.
Data cleaning
As can be seen from the above figure, the whole data is very neat and not particularly messy. We can only do some simple operations.
First use the pandas library to read the data.
import pandas as pd df = pd.read_excel("traditional Chinese rice-pudding.xlsx",index_col=False) df.head()
The results are as follows:
We remove the brackets for the "brand" and "category" fields respectively.
df["brand"] = df["brand"].apply(lambda x: x[1:-1]) df["category"] = df["category"].apply(lambda x: x[1:-1]) df.head()
The results are as follows:
① Top 10 stores of zongzi branddf["brand"].value_counts()[:10]
The results are as follows:
② Top 5 tastes of zongzidef func1(x): if x.find("sweet") > 0: return "Sweet zongzi" else: return x df["category"] = df["category"].apply(func1) df["category"].value_counts()[1:6]
The results are as follows:
③ Division of selling price range of zongzidef price_range(x): #Divide the price range according to Taobao's recommendation if x <= 50: return '<50 element' elif x <= 100: return '50-100 element' elif x <= 300: return '100-300 element' elif x <= 500: return '300-500 element' elif x <= 1000: return '500-1000 element' else: return '>1000 element' df["Price range"] = df["Price"].apply(price_range) df["Price range"].value_counts()
The results are as follows:
Because there are not many data, there are not many fields, so there are not many random data. Therefore, there are no operations such as data De duplication and missing value filling. Therefore, you can go down to get more fields and more data for data analysis.
Data visualization
As the saying goes: words are not as good as tables, and tables are not as good as diagrams. Through visual analysis, we can show the "hidden" information behind the data.
Expansion: of course, it's just "throwing bricks to attract jade". I didn't get too much data or too many fields. It's an assignment for my friends to study. Go on and do more thorough analysis with more data and more fields.
Here, we will make a visual display based on the following questions:
- ① Top 10 column chart of zongzi sales store;
- ② Top 5 column chart of zongzi taste ranking;
- ③ Zongzi sales price range division pie chart;
- ④ Rice dumpling commodity name word cloud map;
In view of the layout of the whole article, the code of the visualization part of this article can be obtained at the end of this article.
① Top 10 column chart of zongzi sales storeConclusion analysis: last year, we analyzed the data of some moon cakes. The brands of "wufangzhai" and "Beijing Daoxiang village" are still fresh in our memory. They can be described as old shops for making moon cakes and zongzi. For example, "Sanquan" and "missing", I always thought that they only make dumplings and dumplings. Is zongzi worth a try? Of course, there are some new brands, such as "Zhu eldest brother", "Daoxiang private house" and other brands can be searched. When shopping, we should select carefully, and the brand is also important.
② Top 5 column chart of zongzi taste rankingConclusion analysis: in my impression, "sweet zongzi" was the most eaten when I was a child. I didn't know that zongzi can still have meat until I was in junior high school. Of course, it can be seen from the picture that most shops sell "fresh meat zongzi". After all, this gift still looks higher. There are also some flavors, such as "honey jujube zongzi" and "bean paste zongzi" , I haven't eaten it. If you gave it away, what flavor would you give it?
③ Pie chart for dividing the sales price range of zongziConclusion analysis: here, I deliberately subdivide the price range. This pie chart is also very realistic. After all, the Dragon Boat Festival is held once a year. It is still dominated by small profits and quick sales. Nearly 80% of the zongzi are sold for less than 100 yuan. Of course, there are some mid-range zongzi, which are priced at 100-300 yuan. If it is more than 300 yuan, I don't think it's necessary to eat. Anyway, I won't spend so much money Buy zongzi.
④ Cloud picture of zongzi commodity nameConclusion analysis: from the picture, we can roughly see the selling points of businesses. After all, it is a festival. "Gifts" and "gifts" reflect the festival atmosphere. "Pork" and "bean paste" reflect the taste of zongzi. Of course, is it a good choice for "breakfast"? If you buy, you also support "group buying".
⑤ The graphics are combined into a large screenThe visualization of this paper uses the pyecharts library to draw. We first make each drawing separately, and then integrate the graphics to make a beautiful visualization screen. About how to make it, you can get the code by private mail!