When reading the headlines yesterday, I found that the number of marriage registrations has decreased for seven consecutive years, reaching a new low in 17 years last year
I was stunned
A closer look shows that in 2020, the number of officially registered marriages totaled 8143300 pairs, a decrease of 1.13 million pairs compared with 2019.
This is also a decline for seven consecutive years since it reached 13.4693 million pairs in 2013. In 2020, the number of marriage registrations of 8143300 pairs also reached a new low in recent 17 years since 2003 (data on the official website of the National Bureau of Statistics: 8114000 pairs).
We observed and commented that everyone has their own views on the phenomenon of low marriage rate
Let's use the crawler to get these comment data today to see what we don't know in addition to the reasons we see
requirement analysis
The data we want to obtain are the data of commentators under the current article
User name
Comment content
Post reply plural
Number of comments and likes
And comment time, etc
Web page analysis
First, we open the browser F12, and the developer mode is as follows:
Find the location of the comment as shown in the figure above and the real url of the web page request
Observe the url characteristics. count=20 represents ten comment data per page, offset=0, 20 and 40 control page turning, and other parameters do not change
https://www.toutiao.com/article/v2/tab_comments/?aid=24&app_name=toutiao_web&offset=0&count=20&group_id=7032951744313164295&item_id=7032951744313164295 https://www.toutiao.com/article/v2/tab_comments/?aid=24&app_name=toutiao_web&offset=20&count=20&group_id=7032951744313164295&item_id=7032951744313164295 https://www.toutiao.com/article/v2/tab_comments/?aid=24&app_name=toutiao_web&offset=40&count=20&group_id=7032951744313164295&item_id=7032951744313164295 https://www.toutiao.com/article/v2/tab_comments/?aid=24&app_name=toutiao_web&offset=60&count=20&group_id=7032951744313164295&item_id=7032951744313164295
Based on this, we can construct the request connection of multi page request
url = f'https://www.toutiao.com/article/v2/tab_comments/?aid=24&app_name=toutiao_web&offset={(page-1)*20}&count=20&group_id=7032951744313164295&item_id=7032951744313164295'
Send request
We use the url we just found to get the single page request information first
url = f'https://www.toutiao.com/article/v2/tab_comments/?aid=24&app_name=toutiao_web&offset={(page-1)*20}&count=20&group_id=7032951744313164295&item_id=7032951744313164295' headers = { 'cookie': 'xxxxxxxxxx', 'referer': 'xxxxxxxxxx', 'user-agent': 'xxxxxxxxxx' } resp = requests.get(url, headers = headers)
The results are as follows:
You can see that this is a json data set, and all the information we want to get is in the comment of the data list.
Here's the analysis, and then it's very simple.
for item in json_data: # User name user = item['comment']['user_name'] # Comment content text = item['comment']['text'] # Post back complex reply = item['comment']['reply_count'] # Comment time times = item['comment']['create_time'] rls_time = time.strftime('%Y-%m-%d %H:%M', time.localtime(times)) # Number of comments and likes stars = item['comment']['digg_count'] ic(user, stars, rls_time, reply, text) ''' ic| user: 'Happy biscuit Zp' stars: 1741 rls_time: '2021-11-21 17:42' reply: 239 text: 'Don't say it's the epidemic' ic| user: 'Tonglu night reading' stars: 253 rls_time: '2021-11-21 17:47' reply: 43 text: 'The marriage rate has declined for seven consecutive years. Where does the fertility rate come from without marriage' ic| user: 'Lily Wang Zhihan' stars: 148 rls_time: '2021-11-21 17:50' reply: 59 text: '2020 I am one of the newlyweds who got married in[lovely]' ic| user: 'Xiaotaozi's life video' stars: 206 rls_time: '2021-11-21 17:52' reply: 43 text: 'There are too many single men in rural areas. They can't afford to get married and there are no women. There are more than 30 single men in our village, but there is no unmarried woman. It's true' ic| user: 'Tomato 1543353620246856' stars: 197 rls_time: '2021-11-21 18:48' reply: 11 text: ('How many people dare to get married now? In case of a scum, talking about divorce is trouble. ' 'My friend met a scum man and appealed. Finally, the man agreed to divorce. My friend gave me the fare from Guangdong to Sichuan. If you don't give it, you won't come. ' 'Here you are. Let's apply for divorce first. As a result, there was an epidemic in the middle and it dragged on. Missed it. Start over again. Then the man was in trouble and didn't come... ' 'The man hasn't been in charge since he got married and had children in 13 years. And hit my friend. Beat my friend away. My friend took care of the children himself. The man kept telling the child that his father was dead.') '''
Multi page acquisition
First obtain 2000 pieces of data for testing
for page in range(1, 200+1): url = f'https://www.toutiao.com/article/v2/tab_comments/?aid=24&app_name=toutiao_web&offset={(page-1)*20}&count=20&group_id=7032951744313164295&item_id=7032951744313164295'
Data saving
Next, we use openpyxl to save the data in excel.
1500 + in total
# Create workbook ws = op.Workbook() # Create a worksheet wb = ws.create_sheet(index=0) # Create header wb.cell(row=1, column=1, value='User name') wb.cell(row=1, column=2, value='Comment like') wb.cell(row=1, column=3, value='Comment time') wb.cell(row=1, column=4, value='Post reply') wb.cell(row=1, column=5, value='Comment content') # Save data ws.save('Marriage rate.xlsx') print('Data saved!')
Data preprocessing
We first use panda to read excel. Then use
pandas removes duplicate data and missing values.
# Read data rcv_data = pd.read_excel('./Marriage rate.xlsx') # Delete duplicate records rcv_data = rcv_data.drop_duplicates() # Delete missing values rcv_data = rcv_data.dropna() # Sample and display 5 pieces of data print(rcv_data.sample(5)) ''' User name Comment like Comment time Post reply Comment content 943 User 4947984566248 1 2021-11-21 17:50 0 Isn't it good to live together? What do you get married? 635 Chengdu gentleman Xichen Tianjie 1 2021-11-21 19:00 0 That's normal 1594 Black Dwarf ReFuelYourlife 0 2021-11-21 23:15 0 The Internet is both a good thing and a bad thing. In the 1980s and 1990s, not so many people compared with each other, because they didn't know a lot of data... 12 Equal name, etc 188 2021-11-21 19:05 11 Good thing, the house price is higher, come on ,,[facepalm ][facepalm ][facepalm ] 1854 kevin master worker 0 2021-11-21 21:17 0 Can't afford to marry Divorce is also inseparable '''
Word cloud display
Use stuttering participles
Finally, use stylecloud to draw a beautiful word cloud display
# Word cloud display def visual_ciyun(): pic = './img.jpg' gen_stylecloud(text=result, icon_name='fas fa-feather-alt', font_path='msyh.ttc', background_color='white', output_name=pic, custom_stopwords=stop_words ) print('Word cloud drawing succeeded!')
If you are interested in word cloud, you can refer to it
How to use python to implement an elegant word cloud? (super detailed)
Word frequency display
The top ten words with the highest frequency of article comments are as follows:
def visual_cipin(): # Word frequency setting all_words = [word for word in result.split(' ') if len(word) > 1 and word not in stop_words] wordcount = Counter(all_words).most_common(10) x1_data, y1_data = list(zip(*wordcount)) ''' ('marry', 'divorce', 'No', 'divorce rate', 'children', 'housing price', 'single', 'betrothal gifts', 'house', 'population') (805, 211, 210, 113, 98, 98, 79, 73, 63, 63) '''
Next, we use visualization to intuitively display the following:
Histogram
Pie chart
Bubble Diagram
Most likes & & most replies
We use the following functions to find the comments with the most likes and the comments with the most replies
def datas_anay(): max_stars = rcv_data[rcv_data['Comment like'] == rcv_data['Comment like'].max()] ic(max_stars) max_reply = rcv_data[rcv_data['Post reply'] == rcv_data['Post reply'].max()] ic(max_reply) ''' User name Comment like Comment time Post reply Comment content 0 Happy biscuit Zp 1615 2021-11-21 17:42 216 Don't say it's the epidemic User name Comment like Comment time Post reply Comment content 27 You city people can really play 111 182 2021-11-21 17:58 285 Now it's more cost-effective to raise a daughter than anything. Small investment, little risk and more money. The bride price my three sisters received at that time was 5000/2000... '''
Opinions of netizens with the largest number of likes
The user named happy cookie Zp got the most likes, and his number of likes was 1970
Don't say it's the epidemic
Let's take a look at the ranking of user comments and likes:
Views of netizens with the largest number of replies
The user who received the most replies was named you city people really play 111. His reply number of comments was 285. It seems that there is a great objection to the comment of big goods lost
Now it's more cost-effective to raise a daughter than anything. Small investment, little risk and more money. My three sisters received 5000 / 20000 / 20000 as betrothal gifts at that time. The bride price is used to buy household appliances, furniture and motorcycles. To ask for bride price money is for the man to pay and the woman to choose household appliances. Few people embezzle the bride price. It is known that corruption will say behind their back that they sell their daughter. The rich woman will also give money. In the current form of population trading, there are three sisters
Let's look at the ranking of user comments and replies:
Comment like time
From the figure below, we can intuitively see that most of the time you like is distributed in
17: 00-19: 00
You can make more comments at this time, and then go to more likes
Comment reply time
For more replies, you can choose to comment on the article between 17:00 and 18:00
Emotional analysis
We take the most favorable comments as an example to analyze the audience's views on Zhihu
The library we use is SnowNLP
SnowNLP is an emotion analysis tool library based on Python, which can carry out Chinese word segmentation, part of speech tagging, emotion analysis, text classification, text keyword extraction, etc.
The emotional value of SnowNLP ranges from 0 to 1. The larger the value, the more positive the emotional tendency.
# Emotional analysis def anay_data(): all_words = [word for word in result.split(' ') if len(word) > 1 and word not in stop_words] positibe = negtive = middle = 0 for i in all_words: pingfen = SnowNLP(i) if pingfen.sentiments > 0.7: positibe += 1 elif pingfen.sentiments < 0.3: negtive += 1 else: middle += 1 print(positibe, negtive, middle) ''' 2499 919 7662 '''
From the figure, we can see that 22% of your comments have a positive attitude, 69% have a medium attitude and only 8% have a negative attitude. It seems that your attitude is still very peaceful.
Affective analysis tree