The national marriage rate has declined for five consecutive years. Why do young people dare not get married?

When reading the headlines yesterday, I found that the number of marriage registrations has decreased for seven consecutive years, reaching a new low in 17 years last year

I was stunned

A closer look shows that in 2020, the number of officially registered marriages totaled 8143300 pairs, a decrease of 1.13 million pairs compared with 2019.

This is also a decline for seven consecutive years since it reached 13.4693 million pairs in 2013. In 2020, the number of marriage registrations of 8143300 pairs also reached a new low in recent 17 years since 2003 (data on the official website of the National Bureau of Statistics: 8114000 pairs).

We observed and commented that everyone has their own views on the phenomenon of low marriage rate

Let's use the crawler to get these comment data today to see what we don't know in addition to the reasons we see

requirement analysis

The data we want to obtain are the data of commentators under the current article

User name

Comment content

Post reply plural

Number of comments and likes

And comment time, etc

Web page analysis

First, we open the browser F12, and the developer mode is as follows:

Find the location of the comment as shown in the figure above and the real url of the web page request

Observe the url characteristics. count=20 represents ten comment data per page, offset=0, 20 and 40 control page turning, and other parameters do not change

https://www.toutiao.com/article/v2/tab_comments/?aid=24&app_name=toutiao_web&offset=0&count=20&group_id=7032951744313164295&item_id=7032951744313164295
https://www.toutiao.com/article/v2/tab_comments/?aid=24&app_name=toutiao_web&offset=20&count=20&group_id=7032951744313164295&item_id=7032951744313164295
https://www.toutiao.com/article/v2/tab_comments/?aid=24&app_name=toutiao_web&offset=40&count=20&group_id=7032951744313164295&item_id=7032951744313164295
https://www.toutiao.com/article/v2/tab_comments/?aid=24&app_name=toutiao_web&offset=60&count=20&group_id=7032951744313164295&item_id=7032951744313164295

Based on this, we can construct the request connection of multi page request

url = f'https://www.toutiao.com/article/v2/tab_comments/?aid=24&app_name=toutiao_web&offset={(page-1)*20}&count=20&group_id=7032951744313164295&item_id=7032951744313164295'

Send request

We use the url we just found to get the single page request information first

url = f'https://www.toutiao.com/article/v2/tab_comments/?aid=24&app_name=toutiao_web&offset={(page-1)*20}&count=20&group_id=7032951744313164295&item_id=7032951744313164295'
headers = {
    'cookie': 'xxxxxxxxxx',
    'referer': 'xxxxxxxxxx',
    'user-agent': 'xxxxxxxxxx'
        }

resp = requests.get(url, headers = headers)

The results are as follows:

You can see that this is a json data set, and all the information we want to get is in the comment of the data list.

Here's the analysis, and then it's very simple.

for item in json_data:

    #  User name
    user = item['comment']['user_name']

    #  Comment content
    text = item['comment']['text']

    #  Post back complex
    reply = item['comment']['reply_count']

    #  Comment time
    times = item['comment']['create_time']
    rls_time = time.strftime('%Y-%m-%d %H:%M', time.localtime(times))

    #  Number of comments and likes
    stars = item['comment']['digg_count']
            
    ic(user, stars, rls_time, reply, text)
    
    '''
    ic| user: 'Happy biscuit Zp'
    stars: 1741
    rls_time: '2021-11-21 17:42'
    reply: 239
    text: 'Don't say it's the epidemic'
ic| user: 'Tonglu night reading'
    stars: 253
    rls_time: '2021-11-21 17:47'
    reply: 43
    text: 'The marriage rate has declined for seven consecutive years. Where does the fertility rate come from without marriage'
ic| user: 'Lily Wang Zhihan'
    stars: 148
    rls_time: '2021-11-21 17:50'
    reply: 59
    text: '2020 I am one of the newlyweds who got married in[lovely]'
ic| user: 'Xiaotaozi's life video'
    stars: 206
    rls_time: '2021-11-21 17:52'
    reply: 43
    text: 'There are too many single men in rural areas. They can't afford to get married and there are no women. There are more than 30 single men in our village, but there is no unmarried woman. It's true'
ic| user: 'Tomato 1543353620246856'
    stars: 197
    rls_time: '2021-11-21 18:48'
    reply: 11
    text: ('How many people dare to get married now? In case of a scum, talking about divorce is trouble.
          '
           'My friend met a scum man and appealed. Finally, the man agreed to divorce. My friend gave me the fare from Guangdong to Sichuan. If you don't give it, you won't come.
          '
           'Here you are. Let's apply for divorce first. As a result, there was an epidemic in the middle and it dragged on. Missed it. Start over again. Then the man was in trouble and didn't come...
          '
           'The man hasn't been in charge since he got married and had children in 13 years. And hit my friend. Beat my friend away. My friend took care of the children himself. The man kept telling the child that his father was dead.')

    '''

Multi page acquisition

First obtain 2000 pieces of data for testing

 for page in range(1, 200+1):
      url = f'https://www.toutiao.com/article/v2/tab_comments/?aid=24&app_name=toutiao_web&offset={(page-1)*20}&count=20&group_id=7032951744313164295&item_id=7032951744313164295'

Data saving

Next, we use openpyxl to save the data in excel.

1500 + in total

 #  Create workbook
    ws = op.Workbook()
    #  Create a worksheet
    wb = ws.create_sheet(index=0)

    #  Create header
    wb.cell(row=1, column=1, value='User name')
    wb.cell(row=1, column=2, value='Comment like')
    wb.cell(row=1, column=3, value='Comment time')
    wb.cell(row=1, column=4, value='Post reply')
    wb.cell(row=1, column=5, value='Comment content')
    
    #  Save data
    ws.save('Marriage rate.xlsx')
    print('Data saved!')

Data preprocessing

We first use panda to read excel. Then use

pandas removes duplicate data and missing values.

#  Read data
rcv_data = pd.read_excel('./Marriage rate.xlsx')

#  Delete duplicate records
rcv_data = rcv_data.drop_duplicates()
#  Delete missing values
rcv_data = rcv_data.dropna()

#  Sample and display 5 pieces of data
print(rcv_data.sample(5))

'''
                   User name  Comment like              Comment time  Post reply                                               Comment content
943     User 4947984566248     1  2021-11-21 17:50     0                                    Isn't it good to live together? What do you get married?
635          Chengdu gentleman Xichen Tianjie     1  2021-11-21 19:00     0                                              That's normal
1594  Black Dwarf ReFuelYourlife     0  2021-11-21 23:15     0  The Internet is both a good thing and a bad thing. In the 1980s and 1990s, not so many people compared with each other, because they didn't know a lot of data...
12                  Equal name, etc   188  2021-11-21 19:05    11                       Good thing, the house price is higher, come on ,,[facepalm ][facepalm ][facepalm ]
1854            kevin master worker     0  2021-11-21 21:17     0                                       Can't afford to marry Divorce is also inseparable
'''

Word cloud display

Use stuttering participles

Finally, use stylecloud to draw a beautiful word cloud display

#  Word cloud display
def visual_ciyun():
    pic = './img.jpg'
    gen_stylecloud(text=result,
                   icon_name='fas fa-feather-alt',
                   font_path='msyh.ttc',
                   background_color='white',
                   output_name=pic,
                   custom_stopwords=stop_words
                   )
    print('Word cloud drawing succeeded!')

If you are interested in word cloud, you can refer to it

How to use python to implement an elegant word cloud? (super detailed)

Word frequency display

The top ten words with the highest frequency of article comments are as follows:

def visual_cipin():
    #  Word frequency setting
    all_words = [word for word in result.split(' ') if len(word) > 1 and word not in stop_words]
    wordcount = Counter(all_words).most_common(10)

    x1_data, y1_data = list(zip(*wordcount))
  
'''
('marry', 'divorce', 'No', 'divorce rate', 'children', 'housing price', 'single', 'betrothal gifts', 'house', 'population')
(805, 211, 210, 113, 98, 98, 79, 73, 63, 63)
'''

Next, we use visualization to intuitively display the following:

Histogram

Pie chart

Bubble Diagram

Most likes & & most replies

We use the following functions to find the comments with the most likes and the comments with the most replies

def datas_anay():
    max_stars = rcv_data[rcv_data['Comment like'] == rcv_data['Comment like'].max()]
    ic(max_stars)

    max_reply = rcv_data[rcv_data['Post reply'] == rcv_data['Post reply'].max()]
    ic(max_reply)
  
'''
     User name  Comment like              Comment time  Post reply       Comment content
               0  Happy biscuit Zp  1615  2021-11-21 17:42   216  Don't say it's the epidemic
     
     User name  Comment like              Comment time  Post reply                                               Comment content
               27  You city people can really play 111   182  2021-11-21 17:58   285  Now it's more cost-effective to raise a daughter than anything. Small investment, little risk and more money. The bride price my three sisters received at that time was 5000/2000...
'''

Opinions of netizens with the largest number of likes

The user named happy cookie Zp got the most likes, and his number of likes was 1970

Don't say it's the epidemic

Let's take a look at the ranking of user comments and likes:

  Views of netizens with the largest number of replies

The user who received the most replies was named you city people really play 111. His reply number of comments was 285. It seems that there is a great objection to the comment of big goods lost  

Now it's more cost-effective to raise a daughter than anything. Small investment, little risk and more money. My three sisters received 5000 / 20000 / 20000 as betrothal gifts at that time. The bride price is used to buy household appliances, furniture and motorcycles. To ask for bride price money is for the man to pay and the woman to choose household appliances. Few people embezzle the bride price. It is known that corruption will say behind their back that they sell their daughter. The rich woman will also give money. In the current form of population trading, there are three sisters

Let's look at the ranking of user comments and replies:

Comment like time

From the figure below, we can intuitively see that most of the time you like is distributed in

17: 00-19: 00

You can make more comments at this time, and then go to more likes

Comment reply time

For more replies, you can choose to comment on the article between 17:00 and 18:00

Emotional analysis

We take the most favorable comments as an example to analyze the audience's views on Zhihu

The library we use is SnowNLP

SnowNLP is an emotion analysis tool library based on Python, which can carry out Chinese word segmentation, part of speech tagging, emotion analysis, text classification, text keyword extraction, etc.

The emotional value of SnowNLP ranges from 0 to 1. The larger the value, the more positive the emotional tendency.

#  Emotional analysis
def anay_data():
    all_words = [word for word in result.split(' ') if len(word) > 1 and word not in stop_words]
    positibe = negtive = middle = 0
    for i in all_words:
        pingfen = SnowNLP(i)
        if pingfen.sentiments > 0.7:
            positibe += 1
        elif pingfen.sentiments < 0.3:
            negtive += 1
        else:
            middle += 1
    print(positibe, negtive, middle)
  
'''
2499 919 7662
'''

From the figure, we can see that 22% of your comments have a positive attitude, 69% have a medium attitude and only 8% have a negative attitude. It seems that your attitude is still very peaceful.

Affective analysis tree

Tags: Python Back-end

Posted on Mon, 29 Nov 2021 02:06:17 -0500 by hbalagh