PaddleHub actual combat: Chinese micro emotion analysis system based on OCEMOTION

PaddleHub actual combat: Chinese micro emotion analysis system based on OCEMOTION

1, Project achievement display

Video link: https://www.bilibili.com/video/BV1944y1C7FQ/

2, Project introduction

2.1 project introduction:

This project is mainly based on PaddleHub to fine tune the Chinese 7 emotion classification data set OCEMOTION through the pre training model erine tiny, so as to complete the construction of the 7 classification emotion analysis model, and complete the development of the final Chinese micro emotion analysis system based on PyQt5, which supports fine-grained emotion classification prediction of single and batch texts, which has cutting-edge and wide application value. At the same time, the whole process tutorial will take you to play with the development of a complete text classification project!

2.2 project highlights:

a. Different from the traditional emotion 2 Classification (positive and negative), this project uses 7 classification data sets, OCEMOTION, which can achieve more fine-grained emotion analysis, so as to better analyze the emotions expressed in user comments, and has cutting-edge and wide application value.

b. Based on PaddleHub, the emotion analysis model is built through the fine-tuning of the pre training model erine tiny. The pre trained models (PTM) based on large-scale unmarked corpus can acquire the general language representation. When the pre training model fine tune is applied to the downstream task, it can achieve better performance than the traditional classification model Lstm, which has also become the mainstream choice of competitions and projects. In addition, the pre training model can avoid training the model from scratch.

c. The whole process practical tutorial for Xiaobai, with detailed explanation of the whole process, will take you to win a complete text classification practical project! The project has high scalability, and those interested can also make more optimization or migrate to similar text classification projects on its basis! If you like a large number of people, you will consider launching advanced tutorials later!

2.3 research significance of emotion analysis:

In comment websites, forums, blogs and social media, a large number of texts expressing opinions can be obtained. These text data are unstructured, not organized in a predefined way, and the amount of data is huge. It is usually difficult to analyze, understand and classify, which is time-consuming and expensive. With the help of emotion analysis system, this unstructured information can rely on automated business flow Cheng transforms large-scale structured data in an effective and low-cost way, which greatly reduces the cost of manual labeling and improves efficiency. Emotion analysis has very important application value in business analysis fields such as public opinion monitoring, topic supervision and word-of-mouth analysis. At present, this technology has also been widely used, such as Sina Weibo mining and constructing the whole network data using emotion analysis Public opinion big data platform. The e-commerce platform uses emotion analysis to mine commodity comments as part of the recommendation system to improve the marketing effect. The small robot helps select text that better matches the user's emotion by identifying the user's emotion in chat. In the near future, emotion analysis will also become an indispensable tool for modern companies. However, at present, emotion analysis The analysis is still limited to the limited simple classification, mainly 2 classification, while the limited emotion classification can not well mine the micro emotion contained in the text and can not well meet the needs. Therefore, the research of fine-grained emotion analysis has cutting-edge and wider application value.

2.4 operating environment requirements:

Note that the training of the model needs to go to the GPU environment. Click fork and select the GPU environment to run!

The core project code of the visual interface is in the "Chinese micro emotion analysis system" folder under the work directory. Select the folder and click download to download the whole folder locally. Then, operate according to the "environment configuration guide and instructions" provided in the folder. It can also run locally in the CPU environment.

Github project address: https://github.com/hchhtc123/Emotion-analysis-system

2.5 general technical route of the project

a. The Chinese emotion 7 classification data set OCEMOTION is cleaned, and the training, verification and test data sets are divided according to the specific categories in the proportion of 8:1:1.

b. Based on PaddleHub, the training and optimization of 7-category Chinese micro emotion analysis model are completed through the fine-tuning of pre training model.

c. The visual interface is developed based on PyQt5, which supports the emotion classification and prediction of single and batch text. Finally, the system is packaged by pyinstaller for demonstration.

3, OCEMOTION dataset

OCEMOTION is a fine-grained affective analysis data set containing 7 categories, of which 7 affective categories are safety, happiness, distinct, anger, like, risk and fear, which is suitable for building a fine-grained affective analysis model. The file format is: id sentence label, separated by '\ t'.

Dataset reference description:

Minglei Li, Yunfei Long, Qin Lu, and Wenjie Li. "Emotion Corpus Construction Based on Selection from Hashtags." In Proceedings of International Conference on Language Resources and Evaluation (LREC). Portorož, Slovenia, 2016

Paper link: https://www.aclweb.org/anthology/L16-1291.pdf

  OCEMOTION Dataset Uploaded AI Studio,Search the dataset“ OCEMOTION-Chinese 7 classification can be added after "fine-grained emotion analysis data set".

Should OCEMOTION The specific download of the data set mainly comes from the introduction of the author NLP Training data provided by the global artificial intelligence technology innovation competition [warm-up competition II] participated in at the address of the competition: https://Tianchi.aliyun.com/competition/entry/531865/information those who are interested can also learn about the competition. It is a classic multi task problem, which is suitable for learning!

Expansion: share some good data sets of personal collection and find websites:

Graviti Open Datasets: https://gas.graviti.cn/open-datasets

Tianchi dataset: https://tianchi.aliyun.com/dataset

Papers With Code dataset: https://www.paperswithcode.com/datasets

AI Studio dataset: https://aistudio.baidu.com/aistudio/datasetoverview

3.1 unzip and view the dataset

# Decompress dataset
%cd /home/aistudio/data/data100731/
!unzip shuju.zip
/home/aistudio/data/data100731
Archive:  shuju.zip
  inflating: OCEMOTION.csv           
# Reading datasets using pandas
import pandas as pd
data = pd.read_csv('OCEMOTION.csv', sep='\t',header=None)
# Because the dataset has no column names, you need to add column names to it for better processing
data.columns = ["id", "text_a", "label"]
# View the first 5 items of data
data.head()
idtext_alabel
00'Do you know what's near Toronto? Ha ha, there's a rag. It's really written in the book. Listen... Your rag is the mostsadness
11Christmas Eve and Christmas have passed. I'm very sad. I quarreled with my mother for two days and ended the war by force of death. Now I'm still in the cold war.sadness
22I'm just selfish and do what I want to do!sadness
33What moved me was not only the sunny day after the rain, but also the charming eyes with tears flowing down.happiness
44good dayshappiness
# Viewing the data file information, you can see that there are 35315 pieces of data in total
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35315 entries, 0 to 35314
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      35315 non-null  int64 
 1   text_a  35315 non-null  object
 2   label   35315 non-null  object
dtypes: int64(1), object(2)
memory usage: 827.8+ KB
# The length information of comment text is counted. It can be seen from the average length that it belongs to short text
data['text_a'].map(len).describe()
count    35315.000000
mean        48.214328
std         84.391942
min          3.000000
25%         18.000000
50%         34.000000
75%         67.000000
max      12326.000000
Name: text_a, dtype: float64
# Distribution of 7 emotion category labels in the statistical data set
data['label'].value_counts()
sadness      12475
happiness     8894
disgust       4347
anger         4068
like          4042
surprise       899
fear           590
Name: label, dtype: int64
# Visual label distribution
%matplotlib inline
data['label'].value_counts(normalize=True).plot(kind='bar');

Insert picture description here

3.2 data cleaning

# Import required packages
import re
import os
import shutil
from tqdm import tqdm
from collections import defaultdict

# Define data cleaning function:

# Cleaning separator characters
def clean_duplication(text):
    left_square_brackets_pat = re.compile(r'\[+')
    right_square_brackets_pat = re.compile(r'\]+')
    punct = [',', '\\.', '\\!', ',', '. ', '!', ',', '\?', '?']

    def replace(string, char):
        pattern = char + '{2,}'
        if char.startswith('\\'):
            char = char[1:]
        string = re.sub(pattern, char, string)
        return string

    text = left_square_brackets_pat.sub('', text)
    text = right_square_brackets_pat.sub('', text)
    for p in punct:
        text = replace(text, p)
    return text

def emoji2zh(text, inverse_emoji_dict):
    for emoji, ch in inverse_emoji_dict.items():
        text = text.replace(emoji, ch)
    return text

# Clean the special expressions in the dataset, and replace the expressions with Chinese through the mapping of json files
def clean_emotion(data_path, emoji2zh_data, save_dir, train=True):
    data = defaultdict(list)
    filename = os.path.basename(data_path)
    with open(data_path, 'r', encoding='utf8') as f:
        texts = f.readlines()
        for line in tqdm(texts, desc=data_path):
            if train:
                id_, text, label = line.strip().split('\t')
            else:
                id_, text = line.strip().split('\t')
            data['id'].append(id_)
            text = emoji2zh(text, emoji2zh_data)
            text = clean_duplication(text)
            data['text_a'].append(text)
            if train:
                data['label'].append(label)
    df = pd.DataFrame(data)
    if not os.path.exists(save_dir):
        os.makedirs(save_dir)
    df.to_csv(os.path.join(save_dir, filename), index=False,
              encoding='utf8', header=False, sep='\t')
    return df
# Read the expression mapping json file (placed in the work directory and named emoji2zh.json) to replace the expression with Chinese characters
import json
emoji2zh_data = json.load(open('/home/aistudio/work/emoji2zh.json', 'r', encoding='utf8'))
# Data cleaning of data
data = clean_emotion('/home/aistudio/data/data100731/OCEMOTION.csv',emoji2zh_data,'./')
/home/aistudio/data/data100731/OCEMOTION.csv: 100%|██████████| 35694/35694 [00:04<00:00, 7890.77it/s]
# Remove the useless id column and save it as text_a,label
data = data[['text_a', 'label']]

3.3 converting emotion category labels

Since the class name is in English, the English class name is mainly converted to the Chinese class name here, so as to be better applied to the Chinese emotion analysis system!

# Replace labels in the dataset, {safety ':' sad ',' happiness': 'happy', 'like': 'like', 'anger': 'anger', 'fear': 'fear', 'surprise': 'surprise': 'disgust'}
data.loc[data['label']=='sadness', 'label'] = 'Sad'
data.loc[data['label']=='happiness', 'label'] = 'cheerful'
data.loc[data['label']=='like', 'label'] = 'like'
data.loc[data['label']=='anger', 'label'] = 'anger'
data.loc[data['label']=='fear', 'label'] = 'fear'
data.loc[data['label']=='surprise', 'label'] = 'surprised'
data.loc[data['label']=='disgust', 'label'] = 'hate'

3.4 manually partition training, verification and test sets

Reasons for dividing training, verification and test sets:

a) The training set directly participates in the process of model parameter adjustment, which obviously can not be used to reflect the real ability of the model (to prevent students who memorize textbooks by rote from having the best performance, that is, to prevent over fitting).

b) The verification set participates in the process of manual parameter adjustment (super parameter), and can not be used to finally judge a model (students who brush the question bank can not be regarded as good students).

c) Therefore, it is necessary to examine a student's real ability (final exam) through the final exam (test set).

Two common data set division methods are provided below, which can be selected according to specific needs or effects:

# # # Division method 1: directly divide the training, verification and test sets according to the proportion
# from sklearn.model_selection import train_test_split
# train_data, test_data = train_test_split(data, test_size=0.2)
# train_data,valid_data=train_test_split(train_data, test_size=0.2)

# # Random scrambling of data
# from sklearn.utils import shuffle
# train_data = shuffle(train_data)
# valid_data = shuffle(valid_data)
# test_data = shuffle(test_data)

# # Save the divided dataset file
# train_data.to_csv('./train.csv', index=False, sep="\t") # Training set
# valid_data.to_csv('./valid.csv', index=False, sep="\t")  # Validation set
# test_data.to_csv('./test.csv', index=False, sep="\t")   # Test set

# print('training set length: ', len(train_dat),' verification set length: ', len(valid_data),' test set length ', len(test_data))
# Division method 2: divide the training, verification and test sets by 8:1:1 according to the specific categories, so as to make the data equally distributed as far as possible

from sklearn.utils import shuffle
train = pd.DataFrame()  # Training set
valid = pd.DataFrame()  # Validation set
test = pd.DataFrame()  # Test set

tags = data['label'].unique().tolist()  # Extract in equal proportion according to the label

# According to the category of the data set, the training, verification and test sets are divided in the proportion of 8:1:1, and are randomly disrupted and saved
for tag in tags:
    # 0.2 data were randomly selected as the training and verification set
    target = data[(data['label'] == tag)]
    sample = target.sample(int(0.2 * len(target)))
    sample_index = sample.index
    # The remaining 0.8 data is used as the training set
    all_index = target.index
    residue_index = all_index.difference(sample_index)  # Data remaining after sample removal
    residue = target.loc[residue_index]
    # Divide the 0.2 data set into test set and verification set in equal proportion
    test_sample = sample.sample(int(0.5 * len(sample)))
    test_sample_index = test_sample.index
    valid_sample_index = sample_index.difference(test_sample_index)
    valid_sample = sample.loc[valid_sample_index]
    # Splice each category
    test = pd.concat([test, test_sample], ignore_index=True)
    valid = pd.concat([valid, valid_sample], ignore_index=True)
    train = pd.concat([train, residue], ignore_index=True)
    # Random scrambling of data
    train = shuffle(train)
    valid = shuffle(valid)
    test = shuffle(test)

# Save as tab delimited text
train.to_csv('train.csv', sep='\t', index=False)  # Training set
valid.to_csv('valid.csv', sep='\t', index=False)  # Validation set
test.to_csv('test.csv', sep='\t', index=False)    # Test set

print('Training set length:', len(train), 'Validation set length:', len(valid), 'Test set length', len(test))
Training set length: 28558 verification set length: 3570 test set length: 3566

4, Constructing micro emotion analysis model based on PaddleHub

About PaddleHub:

PaddleHub is a propeller pre training model management and migration learning tool. Through PaddleHub developers can use high-quality pre training model combined with fine tune API to quickly complete the whole process from migration learning to application deployment. It provides a high-quality pre training model under the propeller ecology, covering image classification, target detection, lexical analysis, semantic model, emotion analysis and visual analysis Frequency classification, image generation, image segmentation, text review, key point detection and other mainstream models.

For more model details, please check the official website: https://www.paddlepaddle.org.cn/hub

If you encounter problems during the use of paddlehub, you can ask the following questions: https://github.com/PaddlePaddle/PaddleHub/issue s

Based on the pre training model, PaddleHub supports the following functions:

The 1. model is software, which realizes fast prediction through Python API or command line, and makes PaddlePaddle model library more convenient.

2. Migration learning: users can complete deep migration learning of natural language processing and computer vision scene with only a small amount of code through fine tune API.

3. Service deployment: you can build API services of your own model with a simple command line.

4. Hyperparametric optimization, automatically search the optimal hyperparametric to get better model effect.

4.1 pre environment preparation

# Download the latest version of paddlehub
!pip install -U paddlehub -i https://pypi.tuna.tsinghua.edu.cn/simple
# Import paddlehub and paddle packages
import paddlehub as hub
import paddle

4.2 loading the pre training model ERNIE Tiny

ERNIE Tiny compresses ERNIE 2.0 Base model mainly through model structure compression and model distillation. Its characteristics and advantages are as follows:

a. Three layer transformer structure is adopted, and the linear speed is increased by 4 times;

b. The hidden layer parameters of the model are widened from 768 of ERNIE 2.0 to 1024;

c. Shorten the sequence length of input text and reduce the computational complexity. The Chinese keyword granularity input is adopted for the first time, and the length is shortened by 40% on average;

d.ERNIE Tiny plays the role of student in training and uses model distillation to learn the distribution and output of the corresponding layer of teacher model ERNIE 2.0 in Transformer layer and Prediction layer;

Comprehensive optimization can bring 4.3 times the prediction speed, and has higher industrial landing capacity.

# Set up 7 emotion categories requiring classification
label_list=list(data.label.unique())
print(label_list)

label_map = { 
    idx: label_text for idx, label_text in enumerate(label_list)
}
print(label_map)
['Sad', 'cheerful', 'like', 'anger', 'fear', 'surprised', 'hate']
{0: 'Sad', 1: 'cheerful', 2: 'like', 3: 'anger', 4: 'fear', 5: 'surprised', 6: 'hate'}
# Just specify the name of the model you want to use and the number of categories for text classification to complete the fine tune network definition. After pre training the model, splice a Full Connected network for classification
# Select ernie_tiny pre training model here and set the fine tuning task as 7 classification task
model = hub.Module(name="ernie_tiny", task='seq-cls', num_classes=7, label_map=label_map)
Download https://bj.bcebos.com/paddlehub/paddlehub_dev/ernie_tiny_2.0.2.tar.gz
[##################################################] 100.00%
Decompress /home/aistudio/.paddlehub/tmp/tmpyvupawg3/ernie_tiny_2.0.2.tar.gz
[##################################################] 100.00%


[2021-08-04 23:21:25,328] [    INFO] - Successfully installed ernie_tiny-2.0.2
[2021-08-04 23:21:25,332] [    INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/ernie_tiny/ernie_tiny.pdparams and saved to /home/aistudio/.paddlenlp/models/ernie-tiny
[2021-08-04 23:21:25,335] [    INFO] - Downloading ernie_tiny.pdparams from https://paddlenlp.bj.bcebos.com/models/transformers/ernie_tiny/ernie_tiny.pdparams
100%|██████████| 354158/354158 [00:04<00:00, 71445.77it/s]
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1297: UserWarning: Skip loading for classifier.weight. classifier.weight is not found in the provided dict.
  warnings.warn(("Skip loading for {}. ".format(key) + str(err)))
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1297: UserWarning: Skip loading for classifier.bias. classifier.bias is not found in the provided dict.
  warnings.warn(("Skip loading for {}. ".format(key) + str(err)))

The parameter usage of hub.Module is as follows:

  • Name: model name. You can select Ernie, ernie_tiny, Bert base cased, Bert base Chinese, Roberta WwM ext, Roberta WwM ext large, etc.
  • Task: fine tune task. Here is SEQ CLS, indicating text classification task.
  • num_classes: indicates the number of categories of the current text classification task. It is determined according to the specific dataset used. The default is 2. It needs to be selected according to the specific classification task.

PaddleHub Model Search

PaddleHub also provides models such as BERT to choose from. The loading examples corresponding to the models currently supporting text classification tasks are as follows:

Model namePaddleHub Module
ERNIE, Chinesehub.Module(name='ernie')
ERNIE tiny, Chinesehub.Module(name='ernie_tiny')
ERNIE 2.0 Base, Englishhub.Module(name='ernie_v2_eng_base')
ERNIE 2.0 Large, Englishhub.Module(name='ernie_v2_eng_large')
BERT-Base, English Casedhub.Module(name='bert-base-cased')
BERT-Base, English Uncasedhub.Module(name='bert-base-uncased')
BERT-Large, English Casedhub.Module(name='bert-large-cased')
BERT-Large, English Uncasedhub.Module(name='bert-large-uncased')
BERT-Base, Multilingual Casedhub.Module(nane='bert-base-multilingual-cased')
BERT-Base, Multilingual Uncasedhub.Module(nane='bert-base-multilingual-uncased')
BERT-Base, Chinesehub.Module(name='bert-base-chinese')
BERT-wwm, Chinesehub.Module(name='chinese-bert-wwm')
BERT-wwm-ext, Chinesehub.Module(name='chinese-bert-wwm-ext')
RoBERTa-wwm-ext, Chinesehub.Module(name='roberta-wwm-ext')
RoBERTa-wwm-ext-large, Chinesehub.Module(name='roberta-wwm-ext-large')
RBT3, Chinesehub.Module(name='rbt3')
RBTL3, Chinesehub.Module(name='rbtl3')
ELECTRA-Small, Englishhub.Module(name='electra-small')
ELECTRA-Base, Englishhub.Module(name='electra-base')
ELECTRA-Large, Englishhub.Module(name='electra-large')
ELECTRA-Base, Chinesehub.Module(name='chinese-electra-base')
ELECTRA-Small, Chinesehub.Module(name='chinese-electra-small')

Through the above line of code, the model is initialized as a model suitable for text classification tasks, and a fully connected network is spliced after the pre training model of ERNIE.

4.3 loading and processing data

# Import dependent Libraries
import os, io, csv
from paddlehub.datasets.base_nlp_dataset import InputExample, TextClassificationDataset
# Data set storage location
DATA_DIR="/home/aistudio/data/data100731/"
# Process the data into a format acceptable to the model
class OCEMOTION(TextClassificationDataset):
    def __init__(self, tokenizer, mode='train', max_seq_len=128):
        if mode == 'train':
            data_file = 'train.csv'  # Training set
        elif mode == 'test':
            data_file = 'test.csv'   # Test set
        else:
            data_file = 'valid.csv'  # Validation set
        
        super(OCEMOTION, self).__init__(
            base_path=DATA_DIR,
            data_file=data_file,
            tokenizer=tokenizer,
            max_seq_len=max_seq_len,
            mode=mode,
            is_file_with_header=True,
            label_list=label_list
            )

    # Parsing samples in text files
    def _read_file(self, input_file, is_file_with_header: bool = False):
        if not os.path.exists(input_file):
            raise RuntimeError("The file {} is not found.".format(input_file))
        else:
            with io.open(input_file, "r", encoding="UTF-8") as f:
                reader = csv.reader(f, delimiter="\t")
                examples = []
                seq_id = 0
                header = next(reader) if is_file_with_header else None
                for line in reader:
                    try:
                        example = InputExample(guid=seq_id, text_a=line[0], label=line[1])
                        seq_id += 1
                        examples.append(example)
                    except:
                        continue
                return examples
                
train_dataset = OCEMOTION(model.get_tokenizer(), mode='train', max_seq_len=128)  # Max_seq_len is determined according to the specific text length, but it should be noted that max_seq_len is no longer than 512
dev_dataset = OCEMOTION(model.get_tokenizer(), mode='dev', max_seq_len=128)
test_dataset = OCEMOTION(model.get_tokenizer(), mode='test', max_seq_len=128)

# View the first 3 training sets
for e in train_dataset.examples[:3]:
    print(e)
# View the first 3 validation sets
for e in dev_dataset.examples[:3]:
    print(e)
# View the first 3 test sets
for e in test_dataset.examples[:3]:
    print(e)
[2021-08-04 23:21:44,835] [    INFO] - Downloading vocab.txt from https://paddlenlp.bj.bcebos.com/models/transformers/ernie_tiny/vocab.txt
100%|██████████| 459/459 [00:00<00:00, 5903.16it/s]
[2021-08-04 23:21:45,047] [    INFO] - Downloading spm_cased_simp_sampled.model from https://paddlenlp.bj.bcebos.com/models/transformers/ernie_tiny/spm_cased_simp_sampled.model
100%|██████████| 1083/1083 [00:00<00:00, 11921.14it/s]
[2021-08-04 23:21:45,199] [    INFO] - Downloading dict.wordseg.pickle from https://paddlenlp.bj.bcebos.com/models/transformers/ernie_tiny/dict.wordseg.pickle
100%|██████████| 161822/161822 [00:02<00:00, 65316.99it/s]
[2021-08-04 23:22:00,079] [    INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/vocab.txt
[2021-08-04 23:22:00,083] [    INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/spm_cased_simp_sampled.model
[2021-08-04 23:22:00,087] [    INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/dict.wordseg.pickle
[2021-08-04 23:22:05,911] [    INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/vocab.txt
[2021-08-04 23:22:05,914] [    INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/spm_cased_simp_sampled.model
[2021-08-04 23:22:05,916] [    INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/dict.wordseg.pickle


text=Is there a kind of people in the world who are not willing to yield to others,Is there a kind of person who can't fight all his life,Is there a kind of people who would rather die than live? Is there a kind of people who are forced not to give up. It has been more than half a year,I'm tired.,Jin Xi,But it's not time to go to bed.	label=like
text=I like the flowers in Daguan county,Are incomparably beautiful	label=like
text=Can't find.	label=Sad
text=Xu Jieqi and dog day's literature class ppt	label=anger
text=-I want to take a nap,Ann.	label=hate
text=attend a meeting,I went up and did it without preparation presentation,Someone is sweating a lot for me. Fortunately, I work clearly at ordinary times,Clear the two items in three minutes,progress,problem,Action plan,Explain the dead line of time one by one,The house is full of Chinese and foreign friends nodding frequently,Someone and someone stared at me and smiled,Qi Qi secretly gave me a thumbs up. ha-ha,Speaking in English is my strong point. oh dear,That's great.	label=cheerful
text=You still need to write yourself a prescription,This is a necessary process based on the experience of everyone. 2. We need constant positive psychological hints. 3. We need to get a sense of satisfaction and affirmation from our daily English practice. 4. This is what I said at the top,There must be one thing that can give me strong positive stimulation 5 continuous exercise,To release psychological energy	label=Sad
text=Woo woo.I forgot my password.I can't get on.	label=Sad
text=Hee hee has refused to speak Tang poetry before going to bed for more than two months,Today, I took out Tang poetry and prepared to speak,As a result, she recited it herself,I said the title of the poem,She recites poetry,I recited fifty or sixty songs in one breath,It really surprised me!	label=cheerful

4.4 selection of optimization strategy and operation configuration

# The AdamW optimizer is used here
optimizer = paddle.optimizer.AdamW(learning_rate=4e-5, parameters=model.parameters())
# run setup
trainer = hub.Trainer(model, optimizer, checkpoint_dir='./ckpt', use_gpu=True, use_vdl=True)      # Performer of fine tune task
[2021-08-04 23:22:30,859] [ WARNING] - PaddleHub model checkpoint not found, start from scratch...

run setup

Trainer mainly controls the training of fine tune task and is the initiator of the task, including the following controllable parameters:

  • Model: optimized model;
  • Optimizer: optimizer selection;
  • use_gpu: whether to use gpu for training;
  • use_vdl: whether to use vdl to visualize the training process;
  • checkpoint_dir: address to save model parameters;
  • compare_metrics: save the measurement indicators of the optimal model;

4.5 model training and verification

Note that the GPU environment is required for model training. During model training, you can check the occupation of video memory through the "performance monitoring" below or enter the "nvdia SMI" command at the terminal. If the video memory is insufficient, you can appropriately reduce the batch_size

trainer.train(train_dataset, epochs=5, batch_size=256, eval_dataset=dev_dataset, save_interval=1)   # Configure the training parameters, start the training, and specify the verification set.

trainer.train mainly controls the specific training process, including the following controllable parameters:

  • train_dataset: the dataset used in training;
  • Epichs: number of training rounds;
  • batch_size: the batch size of training. If GPU is used, please adjust the batch according to the actual situation_ size;
  • num_ Workers: the number of works, which is 0 by default;
  • eval_dataset: validation set;
  • log_interval: the interval between printing the log, in the number of times the batch training is executed.
  • save_interval: the interval frequency of saving the model. The unit is the number of rounds of training.

4.6 evaluating the current training model on the test set

# Evaluate the current training model on the test set
result = trainer.evaluate(test_dataset, batch_size=128) 
[2021-08-04 23:30:35,095] [    INFO] - Evaluation on validation dataset: \
[2021-08-04 23:30:38,511] [    EVAL] - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - [Evaluation result] avg_acc=0.5939
# Advanced extension: use the F1 score index to make a more official evaluation of the effect on the test set
import numpy as np
# Read test set file
df = pd.read_csv('./test.csv',sep = '\t')

news1 = pd.DataFrame(columns=['label'])
news1['label'] = df["label"]
news = pd.DataFrame(columns=['text_a'])
news['text_a'] = df["text_a"]

# First, convert the data read by pandas into array
data_array = np.array(news)
# Then it is converted to list form
data_list =data_array.tolist()

# Predict the test set to get the predicted category label
y_pre = model.predict(data_list, max_seq_len=128, batch_size=128, use_gpu=True)

# True category label for the test set
data_array1 = np.array(news1)
y_val =data_array1.tolist()

# Calculate F1 score of prediction results
from sklearn.metrics import precision_recall_fscore_support,f1_score,precision_score,recall_score
f1 = f1_score(y_val, y_pre, average='macro')
p = precision_score(y_val, y_pre, average='macro')
r = recall_score(y_val, y_pre, average='macro')
print(f1, p, r)
[2021-08-04 23:30:38,536] [    INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/vocab.txt
[2021-08-04 23:30:38,539] [    INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/spm_cased_simp_sampled.model
[2021-08-04 23:30:38,542] [    INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/dict.wordseg.pickle
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/tensor/creation.py:125: DeprecationWarning: `np.object` is a deprecated alias for the builtin `object`. To silence this warning, use `object` by itself. Doing this will not modify any behavior and is safe. 
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if data.dtype == np.object:


0.48899184212664987 0.5591275726811291 0.4574688432467382

ps: those interested can further optimize the effect by adjusting parameters, selecting other pre training models, optimizing network structure and re pre training on the basis of the baseline model!

4.7 model prediction

# Data to be predicted
data = [
    # Sad
    ["You don't have to say sorry,But if we cherish each other"],
    # cheerful
    ["Happiness is actually very simple"],
    # fear
    ["Fear. fall ill"],
    # like
    ["When your long hair reaches your waist,We worked together. I will wait"]
]

# Define 7 categories for emotion classification
label_list=['Sad', 'cheerful', 'like', 'anger', 'fear', 'surprised', 'hate']
label_map = {
    idx: label_text for idx, label_text in enumerate(label_list)
}

# Load the trained model
model = hub.Module(
    name='ernie_tiny',
    task='seq-cls',
    num_classes=7,
    load_checkpoint='./ckpt/best_model/model.pdparams',
    label_map=label_map)

# Model prediction
results = model.predict(data, max_seq_len=128, batch_size=1, use_gpu=True)
for idx, text in enumerate(data):
    print('Data: {} \t Lable: {}'.format(text[0], results[idx]))
[2021-08-04 23:30:48,174] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-tiny/ernie_tiny.pdparams
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1297: UserWarning: Skip loading for classifier.weight. classifier.weight is not found in the provided dict.
  warnings.warn(("Skip loading for {}. ".format(key) + str(err)))
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1297: UserWarning: Skip loading for classifier.bias. classifier.bias is not found in the provided dict.
  warnings.warn(("Skip loading for {}. ".format(key) + str(err)))
[2021-08-04 23:30:54,070] [    INFO] - Loaded parameters from /home/aistudio/data/data100731/ckpt/best_model/model.pdparams
[2021-08-04 23:30:54,077] [    INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/vocab.txt
[2021-08-04 23:30:54,080] [    INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/spm_cased_simp_sampled.model
[2021-08-04 23:30:54,083] [    INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/dict.wordseg.pickle
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/tensor/creation.py:125: DeprecationWarning: `np.object` is a deprecated alias for the builtin `object`. To silence this warning, use `object` by itself. Doing this will not modify any behavior and is safe. 
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if data.dtype == np.object:


Data: You don't have to say sorry,But if we cherish each other 	 Lable: Sad
Data: Happiness is actually very simple 	 Lable: cheerful
Data: Fear. fall ill 	 Lable: fear
Data: When your long hair reaches your waist,We worked together. I will wait 	 Lable: like

5, Complete visual interface demonstration based on PyQt5

ps: the core code of the visual interface based on PyQt5 has been put into the work / Chinese micro emotion analysis system. After downloading the whole folder locally, operate according to the provided environment configuration guide and instructions!

# Move the best model just trained to the work directory to save it better!
le: Sad
    Data: Happiness is actually very simple 	 Lable: cheerful
    Data: Fear. fall ill 	 Lable: fear
    Data: When your long hair reaches your waist,We worked together. I will wait 	 Lable: like


# 5, Complete visual interface demonstration based on PyQt5
ps:be based on PyQt5 The core code of visual interface has been put into work/Under the Chinese micro emotion analysis system, after downloading the whole folder locally, operate according to the provided environment configuration guide and instructions!


```python
# Move the best model just trained to the work directory to save it better!
!cp -r /home/aistudio/data/data100731/ckpt/best_model/model.pdparams /home/aistudio/work/Chinese micro emotion analysis system/best_model/

5.1 PyQt5 introduction:

PyQt5 is a python interface based on Digia's powerful graphics program framework Qt5. The program can run on multiple platforms, including Unix, Windows and Mac OS. QT brings us the most convenient advantage, that is, it has a QT designer. This designer can facilitate our page layout. It can be said that we need a lot of code to complete the page layout in Tkinter. In QT, just drag the control.

5.2 pre preparation - installation and configuration of pyqt5 + pychar:

Pyqt5 + pychar installation and configuration graphic tutorial details There are many online tutorials. You can operate according to the online tutorials.

5.3 tutorial:

After completing the installation and configuration of PyQt5, use the Qt designer software provided by Qt to complete the page design UI file by dragging, and then export it as a. py file, that is, the page designer. Then, by binding the corresponding function to the interface button, input box and other elements, the function of the whole system interface can be added. (there are many online tutorials, which will not be repeated here. It's better to start directly!)

5.4 interface beautification:

Methodology of Python graphical interface beautification Learn to make full use of QSS and add pictures to beautify the interface.

Please refer to interface.py (interface design program) and gui.py (main program) under the folder of 'Chinese micro emotion analysis system' for the interface related codes of the project.

Visual interface demonstration:

  1. Single text emotion analysis page:

  1. Batch text emotion analysis page:

6, System packaging

6.1 application scenario:

After completing the development of the system, we can package the whole Python program through PyInstaller, etc., so that it can be used directly on various platforms without complex environment configuration operation. It can be better delivered to some small white or non Python partners, and it is also convenient for demonstration.

6.2 package tutorial:

Pyinstaller has many packaged online tutorials, which will not be repeated here. Just make good use of the search engine. [solution] Pyinstaller package exe file detailed tutorial

In addition to the commonly used Pyinstaller, Windows users can also try to package system programs with GT boss's QPT for a better experience! You can operate directly according to the tutorials provided by it.

QPT - Quick packaging tool , QPT is a multi-functional encapsulation tool that can "simulate" the development environment. It can package ordinary Python scripts into EXE executable programs with a minimum of one line of command, and selectively add CUDA and NoAVX support to be compatible with more user environments as much as possible.

7, Project summary

This project is a derivative sub project "micro emotion analysis system" of the "public opinion analysis system" I am currently working on. I open source the course of this project. I hope to help you better start a complete text classification project development process. At present, the practical development program of full stream program is still relatively scarce in AI Studio, I hope this project can inspire or help you, so as to better win the competition or completion projects!

At this point, the project will come to an end. If you like it, I hope you can fork like to pay attention to Sanlian! ❤

8, Personal introduction

Huang canhua, 2019 undergraduate majoring in software engineering, School of software, South China Normal University

Main direction: development. At present, I mainly focus on NLP and data mining related competitions or projects

Github address: https://github.com/hchhtc123

Nickname? Alchemist 233

https://aistudio.baidu.com/aistudio/personalcenter/thirdview/330406 Pay attention to me and bring more wonderful projects to share next time!

Tags: NLP paddlepaddle

Posted on Mon, 08 Nov 2021 20:13:59 -0500 by ToolJob