PaddleHub actual combat: Chinese micro emotion analysis system based on OCEMOTION
1, Project achievement display
Video link: https://www.bilibili.com/video/BV1944y1C7FQ/
2, Project introduction
2.1 project introduction:
This project is mainly based on PaddleHub to fine tune the Chinese 7 emotion classification data set OCEMOTION through the pre training model erine tiny, so as to complete the construction of the 7 classification emotion analysis model, and complete the development of the final Chinese micro emotion analysis system based on PyQt5, which supports fine-grained emotion classification prediction of single and batch texts, which has cutting-edge and wide application value. At the same time, the whole process tutorial will take you to play with the development of a complete text classification project!
2.2 project highlights:
a. Different from the traditional emotion 2 Classification (positive and negative), this project uses 7 classification data sets, OCEMOTION, which can achieve more fine-grained emotion analysis, so as to better analyze the emotions expressed in user comments, and has cutting-edge and wide application value.
b. Based on PaddleHub, the emotion analysis model is built through the fine-tuning of the pre training model erine tiny. The pre trained models (PTM) based on large-scale unmarked corpus can acquire the general language representation. When the pre training model fine tune is applied to the downstream task, it can achieve better performance than the traditional classification model Lstm, which has also become the mainstream choice of competitions and projects. In addition, the pre training model can avoid training the model from scratch.
c. The whole process practical tutorial for Xiaobai, with detailed explanation of the whole process, will take you to win a complete text classification practical project! The project has high scalability, and those interested can also make more optimization or migrate to similar text classification projects on its basis! If you like a large number of people, you will consider launching advanced tutorials later!
2.3 research significance of emotion analysis:
In comment websites, forums, blogs and social media, a large number of texts expressing opinions can be obtained. These text data are unstructured, not organized in a predefined way, and the amount of data is huge. It is usually difficult to analyze, understand and classify, which is time-consuming and expensive. With the help of emotion analysis system, this unstructured information can rely on automated business flow Cheng transforms large-scale structured data in an effective and low-cost way, which greatly reduces the cost of manual labeling and improves efficiency. Emotion analysis has very important application value in business analysis fields such as public opinion monitoring, topic supervision and word-of-mouth analysis. At present, this technology has also been widely used, such as Sina Weibo mining and constructing the whole network data using emotion analysis Public opinion big data platform. The e-commerce platform uses emotion analysis to mine commodity comments as part of the recommendation system to improve the marketing effect. The small robot helps select text that better matches the user's emotion by identifying the user's emotion in chat. In the near future, emotion analysis will also become an indispensable tool for modern companies. However, at present, emotion analysis The analysis is still limited to the limited simple classification, mainly 2 classification, while the limited emotion classification can not well mine the micro emotion contained in the text and can not well meet the needs. Therefore, the research of fine-grained emotion analysis has cutting-edge and wider application value.
2.4 operating environment requirements:
Note that the training of the model needs to go to the GPU environment. Click fork and select the GPU environment to run!
The core project code of the visual interface is in the "Chinese micro emotion analysis system" folder under the work directory. Select the folder and click download to download the whole folder locally. Then, operate according to the "environment configuration guide and instructions" provided in the folder. It can also run locally in the CPU environment.
Github project address: https://github.com/hchhtc123/Emotion-analysis-system
2.5 general technical route of the project
a. The Chinese emotion 7 classification data set OCEMOTION is cleaned, and the training, verification and test data sets are divided according to the specific categories in the proportion of 8:1:1.
b. Based on PaddleHub, the training and optimization of 7-category Chinese micro emotion analysis model are completed through the fine-tuning of pre training model.
c. The visual interface is developed based on PyQt5, which supports the emotion classification and prediction of single and batch text. Finally, the system is packaged by pyinstaller for demonstration.
3, OCEMOTION dataset
OCEMOTION is a fine-grained affective analysis data set containing 7 categories, of which 7 affective categories are safety, happiness, distinct, anger, like, risk and fear, which is suitable for building a fine-grained affective analysis model. The file format is: id sentence label, separated by '\ t'.
Dataset reference description:
Minglei Li, Yunfei Long, Qin Lu, and Wenjie Li. "Emotion Corpus Construction Based on Selection from Hashtags." In Proceedings of International Conference on Language Resources and Evaluation (LREC). Portorož, Slovenia, 2016
Paper link: https://www.aclweb.org/anthology/L16-1291.pdf
OCEMOTION Dataset Uploaded AI Studio,Search the dataset“ OCEMOTION-Chinese 7 classification can be added after "fine-grained emotion analysis data set".
Should OCEMOTION The specific download of the data set mainly comes from the introduction of the author NLP Training data provided by the global artificial intelligence technology innovation competition [warm-up competition II] participated in at the address of the competition: https://Tianchi.aliyun.com/competition/entry/531865/information those who are interested can also learn about the competition. It is a classic multi task problem, which is suitable for learning!
Expansion: share some good data sets of personal collection and find websites:
Graviti Open Datasets: https://gas.graviti.cn/open-datasets
Tianchi dataset: https://tianchi.aliyun.com/dataset
Papers With Code dataset: https://www.paperswithcode.com/datasets
AI Studio dataset: https://aistudio.baidu.com/aistudio/datasetoverview
3.1 unzip and view the dataset
# Decompress dataset %cd /home/aistudio/data/data100731/ !unzip shuju.zip
/home/aistudio/data/data100731 Archive: shuju.zip inflating: OCEMOTION.csv
# Reading datasets using pandas import pandas as pd data = pd.read_csv('OCEMOTION.csv', sep='\t',header=None) # Because the dataset has no column names, you need to add column names to it for better processing data.columns = ["id", "text_a", "label"]
# View the first 5 items of data data.head()
id | text_a | label | |
---|---|---|---|
0 | 0 | 'Do you know what's near Toronto? Ha ha, there's a rag. It's really written in the book. Listen... Your rag is the most | sadness |
1 | 1 | Christmas Eve and Christmas have passed. I'm very sad. I quarreled with my mother for two days and ended the war by force of death. Now I'm still in the cold war. | sadness |
2 | 2 | I'm just selfish and do what I want to do! | sadness |
3 | 3 | What moved me was not only the sunny day after the rain, but also the charming eyes with tears flowing down. | happiness |
4 | 4 | good days | happiness |
# Viewing the data file information, you can see that there are 35315 pieces of data in total data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 35315 entries, 0 to 35314 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 35315 non-null int64 1 text_a 35315 non-null object 2 label 35315 non-null object dtypes: int64(1), object(2) memory usage: 827.8+ KB
# The length information of comment text is counted. It can be seen from the average length that it belongs to short text data['text_a'].map(len).describe()
count 35315.000000 mean 48.214328 std 84.391942 min 3.000000 25% 18.000000 50% 34.000000 75% 67.000000 max 12326.000000 Name: text_a, dtype: float64
# Distribution of 7 emotion category labels in the statistical data set data['label'].value_counts()
sadness 12475 happiness 8894 disgust 4347 anger 4068 like 4042 surprise 899 fear 590 Name: label, dtype: int64
# Visual label distribution %matplotlib inline data['label'].value_counts(normalize=True).plot(kind='bar');
Insert picture description here
3.2 data cleaning
# Import required packages import re import os import shutil from tqdm import tqdm from collections import defaultdict # Define data cleaning function: # Cleaning separator characters def clean_duplication(text): left_square_brackets_pat = re.compile(r'\[+') right_square_brackets_pat = re.compile(r'\]+') punct = [',', '\\.', '\\!', ',', '. ', '!', ',', '\?', '?'] def replace(string, char): pattern = char + '{2,}' if char.startswith('\\'): char = char[1:] string = re.sub(pattern, char, string) return string text = left_square_brackets_pat.sub('', text) text = right_square_brackets_pat.sub('', text) for p in punct: text = replace(text, p) return text def emoji2zh(text, inverse_emoji_dict): for emoji, ch in inverse_emoji_dict.items(): text = text.replace(emoji, ch) return text # Clean the special expressions in the dataset, and replace the expressions with Chinese through the mapping of json files def clean_emotion(data_path, emoji2zh_data, save_dir, train=True): data = defaultdict(list) filename = os.path.basename(data_path) with open(data_path, 'r', encoding='utf8') as f: texts = f.readlines() for line in tqdm(texts, desc=data_path): if train: id_, text, label = line.strip().split('\t') else: id_, text = line.strip().split('\t') data['id'].append(id_) text = emoji2zh(text, emoji2zh_data) text = clean_duplication(text) data['text_a'].append(text) if train: data['label'].append(label) df = pd.DataFrame(data) if not os.path.exists(save_dir): os.makedirs(save_dir) df.to_csv(os.path.join(save_dir, filename), index=False, encoding='utf8', header=False, sep='\t') return df
# Read the expression mapping json file (placed in the work directory and named emoji2zh.json) to replace the expression with Chinese characters import json emoji2zh_data = json.load(open('/home/aistudio/work/emoji2zh.json', 'r', encoding='utf8'))
# Data cleaning of data data = clean_emotion('/home/aistudio/data/data100731/OCEMOTION.csv',emoji2zh_data,'./')
/home/aistudio/data/data100731/OCEMOTION.csv: 100%|██████████| 35694/35694 [00:04<00:00, 7890.77it/s]
# Remove the useless id column and save it as text_a,label data = data[['text_a', 'label']]
3.3 converting emotion category labels
Since the class name is in English, the English class name is mainly converted to the Chinese class name here, so as to be better applied to the Chinese emotion analysis system!
# Replace labels in the dataset, {safety ':' sad ',' happiness': 'happy', 'like': 'like', 'anger': 'anger', 'fear': 'fear', 'surprise': 'surprise': 'disgust'} data.loc[data['label']=='sadness', 'label'] = 'Sad' data.loc[data['label']=='happiness', 'label'] = 'cheerful' data.loc[data['label']=='like', 'label'] = 'like' data.loc[data['label']=='anger', 'label'] = 'anger' data.loc[data['label']=='fear', 'label'] = 'fear' data.loc[data['label']=='surprise', 'label'] = 'surprised' data.loc[data['label']=='disgust', 'label'] = 'hate'
3.4 manually partition training, verification and test sets
Reasons for dividing training, verification and test sets:
a) The training set directly participates in the process of model parameter adjustment, which obviously can not be used to reflect the real ability of the model (to prevent students who memorize textbooks by rote from having the best performance, that is, to prevent over fitting).
b) The verification set participates in the process of manual parameter adjustment (super parameter), and can not be used to finally judge a model (students who brush the question bank can not be regarded as good students).
c) Therefore, it is necessary to examine a student's real ability (final exam) through the final exam (test set).
Two common data set division methods are provided below, which can be selected according to specific needs or effects:
# # # Division method 1: directly divide the training, verification and test sets according to the proportion # from sklearn.model_selection import train_test_split # train_data, test_data = train_test_split(data, test_size=0.2) # train_data,valid_data=train_test_split(train_data, test_size=0.2) # # Random scrambling of data # from sklearn.utils import shuffle # train_data = shuffle(train_data) # valid_data = shuffle(valid_data) # test_data = shuffle(test_data) # # Save the divided dataset file # train_data.to_csv('./train.csv', index=False, sep="\t") # Training set # valid_data.to_csv('./valid.csv', index=False, sep="\t") # Validation set # test_data.to_csv('./test.csv', index=False, sep="\t") # Test set # print('training set length: ', len(train_dat),' verification set length: ', len(valid_data),' test set length ', len(test_data))
# Division method 2: divide the training, verification and test sets by 8:1:1 according to the specific categories, so as to make the data equally distributed as far as possible from sklearn.utils import shuffle train = pd.DataFrame() # Training set valid = pd.DataFrame() # Validation set test = pd.DataFrame() # Test set tags = data['label'].unique().tolist() # Extract in equal proportion according to the label # According to the category of the data set, the training, verification and test sets are divided in the proportion of 8:1:1, and are randomly disrupted and saved for tag in tags: # 0.2 data were randomly selected as the training and verification set target = data[(data['label'] == tag)] sample = target.sample(int(0.2 * len(target))) sample_index = sample.index # The remaining 0.8 data is used as the training set all_index = target.index residue_index = all_index.difference(sample_index) # Data remaining after sample removal residue = target.loc[residue_index] # Divide the 0.2 data set into test set and verification set in equal proportion test_sample = sample.sample(int(0.5 * len(sample))) test_sample_index = test_sample.index valid_sample_index = sample_index.difference(test_sample_index) valid_sample = sample.loc[valid_sample_index] # Splice each category test = pd.concat([test, test_sample], ignore_index=True) valid = pd.concat([valid, valid_sample], ignore_index=True) train = pd.concat([train, residue], ignore_index=True) # Random scrambling of data train = shuffle(train) valid = shuffle(valid) test = shuffle(test) # Save as tab delimited text train.to_csv('train.csv', sep='\t', index=False) # Training set valid.to_csv('valid.csv', sep='\t', index=False) # Validation set test.to_csv('test.csv', sep='\t', index=False) # Test set print('Training set length:', len(train), 'Validation set length:', len(valid), 'Test set length', len(test))
Training set length: 28558 verification set length: 3570 test set length: 3566
4, Constructing micro emotion analysis model based on PaddleHub
About PaddleHub:
PaddleHub is a propeller pre training model management and migration learning tool. Through PaddleHub developers can use high-quality pre training model combined with fine tune API to quickly complete the whole process from migration learning to application deployment. It provides a high-quality pre training model under the propeller ecology, covering image classification, target detection, lexical analysis, semantic model, emotion analysis and visual analysis Frequency classification, image generation, image segmentation, text review, key point detection and other mainstream models.
For more model details, please check the official website: https://www.paddlepaddle.org.cn/hub
If you encounter problems during the use of paddlehub, you can ask the following questions: https://github.com/PaddlePaddle/PaddleHub/issue s
Based on the pre training model, PaddleHub supports the following functions:
The 1. model is software, which realizes fast prediction through Python API or command line, and makes PaddlePaddle model library more convenient.
2. Migration learning: users can complete deep migration learning of natural language processing and computer vision scene with only a small amount of code through fine tune API.
3. Service deployment: you can build API services of your own model with a simple command line.
4. Hyperparametric optimization, automatically search the optimal hyperparametric to get better model effect.
4.1 pre environment preparation
# Download the latest version of paddlehub !pip install -U paddlehub -i https://pypi.tuna.tsinghua.edu.cn/simple
# Import paddlehub and paddle packages import paddlehub as hub import paddle
4.2 loading the pre training model ERNIE Tiny
ERNIE Tiny compresses ERNIE 2.0 Base model mainly through model structure compression and model distillation. Its characteristics and advantages are as follows:
a. Three layer transformer structure is adopted, and the linear speed is increased by 4 times;
b. The hidden layer parameters of the model are widened from 768 of ERNIE 2.0 to 1024;
c. Shorten the sequence length of input text and reduce the computational complexity. The Chinese keyword granularity input is adopted for the first time, and the length is shortened by 40% on average;
d.ERNIE Tiny plays the role of student in training and uses model distillation to learn the distribution and output of the corresponding layer of teacher model ERNIE 2.0 in Transformer layer and Prediction layer;
Comprehensive optimization can bring 4.3 times the prediction speed, and has higher industrial landing capacity.
# Set up 7 emotion categories requiring classification label_list=list(data.label.unique()) print(label_list) label_map = { idx: label_text for idx, label_text in enumerate(label_list) } print(label_map)
['Sad', 'cheerful', 'like', 'anger', 'fear', 'surprised', 'hate'] {0: 'Sad', 1: 'cheerful', 2: 'like', 3: 'anger', 4: 'fear', 5: 'surprised', 6: 'hate'}
# Just specify the name of the model you want to use and the number of categories for text classification to complete the fine tune network definition. After pre training the model, splice a Full Connected network for classification # Select ernie_tiny pre training model here and set the fine tuning task as 7 classification task model = hub.Module(name="ernie_tiny", task='seq-cls', num_classes=7, label_map=label_map)
Download https://bj.bcebos.com/paddlehub/paddlehub_dev/ernie_tiny_2.0.2.tar.gz [##################################################] 100.00% Decompress /home/aistudio/.paddlehub/tmp/tmpyvupawg3/ernie_tiny_2.0.2.tar.gz [##################################################] 100.00% [2021-08-04 23:21:25,328] [ INFO] - Successfully installed ernie_tiny-2.0.2 [2021-08-04 23:21:25,332] [ INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/ernie_tiny/ernie_tiny.pdparams and saved to /home/aistudio/.paddlenlp/models/ernie-tiny [2021-08-04 23:21:25,335] [ INFO] - Downloading ernie_tiny.pdparams from https://paddlenlp.bj.bcebos.com/models/transformers/ernie_tiny/ernie_tiny.pdparams 100%|██████████| 354158/354158 [00:04<00:00, 71445.77it/s] /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1297: UserWarning: Skip loading for classifier.weight. classifier.weight is not found in the provided dict. warnings.warn(("Skip loading for {}. ".format(key) + str(err))) /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1297: UserWarning: Skip loading for classifier.bias. classifier.bias is not found in the provided dict. warnings.warn(("Skip loading for {}. ".format(key) + str(err)))
The parameter usage of hub.Module is as follows:
- Name: model name. You can select Ernie, ernie_tiny, Bert base cased, Bert base Chinese, Roberta WwM ext, Roberta WwM ext large, etc.
- Task: fine tune task. Here is SEQ CLS, indicating text classification task.
- num_classes: indicates the number of categories of the current text classification task. It is determined according to the specific dataset used. The default is 2. It needs to be selected according to the specific classification task.
PaddleHub also provides models such as BERT to choose from. The loading examples corresponding to the models currently supporting text classification tasks are as follows:
Model name | PaddleHub Module |
---|---|
ERNIE, Chinese | hub.Module(name='ernie') |
ERNIE tiny, Chinese | hub.Module(name='ernie_tiny') |
ERNIE 2.0 Base, English | hub.Module(name='ernie_v2_eng_base') |
ERNIE 2.0 Large, English | hub.Module(name='ernie_v2_eng_large') |
BERT-Base, English Cased | hub.Module(name='bert-base-cased') |
BERT-Base, English Uncased | hub.Module(name='bert-base-uncased') |
BERT-Large, English Cased | hub.Module(name='bert-large-cased') |
BERT-Large, English Uncased | hub.Module(name='bert-large-uncased') |
BERT-Base, Multilingual Cased | hub.Module(nane='bert-base-multilingual-cased') |
BERT-Base, Multilingual Uncased | hub.Module(nane='bert-base-multilingual-uncased') |
BERT-Base, Chinese | hub.Module(name='bert-base-chinese') |
BERT-wwm, Chinese | hub.Module(name='chinese-bert-wwm') |
BERT-wwm-ext, Chinese | hub.Module(name='chinese-bert-wwm-ext') |
RoBERTa-wwm-ext, Chinese | hub.Module(name='roberta-wwm-ext') |
RoBERTa-wwm-ext-large, Chinese | hub.Module(name='roberta-wwm-ext-large') |
RBT3, Chinese | hub.Module(name='rbt3') |
RBTL3, Chinese | hub.Module(name='rbtl3') |
ELECTRA-Small, English | hub.Module(name='electra-small') |
ELECTRA-Base, English | hub.Module(name='electra-base') |
ELECTRA-Large, English | hub.Module(name='electra-large') |
ELECTRA-Base, Chinese | hub.Module(name='chinese-electra-base') |
ELECTRA-Small, Chinese | hub.Module(name='chinese-electra-small') |
Through the above line of code, the model is initialized as a model suitable for text classification tasks, and a fully connected network is spliced after the pre training model of ERNIE.
4.3 loading and processing data
# Import dependent Libraries import os, io, csv from paddlehub.datasets.base_nlp_dataset import InputExample, TextClassificationDataset
# Data set storage location DATA_DIR="/home/aistudio/data/data100731/"
# Process the data into a format acceptable to the model class OCEMOTION(TextClassificationDataset): def __init__(self, tokenizer, mode='train', max_seq_len=128): if mode == 'train': data_file = 'train.csv' # Training set elif mode == 'test': data_file = 'test.csv' # Test set else: data_file = 'valid.csv' # Validation set super(OCEMOTION, self).__init__( base_path=DATA_DIR, data_file=data_file, tokenizer=tokenizer, max_seq_len=max_seq_len, mode=mode, is_file_with_header=True, label_list=label_list ) # Parsing samples in text files def _read_file(self, input_file, is_file_with_header: bool = False): if not os.path.exists(input_file): raise RuntimeError("The file {} is not found.".format(input_file)) else: with io.open(input_file, "r", encoding="UTF-8") as f: reader = csv.reader(f, delimiter="\t") examples = [] seq_id = 0 header = next(reader) if is_file_with_header else None for line in reader: try: example = InputExample(guid=seq_id, text_a=line[0], label=line[1]) seq_id += 1 examples.append(example) except: continue return examples train_dataset = OCEMOTION(model.get_tokenizer(), mode='train', max_seq_len=128) # Max_seq_len is determined according to the specific text length, but it should be noted that max_seq_len is no longer than 512 dev_dataset = OCEMOTION(model.get_tokenizer(), mode='dev', max_seq_len=128) test_dataset = OCEMOTION(model.get_tokenizer(), mode='test', max_seq_len=128) # View the first 3 training sets for e in train_dataset.examples[:3]: print(e) # View the first 3 validation sets for e in dev_dataset.examples[:3]: print(e) # View the first 3 test sets for e in test_dataset.examples[:3]: print(e)
[2021-08-04 23:21:44,835] [ INFO] - Downloading vocab.txt from https://paddlenlp.bj.bcebos.com/models/transformers/ernie_tiny/vocab.txt 100%|██████████| 459/459 [00:00<00:00, 5903.16it/s] [2021-08-04 23:21:45,047] [ INFO] - Downloading spm_cased_simp_sampled.model from https://paddlenlp.bj.bcebos.com/models/transformers/ernie_tiny/spm_cased_simp_sampled.model 100%|██████████| 1083/1083 [00:00<00:00, 11921.14it/s] [2021-08-04 23:21:45,199] [ INFO] - Downloading dict.wordseg.pickle from https://paddlenlp.bj.bcebos.com/models/transformers/ernie_tiny/dict.wordseg.pickle 100%|██████████| 161822/161822 [00:02<00:00, 65316.99it/s] [2021-08-04 23:22:00,079] [ INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/vocab.txt [2021-08-04 23:22:00,083] [ INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/spm_cased_simp_sampled.model [2021-08-04 23:22:00,087] [ INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/dict.wordseg.pickle [2021-08-04 23:22:05,911] [ INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/vocab.txt [2021-08-04 23:22:05,914] [ INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/spm_cased_simp_sampled.model [2021-08-04 23:22:05,916] [ INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/dict.wordseg.pickle text=Is there a kind of people in the world who are not willing to yield to others,Is there a kind of person who can't fight all his life,Is there a kind of people who would rather die than live? Is there a kind of people who are forced not to give up. It has been more than half a year,I'm tired.,Jin Xi,But it's not time to go to bed. label=like text=I like the flowers in Daguan county,Are incomparably beautiful label=like text=Can't find. label=Sad text=Xu Jieqi and dog day's literature class ppt label=anger text=-I want to take a nap,Ann. label=hate text=attend a meeting,I went up and did it without preparation presentation,Someone is sweating a lot for me. Fortunately, I work clearly at ordinary times,Clear the two items in three minutes,progress,problem,Action plan,Explain the dead line of time one by one,The house is full of Chinese and foreign friends nodding frequently,Someone and someone stared at me and smiled,Qi Qi secretly gave me a thumbs up. ha-ha,Speaking in English is my strong point. oh dear,That's great. label=cheerful text=You still need to write yourself a prescription,This is a necessary process based on the experience of everyone. 2. We need constant positive psychological hints. 3. We need to get a sense of satisfaction and affirmation from our daily English practice. 4. This is what I said at the top,There must be one thing that can give me strong positive stimulation 5 continuous exercise,To release psychological energy label=Sad text=Woo woo.I forgot my password.I can't get on. label=Sad text=Hee hee has refused to speak Tang poetry before going to bed for more than two months,Today, I took out Tang poetry and prepared to speak,As a result, she recited it herself,I said the title of the poem,She recites poetry,I recited fifty or sixty songs in one breath,It really surprised me! label=cheerful
4.4 selection of optimization strategy and operation configuration
# The AdamW optimizer is used here optimizer = paddle.optimizer.AdamW(learning_rate=4e-5, parameters=model.parameters())
# run setup trainer = hub.Trainer(model, optimizer, checkpoint_dir='./ckpt', use_gpu=True, use_vdl=True) # Performer of fine tune task
[2021-08-04 23:22:30,859] [ WARNING] - PaddleHub model checkpoint not found, start from scratch...
run setup
Trainer mainly controls the training of fine tune task and is the initiator of the task, including the following controllable parameters:
- Model: optimized model;
- Optimizer: optimizer selection;
- use_gpu: whether to use gpu for training;
- use_vdl: whether to use vdl to visualize the training process;
- checkpoint_dir: address to save model parameters;
- compare_metrics: save the measurement indicators of the optimal model;
4.5 model training and verification
Note that the GPU environment is required for model training. During model training, you can check the occupation of video memory through the "performance monitoring" below or enter the "nvdia SMI" command at the terminal. If the video memory is insufficient, you can appropriately reduce the batch_size
trainer.train(train_dataset, epochs=5, batch_size=256, eval_dataset=dev_dataset, save_interval=1) # Configure the training parameters, start the training, and specify the verification set.
trainer.train mainly controls the specific training process, including the following controllable parameters:
- train_dataset: the dataset used in training;
- Epichs: number of training rounds;
- batch_size: the batch size of training. If GPU is used, please adjust the batch according to the actual situation_ size;
- num_ Workers: the number of works, which is 0 by default;
- eval_dataset: validation set;
- log_interval: the interval between printing the log, in the number of times the batch training is executed.
- save_interval: the interval frequency of saving the model. The unit is the number of rounds of training.
4.6 evaluating the current training model on the test set
# Evaluate the current training model on the test set result = trainer.evaluate(test_dataset, batch_size=128)
[2021-08-04 23:30:35,095] [ INFO] - Evaluation on validation dataset: \ [2021-08-04 23:30:38,511] [ EVAL] - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - Evaluation on validation dataset: / - Evaluation on validation dataset: - - Evaluation on validation dataset: \ - Evaluation on validation dataset: | - [Evaluation result] avg_acc=0.5939
# Advanced extension: use the F1 score index to make a more official evaluation of the effect on the test set import numpy as np # Read test set file df = pd.read_csv('./test.csv',sep = '\t') news1 = pd.DataFrame(columns=['label']) news1['label'] = df["label"] news = pd.DataFrame(columns=['text_a']) news['text_a'] = df["text_a"] # First, convert the data read by pandas into array data_array = np.array(news) # Then it is converted to list form data_list =data_array.tolist() # Predict the test set to get the predicted category label y_pre = model.predict(data_list, max_seq_len=128, batch_size=128, use_gpu=True) # True category label for the test set data_array1 = np.array(news1) y_val =data_array1.tolist() # Calculate F1 score of prediction results from sklearn.metrics import precision_recall_fscore_support,f1_score,precision_score,recall_score f1 = f1_score(y_val, y_pre, average='macro') p = precision_score(y_val, y_pre, average='macro') r = recall_score(y_val, y_pre, average='macro') print(f1, p, r)
[2021-08-04 23:30:38,536] [ INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/vocab.txt [2021-08-04 23:30:38,539] [ INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/spm_cased_simp_sampled.model [2021-08-04 23:30:38,542] [ INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/dict.wordseg.pickle /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/tensor/creation.py:125: DeprecationWarning: `np.object` is a deprecated alias for the builtin `object`. To silence this warning, use `object` by itself. Doing this will not modify any behavior and is safe. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations if data.dtype == np.object: 0.48899184212664987 0.5591275726811291 0.4574688432467382
ps: those interested can further optimize the effect by adjusting parameters, selecting other pre training models, optimizing network structure and re pre training on the basis of the baseline model!
4.7 model prediction
# Data to be predicted data = [ # Sad ["You don't have to say sorry,But if we cherish each other"], # cheerful ["Happiness is actually very simple"], # fear ["Fear. fall ill"], # like ["When your long hair reaches your waist,We worked together. I will wait"] ] # Define 7 categories for emotion classification label_list=['Sad', 'cheerful', 'like', 'anger', 'fear', 'surprised', 'hate'] label_map = { idx: label_text for idx, label_text in enumerate(label_list) } # Load the trained model model = hub.Module( name='ernie_tiny', task='seq-cls', num_classes=7, load_checkpoint='./ckpt/best_model/model.pdparams', label_map=label_map) # Model prediction results = model.predict(data, max_seq_len=128, batch_size=1, use_gpu=True) for idx, text in enumerate(data): print('Data: {} \t Lable: {}'.format(text[0], results[idx]))
[2021-08-04 23:30:48,174] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-tiny/ernie_tiny.pdparams /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1297: UserWarning: Skip loading for classifier.weight. classifier.weight is not found in the provided dict. warnings.warn(("Skip loading for {}. ".format(key) + str(err))) /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1297: UserWarning: Skip loading for classifier.bias. classifier.bias is not found in the provided dict. warnings.warn(("Skip loading for {}. ".format(key) + str(err))) [2021-08-04 23:30:54,070] [ INFO] - Loaded parameters from /home/aistudio/data/data100731/ckpt/best_model/model.pdparams [2021-08-04 23:30:54,077] [ INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/vocab.txt [2021-08-04 23:30:54,080] [ INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/spm_cased_simp_sampled.model [2021-08-04 23:30:54,083] [ INFO] - Found /home/aistudio/.paddlenlp/models/ernie-tiny/dict.wordseg.pickle /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/tensor/creation.py:125: DeprecationWarning: `np.object` is a deprecated alias for the builtin `object`. To silence this warning, use `object` by itself. Doing this will not modify any behavior and is safe. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations if data.dtype == np.object: Data: You don't have to say sorry,But if we cherish each other Lable: Sad Data: Happiness is actually very simple Lable: cheerful Data: Fear. fall ill Lable: fear Data: When your long hair reaches your waist,We worked together. I will wait Lable: like
5, Complete visual interface demonstration based on PyQt5
ps: the core code of the visual interface based on PyQt5 has been put into the work / Chinese micro emotion analysis system. After downloading the whole folder locally, operate according to the provided environment configuration guide and instructions!
# Move the best model just trained to the work directory to save it better! le: Sad Data: Happiness is actually very simple Lable: cheerful Data: Fear. fall ill Lable: fear Data: When your long hair reaches your waist,We worked together. I will wait Lable: like # 5, Complete visual interface demonstration based on PyQt5 ps:be based on PyQt5 The core code of visual interface has been put into work/Under the Chinese micro emotion analysis system, after downloading the whole folder locally, operate according to the provided environment configuration guide and instructions! ```python # Move the best model just trained to the work directory to save it better! !cp -r /home/aistudio/data/data100731/ckpt/best_model/model.pdparams /home/aistudio/work/Chinese micro emotion analysis system/best_model/
5.1 PyQt5 introduction:
PyQt5 is a python interface based on Digia's powerful graphics program framework Qt5. The program can run on multiple platforms, including Unix, Windows and Mac OS. QT brings us the most convenient advantage, that is, it has a QT designer. This designer can facilitate our page layout. It can be said that we need a lot of code to complete the page layout in Tkinter. In QT, just drag the control.
5.2 pre preparation - installation and configuration of pyqt5 + pychar:
Pyqt5 + pychar installation and configuration graphic tutorial details There are many online tutorials. You can operate according to the online tutorials.
5.3 tutorial:
After completing the installation and configuration of PyQt5, use the Qt designer software provided by Qt to complete the page design UI file by dragging, and then export it as a. py file, that is, the page designer. Then, by binding the corresponding function to the interface button, input box and other elements, the function of the whole system interface can be added. (there are many online tutorials, which will not be repeated here. It's better to start directly!)
5.4 interface beautification:
Methodology of Python graphical interface beautification Learn to make full use of QSS and add pictures to beautify the interface.
Please refer to interface.py (interface design program) and gui.py (main program) under the folder of 'Chinese micro emotion analysis system' for the interface related codes of the project.
Visual interface demonstration:
- Single text emotion analysis page:
- Batch text emotion analysis page:
6, System packaging
6.1 application scenario:
After completing the development of the system, we can package the whole Python program through PyInstaller, etc., so that it can be used directly on various platforms without complex environment configuration operation. It can be better delivered to some small white or non Python partners, and it is also convenient for demonstration.
6.2 package tutorial:
Pyinstaller has many packaged online tutorials, which will not be repeated here. Just make good use of the search engine. [solution] Pyinstaller package exe file detailed tutorial
In addition to the commonly used Pyinstaller, Windows users can also try to package system programs with GT boss's QPT for a better experience! You can operate directly according to the tutorials provided by it.
QPT - Quick packaging tool , QPT is a multi-functional encapsulation tool that can "simulate" the development environment. It can package ordinary Python scripts into EXE executable programs with a minimum of one line of command, and selectively add CUDA and NoAVX support to be compatible with more user environments as much as possible.
7, Project summary
This project is a derivative sub project "micro emotion analysis system" of the "public opinion analysis system" I am currently working on. I open source the course of this project. I hope to help you better start a complete text classification project development process. At present, the practical development program of full stream program is still relatively scarce in AI Studio, I hope this project can inspire or help you, so as to better win the competition or completion projects!
At this point, the project will come to an end. If you like it, I hope you can fork like to pay attention to Sanlian! ❤
8, Personal introduction
Huang canhua, 2019 undergraduate majoring in software engineering, School of software, South China Normal University
Main direction: development. At present, I mainly focus on NLP and data mining related competitions or projects
Github address: https://github.com/hchhtc123
Nickname? Alchemist 233
https://aistudio.baidu.com/aistudio/personalcenter/thirdview/330406 Pay attention to me and bring more wonderful projects to share next time!