[NLP] text emotion analysis

It was too late last night and the code was not finished. It happened that the accuracy of PSO-LSTM could not be recovered. It was miserable / (o) /. I'll fill in the details today

1, Introduction to emotion analysis

  computational research on people's views, emotions, emotions, evaluations and attitudes towards products, services, organizations, individuals, problems, events, topics and their attributes. Text Sentiment Analysis is a common application in natural language processing (NLP) methods. It is also an interesting basic task, especially the classification for the purpose of refining text emotional content. It is the process of analyzing, processing, inducing and reasoning the subjective text with emotional color.
  this paper will introduce the analysis of emotional polarity (tendency) in emotional analysis. The so-called emotional polarity analysis refers to the positive, negative and neutral judgment of the text. In most application scenarios, there are only two categories. For example, the words "like" and "hate" belong to different emotional tendencies.
  this paper will introduce in detail how to preprocess text data, and use LSTM model in deep learning model to realize text emotion analysis.

2, Text introduction and corpus analysis

  this project takes the comments of a commodity on an e-commerce website as the corpus (corpus.csv), Click to download the dataset , there are 4310 comment data in the data set. The emotion of the text is divided into two categories: "positive" and "negative". The first few lines of the data set are as follows:

3, Data set analysis

  • Emotional distribution in data set
  • Comment sentence length distribution in data set

The following codes are the emotion distribution and comment sentence length distribution in the statistical data set

import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import font_manager
from itertools import accumulate

# Set the font when matplotlib draws
my_font=font_manager.FontProperties(fname="C:\Windows\Fonts\simhei.ttf")

# Count sentence length and frequency of occurrence
df=pd.read_csv('data/data_single.csv')
print(df.groupby('label')['label'].count())

df['length']=df['evaluation'].apply(lambda x:len(x))
len_df=df.groupby('length').count()
sent_length=len_df.index.tolist()
sent_freq=len_df['evaluation'].tolist()

# Draw a statistical chart of sentence length and frequency
plt.bar(sent_length,sent_freq)
plt.title('Statistical chart of sentence length and frequency',fontproperties=my_font)
plt.xlabel('Sentence length',fontproperties=my_font)
plt.ylabel('Frequency of sentence length',fontproperties=my_font)
plt.show()
plt.close()
# Draw sentence length cumulative distribution function (CDF)
sent_pentage_list=[(count/sum(sent_freq)) for count in accumulate(sent_freq)]

# Draw CDF
plt.plot(sent_length,sent_pentage_list)

# Find the sentence length with quantile
quantile=0.91
print(list(sent_pentage_list))
for length,per in zip(sent_length,sent_pentage_list):
    if round(per,2)==quantile:
        index=length
        break
print('\n Quantile dimension%s Sentence length:%d.'%(quantile,index))

plt.show()
plt.close()

# Draw the cumulative distribution function of sentence length
plt.plot(sent_length,sent_pentage_list)
plt.hlines(quantile,0,index,colors='c',linestyles='dashed')
plt.vlines(index,0,quantile,colors='c',linestyles='dashed')
plt.text(0,quantile,str(quantile))
plt.text(index,0,str(index))
plt.title('Cumulative distribution function of sentence length',fontproperties=my_font)
plt.xlabel('Sentence length',fontproperties=my_font)
plt.ylabel('Cumulative frequency of sentence length',fontproperties=my_font)
plt.show()
plt.close()

The output results are as follows:

The statistics of sentence length and frequency are as follows:

The cumulative distribution function of sentence length is as follows:

As can be seen from the above pictures, the sentence length of most samples is concentrated between 1-200. If the cumulative frequency of sentence length takes 0.91 quantile, the length is about 183.

4, LSTM model

The model framework implemented is as follows:

The code is as follows:

import pickle
import numpy as np
import pandas as pd
from keras.utils import np_utils
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.layers import LSTM, Dense, Embedding,Dropout
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# load dataset
# ['evaluation'] is feature, ['label'] is label
def load_data(filepath,input_shape=20):
    df=pd.read_csv(filepath)

    # Labels and glossary
    labels,vocabulary=list(df['label'].unique()),list(df['evaluation'].unique())

    # Construct character level features
    string=''
    for word in vocabulary:
        string+=word

    vocabulary=set(string)

    # Dictionary list
    word_dictionary={word:i+1 for i,word in enumerate(vocabulary)}
    with open('word_dict.pk','wb') as f:
        pickle.dump(word_dictionary,f)
    inverse_word_dictionary={i+1:word for i,word in enumerate(vocabulary)}
    label_dictionary={label:i for i,label in enumerate(labels)}
    with open('label_dict.pk','wb') as f:
        pickle.dump(label_dictionary,f)
    output_dictionary={i:labels for i,labels in enumerate(labels)}

    # Vocabulary size
    vocab_size=len(word_dictionary.keys())
    # Number of label categories
    label_size=len(label_dictionary.keys())

    # Sequence filling, press input_shape filling. If the length is insufficient, it shall be supplemented by 0
    x=[[word_dictionary[word] for word in sent] for sent in df['evaluation']]
    x=pad_sequences(maxlen=input_shape,sequences=x,padding='post',value=0)
    y=[[label_dictionary[sent]] for sent in df['label']]
    '''
    np_utils.to_categorical Used to convert labels to shapes such as(nb_samples, nb_classes)
    Binary sequence of.
    hypothesis num_classes = 10. 
    If will[1, 2, 3,......4]Convert to:
    [[0, 1, 0, 0, 0, 0, 0, 0]
     [0, 0, 1, 0, 0, 0, 0, 0]
     [0, 0, 0, 1, 0, 0, 0, 0]
    ......
    [0, 0, 0, 0, 1, 0, 0, 0]]
    '''
    y=[np_utils.to_categorical(label,num_classes=label_size) for label in y]
    y=np.array([list(_[0]) for _ in y])

    return x,y,output_dictionary,vocab_size,label_size,inverse_word_dictionary

# Create a deep learning model, Embedding + LSTM + Softmax
def create_LSTM(n_units,input_shape,output_dim,filepath):
    x,y,output_dictionary,vocab_size,label_size,inverse_word_dictionary=load_data(filepath)
    model=Sequential()
    model.add(Embedding(input_dim=vocab_size+1,output_dim=output_dim,
                        input_length=input_shape,mask_zero=True))
    model.add(LSTM(n_units,input_shape=(x.shape[0],x.shape[1])))
    model.add(Dropout(0.2))
    model.add(Dense(label_size,activation='softmax'))
    model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

    '''
        error:ImportError: ('You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) ', 'for plot_model/model_to_dot to work.')
        Version issue: from keras.utils.vis_utils import plot_model
        Real solution: https://www.pianshen.com/article/6746984081/
    '''

    plot_model(model,to_file='./model_lstm.png',show_shapes=True)
    # Output model information
    model.summary()

    return model

# model training
def model_train(input_shape,filepath,model_save_path):
    # The data set is divided into training set and test set, accounting for 9:1
    # input_shape=100
    x,y,output_dictionary,vocab_size,label_size,inverse_word_dictionary=load_data(filepath,input_shape)
    train_x,test_x,train_y,test_y=train_test_split(x,y,test_size=0.1,random_state=42)

    # Model input parameters need to be adjusted according to your own needs
    n_units=100
    batch_size=32
    epochs=5
    output_dim=20

    # model training
    lstm_model=create_LSTM(n_units,input_shape,output_dim,filepath)
    lstm_model.fit(train_x,train_y,epochs=epochs,batch_size=batch_size,verbose=1)

    # Model saving
    lstm_model.save(model_save_path)

    # Number of test pieces
    N= test_x.shape[0]
    predict=[]
    label=[]
    for start,end in zip(range(0,N,1),range(1,N+1,1)):
        print(f'start:{start}, end:{end}')
        sentence=[inverse_word_dictionary[i] for i in test_x[start] if i!=0]
        y_predict=lstm_model.predict(test_x[start:end])
        print('y_predict:',y_predict)
        label_predict=output_dictionary[np.argmax(y_predict[0])]
        label_true=output_dictionary[np.argmax(test_y[start:end])]
        print(f'label_predict:{label_predict}, label_true:{label_true}')
        # Output prediction results
        print(''.join(sentence),label_true,label_predict)
        predict.append(label_predict)
        label.append(label_true)

    # Prediction accuracy
    acc=accuracy_score(predict,label)
    print('Accuracy of the model on the test set:%s'%acc)

if __name__=='__main__':
    filepath='data/data_single.csv'
    input_shape=180
    model_save_path='data/corpus_model.h5'
    model_train(input_shape,filepath,model_save_path)

5, Key function explanation

plot_model

If you enter from keras.utils import plot in the code_ If the model reports an error, it can be changed to from keras.utils.vis_utils import plot_model.
After I change it, I still report an error: error:ImportError: ('You must install pydot (pip install pydot) and install graphviz (see instructions at https://graphviz.gitlab.io/download/ ) ', ‘for plot_model/model_to_dot to work.’)
Here are the solutions:

  • (1)pip install pydot_ng
  • (2) pip install graphviz. It is recommended not to go directly to pip install Download from the official website , I downloaded the following version

    Unzip it and put it into the site package of the corresponding anaconda environment, and then copy the directory of bin.
  • (3) Modify site packages \ pydot_ ng_ init_. Py, add: path = R "D: \ app \ tech \ anaconda3 \ envs \ NLP \ lib \ site packages \ graphviz \ bin" in Method3. / / the path points to the path just copied, as shown in the figure:

np_utils.to_categorical

np_ utils.to_ Category is used to convert labels into shapes such as (nb_samples, nb_classes)
Binary sequence of.
Assume num_classes = 10.
If [1, 2, 3,...... 4] is converted to:
[[0, 1, 0, 0, 0, 0, 0, 0]
[0, 0, 1, 0, 0, 0, 0, 0]
[0, 0, 0, 1, 0, 0, 0, 0]
......
[0, 0, 0, 0, 1, 0, 0, 0]]

model.summary()

Output the parameter status of each layer of the model through model.summary(), as shown in the figure:

Special thanks

This article refers to The farmer's three fists hurt a little and Error resolution reference link

Tags: Python NLP

Posted on Fri, 19 Nov 2021 23:40:57 -0500 by symchicken