It was too late last night and the code was not finished. It happened that the accuracy of PSO-LSTM could not be recovered. It was miserable / (o) /. I'll fill in the details today
1, Introduction to emotion analysis
computational research on people's views, emotions, emotions, evaluations and attitudes towards products, services, organizations, individuals, problems, events, topics and their attributes. Text Sentiment Analysis is a common application in natural language processing (NLP) methods. It is also an interesting basic task, especially the classification for the purpose of refining text emotional content. It is the process of analyzing, processing, inducing and reasoning the subjective text with emotional color.
this paper will introduce the analysis of emotional polarity (tendency) in emotional analysis. The so-called emotional polarity analysis refers to the positive, negative and neutral judgment of the text. In most application scenarios, there are only two categories. For example, the words "like" and "hate" belong to different emotional tendencies.
this paper will introduce in detail how to preprocess text data, and use LSTM model in deep learning model to realize text emotion analysis.
2, Text introduction and corpus analysis
this project takes the comments of a commodity on an e-commerce website as the corpus (corpus.csv), Click to download the dataset , there are 4310 comment data in the data set. The emotion of the text is divided into two categories: "positive" and "negative". The first few lines of the data set are as follows:
3, Data set analysis
- Emotional distribution in data set
- Comment sentence length distribution in data set
The following codes are the emotion distribution and comment sentence length distribution in the statistical data set
import pandas as pd import matplotlib.pyplot as plt from matplotlib import font_manager from itertools import accumulate # Set the font when matplotlib draws my_font=font_manager.FontProperties(fname="C:\Windows\Fonts\simhei.ttf") # Count sentence length and frequency of occurrence df=pd.read_csv('data/data_single.csv') print(df.groupby('label')['label'].count()) df['length']=df['evaluation'].apply(lambda x:len(x)) len_df=df.groupby('length').count() sent_length=len_df.index.tolist() sent_freq=len_df['evaluation'].tolist() # Draw a statistical chart of sentence length and frequency plt.bar(sent_length,sent_freq) plt.title('Statistical chart of sentence length and frequency',fontproperties=my_font) plt.xlabel('Sentence length',fontproperties=my_font) plt.ylabel('Frequency of sentence length',fontproperties=my_font) plt.show() plt.close() # Draw sentence length cumulative distribution function (CDF) sent_pentage_list=[(count/sum(sent_freq)) for count in accumulate(sent_freq)] # Draw CDF plt.plot(sent_length,sent_pentage_list) # Find the sentence length with quantile quantile=0.91 print(list(sent_pentage_list)) for length,per in zip(sent_length,sent_pentage_list): if round(per,2)==quantile: index=length break print('\n Quantile dimension%s Sentence length:%d.'%(quantile,index)) plt.show() plt.close() # Draw the cumulative distribution function of sentence length plt.plot(sent_length,sent_pentage_list) plt.hlines(quantile,0,index,colors='c',linestyles='dashed') plt.vlines(index,0,quantile,colors='c',linestyles='dashed') plt.text(0,quantile,str(quantile)) plt.text(index,0,str(index)) plt.title('Cumulative distribution function of sentence length',fontproperties=my_font) plt.xlabel('Sentence length',fontproperties=my_font) plt.ylabel('Cumulative frequency of sentence length',fontproperties=my_font) plt.show() plt.close()
The output results are as follows:
The statistics of sentence length and frequency are as follows:
The cumulative distribution function of sentence length is as follows:
As can be seen from the above pictures, the sentence length of most samples is concentrated between 1-200. If the cumulative frequency of sentence length takes 0.91 quantile, the length is about 183.
4, LSTM model
The model framework implemented is as follows:
The code is as follows:
import pickle import numpy as np import pandas as pd from keras.utils import np_utils from keras.utils.vis_utils import plot_model from keras.models import Sequential from keras.preprocessing.sequence import pad_sequences from keras.layers import LSTM, Dense, Embedding,Dropout from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # load dataset # ['evaluation'] is feature, ['label'] is label def load_data(filepath,input_shape=20): df=pd.read_csv(filepath) # Labels and glossary labels,vocabulary=list(df['label'].unique()),list(df['evaluation'].unique()) # Construct character level features string='' for word in vocabulary: string+=word vocabulary=set(string) # Dictionary list word_dictionary={word:i+1 for i,word in enumerate(vocabulary)} with open('word_dict.pk','wb') as f: pickle.dump(word_dictionary,f) inverse_word_dictionary={i+1:word for i,word in enumerate(vocabulary)} label_dictionary={label:i for i,label in enumerate(labels)} with open('label_dict.pk','wb') as f: pickle.dump(label_dictionary,f) output_dictionary={i:labels for i,labels in enumerate(labels)} # Vocabulary size vocab_size=len(word_dictionary.keys()) # Number of label categories label_size=len(label_dictionary.keys()) # Sequence filling, press input_shape filling. If the length is insufficient, it shall be supplemented by 0 x=[[word_dictionary[word] for word in sent] for sent in df['evaluation']] x=pad_sequences(maxlen=input_shape,sequences=x,padding='post',value=0) y=[[label_dictionary[sent]] for sent in df['label']] ''' np_utils.to_categorical Used to convert labels to shapes such as(nb_samples, nb_classes) Binary sequence of. hypothesis num_classes = 10. If will[1, 2, 3,......4]Convert to: [[0, 1, 0, 0, 0, 0, 0, 0] [0, 0, 1, 0, 0, 0, 0, 0] [0, 0, 0, 1, 0, 0, 0, 0] ...... [0, 0, 0, 0, 1, 0, 0, 0]] ''' y=[np_utils.to_categorical(label,num_classes=label_size) for label in y] y=np.array([list(_[0]) for _ in y]) return x,y,output_dictionary,vocab_size,label_size,inverse_word_dictionary # Create a deep learning model, Embedding + LSTM + Softmax def create_LSTM(n_units,input_shape,output_dim,filepath): x,y,output_dictionary,vocab_size,label_size,inverse_word_dictionary=load_data(filepath) model=Sequential() model.add(Embedding(input_dim=vocab_size+1,output_dim=output_dim, input_length=input_shape,mask_zero=True)) model.add(LSTM(n_units,input_shape=(x.shape[0],x.shape[1]))) model.add(Dropout(0.2)) model.add(Dense(label_size,activation='softmax')) model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy']) ''' error:ImportError: ('You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) ', 'for plot_model/model_to_dot to work.') Version issue: from keras.utils.vis_utils import plot_model Real solution: https://www.pianshen.com/article/6746984081/ ''' plot_model(model,to_file='./model_lstm.png',show_shapes=True) # Output model information model.summary() return model # model training def model_train(input_shape,filepath,model_save_path): # The data set is divided into training set and test set, accounting for 9:1 # input_shape=100 x,y,output_dictionary,vocab_size,label_size,inverse_word_dictionary=load_data(filepath,input_shape) train_x,test_x,train_y,test_y=train_test_split(x,y,test_size=0.1,random_state=42) # Model input parameters need to be adjusted according to your own needs n_units=100 batch_size=32 epochs=5 output_dim=20 # model training lstm_model=create_LSTM(n_units,input_shape,output_dim,filepath) lstm_model.fit(train_x,train_y,epochs=epochs,batch_size=batch_size,verbose=1) # Model saving lstm_model.save(model_save_path) # Number of test pieces N= test_x.shape[0] predict=[] label=[] for start,end in zip(range(0,N,1),range(1,N+1,1)): print(f'start:{start}, end:{end}') sentence=[inverse_word_dictionary[i] for i in test_x[start] if i!=0] y_predict=lstm_model.predict(test_x[start:end]) print('y_predict:',y_predict) label_predict=output_dictionary[np.argmax(y_predict[0])] label_true=output_dictionary[np.argmax(test_y[start:end])] print(f'label_predict:{label_predict}, label_true:{label_true}') # Output prediction results print(''.join(sentence),label_true,label_predict) predict.append(label_predict) label.append(label_true) # Prediction accuracy acc=accuracy_score(predict,label) print('Accuracy of the model on the test set:%s'%acc) if __name__=='__main__': filepath='data/data_single.csv' input_shape=180 model_save_path='data/corpus_model.h5' model_train(input_shape,filepath,model_save_path)
5, Key function explanation
plot_model
If you enter from keras.utils import plot in the code_ If the model reports an error, it can be changed to from keras.utils.vis_utils import plot_model.
After I change it, I still report an error: error:ImportError: ('You must install pydot (pip install pydot) and install graphviz (see instructions at https://graphviz.gitlab.io/download/ ) ', ‘for plot_model/model_to_dot to work.’)
Here are the solutions:
- (1)pip install pydot_ng
- (2) pip install graphviz. It is recommended not to go directly to pip install Download from the official website , I downloaded the following version
Unzip it and put it into the site package of the corresponding anaconda environment, and then copy the directory of bin. - (3) Modify site packages \ pydot_ ng_ init_. Py, add: path = R "D: \ app \ tech \ anaconda3 \ envs \ NLP \ lib \ site packages \ graphviz \ bin" in Method3. / / the path points to the path just copied, as shown in the figure:
np_utils.to_categorical
np_ utils.to_ Category is used to convert labels into shapes such as (nb_samples, nb_classes)
Binary sequence of.
Assume num_classes = 10.
If [1, 2, 3,...... 4] is converted to:
[[0, 1, 0, 0, 0, 0, 0, 0]
[0, 0, 1, 0, 0, 0, 0, 0]
[0, 0, 0, 1, 0, 0, 0, 0]
......
[0, 0, 0, 0, 1, 0, 0, 0]]
model.summary()
Output the parameter status of each layer of the model through model.summary(), as shown in the figure:
Special thanks
This article refers to The farmer's three fists hurt a little and Error resolution reference link