Text sentiment analysis: text representation based on word2vec, glove and fasttext word vectors

In the last blog, we used word bag model, including word frequency matrix, TF IDF matrix, LSA and n-gram to construct text features, and did the movie comment emotion classification on Kaggle.

This blog is still about text feature engineering, using word embedding to construct text features, that is, using word2vec, glove and fasttext word vectors for text representation, training random forest classifier.

1, Training word2vec and fasttext word vectors

There are three datasets in Kaggle's sentiment analysis. One is a labeled training set with 25000 comments. The other is a test set without labels, which is used to make predictions and submit results. These two datasets were used in the last article.

There's also an untagged data set with 50000 comments. It's not too bad. We can think of using unlabeled data to train word2vec word vectors and embed words. Compared with the word bag model, word2vec word vector can solve the problem of too high text representation dimension, and take the location information between words into account. Perhaps, using word 2vec word vector to represent the text can achieve better prediction results.

In addition, we can also train fasttext word vectors. Fasttext is a model for text classification. Word vector is its by-product. Its structure is similar to that of CBOW model of word2vec, but the input is the whole text rather than the context information. Moreover, n-gram at character level is used to get the word vector representation of the word and capture the semantic association of words with the same suffix.

Gensim integrates the training package of word2vec word vector and fasttext word vector, with very similar usage. However, it seems that the fasttext package in gensim can only be used to train word vectors, not to do fasttext text text classification.

First import the required libraries.

import os,re
import numpy as np
import pandas as pd

from bs4 import BeautifulSoup
from gensim import models

Then read the labeled training data and unlabeled data, and merge the movie reviews into a list.

"""Read data, including tagged and unlabeled data"""

# Define functions to read data
def load_dataset(name, nrows=None):
    datasets = {
        'unlabeled_train': 'unlabeledTrainData.tsv',
        'labeled_train': 'labeledTrainData.tsv',
        'test': 'testData.tsv'
    }
    if name not in datasets:
        raise ValueError(name)
    data_file = os.path.join('..', 'data', datasets[name])
    df = pd.read_csv(data_file, sep='\t', escapechar='\\', nrows=nrows)
    print('Number of reviews: {}'.format(len(df)))
    return df

# Read tagged and unlabeled data
df_labeled = load_dataset('labeled_train')
df_unlabeled = load_dataset('unlabeled_train')

sentences = []

for s in df_labeled['review']:
    sentences.append(s)

for s in df_unlabeled['review']:
    sentences.append(s)
    
print("It's loaded",len(sentences),"Comments.")
Number of reviews: 25000
Number of reviews: 50000
75000 comments loaded.

Then the data is preprocessed to the format required by gensim. It's very important here. I have been fumbling for a while before I know what input format is correct.

In fact, the input format is like this. If there are two texts, they are processed into the format of [['with ',' all ',' this', 'stuff', 'going',...], ['movie ',' but ',' MJ ',' and ',' most ',...]]]. Each text is a list, and the list element is a single word. This is easy to do, because English doesn't need to be segmented, just use text.split() to segment according to the space.

Because word2vec depends on context, and context may be stop words, so we choose not to stop words.

"""Data preprocessing, to html Labels, non alphabetic characters"""

eng_stopwords = {}.fromkeys([ line.rstrip() for line in open('../stopwords.txt')])

# You can choose whether to disable the word, because word2vec It depends on the context, which may be the stop word.
# So for word2vec,We don't have to stop.
def clean_text(text, remove_stopwords=False):
    text = BeautifulSoup(text,'html.parser').get_text()
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    words = text.lower().split()
    if remove_stopwords:
        words = [w for w in words if w not in eng_stopwords]
    return words

sentences = [clean_text(s) for s in sentences]
# It's the most important thing, gensim The format needed is to make each comment into['with', 'all', 'this', 'stuff', 'going',...]Format of.
# Again, here is the key. If the format is wrong, you can't learn.

Now you can input the training word vector. After the training, there are two ways to keep the model. One is to save the training model itself, which is a binary format file. You can't see the words and word vectors after opening, but you can continue to use more data for training later. The second is to save the words and corresponding word vectors in the format of txt, and you can't add any more training, but you can see after opening To words and word vectors. The following code saves the model in both ways.

"""Print log information"""

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

""""Set the parameters of word vector training and start to train word vector"""

num_features = 300      # Word vector takes 300 dimensions
min_word_count = 40     # Remove words with frequency less than 40
num_workers = 4         # Number of threads running in parallel
context = 10            # Size of context sliding window
model_ = 0              # Use CBOW Model training

model_name = '{}features_{}minwords_{}context.model'.format(num_features, min_word_count, context)

print('Training model...')
model = models.Word2Vec(sentences, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sg=model_)

# Save model
# The file saved in the first method can't be viewed with a text editor, but all the training information is saved, and the training can be added after reading
# The latter method is saved as word2vec Text format, but some information such as vocabulary tree is lost during saving, so additional training is not allowed

model.save(os.path.join('..', 'models', model_name))

model.wv.save_word2vec_format(os.path.join('..','models','word2vec_txt.txt'),binary = False)

Load the saved model and extract the word vector.

# There are two ways to load the model, depending on the format of the hold

model = models.Word2Vec.load(os.path.join('..', 'models', model_name))
model_txt = models.KeyedVectors.load_word2vec_format(os.path.join('..','models','word2vec_txt.txt'),binary = False)

# The word vector of a word in a sentence can be extracted at the same time
model.wv[['man','woman','guy']]

Check the effect of model training, check the words most related to man, you can see that the result is good.

model.wv.most_similar("man")
[('woman', 0.6039960384368896),
 ('lady', 0.5690498948097229),
 ('lad', 0.5434065461158752),
 ('guy', 0.4913134276866913),
 ('person', 0.4771265387535095),
 ('monk', 0.47647857666015625),
 ('widow', 0.47423964738845825),
 ('millionaire', 0.4719209671020508),
 ('soldier', 0.4717007279396057),
 ('men', 0.46545034646987915)]

Then we train the fasttext word vector. The format of the input data is the same as that of word2vec, so we start the training directly with the above data.

To make complaints about it, though it is said that fasttext is a quick thief in text classification, the training vector is very slow, and feels much slower than word2vec, and memory shows 99% from time to time.

model_name_2 = 'fasttext.model'

print('Training model...')
model_2 = models.FastText(sentences, size=num_features, window=context, min_count=min_word_count,\
                        sg = model_, min_n = 2 , max_n = 3)

# Save model
# The file saved in the first method can't be viewed with a text editor, but all the training information is saved, and the training can be added after reading
# The latter method is saved as word2vec Text format, but some information such as vocabulary tree is lost during saving, so additional training is not allowed
model_2.save(os.path.join('..', 'models', model_name_2))
model_2.wv.save_word2vec_format(os.path.join('..','models','fasttext.txt'),binary = False)

It's still very slow to see the words most related to man. From the results, it is quite different from word2vec. The suffixes are all man.

model_2.wv.most_similar("man")
[('woman', 0.6353151798248291),
 ('boman', 0.6015676856040955),
 ('wolfman', 0.5951900482177734),
 ('wyman', 0.5888750553131104),
 ('snowman', 0.5807067155838013),
 ('madman', 0.5781949162483215),
 ('gunman', 0.5617127418518066),
 ('henchman', 0.5536723136901855),
 ('guffman', 0.5454517006874084),
 ('kidman', 0.5268094539642334)]

2, Text representation with word vector of word2vec, glove and fasttext

Well, next, we use word 2vec, glove and fasttext word vectors to represent the text of movie reviews, and train the random forest classifier again to see which word vector is better.

Reopened a jupyter notebook. First import the required libraries.

import os
import re
import numpy as np
import pandas as pd

from bs4 import BeautifulSoup

from nltk.corpus import stopwords

from gensim import models

from sklearn.ensemble import RandomForestClassifier

from sklearn import metrics

Read training set data.

"""Read training set data"""

def load_dataset(name, nrows=None):
    datasets = {
        'unlabeled_train': 'unlabeledTrainData.tsv',
        'labeled_train': 'labeledTrainData.tsv',
        'test': 'testData.tsv'
    }
    if name not in datasets:
        raise ValueError(name)
    data_file = os.path.join('..', 'data', datasets[name])
    df = pd.read_csv(data_file, sep='\t', escapechar='\\', nrows=nrows)
    print('Number of reviews: {}'.format(len(df)))
    return df

df = load_dataset('labeled_train')

Read the trained word 2vec vector, and the pre trained word vector (you need to download the word vector first) for standby. I'm afraid that the memory can't bear it. First, I don't load the fasttext word vector.

"""Read trained word2vec Model"""

model_name_w2v = '300features_40minwords_10context.model'
word2vec_embedding = models.Word2Vec.load(os.path.join('..', 'models', model_name_w2v))

"""read glove Word vector"""

glove_embedding = {}
f = open('../glove.6B/glove.6B.300d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    glove_embedding[word] = coefs
f.close()

Each movie comment in the training set is represented by a vector. First, we need to get the word vector of each word in each comment, and then average the word vectors of all words as a vector representation of sentences or texts.

So we get word2vec and golve representations of movie reviews.

"""Data preprocessing to get word vector and sentence vector"""

#The coding method is a bit rough. In short, it is to average the word vectors of the words in this sentence

eng_stopwords = set(stopwords.words('english'))

# Cleaning text data
def clean_text(text, remove_stopwords=False):
    text = BeautifulSoup(text, 'html.parser').get_text()
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    words = text.lower().split()
    if remove_stopwords:
        words = [w for w in words if w not in eng_stopwords]
    return words

# take word2vec Word vector, or glove Word vector
def to_review_vector(review,model='word2vec'):
    words = clean_text(review, remove_stopwords=True)
    if model == 'word2vec':
        array = np.asarray([word2vec_embedding[w] for w in words if w in word2vec_embedding],dtype='float32')
    elif model == 'glove':
        array = np.asarray([glove_embedding[w] for w in words if w in glove_embedding],dtype='float32')
    elif model == 'fasttext':
        array = np.asarray([fasttext_embedding[w] for w in words if w in fasttext_embedding],dtype='float32')
    else:
        raise ValueError('Please enter: word2vec,glove or fasttext')
    return array.mean(axis=0)

"""word2vec Sample represented"""
train_data_word2vec = [to_review_vector(text,'word2vec') for text in df['review']]

"""use glove Sample represented"""
train_data_glove = [to_review_vector(text,'glove') for text in df['review']]

The sample represented by word2vec is used to train the random forest model, and the out of package estimation is used as the evaluation index of generalization error.

It can be seen from the results that the out of package estimation is 0.83568, and the out of package estimation of the model trained with the word frequency matrix is 0.84232, so the effect is a little worse than that trained with the word bag model before.

def model_eval(train_data):

    print("1,The confusion matrix is:\n")
    print(metrics.confusion_matrix(df.sentiment, forest.predict(train_data)))

    print("\n2,Accuracy, recall and F1 The values are:\n")
    print(metrics.classification_report(df.sentiment,forest.predict(train_data)))

    print("\n3,The out of package estimate is:\n")
    print(forest.oob_score_)
    
    print("\n4,AUC Score For:\n")
    y_predprob = forest.predict_proba(train_data)[:,1]
    print(metrics.roc_auc_score(df.sentiment, y_predprob))
   
    
"""use word2vec Training model and evaluation model of word vector representation"""

forest = RandomForestClassifier(oob_score=True,n_estimators = 200, random_state=42)
forest = forest.fit(train_data_word2vec, df.sentiment)
print("\n====================Evaluate to word2vec Training model for text representation==================\n")
model_eval(train_data_word2vec)

Then the model is trained by the training set represented by the glove word vector. Unfortunately, the out of package estimation is 0.78556, and the generalization performance is poor.

"""use glove Training model and evaluation model of word vector representation"""

forest = RandomForestClassifier(oob_score=True,n_estimators = 200, random_state=42)
forest = forest.fit(train_data_glove, df.sentiment)
print("\n====================Evaluate to glove Training model for text representation==================\n")
model_eval(train_data_glove)

 

Finally, the classifier is trained with the sample represented by the fast text word vector. The out of package estimation is 0.81112, which is much worse than word 2vec.

del word2vec_embedding
del glove_embedding
del train_data_word2vec
del train_data_glove
del forest

"""Read trained fasttext Model"""

model_name_fast = 'fasttext.model'
fasttext_embedding = models.FastText.load(os.path.join('..', 'models', model_name_fast))

"""fasttext Sample represented"""
train_data_fasttext = [to_review_vector(text,'fasttext') for text in df['review']]

"""use fasttext Training model and evaluation model of word vector representation"""

forest = RandomForestClassifier(oob_score=True,n_estimators = 200, random_state=42)
forest = forest.fit(train_data_fasttext, df.sentiment)
print("\n====================Evaluate to fasttext Training model for text representation==================\n")
model_eval(train_data_fasttext)

3, Postscript

I have trained Chinese word vectors with gensim before. I didn't use it for a while, and I forgot the input format. This time, I just consolidated it.

As can be seen from the above results, at least in this task, word2vec performs better than glove and fasttext.

Tags: less jupyter encoding

Posted on Tue, 21 Apr 2020 06:58:19 -0400 by chamal