[NLP]⚠️Learn not to hit me!Half hour learning basic operations 9⚠️ Jingdong Comments Classification

Summary

From today on we will start a natural language processing (NLP) journey.Natural language processing enables processing, understanding and use of human language to bridge the gap between machine language and human language.

RNN

RNN (Recurrent Neural Network), or Circular Neural Network.Compared with CNN, RNN can help us to process sequence information better and mine the relationship between before and after information.For tasks like NLP, there is a strong correlation between the front and back probabilities of the corpus.For example, probability of "Good weather tomorrow"> "Weather basketball tomorrow".

Weight Sharing

Traditional neural networks:

RNN:

The weight sharing of RN N is similar to that of CNN. Sharing a weight at different times greatly reduces the number of parameters.

Calculation process


Calculated State

Calculated output:

LSTM

LSTM (Long Short Term Memory), a short-term and long-term memory model.LSTM is a special RNN model that solves the problem of gradient disappearance and gradient explosion during long training sequence.LSTM performed better in longer sequences than ordinary RNN.LSTM has two transitive states: ct (cell state) and ht (hidden state), compared to RNN which has only one transitive state ht.

stage

LSTM controls the transmission status through a door.

LSTM is divided into three phases:

  1. Forget phase: selective forgetting of input from the previous node
  2. Selective Memory Stage: Memory at this stage is selectively processed.Write down the important ones and the unimportant ones less
  3. Output phase: Determine which outputs will be treated as current state

Data introduction

About 30,000 reviews, divided into positive and negative reviews.


Good reviews:

0  To be a parent, we must have such a mentality as Liu Yong. We should keep learning, making progress and supplying ourselves with fresh blood to keep ourselves in good condition....
1  The author really has an English rigorous style, puts forward opinions, makes argument and argument, although I don't know much about physics, I still feel it...
2  The author's long argument supports his new idea by borrowing detailed reports on data processing and calculation results. Why once the Netherlands had the highest production in Europe in its counties...
3  The author used it before the war"embrace"Stunning.If Japan is not defeated, it will be occupied by the U.S. Army and not delayed by bureaucracy....
4  As a young man, the author likes to read and can see that he has read so many classics that he has a huge inner world. His works are the most valuable....
5  The author has a professional caution that it might be better if he had the privilege to learn the original version. Simplified versions of books have more printing errors that affect the understanding of scholars....
6  The author talks about clear and transparent thoughts like water in the language of a poem, like an experienced and wise old man who helps us understand one after another...
7  The author suggests a way to work and live. As an old consultant, he can not only put forward ideas, but also practise them physically....
8  Author's wisdom, will bring the whole 60-70 In the 1930s, rock stars emerged in endless numbers to connect with their own stories. What is nostalgia? What is shaking...
9  The author's logic is tight and complete. There is no crap, it goes deep into the shallow, follows the good temptation and interacts with each other....

Negative comment:

0  As a popular book with a reputation outside, it still says that foreign enterprises in Guangzhou should reasonably be similar to my living environment. But at a glance...
1  The author has a marked narcissism, and only wives whose husbands are not in work can live like her. Many methods are not practical and there are plagiarism....
2  The author writes this question purely from the perspective of a self-identified successor, which is very objective. Although he does not like it very much, however,...
3       The authors agree that they advocate internal tone and do not trust cosmetics. However, the methods listed are too cumbersome and the ingredients are not easy to find. They are not too practical.
4                  The author's writing is general and his opinions are similar to those of his counterparts on the market. Readers are not recommended to buy them.
5  The author's writing is okay, but it feels too trivial and a little morbid and groaning. A liberal. The author's character is not exact, and he has no nationality....
6                      The author is rather a small capitalist,But a little narcissistic,Books don't help much
7  As a novel describing the emotional life of the past, the author clearly has insufficient life experience and general literary background, and feels wasted after reading it...
8  As a personal experience, you can talk about it on the Internet, but you've already taken the book out a bit and it still has some obvious fallacies. But the writing is good, I suggest...
9  I just wrote a comment excitedly yesterday,A fuss happened today,For recommending this set of books to many friends,Friends drag me to buy online,Before Results...

Code

Preprocessing

import numpy as np
import pandas as pd
import jieba


# Read Terms
stop_words = pd.read_csv("stopwords.txt", index_col=None, names=["stop_word"])
stop_words = stop_words["stop_word"].values.tolist()

def load_data():

    # Read data
    neg = pd.read_excel("neg.xls", header=None)
    pos = pd.read_excel("pos.xls", header=None)

    # Debug Output
    print(neg.head(10))
    print(pos.head(10))

    # combination
    x = np.concatenate((pos[0], neg[0]))
    y = np.concatenate((np.ones(len(pos), dtype=int), np.zeros(len(neg), dtype=int)))

    # Generate df
    data = pd.DataFrame({"content": x, "label": y})
    print(data.head())


    data.to_csv("data.csv")

def pre_process(text):

    # participle
    text = jieba.lcut(text)


    # Remove Numbers
    text = [w for w in text if not str(w).isdigit()]

    # Remove left and right spaces
    text = list(filter(lambda w: w.strip(), text))

    # # Remove characters of length 1
    # text = list(filter(lambda w: len(w) > 1, text))

    # Remove deactivation
    text = list(filter(lambda w: w not in stop_words, text))

    return " ".join(text)

if __name__ == '__main__':

    # Read data
    data = pd.read_csv("data.csv")

    # Preprocessing
    data["content"] = data["content"].apply(pre_process)

    # Preservation
    data.to_csv("processed.csv", index=False)

Principal function

import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split


def tokenizer():

    # Read data
    data = pd.read_csv("processed.csv", index_col=False)
    print(data.head())

    # Convert to tuple
    X = tuple(data["content"])

    # Instantiate tokenizer
    tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=30000)

    # fitting
    tokenizer.fit_on_texts(X)

    # Word bag
    word_index = tokenizer.word_index
    # print(word_index)
    print(len(word_index))

    # Transformation
    sequence = tokenizer.texts_to_sequences(X)

    # Fill
    characters = tf.keras.preprocessing.sequence.pad_sequences(sequence, maxlen=100)

    # Label Conversion
    labels = tf.keras.utils.to_categorical(data["label"])

    # Split Dataset
    X_train, X_test, y_train, y_test = train_test_split(characters, labels, test_size=0.2,
                                                        random_state=0)

    return X_train, X_test, y_train, y_test


def main():

    # Read Word Separation Data
    X_train, X_test, y_train, y_test = tokenizer()
    print(X_train[:5])
    print(y_train[:5])

    # Hyperparameter
    EMBEDDING_DIM = 200  # embedding dimension
    optimizer = tf.keras.optimizers.RMSprop()  # Optimizer
    loss = tf.losses.CategoricalCrossentropy(from_logits=True)  # loss

    # Model
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(30001, EMBEDDING_DIM),
        tf.keras.layers.LSTM(200, dropout=0.2, recurrent_dropout=0.2),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(64, activation="relu"),
        tf.keras.layers.Dense(2, activation="softmax")
    ])
    model.build(input_shape=[None, 20])
    print(model.summary())

    # combination
    model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])

    # Preservation
    checkpoint = tf.keras.callbacks.ModelCheckpoint("model/jindong.h5py", monitor='val_accuracy', verbose=1,
                                                    save_best_only=True,
                                                    mode='max')

    # train
    model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2, batch_size=32, callbacks=[checkpoint])


if __name__ == '__main__':
    main()

Output results:

   Unnamed: 0                                            content  label
0           0  To be a parent, we must have the mentality of Liu Yong, who keeps learning and progressing, and keeps giving ...      1
1           1  The author really has the rigorous style of the British people to put forward a point of view for discussion and demonstration, despite my own knowledge of physics....      1
2           2  Author's lengthy argument supports his new idea by borrowing detailed reports on data processing and calculation results....      1
3           3  The author used it before the war " embrace " Stunning . Japan Without Loss ...      1
4           4  When he was a teenager, the author liked to read so much that he could read so many classics that he had a huge size...      1
49366
[[    0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0   205  1808   119    40    56  2139  1246   434  3594  1321  1715
      9   165    15    22]
 [    0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0  1157     8  3018     1    62   851    34     4    23   455   365
     46   239  1157  3903]
 [    0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0  1579    53   388   958   294  1146    18     1    49  1146   305
   2365     1   496   235]
 [    0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0   213  4719   509
    730 21403   524    42]
 [    0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0   105   159     1     5    16    11
     24     2   299   294     8    39   306 16796    11  1778    29  2674
    640     2   543  1820]]
[[0. 1.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [1. 0.]]
2021-09-20 18:59:07.031583: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'nvcuda.dll'; dlerror: nvcuda.dll not found
2021-09-20 18:59:07.031928: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-09-20 18:59:07.037546: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: DESKTOP-VVCH1JQ
2021-09-20 18:59:07.037757: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: DESKTOP-VVCH1JQ
2021-09-20 18:59:07.043925: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 200)         6000200   
_________________________________________________________________
lstm (LSTM)                  (None, 200)               320800    
_________________________________________________________________
dropout (Dropout)            (None, 200)               0         
_________________________________________________________________
dense (Dense)                (None, 64)                12864     
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 130       
=================================================================
Total params: 6,333,994
Trainable params: 6,333,994
Non-trainable params: 0
_________________________________________________________________
None
2021-09-20 18:59:07.470578: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/2
C:\Users\Windows\Anaconda3\lib\site-packages\tensorflow\python\keras\backend.py:4870: UserWarning: "`categorical_crossentropy` received `from_logits=True`, but the `output` argument was produced by a sigmoid or softmax activation and thus does not represent logits. Was this intended?"
  '"`categorical_crossentropy` received `from_logits=True`, but '
528/528 [==============================] - 272s 509ms/step - loss: 0.3762 - accuracy: 0.8476 - val_loss: 0.2835 - val_accuracy: 0.8839

Epoch 00001: val_accuracy improved from -inf to 0.88391, saving model to model\jindong.h5py
2021-09-20 19:03:40.563733: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
Epoch 2/2
528/528 [==============================] - 299s 566ms/step - loss: 0.2069 - accuracy: 0.9266 - val_loss: 0.2649 - val_accuracy: 0.9005

Epoch 00002: val_accuracy improved from 0.88391 to 0.90050, saving model to model\jindong.h5py

Tags: neural networks Deep Learning NLP

Posted on Tue, 21 Sep 2021 12:03:23 -0400 by tlavelle