Finally updated! Let's explore natural language processing here. As we all know, RNN is making waves in the NLP field, but I don't use RNN here. Hahaha, so the next article is RNN application. Of course, it will be updated more slowly recently. I'll try to join the party first, and there are other things. I feel messy.
In fact, I imported this blog through markdown, but it is too big for CSDN to carry, so I deleted a lot of output content. Before deletion, I can only establish a dictionary in data processing... Including the NLP foundation in front. These are very important contents.
In the field of artificial intelligence, there are 🌹 There are also thorns. I hope I can keep going!
Next, RNN, goodbye!!
I hope you will support my tutorial! I also hope my fans can apply for creative identity as soon as possible! Thank your family!!
TensorFlow system tutorial
The difficulty of NLP application is how to digitize the input characters. Neural network is a pile of numbers × Weight plus paranoia
- Text to value: word segmentation = = > word vector = = > word embedding
import tensorflow as tf tf.__version__
'2.5.0'
tf.test.is_gpu_available()
WARNING:tensorflow:From C:\Users\LENOVO\AppData\Local\Temp/ipykernel_14068/337460670.py:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use `tf.config.list_physical_devices('GPU')` instead. True
Project introduction
Emotional analysis: also known as opinion mining and tendency analysis, it is a process of analyzing, processing, summarizing and reasoning subjective texts with emotional color.
NLP pre basis: word segmentation
- Download jieba library, pip install jieba
- Take Chinese word segmentation as an example
pip install jieba
Note: you may need to restart the kernel to use updated packages. Incorrect syntax for file name, directory name, or volume label.
- jieba.cut has three parameters. The first parameter is the string that needs word segmentation, cut_all parameter is used to control whether full mode is adopted; HMM parameters are used to control whether HMM model is used
import jieba text = 'jupyter Is a very good teacher AI Author, handsome and good, love, love!' word_generator = jieba.cut(text) print(list(word_generator)) # Precise mode
Building prefix dict from the default dictionary ... Loading model from cache C:\Users\LENOVO\AppData\Local\Temp\jieba.cache Loading model cost 0.618 seconds. Prefix dict has been built successfully. ['jupyter', 'yes', 'one person', 'very', 'excellent', 'of', 'AI', 'author', ',', 'Handsome man', 'also', 'good', ',', 'love', 'Yes', 'love', 'Yes', '!']
print(list(jieba.cut(text,cut_all=True,HMM=False))) # Full mode
['jupyter', 'yes', 'one person', 'very', 'excellent', 'of', 'AI', 'author', ',', 'people', 'Handsome', 'also', 'good', ',', 'love', 'Yes', 'love', 'Yes', '!']
print(list(jieba.cut_for_search(text))) # Search engine mode
['jupyter', 'yes', 'one person', 'very', 'excellent', 'of', 'AI', 'author', ',', 'Handsome man', 'also', 'good', ',', 'love', 'Yes', 'love', 'Yes', '!']
# You can also return to the list directly print(jieba.lcut(text))
['jupyter', 'yes', 'one person', 'very', 'excellent', 'of', 'AI', 'author', ',', 'Handsome man', 'also', 'good', ',', 'love', 'Yes', 'love', 'Yes', '!']
print(jieba.lcut_for_search(text))
['jupyter', 'yes', 'one person', 'very', 'excellent', 'of', 'AI', 'author', ',', 'Handsome man', 'also', 'good', ',', 'love', 'Yes', 'love', 'Yes', '!']
# Developers can specify their own custom dictionaries to include words that are not in the jieba thesaurus jieba.load_userdict('Dictionaries.txt') ''' Dictionaries.txt Content: AI author Love, love ''' print(jieba.lcut(text))
['jupyter', 'yes', 'one person', 'very', 'excellent', 'of', 'AI author', ',', 'Handsome man', 'also', 'good', ',', 'Love, love', '!']
1, Data set
1. IMDB preset data in TF
imdb = tf.keras.datasets.imdb (train_data,train_labels),(test_data,test_labels) = imdb.load_data()
<__array_function__ internals>:5: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. C:\Users\LENOVO\anaconda3\envs\tf\lib\site-packages\tensorflow\python\keras\datasets\imdb.py:155: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx]) C:\Users\LENOVO\anaconda3\envs\tf\lib\site-packages\tensorflow\python\keras\datasets\imdb.py:156: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])
train_data.shape,train_labels.shape,test_data.shape,test_labels.shape
((25000,), (25000,), (25000,), (25000,))
train_labels[0] # 1 is a positive comment
1
2. Self made dataset
- This approach is more realistic
- Basic idea:
- Obtain data and determine the data format specification
- Word segmentation, English word segmentation can be divided according to the space, and Chinese word segmentation can refer to jieba
- Build a word index table and give each word a numerical index number
- Paragraph text to word index vector
- Paragraph text to word embedding matrix
import os import tarfile import urllib.request import numpy as np import re from random import randint
# Data address url = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz' # Data storage path file_path = 'data/aclImdb_v1.tar.gz' if not os.path.exists('data'): os.mkdir('data') if not os.path.isfile(file_path): print('downloading') result = urllib.request.urlretrieve(url,filename=file_path) print('ok',result) else: print(file_path,'is existed!')
data/aclImdb_v1.tar.gz is existed!
# Decompress data if not os.path.exists('data/aclImdb'): tfile = tarfile.open(file_path,'r:gz') print('extracting...') result = tfile.extractall('data/') # tfile.extractall('data / ') unzip the file to the data directory print('ok',result) else: print('data/aclImdb is existed')
data/aclImdb is existed
# Read the data set, digression, unfamiliar with re, need to be supplemented # Clear unnecessary characters in the text, such as html tags < br / > def remove_tags(text): re_tag = re.compile(r'<[>]+>') # The compile function is used to compile regular expressions and generate a Pattern object return re_tag.sub('',text) # re_tag.sub('',text) replace the matched characters with empty ones
# Encapsulate read data sets into functions def read_files(file_type): # 1) Save the paths of all files into file_list and count the number of positive and negative samples path = 'data/aclImdb/' file_list = [] positive_file_path = path+file_type+'/pos/' for f in os.listdir(positive_file_path): file_list.append(positive_file_path+f) positive_num = len(file_list) negitave_file_path = path+file_type+'/neg/' for f in os.listdir(negitave_file_path): file_list.append(negitave_file_path+f) negitave_num = len(file_list) - positive_num print('read',file_type,':',len(file_list)) print('positive_num',positive_num) print('negitave_num',negitave_num) # 2) Make your own label, because the folder name of this dataset is the label of the feature labels = [[1,0]]*positive_num + [[0,1]]*negitave_num # Adding lists will splice lists × A number repeats what's inside # 3) Get all text features = [] for fi in file_list: with open(fi,'rt',encoding='utf8') as f: features+=[remove_tags(''.join(f.readlines()))] return features,labels
train_x,train_y = read_files('train') test_x,test_y = read_files('test') test_y = np.array(test_y) train_y = np.array(train_y)
read train : 21247 positive_num 8747 negitave_num 12500 read test : 25000 positive_num 12500 negitave_num 12500
train_x[0] # features
'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'
train_y[0] # Positive comments
array([1, 0])
2, Data processing
1. Establish a dictionary
token = tf.keras.preprocessing.text.Tokenizer(num_words=4000) # 4000 is only 4000 words
token.fit_on_texts(train_x) # From train_ Create a dictionary in X
# Check how many documents (files) the token has read token.document_count
21247
# Map words to their ranking or index print(token.word_index)
{'the': 1, 'a': 2, 'and': 3, 'of': 4, 'to': 5, 'is': 6, 'br': 7, 'in': 8, 'it': 9, 'i': 10, 'this': 11, 'that': 12, 'was': 13, 'as': 14, 'for': 15, 'movie': 16, 'with': 17, 'but': 18, 'film': 19, 'on': 20, 'not': 21, 'you': 22, 'are': 23, 'his': 24, 'have': 25, 'be': 26, 'he': 27, 'one': 28, 'all': 29, 'at': 30, 'by': 31, 'they': 32, 'an': 33, 'so': 34, 'like': 35, 'who': 36, 'from': 37, 'or': 38, 'just': 39, 'her': 40, 'about': 41, 'if': 42, 'out': 43, ckman': 5828, 'kennedy': 5829, 'net': 5830, 'creek': 5831, 'sniper': 5832, 'beowulf': 5833, 'headache': 5834, 'ariel': 5835, 'programs': 5836, 'insightful': 5837, 'gods': 5838, 'leaders': 5839, 'prominent': 5840, 'files': 5841, 'eleven': 5842, 'choosing': 5843, 'refers': 5844, 'evolution': 5845, 'hepburn': 5846, 'uplifting': 5847, 'triangle': 5848, 'lex': 5849, 'garner': 5850, 'accepts': 5851, 'outright': 5852, 'lasts': 5853, 'representation': 5854, 'teaches': 5855, 'spit': 5856, "anyone's": 5857, 'occasions': 5858, 'hats': 5859, 'popping': 5860, 'survives': 5861, 'studies': 5862, 'tossed': 5863, 'landed': 5864, 'terminator': 5865, 'femme': 5866, 'ish': 5867, 'continually': 5868, 'centre': 5869, 'incidentally': 5870, 'dismal': 5871, 'communicate': 5872, 'caricature': 5873, 'coat': 5874, 'chills': 5875, 'trivia': 5876, 'myth': 5877, '200': 5878, 'respective': 5879, 'damaged': 5880, 'marvel': 5881, 'affairs': 5882, "hitler's": 5883, 'motive': 5884, 'transformed': 5885, 'refuse': 5886, 'breakfast': 5887, 'unattractive': 5888, 'claude': 5889, 'underwear': 5890, 'pacific': 5891, 'misfortune': 5892, 'derivative': 5893, }
# Map words to the number of documents or text they appear during training print(token.word_docs)
defaultdict(<class 'int'>, {'immediately': 363, 'some': 8229, '35': 79, 'right': 2349, 'recalled': 14, "teachers'": 1, 'believe': 1913, 'me': 6291, 'many': 4228, 'student': 271, 'pomp': 7, 'which': 6380, 'welcome': 171, 'school': 1065, 'who': 9377, 'remind': 128, 'inspector': 97, 'than': 6115, 'is': 19060, 'your': 3729, "isn't": 2242, 'situation': 478, 'through': 3449, 'years': 2977, 'of': 20179, 'financially': 20, 'students': 246, 'tried': 639, 'think': 4629, 'time': 7438, 'pettiness': 2, 'closer': 161, 'knew': 699, 'sack': 40, 'programs': 52, 'profession': 53, 'teaching': 68, 'to': 19978, 'high': 1587, 'the': 21072, 'burn': 106, 'their': 5874, 'episode': 836, 'see': 6790, 'insightful': 54, 'one': 11980, 'ran': 184, 'that': 17054, 'far': 2208, 'here': 3620, "high's": 1, 'expect': 924, 'i': 16439, 'my': 6881, 'repeatedly': 95, 'it': 18166, 'adults': 274, 'as': 13603, 'can': 6656, 'cartoon': 318, 'saw': 2319, 'line': 1398, 'pity': 194, 'satire': 183, 'in': 18691, "i'm": 3223, 'same': 2848, 'much': 6082, 'pathetic': 410, 'bromwell': 4, 'all': 11137, 'when': 7652, 'other': 5584, 'down': 2618, 'a': 20532, 'what': 8249, 'schools': 46, 'at': 11099, 'classic': 1247, 'about': 8957, 'such': 3461, 'comedy': 1960, 'lead': 991, 'whole': 2300, 'scramble': 6, 'teachers': 54, 'reality': 666, 'life': 3762, 'and': 20504, 'survive': 181, 'fetched': 85, 'age': 795, 'photography': 320, "i'd": 1016
# View the number of occurrences of words in the Token print(token.word_counts)
OrderedDict([('bromwell', 8), ('high', 1844), ('is', 90075), ('a', 137721), ('cartoon', 473), ('comedy', 2681), ('it', 67260), ('ran', 191), ('at', 20123), ('the', 283652), ('same', 3488), ('time', 10745), ('as', 39107), ('some', 13483), ('other', 7556), ('programs', 56), ('about', 14798), ('school', 1371), ('life', 5313), ('such', 4403), ('teachers', 59), ('my', 10528), ('35', 80), ('years', 3684), ('in', 78849), ('teaching', 72), ('profession', 57), ('lead', 1105), ('me', 9171), ('to', 115333), ('believe', 2168), ('that', 59452), ("high's", 1), ('satire', 226), ('much', 8375), ('closer', 174), ('reality', 814), ('than', 8513), ('scramble', 6), ('survive', 198), ('financially', 21), ('insightful', 56), ('students', 316), ('who', 17329), ('can', 9386), ('see', 9621), ('right', 2796), ('through', 4307), ('their', 9431), ('pathetic', 441), ("teachers'", 1), ('pomp', 8), ('pettiness', 2), ('of', 122635), ('whole', 2702), ('situation', 530), ('all', 20397), ('remind', 132), ('schools', 53), ('i', 66219), ('knew', 762), ('and', 136984), ('when', 11932), ('saw', 2643), ('episode', 1363), ('which', 10116), ('student', 319), ('repeatedly', 97), ('tried', 703), ('burn', 107), ('down', 3146), ('immediately', 386), ('recalled', 14), ('classic', 1458), ('line', 1613), ('inspector', 146), ("i'm", 4167), ('here', 4749), ('sack', 42), ('one', 22498), ('your', 4963), ('welcome', 179), ('expect', 1018), ('many', 5583), ('adults', 313), ('age', 912), ('think', 6182), ('far', 2591), ('fetched', 90), ('what', 13107), ('pity', 198), ("isn't", 2761), ('liked', 1214), ('film', 32981), ('action', 2839), ('scenes', 4471), ('were', 9332), ('very', 11536), ('interesting', 2666), ('tense', 123), ('well', 8701), ('done',
2. Text to number list (word vector)
train_sequences = token.texts_to_sequences(train_x) # The text is mapped to the number in the word vector, that is, the ranking of word occurrence test_sequences = token.texts_to_sequences(test_x)
3. Make the converted number list the same length
''' tf.keras.preprocessing.sequence.pad_sequences(train_sequences, A two-level nested list of floating-point numbers or integers padding='post','pre'or'post',Determines whether to supplement 0 at the beginning or end of the sequence when it is necessary to supplement 0 truncating='post','pre'or'post',Determines whether to truncate the sequence from the beginning or the end maxlen=400),'None Or integer, which is the maximum length of the sequence. Sequences greater than this length will be truncated, and sequences less than this length will be truncated'Will fill in 0 ''' train_x = tf.keras.preprocessing.sequence.pad_sequences(train_sequences, padding='post', truncating='post', maxlen=400) test_x = tf.keras.preprocessing.sequence.pad_sequences(test_sequences, padding='post', truncating='post', maxlen=400)
3, Modeling
model = tf.keras.models.Sequential() # Word embedding layer, which acts as the input layer ''' model.add(tf.keras.layers.Embedding(output_dim=32,Dimension of output word vector input_dim=4000,#Enter the length of the vocabulary, the maximum number of words + 1 input_length=400)) # Enter the length of the Tensor ''' model.add(tf.keras.layers.Embedding(output_dim=32, input_dim=4000, input_length=400)) # Flat layer model.add(tf.keras.layers.GlobalAveragePooling1D()) # model.add(tf.keras.layers.Flatten()) # Full connection layer model.add(tf.keras.layers.Dense(units=256,activation='relu')) # Discard layers to prevent overfitting model.add(tf.keras.layers.Dropout(0.3)) # Output layer model.add(tf.keras.layers.Dense(units=2,activation='softmax')) model.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, 400, 32) 128000 _________________________________________________________________ global_average_pooling1d (Gl (None, 32) 0 _________________________________________________________________ dense (Dense) (None, 256) 8448 _________________________________________________________________ dropout (Dropout) (None, 256) 0 _________________________________________________________________ dense_1 (Dense) (None, 2) 514 ================================================================= Total params: 136,962 Trainable params: 136,962 Non-trainable params: 0 _________________________________________________________________
4, Training
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy']) history=model.fit(train_x,train_y,validation_split=0.2,epochs=10,batch_size=128,verbose=1)
Epoch 1/10 133/133 [==============================] - 3s 14ms/step - loss: 0.6625 - accuracy: 0.6158 - val_loss: 0.6072 - val_accuracy: 0.6784 Epoch 2/10 133/133 [==============================] - 2s 12ms/step - loss: 0.3943 - accuracy: 0.8412 - val_loss: 0.3679 - val_accuracy: 0.8511 Epoch 3/10 133/133 [==============================] - 2s 12ms/step - loss: 0.2833 - accuracy: 0.8893 - val_loss: 0.3094 - val_accuracy: 0.8779 Epoch 4/10 133/133 [==============================] - 1s 10ms/step - loss: 0.2439 - accuracy: 0.9038 - val_loss: 0.3789 - val_accuracy: 0.8386 Epoch 5/10 133/133 [==============================] - 1s 11ms/step - loss: 0.2217 - accuracy: 0.9148 - val_loss: 0.2759 - val_accuracy: 0.8934 Epoch 6/10 133/133 [==============================] - 1s 11ms/step - loss: 0.2000 - accuracy: 0.9255 - val_loss: 0.3568 - val_accuracy: 0.8640 Epoch 7/10 133/133 [==============================] - 1s 11ms/step - loss: 0.1890 - accuracy: 0.9283 - val_loss: 0.3279 - val_accuracy: 0.8798 Epoch 8/10 133/133 [==============================] - 1s 11ms/step - loss: 0.1769 - accuracy: 0.9347 - val_loss: 0.3767 - val_accuracy: 0.8619 Epoch 9/10 133/133 [==============================] - 1s 10ms/step - loss: 0.1687 - accuracy: 0.9384 - val_loss: 0.3250 - val_accuracy: 0.8882 Epoch 10/10 133/133 [==============================] - 2s 12ms/step - loss: 0.1610 - accuracy: 0.9430 - val_loss: 0.4318 - val_accuracy: 0.8522
import matplotlib.pyplot as plt def show_train_history(train_history,train_metrics,val_metrics): plt.plot(train_history[train_metrics]) plt.plot(train_history[val_metrics]) plt.title('Trian History') plt.ylabel(train_metrics) plt.xlabel('epoch') plt.legend(['trian','validation'],loc='upper left') plt.show()
show_train_history(history.history,'loss','val_loss')
show_train_history(history.history,'accuracy','val_accuracy')
The accuracy and loss of this verification set have been fluctuating, while the training set has been rising. In fact, it can be roughly estimated that it means a little over fitting
5, Evaluation and prediction
model.evaluate(test_x,test_y,verbose=1) # 0 is none, 1 is a progress bar, 2 is an epoch, and 1 is a progress bar
782/782 [==============================] - 2s 3ms/step - loss: 0.3558 - accuracy: 0.8661 [0.35584765672683716, 0.8661199808120728]
pre = model.predict(test_x) pre[0]
array([0.97998744, 0.0200126 ], dtype=float32)
# Model application, I wrote it myself x = ["This is really a junk movie. Jupyter doesn't like it. Thank you! It's really bad"] x = token.texts_to_sequences(x) x = tf.keras.preprocessing.sequence.pad_sequences(x, padding='post', truncating='post', maxlen=400) x
array([[ 11, 6, 62, 2, 2356, 16, 147, 35, 9, 1298, 22, 44, 62, 71, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
y = model.predict(x) y
array([[0.44359663, 0.55640334]], dtype=float32)
state = {0:'pos',1:'neg'} state[np.argmax(y)]
'neg'