Using Keras to build story generator based on LSTM model

Schematic diagram of LSTM network operation

What is an LSTM network?

LSTM (Long Short Term Memory) is a kind of special RNN (recurrent neural networks).
LSTM can learn the long-term dependence between parameters by updating the unit state, which is widely used in machine translation, language recognition and other fields.

Background of LSTM

When you read this article, you can understand the context according to your understanding of the words read before.
You will not be able to understand the meaning of the text directly from the beginning or from the middle part of the text.

One of the main disadvantages of the traditional neural network is that it can't really work and run like the neurons in the human brain, and can only use short-term memory or information.
Once the data sequence is long, it is difficult to transfer the early stage information to the later stage

Consider the following two sentences.
If we want to predict "< in the first sentence >The best prediction is Telugu. Because according to the context, this sentence is about Hyderabad's mother tongue.
Such prediction is very basic for human beings, but it is very difficult for artificial neural network.

The word Hyderabad indicates that its language should be Telugu. But "Hyderabad" appears at the beginning of the sentence.
Therefore, in order to predict accurately, neural network must memorize all sequences of words.
And that's what LSTM can do.

Programming LSTM

This paper will develop a story generator model through LSTM network. It mainly uses natural language processing (NLP) for data preprocessing and bidirectional LSTM for model building.

Step 1: dataset preparation

Create a short story text library with various subject types and save it as“ stories.txt ”.
A fragment in the text library is as follows:

Frozen grass crunched beneath the steps of a shambling man. His shoes were crusted and worn, and dirty toes protruded from holes in the sides. His quivering eye scanned the surroundings: a freshly paved path through the grass, which led to a double swingset, and a picnic table off to the side with a group of parents lounging in bundles, huddled to keep warm. Squeaky clean-and-combed children giggled and bounced as they weaved through the pathways with their hot breaths escaping into the air like smoke.

Step 2: import and analyze data analysis library

Next, we import the necessary libraries and look at the datasets.
The Keras framework running in TensorFlow 2.0 is used.

from tensorflow.keras.preprocessing.sequence import pad_sequences from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.models import Sequential from tensorflow.keras.optimizers import Adam from tensorflow.keras import regularizers import tensorflow.keras.utils as ku import numpy as np import tensorflow as tf import pickle data=open('stories.txt',encoding="utf8").read()

Step3: preprocessing data using NLP Library

First, we convert all the data to lowercase and split it into rows to get a list of python statements.
The reason for converting to lowercase is that the meaning of the same word is the same in different cases. For example, "doctor" and "doctor" are doctors, but the model will treat them differently.

Then we code the words and convert them into vectors. Generates an index property for each word, which returns a dictionary containing key value pairs, where the key is the word and the value is the token of the word.

# Converting the text to lowercase and splitting it corpus = data.lower().split("\n") # Tokenization tokenizer = Tokenizer() tokenizer.fit_on_texts(corpus) total_words = len(tokenizer.word_index) + 1 print(total_words)

The next step is to convert the sentence to a list of values based on these tag indexes. This converts a line of text, such as "from grade crunched bench the steps," to a list of tags that represent the words.

Then we will traverse the tag list and make the length of each sentence consistent. Otherwise, it may be difficult to train the neural network with them. It's mainly about traversing all sequences and finding the longest one. Once we have the longest sequence length, the next thing we need to do is fill all the sequences to make them the same length.

At the same time, we need to divide the input data (features) and the output data (labels). The input data is all data except the last character, while the output data is the last character.

Now, we will do one hot coding for tags, because this is actually a classification problem. Given a word sequence, we can classify and predict the next word from the corpus.

# create input sequences using list of tokens input_sequences = [] for line in corpus: token_list = tokenizer.texts_to_sequences([line])[0] for i in range(1, len(token_list)): n_gram_sequence = token_list[:i+1] input_sequences.append(n_gram_sequence) # pad sequences max_sequence_len = max([len(x) for x in input_sequences]) print(max_sequence_len) input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')) # create predictors and label predictors, label = input_sequences[:,:-1],input_sequences[:,-1] label = ku.to_categorical(label, num_classes=total_words)

Step 4: build a model

With the training data set, we can build the required model:

model = Sequential() model.add(Embedding(total_words, 300, input_length=max_sequence_len-1)) model.add(Bidirectional(LSTM(200, return_sequences = True))) model.add(Dropout(0.2)) model.add(LSTM(100)) model.add(Dense(total_words/2, activation='relu', kernel_regularizer=regularizers.l2(0.001))) model.add(Dense(total_words, activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) print(model.summary()) history = model.fit(predictors, label, epochs=200, verbose=0)

The first layer is the embedding layer. The first parameter reflects the number of words processed by the model. Here we want to be able to process all words, so we assign the total value_ Words; the second parameter reflects the dimension used to draw the word vector, which can be adjusted at will, and different prediction results will be obtained; the third parameter reflects the length of the input sequence, because the input sequence is all the data except the last character in the original sequence, so one needs to be subtracted here.
This is followed by the bidirectional LSTM layer and the Dense layer.
For loss function, we set it as classification cross entropy; for optimization function, we choose adam algorithm.

Step 5: result analysis

For the effect after training, we mainly look at the accuracy and loss.

import matplotlib.pyplot as plt acc = history.history['accuracy'] loss = history.history['loss'] epochs = range(len(acc)) plt.plot(epochs, acc, 'b', label='Training accuracy') plt.title('Training accuracy') plt.figure() plt.plot(epochs, loss, 'b', label='Training Loss') plt.title('Training loss') plt.legend() plt.show()

From the graph, it can be seen that the training accuracy keeps improving while the loss is declining. It shows that the model achieves better performance.

Step 6: save model

The following code can be used to save the completed training model for further deployment.

# serialize model to JSON model_json=model.to_json() with open("model.json","w") as json_file: json_file.write(model_json) # serialize weights to HDF5 model.save_weights("model.h5") print("Saved model to disk")

Step 7: Forecast

Next, we will use the trained model to predict words and generate stories.
First, the user input the initial statement, then preprocess the statement, input it into the LSTM model, and get a corresponding prediction word. Repeat this process and you will be able to generate the corresponding story. The specific code is as follows:

seed_text = "As i walked, my heart sank" next_words = 100 for _ in range(next_words): token_list = tokenizer.texts_to_sequences([seed_text])[0] token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre') predicted = model.predict_classes(token_list, verbose=0) output_word = "" for word, index in tokenizer.word_index.items(): if index == predicted: output_word = word break seed_text += " " + output_word print(seed_text)

The generated story is as follows:

As i walked, my heart sank until he was alarmed by the voice of the hunter
and realised what could have happened with him he flew away the boy crunched
before it disguised herself as another effort to pull out the bush which he did
the next was a small tree which the child had to struggle a lot to pull out
finally the old man showed him a bigger tree and asked the child to pull it
out the boy did so with ease and they walked on the morning she was asked
how she had slept as a while they came back with me

All text libraries: https://gist.github.com/jayashree8/08448d1b6610e444dc7a033ef4a5aae7#file-stories-txt
Source code of this article: https://github.com/jayashree8/Story_Generator/blob/master/Story_Generator.ipynb
By Jayashree domala
Deep hub translation team: Oliver Lee