Natural language processing: preprocessing of PTB data set

Reference books TensorFlow: a practical Google deep learning framework (version 2) First, assign a number to each word according to the word frequency...

Reference books

TensorFlow: a practical Google deep learning framework (version 2)

First, assign a number to each word according to the word frequency sequence, and then save the vocabulary to a separate vocab file.

#!/usr/bin/env python # -*- coding: UTF-8 -*- # coding=utf-8 """ @author: Li Tian @contact: [email protected] @software: pycharm @file: word_deal1.py @time: 2019/2/20 10:42 @desc: First, assign a number to each word according to the word frequency sequence, and then save the vocabulary to a separate vocab In the document. """ import codecs import collections from operator import itemgetter # Training set data file RAW_DATA = "./simple-examples/data/ptb.train.txt" # Exported glossary file VOCAB_OUTPUT = "ptb.vocab" # Count the frequency of words counter = collections.Counter() with codecs.open(RAW_DATA, "r", "utf-8") as f: for line in f: for word in line.strip().split(): counter[word] += 1 # Sort the words according to the word frequency order sorted_word_to_cnt = sorted(counter.items(), key=itemgetter(1), reverse=True) sorted_words = [x[0] for x in sorted_word_to_cnt] # Later, we need to add a sentence terminator at the line break of the text“<eos>",Add it to the vocabulary in advance. sorted_words = ["<eos>"] + sorted_words # When processing machine translation data later, the"<eos>",Also need to"<unk>"And sentence start"<sos>"join # Glossary and remove low-frequency words from the glossary. stay PTB In the data, because the input data has replaced the low-frequency words with # "<unk>",So this step is not needed. # sorted_words = ["<unk>", "<sos>", "<eos>"] + sorted_words # if len(sorted_words) > 10000: # sorted_words = sorted_words[:10000] with codecs.open(VOCAB_OUTPUT, 'w', 'utf-8') as file_output: for word in sorted_words: file_output.write(word + "\n")

Run results:

After the vocabulary is determined, the training files, test files, etc. are converted to word numbers according to the vocabulary files. The number of each word is its line number in the vocabulary file.

#!/usr/bin/env python # -*- coding: UTF-8 -*- # coding=utf-8 """ @author: Li Tian @contact: [email protected] @software: pycharm @file: word_deal2.py @time: 2019/2/20 11:10 @desc: After the vocabulary is determined, the training files, test files, etc. are converted to word numbers according to the vocabulary files. The number of each word is its line number in the vocabulary file. """ import codecs import sys # Original training set data file RAW_DATA = "./simple-examples/data/ptb.train.txt" # Glossary file generated above VOCAB = "ptb.vocab" # Replace word with output file after word number OUTPUT_DATA = "ptb.train" # Read the glossary and create a vocabulary to word number mapping. with codecs.open(VOCAB, "r", "utf-8") as f_vocab: vocab = [w.strip() for w in f_vocab.readlines()] word_to_id = # If a deleted low frequency word appears, replace with"<unk>". def get_id(word): return word_to_id[word] if word in word_to_id else word_to_id["<unk"] fin = codecs.open(RAW_DATA, "r", "utf-8") fout = codecs.open(OUTPUT_DATA, 'w', 'utf-8') for line in fin: # Read words and add<eos>Terminator words = line.strip().split() + ["<eos>"] # Replace each word with a number in the glossary out_line = ' '.join([str(get_id(w)) for w in words]) + '\n' fout.write(out_line) fin.close() fout.close()

Operation result:

2 December 2019, 23:16 | Views: 6203

Add new comment

For adding a comment, please log in
or create account

0 comments