Natural language processing: preprocessing of PTB data set

Reference books

TensorFlow: a practical Google deep learning framework (version 2)

First, assign a number to each word according to the word frequency sequence, and then save the vocabulary to a separate vocab file.

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
# coding=utf-8 

"""
@author: Li Tian
@contact: 694317828@qq.com
@software: pycharm
@file: word_deal1.py
@time: 2019/2/20 10:42
@desc: First, assign a number to each word according to the word frequency sequence, and then save the vocabulary to a separate vocab In the document.
"""

import codecs
import collections
from operator import itemgetter

# Training set data file
RAW_DATA = "./simple-examples/data/ptb.train.txt"
# Exported glossary file
VOCAB_OUTPUT = "ptb.vocab"

# Count the frequency of words
counter = collections.Counter()
with codecs.open(RAW_DATA, "r", "utf-8") as f:
    for line in f:
        for word in line.strip().split():
            counter[word] += 1

# Sort the words according to the word frequency order
sorted_word_to_cnt = sorted(counter.items(), key=itemgetter(1), reverse=True)
sorted_words = [x[0] for x in sorted_word_to_cnt]

# Later, we need to add a sentence terminator at the line break of the text“<eos>",Add it to the vocabulary in advance.
sorted_words = ["<eos>"] + sorted_words
# When processing machine translation data later, the"<eos>",Also need to"<unk>"And sentence start"<sos>"join
# Glossary and remove low-frequency words from the glossary. stay PTB In the data, because the input data has replaced the low-frequency words with
# "<unk>",So this step is not needed.
# sorted_words = ["<unk>", "<sos>", "<eos>"] + sorted_words
# if len(sorted_words) > 10000:
#     sorted_words = sorted_words[:10000]

with codecs.open(VOCAB_OUTPUT, 'w', 'utf-8') as file_output:
    for word in sorted_words:
        file_output.write(word + "\n")

Run results:

 

After the vocabulary is determined, the training files, test files, etc. are converted to word numbers according to the vocabulary files. The number of each word is its line number in the vocabulary file.

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
# coding=utf-8 

"""
@author: Li Tian
@contact: 694317828@qq.com
@software: pycharm
@file: word_deal2.py
@time: 2019/2/20 11:10
@desc: After the vocabulary is determined, the training files, test files, etc. are converted to word numbers according to the vocabulary files. The number of each word is its line number in the vocabulary file.
"""

import codecs
import sys

# Original training set data file
RAW_DATA = "./simple-examples/data/ptb.train.txt"
# Glossary file generated above
VOCAB = "ptb.vocab"
# Replace word with output file after word number
OUTPUT_DATA = "ptb.train"

# Read the glossary and create a vocabulary to word number mapping.
with codecs.open(VOCAB, "r", "utf-8") as f_vocab:
    vocab = [w.strip() for w in f_vocab.readlines()]
word_to_id = {k: v for (k, v) in zip(vocab, range(len(vocab)))}


# If a deleted low frequency word appears, replace with"<unk>". 
def get_id(word):
    return word_to_id[word] if word in word_to_id else word_to_id["<unk"]


fin = codecs.open(RAW_DATA, "r", "utf-8")
fout = codecs.open(OUTPUT_DATA, 'w', 'utf-8')
for line in fin:
    # Read words and add<eos>Terminator
    words = line.strip().split() + ["<eos>"]
    # Replace each word with a number in the glossary
    out_line = ' '.join([str(get_id(w)) for w in words]) + '\n'
    fout.write(out_line)
fin.close()
fout.close()

Operation result:

Tags: Python Pycharm Google

Posted on Mon, 02 Dec 2019 23:16:24 -0500 by olanjouw