NLP star sky intelligent dialogue robot series for natural language processing: in depth understanding of Transformer Workshop on Machine Translation (WMT)

NLP star sky intelligent dialogue robot series for natural language processing: in depth understanding of Transformer Workshop on Machine Translation (WMT)

Machine transductions and translations

The evaluation of machine translation shows the progress of natural language processing technology. To determine that one solution is better than another, each NLP challenger, each laboratory, or organization must reference the same data set to make it more effective. Now we study the WMT dataset.

Vaswani et al. (2017) demonstrated the achievements of Transformer on WMT 2014 English to German translation task and WMT 2014 English to French translation task, and Transformer obtained the most advanced BLEU score.

The 2014 Workshop on Machine Translation (WMT) includes a European language dataset, one of which contains the Europarl corpus. We will use the French English dataset from the European parallelism proceedings parallel corpus, which is linked to https://www.statmt.org/europarl/v7/fr-en.tgz.

Includes two files:

  • europarl-v7.fr-en.en
  • europarl-v7.fr-en.fr

We will load, clear, and reduce the size of the repository.

Preprocessing a WMT dataset

We will preprocess europarl-v7.fr-en.en and europarl-v7.fr-en.fr, read.py, and dump the serialized output file using standard Python functions and pickle:

import pickle
from pickle import dump

Load file into memory:

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, mode='rt', encoding='utf-8')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

Cut the loaded document into sentences

# split a loaded document into sentences
def to_sentences(doc):
    return doc.strip().split('\n')

The shortest and longest length of the query sentence

# shortest and longest sentence lengths
def sentence_lengths(sentences):
    lengths = [len(s.split()) for s in sentences]
    return min(lengths), max(lengths)

The sentence text is preprocessed to avoid training invalid and invalid noise markers. Text lines are normalized, including null value marks, converted to lowercase, punctuation marks are deleted from marks and non printable characters, marks containing numbers are excluded, and the cleaned text lines are saved as strings

# clean lines
import re
import string
import unicodedata
def clean_lines(lines):
    cleaned = list()
    # prepare regex for char filtering
    re_print = re.compile('[^%s]' % re.escape(string.printable))  
    # prepare translation table for removing punctuation
    table = str.maketrans('', '', string.punctuation)
    for line in lines:
        # normalize unicode characters
        line = unicodedata.normalize('NFD', line).encode('ascii', 'ignore')
        line = line.decode('UTF-8')
        # tokenize on white space
        line = line.split()
        # convert to lower case
        line = [word.lower() for word in line]
        # remove punctuation from each token
        line = [word.translate(table) for word in line]
        # remove non-printable chars form each token
        line = [re_print.sub('', w) for w in line]
        # remove tokens with numbers in them
        line = [word for word in line if word.isalpha()]
        # store as string
        cleaned.append(' '.join(line))
    return cleaned

The key functions required for the dataset have been defined to load and clean up the English dataset:

# load English data
filename = 'europarl-v7.fr-en.en'
doc = load_doc(filename)
sentences = to_sentences(doc)
minlen, maxlen = sentence_lengths(sentences)
print('English data: sentences=%d, min=%d, max=%d' % (len(sentences),minlen, maxlen))
cleanf=clean_lines(sentences)
English data: sentences=2007723, min=0, max=668

The dataset has been cleaned up. pickle dumps it to English.pkl

filename = 'English.pkl'
outfile = open(filename,'wb')
pickle.dump(cleanf,outfile)
outfile.close()
print(filename," saved")
English.pkl  saved

Now repeat the same process for the French data and dump it to the serialized file French.pkl

# load French data
filename = 'europarl-v7.fr-en.fr'
doc = load_doc(filename)
sentences = to_sentences(doc)
minlen, maxlen = sentence_lengths(sentences)
print('French data: sentences=%d, min=%d, max=%d' % (len(sentences),
minlen, maxlen))
cleanf=clean_lines(sentences)
filename = 'French.pkl'
outfile = open(filename,'wb')
pickle.dump(cleanf,outfile)
outfile.close()
print(filename," saved")
French data: sentences=2007723, min=0, max=693
French.pkl  saved

The results of the operation are as follows

#Pre-Processing datasets for Machine Translation
#Copyright 2020, Denis Rothman, MIT License
#Denis Rothman modified the code for educational purposes.
#Reference:
#Jason Brownlee PhD, 'How to Prepare a French-to-English Dataset for Machine Translation
# https://machinelearningmastery.com/prepare-french-english-dataset-machine-translation/

Star intelligent dialogue robot series blog

Tags: Python NLP BERT Transformer

Posted on Mon, 27 Sep 2021 07:20:09 -0400 by Kor Ikron