NLP star sky intelligent dialogue robot series for natural language processing: in depth understanding of Transformer Workshop on Machine Translation (WMT)
Machine transductions and translations
The evaluation of machine translation shows the progress of natural language processing technology. To determine that one solution is better than another, each NLP challenger, each laboratory, or organization must reference the same data set to make it more effective. Now we study the WMT dataset.
Vaswani et al. (2017) demonstrated the achievements of Transformer on WMT 2014 English to German translation task and WMT 2014 English to French translation task, and Transformer obtained the most advanced BLEU score.
The 2014 Workshop on Machine Translation (WMT) includes a European language dataset, one of which contains the Europarl corpus. We will use the French English dataset from the European parallelism proceedings parallel corpus, which is linked to https://www.statmt.org/europarl/v7/fr-en.tgz.
Includes two files:
We will load, clear, and reduce the size of the repository.
Preprocessing a WMT dataset
We will preprocess europarl-v7.fr-en.en and europarl-v7.fr-en.fr, read.py, and dump the serialized output file using standard Python functions and pickle:
import pickle from pickle import dump
Load file into memory:
# load doc into memory def load_doc(filename): # open the file as read only file = open(filename, mode='rt', encoding='utf-8') # read all text text = file.read() # close the file file.close() return text
Cut the loaded document into sentences
# split a loaded document into sentences def to_sentences(doc): return doc.strip().split('\n')
The shortest and longest length of the query sentence
# shortest and longest sentence lengths def sentence_lengths(sentences): lengths = [len(s.split()) for s in sentences] return min(lengths), max(lengths)
The sentence text is preprocessed to avoid training invalid and invalid noise markers. Text lines are normalized, including null value marks, converted to lowercase, punctuation marks are deleted from marks and non printable characters, marks containing numbers are excluded, and the cleaned text lines are saved as strings
# clean lines import re import string import unicodedata def clean_lines(lines): cleaned = list() # prepare regex for char filtering re_print = re.compile('[^%s]' % re.escape(string.printable)) # prepare translation table for removing punctuation table = str.maketrans('', '', string.punctuation) for line in lines: # normalize unicode characters line = unicodedata.normalize('NFD', line).encode('ascii', 'ignore') line = line.decode('UTF-8') # tokenize on white space line = line.split() # convert to lower case line = [word.lower() for word in line] # remove punctuation from each token line = [word.translate(table) for word in line] # remove non-printable chars form each token line = [re_print.sub('', w) for w in line] # remove tokens with numbers in them line = [word for word in line if word.isalpha()] # store as string cleaned.append(' '.join(line)) return cleaned
The key functions required for the dataset have been defined to load and clean up the English dataset:
# load English data filename = 'europarl-v7.fr-en.en' doc = load_doc(filename) sentences = to_sentences(doc) minlen, maxlen = sentence_lengths(sentences) print('English data: sentences=%d, min=%d, max=%d' % (len(sentences),minlen, maxlen)) cleanf=clean_lines(sentences)
English data: sentences=2007723, min=0, max=668
The dataset has been cleaned up. pickle dumps it to English.pkl
filename = 'English.pkl' outfile = open(filename,'wb') pickle.dump(cleanf,outfile) outfile.close() print(filename," saved")
Now repeat the same process for the French data and dump it to the serialized file French.pkl
# load French data filename = 'europarl-v7.fr-en.fr' doc = load_doc(filename) sentences = to_sentences(doc) minlen, maxlen = sentence_lengths(sentences) print('French data: sentences=%d, min=%d, max=%d' % (len(sentences), minlen, maxlen)) cleanf=clean_lines(sentences) filename = 'French.pkl' outfile = open(filename,'wb') pickle.dump(cleanf,outfile) outfile.close() print(filename," saved")
French data: sentences=2007723, min=0, max=693 French.pkl saved
The results of the operation are as follows
#Pre-Processing datasets for Machine Translation #Copyright 2020, Denis Rothman, MIT License #Denis Rothman modified the code for educational purposes. #Reference: #Jason Brownlee PhD, 'How to Prepare a French-to-English Dataset for Machine Translation # https://machinelearningmastery.com/prepare-french-english-dataset-machine-translation/