We will continue to analyze the train in doc2vec model after 03 blogs_ Model.py code to analyze the specific construction process of doc2vec model. The code is also suitable for the training with pre trained word embedding, and the implementation of the code is based on genism.
1, Train_ Code analysis of model.py
import gensim.models as g import logging #doc2vec parameter vector_size = 256 window_size = 15 min_count = 1 sampling_threshold = 1e-5 negative_size = 5 train_epoch = 100 dm = 0 # 0 = dbow; 1 = dmpv worker_count = 1 # Number of parallels used to control training
This section initializes the relevant parameters of doc2vec. We will introduce them one by one:
- vector_size: word vector length; the default value is 100
- window_size: window size, indicating the maximum distance between the current word and the predicted word in a sentence
- min_count: the dictionary can be truncated. The word frequency is less than min_ Words with count times will be discarded. The default value is 5
- sampling_threshold: the configured threshold for random downsampling of high-frequency words. The default value is 1e-3 and the range is (0,1e-5)
- negative_size: if > 0, negative sampling will be used to set the number of noise words (generally 5-20)
- train_epoch: number of iterations
- The dbow mode is adopted for dm: 0, and the dmpv mode is adopted for dm: 1
- worker_count: used to control the number of parallels for training
# Pre trained word embedding pretrained_emb = "toy_data/pretrained_word_embeddings.txt" # If there is no pre trained word embedding, there is no such parameter # Input corpus train_corpus = "toy_data/wiki_en.txt" # Model output save_model_name = 'wiki_en_doc2vec.model' saved_path = "toy_data/model/wiki_en_doc2vec.model.bin"
This section declares pre trained_ The path of EMB and input corpus, and the storage path of the trained doc2vec model is declared at the same time.
# Get log information logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) # Training doc2vec model docs = gm.doc2vec.TaggedLineDocument(train_corpus) # Loading corpus model = gm.Doc2Vec(docs, vector_size=vector_size, window=window_size, min_count=min_count, sample=sampling_threshold, workers=worker_count, hs=0, dm=dm, negative=negative_size,dbow_words=1,dm_concat=1,epochs=train_epoch,pretrained_emb=pretrained_emb) # Save model model.save(saved_path)
The doc2vec model is trained with the previously declared parameters and the loaded prediction library to generate the bin file, which is the pre trained model required in our project extract.py.
2, Infer_ Code analysis of test.py
import gensim.models as gm import codecs import numpy as np # parameter model = "toy_data/model/wiki_en_doc2vec.model.bin" test_docs = "toy_data/test.txt" # test.txt is the text to be vectorized output_file = "toy_data/test_vector.txt" # Get the vector representation of each line of the test text # Hyperparametric start_alpha = 0.01 infer_epoch = 1000 # Loading model m = gm.Doc2Vec.load(model) test_docs = [x.strip().split() for x in codecs.open(test_docs, "r", "utf-8").readlines()]
This part first indicates the path of the doc2vec model trained before, the text path to be vectorized, and the obtained vector is stored as test_vector.txt, initialize the super parameters, load the model trained before, and read in the test by line_ Docs, call the split() function to split, return the list object and assign it to test_docs.
#infer test vectors output = open(output_file, "w") for d in test_docs: output.write(" ".join([str(x) for x in m.infer_vector(d, alpha=start_alpha)]) + "\n") output.flush() output.close() #print(len(test_docs)) #Number of lines of test text print(m.most_similar("party", topn=5)) # Find the first 5 words closest to party # Save as numpy test_vector = np.loadtxt('toy_data/test_vector.txt') test_vector = np.save('toy_data/test_vector', test_vector)
This section opens the write file output_file, for test_ For each token in docs, call infer_vector vectorizes it, and then most can be called_ The similar function finds the words closest to a word. Finally, write the test_ The vector.txt file is saved as numpy with the file name toy_ data/test_ vector.infer_ The running results of test.py are as follows:
Figure 1: Test_ Result of vector.txt
Figure 2: generated vectorized txt and npy files
3, Implementation and analysis of part of speech tagging
In utils.py of this project, the extract function needs to build a syntax tree, which involves the representation of lexical chunks: tags and trees. We will first briefly analyze the implementation of part of speech tagging, and then specifically analyze the processing of syntax tree by utils.py.
Classification by part of speech is a classification method of words in language. It is the result of dividing words based on language features and taking into account lexical meaning. There are 14 common parts of speech, such as nouns, verbs, adjectives, etc. Part of speech (POS) is to mark the part of speech of each word in a text. Part of speech tagging is based on word segmentation, which is another perspective of understanding the text language.
Chinese word segmentation based on jieba