Word2Vec development and code implementation

Word2Vec

Language model

calculation
shortcoming
Hypothesis based on Markov
n-gram models
Construct language model

Word vector

Unique heat code
Generating word vector from language model
word2vec

Continuous Bag of Words(CBOW)
Skip-gram
Training skills
objective function
Formula derivation

code implementation

Language model

Language model: predict the probability of each sentence appearing in language

P(S) is also called language model, which is used to calculate the probability of a sentence.

calculation

The number of times a whole sentence appears in the training corpus divided by the number of times the current word Wi is not included
Number of occurrences.

shortcoming

Data is too sparse
Parameter space is too large
Explanation: although it can express the probability of a sentence, the calculation amount of such calculation is particularly large, which will lead to too sparse data, because each word has to consider a lot of words in front of it, and the probability of the combination of many words in front is not very high, the more words are combined into it, the rarer its data model is. Because the data is very sparse, the parameter space will be too large.

Hypothesis based on Markov

In order to solve the above shortcomings, the following introduces the hypothesis based on Markov.
Based on Markov hypothesis: the appearance of the next word only depends on one or several words in front of it.

Suppose that the appearance of the next word depends on the word in front of it

Suppose that the appearance of the next word depends on the two words in front of it

n-gram models

n-gram model: assume that the occurrence probability of the current word is only related to the N-1 word in front of it.
How to select n:

Bigger n: it has more information about the constraints of the next words, and has greater discrimination;
Smaller n: more times appear in the training corpus, with more reliable statistical information and higher reliability.
In theory, n is better. In experience, trigram is used the most. Nevertheless, in principle, trigram can be used
bigram, never trigram.

Construct language model

Construct language model: maximum likelihood estimation.

Count (X): the number of occurrences of word sequence X in the training corpus.

Word vector

The words are: [0.792, - 0.177, - 0.107, 0.109, 0.542 ]
Common dimensions 50 or 100
Solve the problem of "semantic gap"
The similarity between words can be reflected by calculating the distance between vectors (Euclidean distance, cosine distance, etc.)

Unique heat code

Generating word vector from language model

Neural network language model (NNLM): directly from the language model, the process of model optimization is transformed into the process of word vector representation.
Objective function:

Model structure:

NNLM starts from the language model (i.e. from the perspective of calculation probability), constructs the neural network to optimize the model for the objective function. The starting point of training is to use the neural network to build the language model to achieve the word prediction task, and the by-product of the model after the optimization process is the word vector.
When training the neural network model, the goal is to predict the probability of the word, that is, in the word environment, to predict what the next word is. The objective function is as follows. After training the network to a certain extent, the final model parameters can be used as word vectors.

Cyclic neural network language model (RNNLN): a language model based on cyclic neural network.
w(t) refers to the current input word at the t-th time. The dimension is V, and V is the dictionary size. One hot representation.
s(t-1) represents the previous output of the hidden layer,
y(t) indicates the output.

The most advantage of recurrent neural network is that it can really fully utilize all the above information to predict the next word, instead of only opening the window of n words and only predicting the next word.

Disadvantages:

High computational complexity
Many parameters

So let's talk about word2vec

word2vec

Continuous Bag of Words(CBOW)

Continuous Bag of Words (CBOW)
To predict the central word (Wt).

Objective function:

No hidden layer
Use two-way context window
Context word order independent (BoW)
The input layer is directly represented by low dimensional dense vectors
The projection layer is simplified to sum (average)

CBOW model structure:

Skip-gram

Skip gram, a word skipping model, predicts the surrounding words according to the central word (Wt), that is, the prediction context.

Objective function:

Input layer: word vector containing only the center word w of the current sample
Projection layer: constant projection, for comparison with CBOW model
Output layer: like CBOW model, output layer is also a Huffman tree

Skip gram model structure:

Training skills

Hierarchical Softmax
Negative Sampling

objective function

Minimize objective function ⟺ maximize prediction accuracy

We need to minimize the objective function, so for each word w we will use two vectors:

v_(w) : when w is the center word
u_(w) : when w is a context word

Then for the central word c and the context word o:

Formula derivation

code implementation

Learning word2vec with gensim (gensim is a NLP package)
The data set used below is the novel "the name of the people", with reference to
Link: https://pan.baidu.com/s/1ojWGMI756SO93OCAMNXFVg
Extraction code: l0zv
It's more convenient to open Baidu online mobile App after copying this content

# -*- coding: utf-8 -*- import jieba import jieba.analyse jieba.suggest_freq('Sarekin', True) jieba.suggest_freq('Tian Guofu', True) jieba.suggest_freq('Gao Yuliang', True) jieba.suggest_freq('Hou Liangping', True) jieba.suggest_freq('Zhong Xiaoai', True) jieba.suggest_freq('Chenyan', True) jieba.suggest_freq('Ouyang Jing', True) jieba.suggest_freq('Easy to learn', True) jieba.suggest_freq('Wangda Road', True) jieba.suggest_freq('Cai Chenggong', True) jieba.suggest_freq('Sun Liancheng', True) jieba.suggest_freq('Ji Changming', True) jieba.suggest_freq('Ding Yizhen', True) jieba.suggest_freq('Zheng Xipo', True) jieba.suggest_freq('Zhao Donglai', True) jieba.suggest_freq('Gaoxiaoqin', True) jieba.suggest_freq('Zhao Ruilong', True) jieba.suggest_freq('Lin Huahua', True) jieba.suggest_freq('Lu Yi', True) jieba.suggest_freq('Liu Xinjian', True) jieba.suggest_freq('Liu celebrates', True) with open('./in_the_name_of_people.txt',encoding='utf-8') as f: document = f.read() #document_decode = document.decode('GBK') document_cut = jieba.cut(document) #print ' '.join(jieba_cut) / / if the result is printed, the word segmentation effect disappears and the following result cannot be displayed result = ' '.join(document_cut) result = result.encode('utf-8') with open('./in_the_name_of_people_segment.txt', 'wb') as f2: f2.write(result) f.close() f2.close()

# import modules & set up logging import logging import os from gensim.models import word2vec logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) sentences = word2vec.LineSentence('./in_the_name_of_people_segment.txt') model = word2vec.Word2Vec(sentences, hs=1,min_count=1,window=3,size=100)

req_count = 5 #Output groups #Find out the most similar word set of a word vector for key in model.wv.similar_by_word('Li Dakang', topn =100): if len(key[0])==3: #Length of the first column req_count -= 1 print (key[0],key[1]) if req_count == 0: break;

req_count = 5 for key in model.wv.similar_by_word('Zhao Donglai', topn =100): if len(key[0])==3: req_count -= 1 print (key[0], key[1]) if req_count == 0: break;

req_count = 5 for key in model.wv.similar_by_word('Gao Yuliang', topn =100): if len(key[0])==3: req_count -= 1 print (key[0], key[1]) if req_count == 0: break;

req_count = 5 for key in model.wv.similar_by_word('Sarekin', topn =100): if len(key[0])==3: req_count -= 1 print (key[0], key[1]) if req_count == 0: break;

#On the similarity of two word vectors print (model.wv.similarity('Sarekin','Gao Yuliang')) print (model.wv.similarity('Li Dakang','Wangda Road'))

#Find out different kinds of words print (model.wv.doesnt_match(u"Sha Ruijin, Gao Yuliang, Li Dakang, Liu celebrate".split()))