Word2Vec
- Language model
- Word vector
- code implementation
Language model: predict the probability of each sentence appearing in language
P(S) is also called language model, which is used to calculate the probability of a sentence.
calculation
The number of times a whole sentence appears in the training corpus divided by the number of times the current word Wi is not included
Number of occurrences.
shortcoming
- Data is too sparse
- Parameter space is too large
Explanation: although it can express the probability of a sentence, the calculation amount of such calculation is particularly large, which will lead to too sparse data, because each word has to consider a lot of words in front of it, and the probability of the combination of many words in front is not very high, the more words are combined into it, the rarer its data model is. Because the data is very sparse, the parameter space will be too large.
Hypothesis based on Markov
In order to solve the above shortcomings, the following introduces the hypothesis based on Markov.
Based on Markov hypothesis: the appearance of the next word only depends on one or several words in front of it.
- Suppose that the appearance of the next word depends on the word in front of it
- Suppose that the appearance of the next word depends on the two words in front of it
n-gram models
n-gram model: assume that the occurrence probability of the current word is only related to the N-1 word in front of it.
How to select n:
- Bigger n: it has more information about the constraints of the next words, and has greater discrimination;
- Smaller n: more times appear in the training corpus, with more reliable statistical information and higher reliability.
In theory, n is better. In experience, trigram is used the most. Nevertheless, in principle, trigram can be used
bigram, never trigram.
Construct language model
Construct language model: maximum likelihood estimation.
Count (X): the number of occurrences of word sequence X in the training corpus.
Word vector- The words are: [0.792, - 0.177, - 0.107, 0.109, 0.542 ]
- Common dimensions 50 or 100
- Solve the problem of "semantic gap"
- The similarity between words can be reflected by calculating the distance between vectors (Euclidean distance, cosine distance, etc.)
Unique heat code
Generating word vector from language model
Neural network language model (NNLM): directly from the language model, the process of model optimization is transformed into the process of word vector representation.
Objective function:
Model structure:
- NNLM starts from the language model (i.e. from the perspective of calculation probability), constructs the neural network to optimize the model for the objective function. The starting point of training is to use the neural network to build the language model to achieve the word prediction task, and the by-product of the model after the optimization process is the word vector.
- When training the neural network model, the goal is to predict the probability of the word, that is, in the word environment, to predict what the next word is. The objective function is as follows. After training the network to a certain extent, the final model parameters can be used as word vectors.
Cyclic neural network language model (RNNLN): a language model based on cyclic neural network.
w(t) refers to the current input word at the t-th time. The dimension is V, and V is the dictionary size. One hot representation.
s(t-1) represents the previous output of the hidden layer,
y(t) indicates the output.
The most advantage of recurrent neural network is that it can really fully utilize all the above information to predict the next word, instead of only opening the window of n words and only predicting the next word.
Disadvantages:
- High computational complexity
- Many parameters
So let's talk about word2vec
word2vec
Continuous Bag of Words(CBOW)
Continuous Bag of Words (CBOW)
To predict the central word (Wt).
- Objective function:
- No hidden layer
- Use two-way context window
- Context word order independent (BoW)
- The input layer is directly represented by low dimensional dense vectors
- The projection layer is simplified to sum (average)
CBOW model structure:
Skip-gram
Skip gram, a word skipping model, predicts the surrounding words according to the central word (Wt), that is, the prediction context.
- Objective function:
- Input layer: word vector containing only the center word w of the current sample
- Projection layer: constant projection, for comparison with CBOW model
- Output layer: like CBOW model, output layer is also a Huffman tree
Skip gram model structure:
Training skills
- Hierarchical Softmax
- Negative Sampling
objective function
Minimize objective function ⟺ maximize prediction accuracy
We need to minimize the objective function, so for each word w we will use two vectors:
- v_(w) : when w is the center word
- u_(w) : when w is a context word
Then for the central word c and the context word o:
Formula derivation
code implementationLearning word2vec with gensim (gensim is a NLP package)
The data set used below is the novel "the name of the people", with reference to
Link: https://pan.baidu.com/s/1ojWGMI756SO93OCAMNXFVg
Extraction code: l0zv
It's more convenient to open Baidu online mobile App after copying this content
# -*- coding: utf-8 -*- import jieba import jieba.analyse jieba.suggest_freq('Sarekin', True) jieba.suggest_freq('Tian Guofu', True) jieba.suggest_freq('Gao Yuliang', True) jieba.suggest_freq('Hou Liangping', True) jieba.suggest_freq('Zhong Xiaoai', True) jieba.suggest_freq('Chenyan', True) jieba.suggest_freq('Ouyang Jing', True) jieba.suggest_freq('Easy to learn', True) jieba.suggest_freq('Wangda Road', True) jieba.suggest_freq('Cai Chenggong', True) jieba.suggest_freq('Sun Liancheng', True) jieba.suggest_freq('Ji Changming', True) jieba.suggest_freq('Ding Yizhen', True) jieba.suggest_freq('Zheng Xipo', True) jieba.suggest_freq('Zhao Donglai', True) jieba.suggest_freq('Gaoxiaoqin', True) jieba.suggest_freq('Zhao Ruilong', True) jieba.suggest_freq('Lin Huahua', True) jieba.suggest_freq('Lu Yi', True) jieba.suggest_freq('Liu Xinjian', True) jieba.suggest_freq('Liu celebrates', True) with open('./in_the_name_of_people.txt',encoding='utf-8') as f: document = f.read() #document_decode = document.decode('GBK') document_cut = jieba.cut(document) #print ' '.join(jieba_cut) / / if the result is printed, the word segmentation effect disappears and the following result cannot be displayed result = ' '.join(document_cut) result = result.encode('utf-8') with open('./in_the_name_of_people_segment.txt', 'wb') as f2: f2.write(result) f.close() f2.close()
# import modules & set up logging import logging import os from gensim.models import word2vec logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) sentences = word2vec.LineSentence('./in_the_name_of_people_segment.txt') model = word2vec.Word2Vec(sentences, hs=1,min_count=1,window=3,size=100)
req_count = 5 #Output groups #Find out the most similar word set of a word vector for key in model.wv.similar_by_word('Li Dakang', topn =100): if len(key[0])==3: #Length of the first column req_count -= 1 print (key[0],key[1]) if req_count == 0: break;
req_count = 5 for key in model.wv.similar_by_word('Zhao Donglai', topn =100): if len(key[0])==3: req_count -= 1 print (key[0], key[1]) if req_count == 0: break;
req_count = 5 for key in model.wv.similar_by_word('Gao Yuliang', topn =100): if len(key[0])==3: req_count -= 1 print (key[0], key[1]) if req_count == 0: break;
req_count = 5 for key in model.wv.similar_by_word('Sarekin', topn =100): if len(key[0])==3: req_count -= 1 print (key[0], key[1]) if req_count == 0: break;
#On the similarity of two word vectors print (model.wv.similarity('Sarekin','Gao Yuliang')) print (model.wv.similarity('Li Dakang','Wangda Road'))
#Find out different kinds of words print (model.wv.doesnt_match(u"Sha Ruijin, Gao Yuliang, Li Dakang, Liu celebrate".split()))