·Introduction to Word2Vec
One of the core concepts of natural language processing is how to quantify words and expressions so that they can be used in the model environment. This mapping of language elements to numerical representations is called word embedding.
Word2Vec is a word embedding process. The concept is relatively simple: by circulating sentence by sentence in the corpus to fit a model, the current word is predicted according to the adjacent words in the predefined window.
For this purpose, it uses a neural network, but in fact, in the end, we do not use the predicted results. Once the model is saved, we only save the weight of the hidden layer. In the original model we will use, there are 300 weights, so each word is represented by a 300 dimensional vector.
Note that two words do not have to be close to each other to be considered similar. If two words never appear in the same sentence, but they are usually surrounded by the same, it is certain that they have similar meanings.
There are two modeling methods in Word2Vec: skip gram and continuous bag of words. Both methods have their own advantages and sensitivity to some super parameters.
Of course, the word vector you get depends on the corpus of your training model. Generally speaking, you do need a huge corpus, a trained version of Wikipedia, or news articles from different sources. The results we're going to use are trained on Google News.
·Simple visualization
Customize a small corpus and try to give a simple visualization of Word2Vec:
import gensim %matplotlib inline from gensim.models import Word2Vec from sklearn.decomposition import PCA from matplotlib import pyplot # Training corpus sentences = [['this', 'is', 'the', 'an', 'apple', 'for', 'you'], ['this', 'is', 'the', 'an', 'orange', 'for', 'you'], ['this', 'is', 'the', 'an', 'banana', 'for', 'you'], ['apple','is','delicious'], ['apple','is','sad'], ['orange','is','delicious'], ['orange','is','sad'], ['apple','tests','delicious'], ['orange','tests','delicious']] # Training model using corpus model = Word2Vec(sentences,window=5, min_count=1) # Fitting data based on 2d PCA # X = model[model.wv.vocab] X = model.wv[model.wv.key_to_index] pca = PCA(n_components=2) result = pca.fit_transform(X) # Visual display pyplot.scatter(result[:, 0], result[:, 1]) words = list(model.wv.key_to_index) for i, word in enumerate(words): pyplot.annotate(word, xy=(result[i, 0], result[i, 1])) pyplot.show()
Because the corpus is given randomly and the number is small, the correlation between words displayed by the trained word vector is not so strong. The main purpose here is to show what kind of output we can get through the Word2Vec model if we input a series of words.
·Actual combat drill
Through the trained model on the corpus of Google News, see how the word vectors obtained by Word2Vec can be used.
First, you need to download the pre trained Word2Vec vector, which can be selected from a variety of background fields. The training model based on Google news corpus can be downloaded by searching "Google News vectors negative 300". The file size is 1.53GB and contains a 300 dimensional representation of 3 billion words.
As with the above simple visualization in Python, you need to use the gensim library. Suppose that the downloaded file is saved in the "wordretrain" folder on disk E of the computer.
from gensim.models.keyedvectors import KeyedVectors word_vectors = KeyedVectors.load_word2vec_format(\ 'E:\wordpretrain/GoogleNews-vectors-negative300.bin.gz', \ binary = True, limit = 1000000)
In this way, we have a ready-made word vector model, that is, each word is uniquely represented by a 300 dimensional vector. Let's take a look at some simple uses of it.
1. You can actually view the vector representation of any word:
word_vectors['dog']
But it's hard to explain what each dimension of this vector means.
2. Most can be used_ Similar function finds words with similar meanings. The topn parameter defines the number of words to be listed:
word_vectors.most_similar(positive = ['nice'], topn = 5)
The numbers in brackets indicate the similarity.
3. If we want to combine the vectors of father and woman and subtract the vector of man, we can get:
word_vectors.most_similar( positive = ['father', 'woman'], negative = ['man'], topn = 1)
In fact, it's easy to think of this thing: suppose that in two dimensions (parent-child relationship and gender), the vector of the word "woman" is (0,1), "man" is (0, - 1), "father" is (1, - 1), and "mother" is (1,1), then "father" + "woman" - "man" = (1, - 1) + (0,1) - (0, - 1) = (1,1) = "mother". Of course, the difference is that we have 300 dimensions here, but the principle is the same.
4. Visualization:
%matplotlib inline import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.decomposition import PCA import adjustText
from jupyterthemes import jtplot jtplot.style(theme='onedork') #Select a drawing theme
def plot_2d_representation_of_words( word_list, word_vectors, flip_x_axis = False, flip_y_axis = False, label_x_axis = "x", label_y_axis = "y", label_label = "fruit"): pca = PCA(n_components = 2) word_plus_coordinates=[] for word in word_list: current_row = [] current_row.append(word) current_row.extend(word_vectors[word]) word_plus_coordinates.append(current_row) word_plus_coordinates = pd.DataFrame(word_plus_coordinates) coordinates_2d = pca.fit_transform( word_plus_coordinates.iloc[:,1:300]) coordinates_2d = pd.DataFrame( coordinates_2d, columns=[label_x_axis, label_y_axis]) coordinates_2d[label_label] = word_plus_coordinates.iloc[:,0] if flip_x_axis: coordinates_2d[label_x_axis] = \ coordinates_2d[label_x_axis] * (-1) if flip_y_axis: coordinates_2d[label_y_axis] = \ coordinates_2d[label_y_axis] * (-1) plt.figure(figsize = (15,10)) p1=sns.scatterplot( data=coordinates_2d, x=label_x_axis, y=label_y_axis) x = coordinates_2d[label_x_axis] y = coordinates_2d[label_y_axis] label = coordinates_2d[label_label] texts = [plt.text(x[i], y[i], label[i]) for i in range(len(x))] adjustText.adjust_text(texts)
fruits = ['apple','orange','banana','lemon','car','tram','boat','bicycle', 'cherry','mango','grape','durian','watermelon','train','motorbike','ship', 'peach','pear','pomegranate','strawberry','bike','bus','truck','subway','airplane']
plot_2d_representation_of_words( word_list = fruits, word_vectors = word_vectors, flip_y_axis = True)
Here I mixed a few words of transportation into the word list of fruit. Obviously, the result is quite good. It can not only clearly see the correlation between words, but also automatically cluster.
Of course, the above is just a simple operation and application of Word2Vec model. It can not only perform tasks at the word level, but also serve as input to many models, including but not limited to:
·Calculate similarity
· Look for similar words
· Information retrieval
·As the input of SVM/LSTM and other models
· Chinese word segmentation
· Named body recognition
·Sentence representation
· Emotional analysis
·Document representation
· Document subject discrimination
·Summary
From the above Word2Vec practice and simple application, we can get the core idea of word vector training: if the context of two words is similar, their vectors are also similar.