The way to learning natural language processing 09 uses Gensim library to construct word vector Word2Vector

There is a road to the mountain of books. Diligence is the path. There is no end to learning. It is hard to make a boat

1, Gensim constructs word vector

1.1 data preprocessing

from gensim.models import word2vec
import logging  # Custom print log
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level = logging.INFO)

Original corpus

raw_sentences = ['the quick brown fox jumps over the lazy dogs','you go home now to sleep']

segmentation

sentences = [s.split() for s in raw_sentences]
print(sentences)
[['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dogs'], ['you', 'go', 'home', 'now', 'to', 'sleep']]

1.2 import model

min count:

  • In different corpus sizes, we have different requirements for reference word frequency. For example, in a large corpus, we want to ignore words that only appear once or twice. Here, we can control it by setting the min count parameter. Generally speaking, reasonable parameter values will be set between 0 and 100

Size (set the number of neural network layers):

  • The size parameter is mainly used to set the number of layers of the neural network. The default value in Word2Vec is set to 100 layers. A higher level setting means more input data, but it can also improve the overall accuracy. The reasonable setting range is 10 ~ hundreds. Basically, the default value of 100 is good
model = word2vec.Word2Vec(sentences,min_count=1)

Judge the similarity between words

y1 = model.wv.similarity('dogs','you')
print(y1)
0.0048460397

2, Constructing word vector based on Wikipedia data

Wikipedia data download

The downloaded format is an. xml.bz2 format, which needs to be converted into a txt file with a script file

The data downloaded from Wikipedia is a traditional Chinese data source, which needs to be converted from traditional to simplified, and word segmentation of the document is also required


Using opencc software to convert traditional Chinese to simplified Chinese, you need to download and unzip the opencc file package

Copy the extracted traditional files to the folder of opencc

Use the cmd command line to enter opencc and enter the installation directory

opencc -i wiki texts.txt -o test.txt -c t2s.json

-i specifies the name of the input file and -o generates the output file

The word vector model is constructed based on the text divided into words

Save the constructed word vector model

model.save('path')

3, Similarity calculation

from gensim.models import word2vec
en_wiki_word2vec_model = word2vec.load("wiki.zh.text.model")
testwords =["Apple",'mathematics','learning',"idiot",'Basketball']

for i in range (5):
	res = en_wiki_word2vec_model.most_similar(testwords[i])
	print (testwords [i])
	print (res)

4, Emotion classification based on word2vec

Import library

import os
import re
import numpy as np
import pandas as pd

from bs4 import BeautifulSoup

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn. linear_model import LogisticRegression
import nltk
from nltk.corpus import stopwords
nltk.download()   #Installation, nltk except that Chinese word segmentation is not possible, everything else is OK

Reading training data with pandas

df = pd.read_csv("  ",sep="\t",escapechar="\\")
print("Number oof reviews: {}".format(len(df)))
df.head()
Number of reviews: 25000


The preprocessing of film review data includes the following steps:

  • 1. Remove the html tag
  • 2. Remove punctuation
  • 3. Segmentation into words / token
  • 4. Remove the stop words
  • 5. Reorganize into new sentences
df['review' ][1000]

#Example - remove data from HTML tags
example = BeautifulSoup(df['review'][1000],'html.parser').get_text()
example

#Remove punctuation
example_letters = re.sub(r'[Ta-zA-Z]',' ',example)
example_letters

Tags: NLP

Posted on Fri, 24 Sep 2021 09:03:23 -0400 by jimbo2150