User comment text mining

User comment text mining

Learning objectives

  • Know the role of comment text mining
  • Master the use of nltk and gensim for basic NLP processing

1, Introduction to comment text mining

  • Text mining is to mine the content we are interested in from text information
  • Why should data analysts focus on text data
    • In our daily product and operation work, most of the data analysis methods and forms we often contact are based on descriptive analysis of numbers (values). These are also called structured data
    • In addition, pictures, words and videos are collectively referred to as unstructured data
    • Unstructured data contains a large amount of information, especially text information (user comments) is an important means to understand whether users are satisfied with products and services
    • In the actual work of products and operations, it is very important for us to find out the internal causes of users' preferences, purchase / use and loss
  • For non proprietary e-commerce business, text data is extremely important
    • If it is a self owned APP, we can obtain the data we want through embedding points. However, as a third-party seller, the means to reach users are limited and can only be accessed through the data interface exposed by the platform
    • The unstructured data we can obtain through the platform mainly refers to user comment data
    • User comment data includes user information, competitor information and product information
    • We are all users of e-commerce product stations. We all know the importance of user comments, and there are many marketing activities related to comments: Brush high praise, delete comments, red envelope request comments, etc
  • The following objectives can be achieved through comment text mining:
    • Operation optimization: tap user preferences, tap competitive product dynamics, and improve the competitiveness of their products
    • Product update: explore product update trends and find product problems from users in time
    • Word of mouth management: identify the difference between home products and competitors

2, Project background

  • We want to know about competitive products and market information from the perspective of users

  • Introduction to Ukulele properties

    • Size: Soprano 21 "Concert 23" tenor 26 "
    • Material: basswood, ebony, peach blossom core, plastic
    • Color: log color, red, blue, black
  • Take a look at the composition of Amazon comments. Our main analysis object is the comment body

  • Project requirements:

    • Breakdown of sales of competitive products: the main sales products of competitive products are deduced through the number of historical comments on different models in the comments on competitive products
    • Specific voice of high score and low score of competitive products: what are the comments with a high score of 4-5, what are the users with a low score of 1-2, and what are the aspects we pay more attention to
  • Technical realization:

    • Refinement of competitive products
    • High score and low score keyword extraction

3, Introduction to text mining related methods

1. How to represent text with numeric values

  • Machines do not understand human's natural language. We need to convert natural language into a "language" that is easy for machines to understand, NLP (Natural language processing)

  • In NLP, the most fine-grained is words. Words form sentences, and sentences form paragraphs, chapters and documents. Therefore, to deal with NLP, we must first deal with words

  • The ultimate goal of word processing is to represent words with vectors

    ① Get original text: Yes, Everjoys ukulele is the right Soprano Ukulele I have been looking for. It arrived earlier and very well packed, just the way I expected

    ② Participle: [Yes', 'Yes',' Everjoys', 'ukulele', 'is',' the ',' right ',' Soprano ',' ukulele ',' I ',' have ',' be ',' looking ',' for ',' It ',' arrived ',' early ',' and ',' very ',' well ',' packed ',' just ',' the ',' way ',' I ',' expected. ']

    jieba

    ③ Vectorization coding: [1,0,1,0,1,0,1,0...]

    Male [1,0]

    Female [0,1]

    Beijing [1,0,0]

    Shanghai [0,1,0]

    Shenzhen [0,0,1]

    onehot coding

    pd.get_dummies()

  • Content to be processed in word segmentation stage:

    • First, split a sentence into one word. English word segmentation is very simple. It can be directly through the space. Chinese word segmentation can use a tripartite library such as jieba

    • I'd I would

    • I have

    • Next, we need to restore the words with temporal changes to the unchanged words

      • Stem extraction – Stemming

        Stem extraction is the process of getting the root of a word by removing the affixes.

        Common affixes include "plural of noun", "progressive" and "past participle"

      • Word form reduction – lemmatization

        Word form reduction is based on dictionary, which transforms the complex form of words into the most basic form.

        Word form restoration does not simply remove the Prefix suffix, but will convert words according to the dictionary. For example, "drop" will be converted to "drive".

    • After getting the original word, you also need to remove the stop words and some auxiliary words, function words and conjunctions

      • stop word: you can manually specify which words will not be retained in the word segmentation results after processing
      • Generally, we only care about nouns, verbs and adjectives
    • There are three-party libraries to help us realize the above processes

4, Code implementation

1. Import & load data

import pandas as pd
import re
import math
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import warnings
# Ignore unnecessary warnings
warnings.filterwarnings('ignore')

# nltk: package for text processing
from nltk.stem.wordnet import WordNetLemmatizer # Part of speech reduction
from nltk.corpus import wordnet as wn
from collections import Counter

import pyecharts.options as opts
from pyecharts.charts import WordCloud

%matplotlib inline
  • Load data
df_reviews=pd.read_csv('data/reviews.csv')
df_reviews.head()
  • View data
df_reviews.info()
  • From the above results, we can see that there are a few missing fields of short_d, content and name, which need to be handled

2. Data processing

  • Since what we analyze is the text content of the comment, we can delete the content field directly
# If the comment content is vacant, it will be deleted directly
df_reviews = df_reviews.dropna()

# Intercept star data in comments 
def get_stars(n):
    return float(n.replace(' out of 5 stars',''))

# Obtain evaluation attributes according to the number of stars evaluated, including high praise (4 points and above), medium evaluation (3 points) and poor evaluation (2 points and below)
def stars_cat(n):
    '''
    The score is converted into good, medium and poor. 1 point and 2 points are poor, 3 Score evaluation, 4 5 points, highly praised
    '''
    if n<=2:
        return 'negative comment'  
    elif n ==3:
        return 'Middle evaluation' 
    else:
        return 'Praise'

# Get the date information in the comment and convert it to date time format
def get_date(x):
    '''
    Processing comment date  Reviewed in the United States on June 24, 2020
    First use 'on ' To split, Split the date text into two parts
    Reuse', 'split, Split the back part into ['Month day','year']
    Finally, put the front'Month day' Split into month and day with spaces
    '''
    x = x.split('on ')[1] # Split the data into two parts ['Reviewed in the United States on ','June 24, 2020 ']
    x = x.split(', ') 
    y= x[1]
    x = x[0].split(' ')
    m,d = x[0],x[1]
    if m=='January' or m=='Jan':
        on_date='01-'+d+'-'+y
    elif m=='February' or m=='Feb':
        on_date='02-'+d+'-'+y
    elif m=='March' or m=='Mar':
        on_date='03-'+d+'-'+y
    elif  m=='April' or m=='Apr':
        on_date='04-'+d+'-'+y
    elif  m=='May':
        on_date='05-'+d+'-'+y
    elif  m=='June' or m=='Jun':
        on_date='06-'+d+'-'+y
    elif  m=='July' or m=='Jul':
        on_date='07-'+d+'-'+y
    elif m=='August' or m=='Aug':
        on_date='08-'+d+'-'+y
    elif m=='September' or m=='Sep':
        on_date='09-'+d+'-'+y
    elif m=='October' or m=='Oct':
        on_date='10-'+d+'-'+y
    elif m=='November' or m=='Nov':
        on_date='11-'+d+'-'+y
    elif m=='December' or m=='Dec':
        on_date='12-'+d+'-'+y    
    #on_date=datetime.datetime.strptime(on_date, '%m-%d-%Y').strftime('%Y-%m-%d')
    return on_date

# word count 
df_reviews['stars_num']=df_reviews['stars'].apply(get_stars)
df_reviews['content_cat']=df_reviews['stars_num'].apply(stars_cat)
df_reviews['date_d']=df_reviews['date'].apply(get_date)

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-5rKRsNDI-1632470680944)(img\reviews3.png)]

3. Analysis of non text data

  • Count the number of comments on the product
  • Count the quantity of different types of products
  • Star Distribution of Statistical Product Reviews
# View the total number of different products
sns.set(font_scale=1)
df_reviews['product_name'].value_counts().plot(kind='bar')
# Count the number of documents issued in chronological order and analyze whether there is a periodic law
df_reviews['date_d'] = pd.to_datetime(df_reviews['date_d'])
df_reviews['y_m'] = df_reviews['date_d'].astype('datetime64[M]')#Date of withdrawal
df_reviews.head()

[the external link image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-mwBe3U90-1632470680946)(img\reviews5.png)]

# Different products everjoys ranch kala donner 
# Construct a structured multi drawing grid and draw multiple instances of the same drawing on different subsets, -- > facetgrid()
# The FacetGrid parameter describes the data col used in the data plot, and which column is used to intercept the data col for each small graph_ How many columns are there in the wrap thumbnail? Does sharex share the x-axis? Does sharey share the Y-axis height? Picture height aspect width height ratio
g = sns.FacetGrid(data = df_reviews,col = 'product_name',col_wrap = 2,sharex=False,sharey=False,height = 5, aspect= 1.2)
# g.map to draw the small graph in the faceted graph, and use sns.countplot to draw from df_reviews leverage product_name group each group to draw content_cat number of different categories order specifies the order of columns
g.map(sns.countplot,'content_cat',order=['Praise','negative comment','Middle evaluation'])
# The number of documents issued for each product is counted monthly
df_content=df_reviews.groupby(['product_name','y_m'])['content'].count().reset_index()
g=sns.FacetGrid(data=df_content,col='product_name',col_wrap=2,sharey=False,sharex=False,height=4,aspect =2)
# The first parameter is passed in which API to call for drawing. The following parameters are passed in which parameters to use when calling (plot. Plot). The specific data is passed in the column name
g.map(plt.plot,"y_m",'content',marker='1')#marker='1 'each point in the line chart will be represented by a short line
# Distribution of good, medium and bad comments over time
df_content=df_reviews.groupby(['product_name','y_m','content_cat'])['content'].count().reset_index()
g=sns.FacetGrid(data=df_content,col='product_name',hue='content_cat',col_wrap=2,sharey=False,sharex=False,height=4,aspect =2)
g.map(plt.plot,"y_m",'content',marker='.')#marker = '.' each point in the line chart will be represented by a point
g.add_legend()# Add legend

# Distribution of different models of the same product
df_content=df_reviews.groupby(['product_name','y_m','type'])['content'].count().reset_index()
g=sns.FacetGrid(data=df_content,col='product_name',hue='type',col_wrap=2,sharey=False,sharex=False,height=4,aspect =2)
g.map(plt.plot,"y_m",'content',marker='.')
g.add_legend()

4. Text mining

  • Data De duplication
df_data = df_reviews.drop_duplicates(subset={"product_name","type","date_d","content_cat","content","stars_num","name"})
df_text=df_data['content']
df_text[0]

"This is for children, not adults. I cannot use the tuner, so I use one on my phone. It doesn't stay in tune longer than a couple minutes."

  • Judge whether a product is good or bad
sample_positive=df_data[(df_data.product_name=='everjoys-Soprano') & (df_data.content_cat=='Praise')]
sample_negative=df_data[(df_data.product_name=='everjoys-Soprano') & (df_data.content_cat=='negative comment')]
len(sample_positive)

1037

len(sample_negative)

223

  • Restore common abbreviations
# In the first step of corpus analysis, regular expressions are used to process the text,
# Regular: an expression consisting of a series of ordinary and special characters used to describe text rules
# re's package is a package about regular matching
# re.sub(pattern,replacement,string) ##Find all patterns in the string, replace them, and output the replaced result string
# ? the previous character appears at most once: < = 1
# *The previous character appears at least 0 times: > = 0
# +The previous character appears at least once: > = 1
# ^The identity string starts with a character after it
# . represents any character
# $identifies the end of the previous character of the string
# () identifies a group. Group (0) represents the pattern matching result of regular expression, and group(1) represents the first group matched in parentheses
# The characters in brackets in [] represent the value range of a character
# {} the number in braces indicates the number of times the previous character is repeated
# \The backslash indicates the special meaning of removing wildcards and is only used as ordinary characters
# |Or
def replace_abbreviations(text):
    # Restore common abbreviations, i'm i'd he's
    new_text = re.sub(r"(it|he|she|that|this|there|here)(\'s)",r"\1 is", text,re.I)
    # (? < = pattern) xxx is to capture the content xxx starting with pattern
    new_text = re.sub(r"(?<=[a-zA-Z])n\'t"," not", new_text) # Abbreviation of not are't -- are not
    new_text = re.sub(r"(?<=[a-zA-Z])\'d"," would", new_text) # Abbreviation of would I'd -- > I would -- > 'I''Would '
    new_text = re.sub(r"(?<=[a-zA-Z])\'ll"," will", new_text) # Abbreviation for will
    new_text = re.sub(r"(?<=[I|i])\'m"," am", new_text) # Abbreviation for am
    new_text = re.sub(r"(?<=[a-zA-Z])\'re"," are", new_text) # Abbreviation for are
    new_text = re.sub(r"(?<=[a-zA-Z])\'ve"," have", new_text) # Abbreviation for have
    new_text = new_text.replace('\'', ' ').replace('.', '. ')
    return new_text

  • Lemmatization
    • We use morphy method in nltk package to restore word form

nltk: Natural Language Toolkit is a Python open source library commonly used in the field of natural language processing. nltk provides a series of methods to help us carry out common operations in nlp fields such as part of speech tagging, stem extraction and word segmentation. It also provides interfaces for more than 50 corpora and vocabulary resources

def get_lemma(word):
    lemma=wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma

  • De stop word
#punctuation
punctuation = [",", ":", ";", ".", "!", "'", '"', "'", "?", "/", "-", "+", "&", "(", ")"]
stop_words=nltk.corpus.stopwords.words('english')+punctuation

  • We encapsulate the above processing into methods
    • Abbreviation reduction → stem extraction, word form reduction → de stop words
# Encapsulated as pipeline
def prepare_text(n):
    tx = replace_abbreviations(str(n)) # Abbreviation restore
    # Word segmentation English word segmentation is actually using spaces to split 
    tokens = nltk.word_tokenize(tx)
    # Word reduction
    tokens = [get_lemma(token) for token in tokens]
    # De stop word
    tokens = [ i for i in tokens if i not in stop_words] # Traverse each word. If it is in the stop list, remove the return that is not in the stop list
    return tokens

  • Positive and negative comments are handled separately
clean_txt_positive=[prepare_text(s) for s in sample_positive['content']]
clean_txt_negative=[prepare_text(s) for s in sample_negative['content']]

  • View original text
sample_positive['content'][2]

'Very nice product! The ukulele is very light and the craftsmanship is great. Everything it came with was good as well. Needs lots of tuning in the beginning'

  • View processed text
clean_txt_positive[0]

['nice',
 'product',
 'ukulele',
 'light',
 'craftsmanship',
 'great',
 'everything',
 'come',
 'wa',
 'good',
 'well',
 'need',
 'lots',
 'tuning',
 'beginning']

  • Statistical word frequency
Counter(clean_txt_positive[0]).most_common(2)

[('nice', 1), ('product', 1)]

5. Create word cloud

  • Create a method, count the total number of words, the number of words in each comment, and word richness (number of words after de duplication / total number of words)
#Clean_textthe list of all comments processed is a two-dimensional list. Each comment corresponds to a list. The list stores the keywords of the comment processed
def get_words(clean_text):
    words_all = [] # Create a list of words to hold all comments
    for words in clean_text:
        for word in words:
            words_all.append(word) # Traverse each word of each comment and put it in the list
    total_words = list(set(words_all)) #  Count how many different words appear in total. De duplicate words_all and convert it into list
    all_words = Counter(words_all) # Count how many times each word appears
    content_mean = len(words_all)/len(clean_text)  # How many keywords are there in each comment? Total words / how many comments are there in total
    words_cap =  len(total_words)/len(words_all) # Total number of words after de duplication / total number of words before de duplication
    return all_words,content_mean,total_words,words_cap
    
words_all_positive,content_mean_positive,total_words,words_cap_positive=get_words(clean_txt_positive)
words_all_negative,content_mean_negative,total_words,words_cap_negative=get_words(clean_txt_negative)

1832

959

  • View the number and richness of words in each comment
content_mean_positive,words_cap_positive

(15.278152069297402, 0.11540884465163159)

content_mean_negative,words_cap_negative

(19.6457399103139, 0.21889979456745035)

  • Count the word frequency of the most words and prepare to draw the word cloud map
positive_words_wordcloud=words_all_positive.most_common(100)# Take out the top 100 words with the highest frequency
negative_words_wordcloud=words_all_negative.most_common(100)
positive_words_wordcloud

[('ukulele', 402),
 ('love', 390),
 ('great', 381),
 ('wa', 356),
 ('good', 252),
 ('play', 236),
 ('tune', 219),
 ('come', 201),
 ('get', 200),
 ('tuner', 192),
 ('beginner', 189),
 ('daughter', 184),
 ......

  • Draw cloud map of Praise Words
(WordCloud()
    .add(series_name="Praise CI Yun",
         data_pair=positive_words_wordcloud,  #Incoming data for drawing word cloud
         word_size_range=[16, 80])  #word_size_range font size value range
    .set_global_opts(
        title_opts=opts.TitleOpts(
            title="Praise CI Yun", 
            title_textstyle_opts=opts.TextStyleOpts(font_size=23) # Set title font size
        ),
        tooltip_opts=opts.TooltipOpts(is_show=True),  # Set to True, a prompt box will pop up when the mouse slides over the text
    )
    .render_notebook()
)

  • Draw a cloud map of bad comments
(WordCloud()
    .add(series_name="Bad comment cloud", data_pair=negative_words_wordcloud, word_size_range=[16, 80])
    .set_global_opts(
        title_opts=opts.TitleOpts(
            title="Bad comment cloud", title_textstyle_opts=opts.TextStyleOpts(font_size=23)
        ),
        tooltip_opts=opts.TooltipOpts(is_show=True),
    )
    .render_notebook()
)

Summary

  • Comment text mining
    • Excavate user preferences, excavate competitive product dynamics, and improve their product competitiveness
    • Discover product update trends and find product problems from users in time
  • Basic routines of English text processing
    • Word segmentation → abbreviation reduction → stem extraction, word form reduction → de stop words
    • Library nltk used
  • word2vec word vector
    • Train a word vector model with a set of corpora. This model is equivalent to an N-dimensional (N needs to be manually specified) semantic space. Each word in the corpus corresponds to a word vector
    • The similarity between word vectors can be used as the basis for judging semantic similarity
    • gensim library can help us train word vector model

Tags: Big Data AI NLP

Posted on Fri, 24 Sep 2021 03:13:11 -0400 by hoogeebear