Visualization of community distribution with Gephi and community

First, there are 5000 comment texts, which are composed of two parts. The first is the comment, and the second is the score. 1 means high praise (5 points for the later text) and 0 means poor comment (1 point for the later text)

I have been using Lenovo computer. I believe Lenovo very much. This product is absolutely authentic and is completely consistent with the description	1
 This computer is very good. It runs very fast. It doesn't need cards to play games. It's still cost-effective	1
 It's disgusting to have one star at most for this price fluctuation and full gift activity	0
 Abnormal noise, screen light leakage, immovable blue screen, acer The technical support is too poor. Is there no test? These are obvious problems,	0
 Buy a day to reduce the price by 200. What else do you say about 30 day insurance? Is Jingdong still credible?	0

Code flow

  1. Load the stop word list and stutter the 1-point and 5-point comment text. The data after removing the stop word is written into the json file for subsequent use.
  2. Count the word frequency and obtain the high-frequency words of 1 and 5 comment texts.
  3. Implement the wordNetwork class, inherit the Graph in networkx, and redefine the edge addition method and edge filtering method.
  4. The word network is drawn and visualized with Gephi software. Observe the connective structure of common words.
  5. Community method and Gephi software are used to realize the visualization of community distribution
  6. Calculate the vector difference of all common words of 1 comment and 5 comments
import json
import jieba
import matplotlib
import matplotlib.pyplot as plt
import networkx as nx 
from networkx import Graph
import collections
import numpy as np
import community
from scipy import stats

plt.rcParams['font.sans-serif']=['SimHei'] #Used to display Chinese labels normally
plt.rcParams['axes.unicode_minus']=False #Used to display negative signs normally

1. Load the stop word list and stutter and segment the 1-point and 5-point comment text. After removing the stop word, it is stored in the good and bad lists respectively. Each secondary list represents each comment (word segmented), and the data is written into the json file for subsequent use.

def load_sw(filepath):
    sw = [line.strip() for line in open(filepath,'r',encoding='utf8')]
    return sw


def cut_words(file, stopwords, seperator='\t'):
    good = []
    bad = []
    i = j = 0
    with open(file, 'r', encoding='utf8') as f:
        for line in f:
            items = line.strip().split(seperator)
            if len(items) == 2:
                if items[1] == '0':
                    bad.append([])
                    for w in jieba.cut(items[0].strip()):
                        if w not in stopwords and w != ' ':
                            bad[i].append(w)
                    i += 1
                if items[1] == '1':
                    good.append([])
                    for w in jieba.cut(items[0].strip()):
                        if w not in stopwords and w != ' ':
                            good[i].append(w)
                    j += 1
    with open('good.json', 'w', encoding='gb18030') as fjson:
        json.dump(good, fjson, ensure_ascii=False, indent=2)
    with open('bad.json', 'w', encoding='gb18030') as fjson:
        json.dump(bad, fjson, ensure_ascii=False, indent=2)


2. Count the word frequency and obtain the high-frequency words of the comment text with 1 and 5 points.

# Statistical word frequency
def get_freq(wd_list):
    wf = {}
    for wd in wd_list:
        for w in wd:
            if w in wf:
                wf[w] += 1
            else:
                wf[w] = 1
    return wf


# Get topn word
def topn_w(topn, sorted_wf):
    top_list = []
    for i in range(topn):
        top_list.append(sorted_wf[i][0])
    return top_list
def main():
    stopwords = load_sw('stopwords_list.txt')
    cut_words('1 5 points and 5 points comment text.txt', stopwords)
    
    # Read json file
    with open('good.json', 'r', encoding='gb18030') as f:
        good_data = json.load(f)
    with open('bad.json', 'r', encoding='gb18030') as f:
        bad_data = json.load(f)
    # Highly praised high-frequency words
    word_freq_good = get_freq(good_data)
    sorted_wf_good = sorted(word_freq_good.items(),key=lambda x: x[1],reverse=True)
    top_good = topn_w(25, sorted_wf_good)
    print('topn good:')
    print(top_good)
    # High frequency words with bad comments
    word_freq_bad = get_freq(bad_data)
    sorted_wf_bad = sorted(word_freq_bad.items(),key=lambda x: x[1],reverse=True)
    top_bad = topn_w(25, sorted_wf_bad)
    print('topn bad:')
    print(top_bad)


3. Implement the wordNetwork class, inherit the Graph in networkx, and redefine the edge addition method and edge filtering method.

# Inherit Graph class
class wordNetwork(Graph):
    # The w default value is 4
    def add_edges_from_list(self, wd_list, w=4):
        for comment in wd_list:
            for wd in comment:
                for k in range(comment.index(wd)+1, comment.index(wd)+w):
                    if k < len(comment):
                        if (wd, comment[k]) not in self.edges():
                            self.add_edge(wd, comment[k], weight=1)
                        else:
                            self[wd][comment[k]]['weight'] += 1

    # Edge filtering method
    def filt_edge(self, t=20):
        for edge in self.edges():
            if self[edge[0]][edge[1]]['weight'] < t:
                self.remove_edge(edge[0],edge[1])
        del_list = []
        for n, nbrs in self.adj.items():
            if not nbrs:
                del_list.append(n)
        for item in del_list:
            self.remove_node(item)
  1. Draw the word network, output the key common words with 1 and 5 points (the top 10 words with the highest number of edges), output the gexf file, and observe with Gephi
    # Construction of comment network
    good_net = wordNetwork()
    good_net.add_edges_from_list(good_data)
    good_net.filt_edge(t=130)
    # The word with the largest number of output edges
    good_degree_topn = sorted(good_net.degree,key=lambda x: x[1],reverse=True)
    print('topn degree good')
    print(good_degree_topn[0:10])
    # Output gexf file
    nx.write_gexf(good_net,'good.gexf')
    # Painting word network
    nx.draw(good_net, pos=nx.random_layout(good_net),with_labels=True, font_size=7,edge_color='r', node_size=300)
    plt.show()
    
    bad_net = wordNetwork()
    bad_net.add_edges_from_list(bad_data)
    bad_net.filt_edge(t=130)
    bad_degree_topn = sorted(bad_net.degree, key=lambda x: x[1],reverse=True)
    print('topn degree bad')
    print(bad_degree_topn[0:10])
    nx.write_gexf(bad_net,'bad.gexf')
    nx.draw(bad_net, pos=nx.random_layout(bad_net), with_labels=True,font_size=7, edge_color='r', node_size=300)
    plt.show()




When observed with gephi, it is clearer. It can be seen that when t=130, the word network of 5 points is more complex, there are more common words and the connection is also complex.





5. Community approach
Visualization with community:

    # Sub community
    part = community.best_partition(bad_net)
    #print(part)
    # Calculate modularity
    mod = community.modularity(part, bad_net)
    #print(mod)
    # Drawing
    values = [part.get(node) for node in bad_net.nodes()]
    nx.draw_spring(bad_net,cmap=plt.get_cmap('jet'),node_color=values,with_labels=True)
    plt.show()


Visualization of community distribution with Gephi:
5 comments, divided into 14 communities. There are three communities with more points
The community where "computer" is located has overall positive comments such as good, satisfied, good-looking, beautiful and easy to use.
The communities where "customer service" is located include: solution, answer, attitude, patience, etc
Win10, startup, system, operation and fluency are the evaluation of computer performance


1 comment, divided into 12 communities. There are three communities with more points
The communities where "computer" is located are: garbage, cards, bad, etc.
Bad reviews, regrets and don't be a community, which is the evaluation of poor purchase experience.
Customer service and return are a community, which is related to after-sales.


6. Calculate the vector difference of all common words of 1 comment and 5 comments
(1) Find the common nodes in the 1-point and 5-point word networks as common words
(2) Calculate the total number of times each shared word appears in the 1-point and 5-point comment text as the importance of the word( Establish a vector, and the value of each column on the vector is the total frequency of the corresponding word in the text)
(3) Vector normalization, freq/sum(freq)
(4) Calculate the pierce correlation coefficient of 1-point and 5-point vectors.

# Correlation of common word vectors
def common_words_simi(good_w, bad_w, good_data, bad_data):
    common_w = {}         # Find common words with 1 and 5 points
    for w in good_w:
        if w in bad_w:
            common_w[w] = 0
    common = common_w
    print(common_w.keys())
    # Common word importance vector of 5-point text
    vec_good = []
    for wd in good_data:
        for w in wd:
            if w in common:
                common[w] += 1
    for key in common.keys():
        vec_good.append(common[key])
    # normalization
    total = sum(vec_good)
    for i in range(len(vec_good)):
        vec_good[i] = vec_good[i]/total
    print(vec_good)

    common = common_w
    vec_bad = []
    for wd in bad_data:
        for w in wd:
            if w in common:
                common[w] += 1
    for key in common.keys():
        vec_bad.append(common[key])
    # normalization
    total = sum(vec_bad)
    for i in range(len(vec_bad)):
        vec_bad[i] = vec_bad[i]/total
    print(vec_bad)
    # Pierce correlation
    print("relevance:")
    print(stats.pearsonr(vec_good, vec_bad))

From the results, we can see that the common words are 'power on', 'goods',' computer ',' quality ',' delivery ',' things', 'system', 'buy', 'logistics',' receive ',' speed ',' hard disk ',' solid state ',' service ',' mouse '
',' seller ',' customer service ',' a few days', 'really', 'found', 'first time', 'online', 'attitude', 'ask', 'win10', 'slow', 'new'

Pierce correlation coefficient of two vectors: (0.9447392932592632, 1.2953042118156725e-13)
The first number of the obtained results represents the correlation coefficient and reflects the degree of linear correlation between the two vectors. When the correlation coefficient is 0, it means uncorrelation, and when the correlation coefficient is close to 1, it means positive correlation. 0.9447392932592632 indicates that the two vectors are highly positively correlated, indicating that there is little difference in the number of occurrences of each common word in 1-point and 5-point comment texts, and the importance is almost the same.
1.2953042118156725e-13 is the P value, which is the result obtained during the calculation of paired t-test.
null hypothesis H0, and there is no "linear" correlation between the two parameters.
alternative hypothesis H1 , there is a "linear" correlation between the two parameters.
The significance level is 0.05, P < 0.05, indicating that there is a "linear" correlation between the two parameters.

Tags: css NLP

Posted on Fri, 03 Sep 2021 03:05:13 -0400 by peter_anderson