Graduation project - Title: design and implementation of news mining algorithm system based on FP growth

1 project background

Nowadays, the news is flooded and dazzling. Even there are countless news on the same topic. People can pay attention to different hot news on professional news websites according to their careers and hobbies. Therefore, by analyzing the hot issues that people pay attention to the news, we can get the people's concern about a certain field and the problems that society needs to solve. It is also conducive to understand the current focus of public opinion, help the government understand public opinion, facilitate the state to correctly guide public opinion, and make our society more stable and harmonious. Taking the financial field as an example, this paper grabs a large number of financial news on the network through crawler technology, and finds hot spots through preprocessing and density cluster analysis of news content text; From the hot spots found, the lexical cluster analysis is carried out to obtain the people or things involved in the hot spots, so as to analyze the problems concerned by the society in the economic field and the problems to be solved.

2 algorithm architecture

The senior students of the project should analyze the hot news issues through text mining technology, and get the hot news by clustering the news content from the financial news captured on the Internet; Then analyze the hot spots, and get the people, industries or organizations involved in the hot issues by clustering the words related to a hot spot.

1. Capture a large number of financial news reports from three professional financial news websites (sina finance, Sohu Finance and Xinhuanet Finance) by using news API, crawler algorithm and multi-threaded parallel technology;

2. De duplicate the news and filter the time period, then jieba segment and label the news content text to filter out the parts of speech such as nouns, verbs and abbreviations. Before word segmentation, use the user-defined word dictionary to increase the accuracy of word segmentation. After word segmentation, use the stop word dictionary, disambiguation dictionary and reserved single word dictionary to filter out the words that are irrelevant to the topic and affect the clustering accuracy, The thesaurus of each news is established, the news is clustered by DBSCAN after TF-IDF feature extraction, and the size of each class is sorted;

3. For each kind of news after clustering, in order to get the topic information of the hot spot, we also need to extract their titles, use TextRank algorithm to sort the importance of the titles, and use the title with the highest importance to describe the hot topic

4. jieba word segmentation is carried out for all news contents, and the word2vec word embedding model is trained. Then, for each kind of news after clustering, the results of their content word segmentation are extracted, the word vector of each word is obtained by using word2vec model, and then the FP growth algorithm is used to mine relevant news.

3 principle of FP growth algorithm

3.1 FP Tree

FP Tree is a tree structure for storing data, as shown in the figure below. Each branch represents an item set of the data set, and the number represents the number of occurrences of the element in a branch

3.2 algorithm process

1 build FP Tree

  • Traverse the data set to obtain the occurrence times of each element item, and remove the element items that do not meet the minimum support
  • Build FP Tree: read in each itemset and add it to an existing path. If the path does not exist, create a new path (each path is an unordered set)

2 mining frequent itemsets from FP Tree

  • Get conditional pattern base from FP Tree
  • The conditional FP Tree of the corresponding element is constructed by using the conditional pattern base, and the iteration is carried out until the tree contains an element item

The algorithm process is relatively simple, and the specific process is further understood in the practical operation in the next section.

3.3 algorithm implementation

3.3.1 building FP Tree

class treeNode:
    def __init__(self,nameValue,numOccur,parentNode): #Node name
        self.count=numOccur #Number of occurrences of node element
        self.nodeLink=None #Store the next element connected to the node in the node linked list
        self.children={} #It is used to store the child nodes of the node, and value is the child node name
    def inc(self,numOccur):
    def disp(self,ind=1):
        print("   "*ind,,self.count) #Output a row of node names and the number of node elements. The indentation indicates the depth of the tree where the row of nodes are located
        for child in self.children.values():
            child.disp(ind+1) #For child nodes, depth + 1

# Construct FP Tree
# dataSet is a dictionary type, which indicates the dataSet exploring frequent itemsets, keys is each set, and values is the number of occurrences of each set in the dataSet
# minSup is the minimum support. The first step in constructing FP Tree is to calculate the support of each element in the dataset, and select the element that meets the minimum support to enter the next step
def createTree(dataSet,minSup=1):

    #Traverse each set and count the occurrence times of each element in the data set
    for key in dataSet.keys():
        for item in key:
    #Traverse each element and delete the elements that do not meet the minimum support
    for key in list(headerTable.keys()):
        if headerTable[key]<minSup:
            del headerTable[key]
    #If no element meets the minimum support requirement, return None and end the function
    if len(freqItemSet)==0:
        return None,None
    for key in headerTable.keys():
        headerTable[key]=[headerTable[key],None] #[number of element occurrences, * * pointer to the first element item of each itemset * *]
    retTree=treeNode("Null Set",1,None) #Initializes the top node of the FP Tree
    for tranSet,count in dataSet.items():
        localD={} #The frequent elements and their occurrence times in each cycle are stored, which is convenient to sort the elements of each set within the itemset by using the global occurrence times
        for item in tranSet:
            if item in freqItemSet:
        if len(localD)>0:
            orderedItems=[v[0] for v in sorted(localD.items(),key=operator.itemgetter(1),reverse=True)] #The elements in each itemset (tranSet) are sorted according to the global occurrences of the elements
            updateTree(orderedItems,retTree,headerTable,count) #Populate the tree with the sorted itemset
    return retTree,headerTable

#Tree update function
#Items is the itemset sorted by occurrence times, which is the itemset to be updated to the tree; count is the number of occurrences of the items itemset in the dataset
#In tree is the tree to be updated; headTable is a header pointer table that stores all elements that meet the minimum support requirements
def updateTree(items,inTree,headerTable,count):
    #If the current most frequent element of itemset items is in the child node of the existing tree, the count value of the child node of the tree will be directly increased by the number of occurrences of items[0]
    if items[0] in inTree.children: 
    else:#If the current most frequent element of the itemset items is not in the child node of the existing tree (that is, the tree branch does not exist), a new child node is added through the treeNode class
        #If there is no such element in the header table after adding a node, the new node will be added to the header table as a header element
        if headerTable[items[0]][1]==None: 
        else:#If there is this element in the header table after adding a node, update the linked list of this element, that is, add this element at the end of the linked list of this element
    #When the number of items elements in the itemset is more than 1, the updateTree is iterated for the remaining elements
    if len(items)>1:

#Element list update function
#nodeToTest is the header of the linked list of elements to be updated
#targetNode is the element node to be added to the element linked list
def updateHeader(nodeToTest,targetNode):
    #If the next element of the current element in the element list to be updated is not empty, the last element of the element list will be searched iteratively
    while nodeToTest.nodeLink!=None: 
        nodeToTest=nodeToTest.nodeLink #Similar to the rope rolling, it is gradually rolled from the first to the last
    #After finding the end element of the element list, add targetNode after this element as the new end element of the element list


#Load simple dataset
def loadSimpDat():
    simpDat = [['r', 'z', 'h', 'j', 'p'],
               ['z', 'y', 'x', 'w', 'v', 'u', 't', 's'],
               ['r', 'x', 'n', 'o', 's'],
               ['y', 'r', 'x', 'z', 'q', 't', 'p'],
               ['y', 'z', 'x', 'e', 'q', 's', 't', 'm']]
    return simpDat

#Convert a data set in list format to dictionary format
def createInitSet(dataSet):
    for trans in dataSet:
    return retDict


input data:

The FP tree constructed from this dataset is long enough to see if it meets the FP Tree structure described in the previous section

3.4 mining frequent itemsets from FP Tree

The specific process is as follows:

1 get conditional pattern base from FP Tree

  • Conditional pattern base: a set of paths ending with the element item to be found. Each path is a prefix path. The path set includes prefix paths and path count values.
  • For example, the conditional pattern base of element "r" is {x,s}2,{z,x,y}1,{z}1
  • Prefix path: everything between the element you are looking for and the root node of the tree
  • Path count value: equal to the count value of the starting element item of the prefix path (i.e. the element to be found)

2 construct the conditional FP Tree of corresponding elements by using the conditional pattern base

  • For each frequent item, a conditional FP Tree is created.
  • For example, create a conditional FP Tree for element T: use the obtained conditional pattern base of element t as input, and build the conditional FP Tree of element T using the same logic as building the FP Tree

3 iterate steps (1) and (2) until the tree contains an element item

  • Next, continue to build the conditional FP Tree corresponding to {t,x}{t,y}{t,z} (tx,ty,tz are the frequent itemsets of the t conditional FP Tree) until there are no elements in the conditional FP Tree
  • So far, we can get the frequent itemsets related to element t, including 2-element itemsets, 3-element itemsets...
#The whole path of the leaf node is traced back by the leaf node
#leafNode is a leaf node in treeNode format; prefixPath is the prefix path collection of the leaf node in list format. Pay attention to the existing contents of prefixPath before calling this function
def ascendTree(leafNode,prefixPath):
    if leafNode.parent!=None:
#Gets the conditional pattern base of the specified element
#basePat is the specified element; treeNode is the first element node of the specified element list. If the "r" element is specified, treeNode is the first r node of the r element list
def findPrefixPath(basePat,treeNode):
    condPats={} #The conditional pattern base that holds the specified element
    while treeNode!=None: #When the node pointed to by the element linked list is not empty (that is, when the linked list of the specified element has not been traversed)
        ascendTree(treeNode,prefixPath) #Backtrace the prefix path of the current node of the element
        if len(prefixPath)>1:
            condPats[frozenset(prefixPath[1:])]=treeNode.count #Construct the conditional pattern base of the current node of the element
        treeNode=treeNode.nodeLink #Points to the next element in the linked list of this element
    return condPats

#Mining frequent itemsets with FP Tree
#In tree: built FP Tree of the whole dataset
#Headertable: header pointer table of FP Tree
#minSup: minimum support, used to build conditional FP Tree
#preFix: cache table of new frequent itemsets, set([]) format
#freqItemList: frequent itemset collection, list format

def mineTree(inTree,headerTable,minSup,preFix,freqItemList):
    #Sort in ascending order according to the occurrence times of elements in the header pointer table, that is, search for frequent itemsets from the bottom of the header pointer table
    bigL=[v[0] for v in sorted(headerTable.items(),key=lambda p:p[1][0])] 
    for basePat in bigL:
        #Append the frequent items of the current depth to the existing frequent itemset, and then append the frequent itemset to the frequent itemset list
        print("freqItemList add newFreqSet",newFreqSet)
        #Gets the conditional pattern base of the current frequent item
        #Using the conditional pattern base of current frequent items to construct conditional FP Tree
        #Iterate until the conditional FP Tree of the current frequent item is empty
        if myHead!=None:

Next, test the FP Tree just built,


The frequent itemsets we mine from the FP Tree are as follows. The minimum support set here is 3:

The above figure shows the element itemset with support greater than 3 (more than 3 times), that is, the frequent itemset.

4 system design display

In order to facilitate operation and understanding, the senior students use the tkinter module of Python to design a system operation interface

Analysis Visualization

(to be continued...)

5 finally - design help

Bi design help, problem opening guidance, technical solutions

Tags: Python Big Data Algorithm

Posted on Wed, 10 Nov 2021 01:51:41 -0500 by pplexr