FP growth algorithm finds frequent itemsets -- building FP Tree

There are two common algorithms for mining frequent itemsets, one is Apriori algorithm The other is FP growth. Apriori mining frequent item sets by constantly constructing candidate sets and filtering candidate sets needs to scan the original data many times. When the original data is large, the disk I/O times are too many and the efficiency is relatively low. FPGrowth is different from Apriori's "trial" strategy. The algorithm only scans the original data twice, and compresses the original data through FP Tree data structure, which is more efficient.

FP stands for frequent pattern. The algorithm consists of two steps: FP Tree building and mining frequent item set.

FP Tree Representation

The FP Tree is constructed by reading in transactions one by one and mapping them to a path in the FP Tree. Because different transactions may have several identical items, their paths may partially overlap. The more paths overlap each other, the better compression effect can be obtained by using FP Tree structure. If FP Tree is small enough to be stored in memory, it can directly extract frequent item sets from this memory structure without repeatedly scanning the data stored on the hard disk.

An FP Tree is shown below:

Generally, the size of FP Tree is smaller than that of uncompressed data, because transactions of data often share some common items. In the best case, all transactions have the same item set, and FP Tree only contains one node path. When each transaction has a unique item set, the worst case occurs. Because transactions do not contain any common items, the size of FP Tree is actually the same as the original number It's the same size.

The root node of the FP Tree is represented by φ, and the rest nodes include a data item and its support on this path. Each path is a set of data items that meet the minimum support in the training data. The FP Tree also links all the same items into a linked list, which is represented by a blue line in the above figure.

In order to quickly access the same item in the tree, it is also necessary to maintain a pointer list (headTable) connecting nodes with the same item. Each list element includes: data item, global minimum support for the item, and pointer to the header of the necklace table in the FP Tree.

Build FP Tree

Now we have the following data:

FP growth algorithm needs to scan the original training set twice to build FP Tree.

In the first scan, all items that do not meet the minimum support are filtered out; items that meet the minimum support are sorted according to the global minimum support. On this basis, for the convenience of processing, items can also be sorted again according to their keywords.

Results after the first scan

Second scan, construct FP Tree.

The scanned data is filtered. If a data item is encountered for the first time, create the node and add a pointer to the node in the headTable; otherwise, find the corresponding node of the item by path and modify the node information. The specific process is as follows:

Transaction 001, {z,x}

Transaction 002, {z,x,y,t,s}

Transaction 003, {z}

Transaction 004, {x,s,r}


Transaction 005, {z,x,y,t,r}

Transaction 006, {z,x,y,t,s}

It can be seen from the above that the headTable is not created with FPTree, but has been created in the first scan. When creating FPTree, you only need to point the pointer to the corresponding node. Starting from transaction 004, you need to create connections between nodes so that the same items on different paths are linked into linked lists.

The code is as follows:

 1 def loadSimpDat():
 2     simpDat = [['r', 'z', 'h', 'j', 'p'],
 3                ['z', 'y', 'x', 'w', 'v', 'u', 't', 's'],
 4                ['z'],
 5                ['r', 'x', 'n', 'o', 's'],
 6                ['y', 'r', 'x', 'z', 'q', 't', 'p'],
 7                ['y', 'z', 'x', 'e', 'q', 's', 't', 'm']]
 8     return simpDat
10 def createInitSet(dataSet):
11     retDict = {}
12     for trans in dataSet:
13         fset = frozenset(trans)
14         retDict.setdefault(fset, 0)
15         retDict[fset] += 1
16     return retDict
18 class treeNode:
19     def __init__(self, nameValue, numOccur, parentNode):
20         self.name = nameValue
21         self.count = numOccur
22         self.nodeLink = None
23         self.parent = parentNode
24         self.children = {}
26     def inc(self, numOccur):
27         self.count += numOccur
29     def disp(self, ind=1):
30         print('   ' * ind, self.name, ' ', self.count)
31         for child in self.children.values():
32             child.disp(ind + 1)
35 def createTree(dataSet, minSup=1):
36     headerTable = {}
37     #This time, the data set is traversed to record the support of each data item
38     for trans in dataSet:
39         for item in trans:
40             headerTable[item] = headerTable.get(item, 0) + 1
42     #Filter based on minimum support
43     lessThanMinsup = list(filter(lambda k:headerTable[k] < minSup, headerTable.keys()))
44     for k in lessThanMinsup: del(headerTable[k])
46     freqItemSet = set(headerTable.keys())
47     #If all data does not meet the minimum support, return None, None
48     if len(freqItemSet) == 0:
49         return None, None
51     for k in headerTable:
52         headerTable[k] = [headerTable[k], None]
54     retTree = treeNode('φ', 1, None)
55     #Traverse the data set for the second time, and build fp-tree
56     for tranSet, count in dataSet.items():
57         #A training sample is processed according to the minimum support, key:One of the samples, value:Global support for this sample
58         localD = {}
59         for item in tranSet:
60             if item in freqItemSet:
61                 localD[item] = headerTable[item][0]
63         if len(localD) > 0:
64             #Sort the data in each transaction according to the global frequent items,Equivalent to order by p[1] desc, p[0] desc
65             orderedItems = [v[0] for v in sorted(localD.items(), key=lambda p: (p[1],p[0]), reverse=True)]
66             updateTree(orderedItems, retTree, headerTable, count)
67     return retTree, headerTable
70 def updateTree(items, inTree, headerTable, count):
71     if items[0] in inTree.children:  # check if orderedItems[0] in retTree.children
72         inTree.children[items[0]].inc(count)  # incrament count
73     else:  # add items[0] to inTree.children
74         inTree.children[items[0]] = treeNode(items[0], count, inTree)
75         if headerTable[items[0]][1] == None:  # update header table
76             headerTable[items[0]][1] = inTree.children[items[0]]
77         else:
78             updateHeader(headerTable[items[0]][1], inTree.children[items[0]])
80     if len(items) > 1:  # call updateTree() with remaining ordered items
81         updateTree(items[1:], inTree.children[items[0]], headerTable, count)
84 def updateHeader(nodeToTest, targetNode):  # this version does not use recursion
85     while (nodeToTest.nodeLink != None):  # Do not use recursion to traverse a linked list!
86         nodeToTest = nodeToTest.nodeLink
87     nodeToTest.nodeLink = targetNode
89 simpDat = loadSimpDat()
90 dictDat = createInitSet(simpDat)
91 myFPTree,myheader = createTree(dictDat, 3)
92 myFPTree.disp()

The above code did not sort the filtered items of each training data after the first scan, but put the sorting in the second scan, which can simplify the complexity of the code.

Console information:


The influence of item order on FP Tree

It is worth noting that the key ordering of items will affect the structure of the FP Tree. The following two figures are FP trees generated by the same training set. In Figure 1, items are not processed except for sorting according to the minimum support degree; in Figure 2, items are sorted in descending order according to keywords. The structure of the tree will also affect the results of subsequent frequent item discovery.

Figure 1. No key sorting for items

Figure 2 key descending sorting of items


The next article continues with how to discover frequent itemsets.


Source: WeChat official account "I am the 8."

This article focuses on learning, research and sharing. If you need to reprint it, please contact me and indicate the author and source, non-commercial use!  

Scan the two-dimensional code and pay attention to the author's official account. "I am the 8."

Tags: Lambda REST

Posted on Mon, 18 May 2020 02:30:54 -0400 by wilbur