There are two common algorithms for mining frequent itemsets, one is Apriori algorithm The other is FP growth. Apriori mining frequent item sets by constantly constructing candidate sets and filtering candidate sets needs to scan the original data many times. When the original data is large, the disk I/O times are too many and the efficiency is relatively low. FPGrowth is different from Apriori's "trial" strategy. The algorithm only scans the original data twice, and compresses the original data through FP Tree data structure, which is more efficient.
FP stands for frequent pattern. The algorithm consists of two steps: FP Tree building and mining frequent item set.
FP Tree Representation
The FP Tree is constructed by reading in transactions one by one and mapping them to a path in the FP Tree. Because different transactions may have several identical items, their paths may partially overlap. The more paths overlap each other, the better compression effect can be obtained by using FP Tree structure. If FP Tree is small enough to be stored in memory, it can directly extract frequent item sets from this memory structure without repeatedly scanning the data stored on the hard disk.
An FP Tree is shown below:
Generally, the size of FP Tree is smaller than that of uncompressed data, because transactions of data often share some common items. In the best case, all transactions have the same item set, and FP Tree only contains one node path. When each transaction has a unique item set, the worst case occurs. Because transactions do not contain any common items, the size of FP Tree is actually the same as the original number It's the same size.
The root node of the FP Tree is represented by φ, and the rest nodes include a data item and its support on this path. Each path is a set of data items that meet the minimum support in the training data. The FP Tree also links all the same items into a linked list, which is represented by a blue line in the above figure.
In order to quickly access the same item in the tree, it is also necessary to maintain a pointer list (headTable) connecting nodes with the same item. Each list element includes: data item, global minimum support for the item, and pointer to the header of the necklace table in the FP Tree.
Build FP Tree
Now we have the following data:
FP growth algorithm needs to scan the original training set twice to build FP Tree.
In the first scan, all items that do not meet the minimum support are filtered out; items that meet the minimum support are sorted according to the global minimum support. On this basis, for the convenience of processing, items can also be sorted again according to their keywords.
Results after the first scan
Second scan, construct FP Tree.
The scanned data is filtered. If a data item is encountered for the first time, create the node and add a pointer to the node in the headTable; otherwise, find the corresponding node of the item by path and modify the node information. The specific process is as follows:
Transaction 001,
Transaction 002,
Transaction 003,
Transaction 004,
Transaction 005,
Transaction 006,
It can be seen from the above that the headTable is not created with FPTree, but has been created in the first scan. When creating FPTree, you only need to point the pointer to the corresponding node. Starting from transaction 004, you need to create connections between nodes so that the same items on different paths are linked into linked lists.
The code is as follows:
1 def loadSimpDat(): 2 simpDat = [['r', 'z', 'h', 'j', 'p'], 3 ['z', 'y', 'x', 'w', 'v', 'u', 't', 's'], 4 ['z'], 5 ['r', 'x', 'n', 'o', 's'], 6 ['y', 'r', 'x', 'z', 'q', 't', 'p'], 7 ['y', 'z', 'x', 'e', 'q', 's', 't', 'm']] 8 return simpDat 9 10 def createInitSet(dataSet): 11 retDict = {} 12 for trans in dataSet: 13 fset = frozenset(trans) 14 retDict.setdefault(fset, 0) 15 retDict[fset] += 1 16 return retDict 17 18 class treeNode: 19 def __init__(self, nameValue, numOccur, parentNode): 20 self.name = nameValue 21 self.count = numOccur 22 self.nodeLink = None 23 self.parent = parentNode 24 self.children = {} 25 26 def inc(self, numOccur): 27 self.count += numOccur 28 29 def disp(self, ind=1): 30 print(' ' * ind, self.name, ' ', self.count) 31 for child in self.children.values(): 32 child.disp(ind + 1) 33 34 35 def createTree(dataSet, minSup=1): 36 headerTable = {} 37 #This time, the data set is traversed to record the support of each data item 38 for trans in dataSet: 39 for item in trans: 40 headerTable[item] = headerTable.get(item, 0) + 1 41 42 #Filter based on minimum support 43 lessThanMinsup = list(filter(lambda k:headerTable[k] < minSup, headerTable.keys())) 44 for k in lessThanMinsup: del(headerTable[k]) 45 46 freqItemSet = set(headerTable.keys()) 47 #If all data does not meet the minimum support, return None, None 48 if len(freqItemSet) == 0: 49 return None, None 50 51 for k in headerTable: 52 headerTable[k] = [headerTable[k], None] 53 54 retTree = treeNode('φ', 1, None) 55 #Traverse the data set for the second time, and build fp-tree 56 for tranSet, count in dataSet.items(): 57 #A training sample is processed according to the minimum support, key:One of the samples, value:Global support for this sample 58 localD = {} 59 for item in tranSet: 60 if item in freqItemSet: 61 localD[item] = headerTable[item][0] 62 63 if len(localD) > 0: 64 #Sort the data in each transaction according to the global frequent items,Equivalent to order by p[1] desc, p[0] desc 65 orderedItems = [v[0] for v in sorted(localD.items(), key=lambda p: (p[1],p[0]), reverse=True)] 66 updateTree(orderedItems, retTree, headerTable, count) 67 return retTree, headerTable 68 69 70 def updateTree(items, inTree, headerTable, count): 71 if items[0] in inTree.children: # check if orderedItems[0] in retTree.children 72 inTree.children[items[0]].inc(count) # incrament count 73 else: # add items[0] to inTree.children 74 inTree.children[items[0]] = treeNode(items[0], count, inTree) 75 if headerTable[items[0]][1] == None: # update header table 76 headerTable[items[0]][1] = inTree.children[items[0]] 77 else: 78 updateHeader(headerTable[items[0]][1], inTree.children[items[0]]) 79 80 if len(items) > 1: # call updateTree() with remaining ordered items 81 updateTree(items[1:], inTree.children[items[0]], headerTable, count) 82 83 84 def updateHeader(nodeToTest, targetNode): # this version does not use recursion 85 while (nodeToTest.nodeLink != None): # Do not use recursion to traverse a linked list! 86 nodeToTest = nodeToTest.nodeLink 87 nodeToTest.nodeLink = targetNode 88 89 simpDat = loadSimpDat() 90 dictDat = createInitSet(simpDat) 91 myFPTree,myheader = createTree(dictDat, 3) 92 myFPTree.disp()
The above code did not sort the filtered items of each training data after the first scan, but put the sorting in the second scan, which can simplify the complexity of the code.
Console information:
The influence of item order on FP Tree
It is worth noting that the key ordering of items will affect the structure of the FP Tree. The following two figures are FP trees generated by the same training set. In Figure 1, items are not processed except for sorting according to the minimum support degree; in Figure 2, items are sorted in descending order according to keywords. The structure of the tree will also affect the results of subsequent frequent item discovery.
Figure 1. No key sorting for items
Figure 2 key descending sorting of items
The next article continues with how to discover frequent itemsets.
Source: WeChat official account "I am the 8."
This article focuses on learning, research and sharing. If you need to reprint it, please contact me and indicate the author and source, non-commercial use!
Scan the two-dimensional code and pay attention to the author's official account. "I am the 8."