[hand tearing algorithm] [Apriori] principle and code implementation of association rules Apriori

1. Preface

⭐ Ready to eat, direct copy, lazy portal: 4.1 It is necessary for lazy people to unpack fast food

⭐ This paper mainly analyzes Apriori algorithm from the perspectives of principle, code implementation theory and practice

⭐ The theoretical part is mainly about what are frequent itemsets, support and confidence

⭐ The actual combat part is mainly to reproduce the code according to the principle and verify the results on a simple data set. The corresponding code link in this paper is: Apriori code

⭐ If you are too lazy to write code, you can directly skip section 5 and install the dependent package: 4. It's necessary for lazy people to unpack fast food

2. Introduction

⭐ Association rule mining algorithms are usually unsupervised learning. They mine potential association rules by analyzing data sets. The most typical example is the story of beer and diapers

⭐ Apriori is the first association rule mining algorithm and the most classic algorithm. It uses the iterative method of layer by layer search to find out the relationship between itemsets in the database to form rules. Its process consists of connection (class matrix operation) and pruning (removing unnecessary intermediate results).

⭐ The algorithm can not only discover frequent itemsets, but also mine association rules between items.

⭐ The algorithm uses support to quantify frequent itemsets and confidence to quantify association rules

3. Principle

3.0. Example

We use the following transaction table to simulate our transaction with our mouth Apriori The working steps of the algorithm and their concepts and principles are introduced

The transactions are as follows:
Transaction IDProduct list
1Potatoes, diapers, beer
2Chocolate, milk, potatoes, beer
3Milk, diapers, beer
4Chocolate, diapers, beer
5Chocolate, beer

3.1. Concept introduction

⭐ ⅸ support: P ( A ∩ B ) P(A \cap B) P(A ∩ B) indicates the probability of both A and B. For example, we have A total of 5 trading orders, including 3 orders for beer and diapers, then P ( A ∩ B ) = 3 / 5 = 0.6 P(A \cap B) = 3 / 5 = 0.6 P(A ∩ B)=3/5=0.6, that is, the support is 0.6

⭐ ⒆ confidence: P ( B ∣ A ) P(B | A) P(B ∣ A), indicating the probability that B occurs simultaneously in the event of A P ( A B ) / P ( A ) P(AB)/P(A) P(AB)/P(A). It shows the correlation degree of the two events of ab. to be exact, it is the correlation degree of A to B. note that it is not the correlation degree of B to A. For example, we have A total of 5 trading orders, including 3 orders for beer and diapers, i.e P ( A B ) = 3 / 5 = 0.6 P(AB)=3/5=0.6 P(AB)=3/5=0.6, our order containing beer has 5 transaction orders, then P ( A ) = 5 / 5 = 1.0 P(A) = 5 / 5=1.0 P(A)=5/5=1.0, 3 sheets containing diapers, then P ( B ) = 3 / 5 = 0.6 P(B) = 3/5 = 0.6 P(B)=3/5=0.6, then the correlation degree of beer to diapers is: P ( B ∣ A ) = P ( A B ) / P ( A ) = 0.6 / 1.0 = 0.6 P(B|A) = P(AB)/P(A)=0.6/1.0 = 0.6 P(B ∣ A)=P(AB)/P(A)=0.6/1.0=0.6. The correlation degree between diaper and beer is as follows: P ( A ∣ B ) = P ( A B ) / P ( B ) = 0.6 / 0.6 = 1.0 P(A|B) = P(AB)/P(B) = 0.6/0.6 = 1.0 P(A∣B)=P(AB)/P(B)=0.6/0.6=1.0

⭐ Frequent k itemsets: itemsets are collections of items. Items can be commodities, so itemsets are collections of commodities. Frequent itemsets are collections of items that are often (meeting the minimum support), such as diapers and beer, which are called frequent 2 itemsets

3.2.Apriori principle

⭐ If an itemset is frequent, then all its subsets are also frequent. For example, if the minimum support is 0.5, then the support for beer and diaper is 0.6, then the support for beer and diaper is 1.0 1.0 1.0 and 0.6 0.6 0.6

⭐ If an itemset is infrequent, all its supersets are also infrequent. For example, if the minimum support is 0.5, there are only two orders of milk, which are purchased together with beer and diapers. Then the support of milk is 2 / 5 = 0.4 2/5 = 0.4 2 / 5 = 0.4, which is less than the minimum support, so it is infrequent. Their supersets are {beer, milk} and {diaper, milk}. Their support is 2 / 5 = 0.4 2/5 =0.4 2 / 5 = 0.4 and 1 / 5 = 0.2 1/5 = 0.2 1 / 5 = 0.2, so they are infrequent, that is, the superset of milk is infrequent.

3.3. Advantages

⭐ The association rules of the algorithm are generated on the basis of frequent itemsets, which can ensure that the support of these rules reaches the specified level, which is universal and convincing

⭐ The algorithm is simple, easy to understand and has low requirements for data

3.4. Disadvantages

⭐ When generating candidate itemsets at each step, there are too many combinations generated in the cycle. The more items, the more computing resources will be consumed. If pruning is not done, the time complexity is O ( 2 n ) O(2^n) O(2n), n is the number of itemsets. If we have 6 items and 6 types of combinations (single combination, pairwise combination, triple combination, etc.), the mathematical formula is: C 6 1 + C 6 2 + C 6 3 + C 6 4 + C 6 5 + C 6 6 = 2 6 − 1 C_6^1 + C_6^2 + C_6^3 + C_6^4 + C_6^5 + C_6^6 =2^6 -1 C61​+C62​+C63​+C64​+C65​+C66​=26−1

⭐ Every time the item set support is calculated, all the data in the database are scanned and compared, resulting in a large I/O load.

3.5. Algorithm steps

⭐ Let's use the following steps to illustrate our algorithm

Step 1: input data set X,Our example corresponds to five transactions
 Step 2: determine the data set X The itemsets contained in are not repeated. Our example is{Milk, chocolate, diapers, beer, potatoes}
Step 3: carry out the first iteration, scan and count the items in each item set separately (that is, how many transaction records an item appears), and record each item
		Set of candidate 1 items C1 And calculate the support of each item
 Step 4: set the minimum support according to candidate 1 item set C1 Compare the support degree and the minimum support degree of the member. The member with the minimum support degree greater than or equal to the minimum support degree is candidate 2 
		Itemset C2. 
Step 5: keep the minimum support unchanged, repeat the fourth part until there is no itemset that meets the minimum support, and then output the final frequent itemset.
Step 6: set the minimum confidence according to k Itemset CK and k+1 Itemset Ck+1,calculation k The confidence of the itemset. If it meets the minimum confidence, it will be filtered and stored.
Step 7: repeat step 6, k The value starts from 1 to the maximum value, and returns a list of association rules with strong correlation.

⭐ Let's use two figures to illustrate our steps 1-5

-----------------------The following is our basis 3.0 The example part of the algorithm describes the working steps of the algorithm----------------------------------

⭐ Prepare data: transaction data in the database

Transaction IDProduct list
1Potatoes, diapers, beer
2Chocolate, milk, potatoes, beer
3Milk, diapers, beer
4Chocolate, diapers, beer
5Chocolate, beer

⭐ The first round: scan from the database to generate 1 itemset C1

ItemsetSupport
{potatoes}0.40
{diapers}0.60
{beer}1.00
{milk}0.40
{chocolate}0.60

⭐ The first round: call the Scan function to filter itemsets with support less than 0.5, and get a frequent itemset L1

ItemsetSupport
{diapers}0.60
{beer}1.00
{chocolate}0.60

⭐ The second round: generate two itemsets C2 according to the frequent one itemset L1

ItemsetSupport
{diapers, beer}0.60
{chocolate, beer}0.60
{chocolate, diaper}0.20

⭐ The second round: call the Scan function to filter itemsets with support less than 0.5 and get 2 frequent itemsets L2

ItemsetSupport
{diapers, beer}0.60
{chocolate, beer}0.60

⭐ The third round: generate three itemsets C3 according to the frequent two itemsets L2

ItemsetSupport
{diapers, beer, chocolate}0.20

⭐ The third round: call the Scan function to filter the itemsets with support less than 0.5, and get three frequent itemsets L3. Because L3 is empty, the algorithm ends here

ItemsetSupport
{None}None
-------------------------------We use the following steps to mine our strong association rules---------------------------------------

⭐ Step 1: get all the mined rules

ruleConfidence
{beer} - > {diaper}0.60
{chocolate} ----- > {diaper}0.33
{diapers} - > {beer}1.00
{chocolate} - > {beer}1.00
{diaper, beer} - > {chocolate}0.33
{chocolate, beer} - > {diaper}0.33

⭐ Step 2: filter the rules with confidence less than 0.7 and generate strong rules

ruleConfidence
{diapers} - > {beer}1.00
{chocolate} - > {beer}1.00

🎉 So far, we have got our two strong rules, that is, when we buy diapers, we have a 100% probability of buying beer; When you buy chocolate, you have a 100% chance of buying beer

4. Code implementation

4.1 It is necessary for lazy people to unpack fast food

⭐ This code has been verified and run. Click the upper right corner of the code box to copy all the codes

#1. Build candidate item set C1
def createC1(dataSet):
    c1 =list(set([y for x in dataSet for y in x]))
    c1.sort()
    c2 = [[x] for x in c1]
    return list(map(frozenset, c2))

#Convert candidate set Ck to frequent itemset Lk
#D: Raw data set
#Cn: candidate set item Ck
#minSupport: minimum support
def scanD(D, Ck, minSupport):
    #Candidate set count
    ssCnt = {}
    for tid in D:
        for can in Ck:
            if can.issubset(tid):
                if can not in ssCnt.keys(): ssCnt[can] = 1
                else: ssCnt[can] += 1

    numItems = float(len(D))
    Lk= []     # Frequent itemset Lk generated by candidate set item Cn
    supportData = {}    #Support Dictionary of candidate set item Cn

    #Calculate the support of candidate item set, supportData key: candidate, value: support
    for key in ssCnt:
        support = ssCnt[key] / numItems
        if support >= minSupport:
            Lk.append(key)
        supportData[key] = support
    return Lk, supportData

#Join operation to convert frequent Lk-1 itemsets into candidate k itemsets through splicing
def aprioriGen(Lk_1, k):
    Ck = []
    lenLk = len(Lk_1)
    for i in range(lenLk):
        L1_list = list(Lk_1[i])
        L1 = L1_list[:k - 2]
        L1.sort()
        for j in range(i + 1, lenLk):
            #If the first k-2 items are the same, merge the two sets
            L2_list = list(Lk_1[j])
            L2 = list(Lk_1[j])[:k - 2]
            L2.sort()
            if L1 == L2:
                Ck.append(Lk_1[i] | Lk_1[j])
    return Ck

def apriori(dataSet, minSupport = 0.5):
    C1 = createC1(dataSet)
    L1, supportData = scanD(dataSet, C1, minSupport)
    L = [L1]
    k = 2
    while (len(L[k-2]) > 0):
        Lk_1 = L[k-2]
        Ck = aprioriGen(Lk_1, k)
        print("ck:",Ck)
        Lk, supK = scanD(dataSet, Ck, minSupport)
        supportData.update(supK)
        print("lk:", Lk)
        L.append(Lk)
        k += 1
    return L, supportData

#Generate association rules
#50: Frequent itemset list
#supportData: a dictionary containing frequent itemset support data
#minConf minimum confidence
def generateRules(L, supportData, minConf=0.7):
    #List of rules with confidence
    bigRuleList = []
    #Traversal from frequent binomial set
    for i in range(1, len(L)):
        for freqSet in L[i]: 
            H1 = [frozenset([item]) for item in freqSet] # Split itemset
            if (i > 1):
                rulesFromConseq(freqSet, H1, supportData, bigRuleList, minConf)
            else:
                calcConf(freqSet, H1, supportData, bigRuleList, minConf)
    return bigRuleList


# Whether the calculation meets the minimum reliability
def calcConf(freqSet, H, supportData, brl, minConf=0.7):
    prunedH = []
    #Use each conseq as the last part
    for conseq in H:
        # Calculate confidence
        P_A = supportData[freqSet.difference(conseq)]
        conf = supportData[freqSet] / P_A
        if conf >= minConf:
            print(freqSet - conseq, '-->', conseq, 'conf:', conf)
            # Three elements in a tuple: antecedent, consequent, and confidence
            brl.append((freqSet - conseq, conseq, conf))
            prunedH.append(conseq)
    #Return to the following list
    return prunedH


# Evaluate rules
def rulesFromConseq(freqSet, H, supportData, brl, minConf=0.7):
    m = len(H[0])
    if (len(freqSet) > (m + 1)):
        Hmp1 = aprioriGen(H, m + 1)
       # print(1,H, Hmp1)
        Hmp1 = calcConf(freqSet, Hmp1, supportData, brl, minConf)
        if (len(Hmp1) > 0):
            rulesFromConseq(freqSet, Hmp1, supportData, brl, minConf)

dataset = [['potato', 'baby diapers', 'Beer'], ['Chocolates', 'milk', 'potato', 'Beer'] , ['milk', 'baby diapers', 'Beer'], \
                ['Chocolates', 'baby diapers', 'Beer'], ['Chocolates', 'Beer']]
L, supportData = apriori(dataset, minSupport=0.5)
rules = generateRules(L, supportData, minConf=0.7)
for e in rules:
    print(e)

4.2. Code details

⭐ createC1(dataSet): scan all events (transactions) and obtain the collection of all items (commodities).

#1. Build candidate item set C1
def createC1(dataSet):
    c1 =list(set([y for x in dataSet for y in x]))
    c1.sort()
    c2 = [[x] for x in c1]
    return list(map(frozenset, c2))

⭐ Scan D (D, Ck, minsupport): select the frequent itemset Lk that meets the minimum support from the candidate itemset Ck according to the minimum support

#Convert candidate set Ck to frequent itemset Lk
#D: Raw data set
#Cn: candidate set item Ck
#minSupport: minimum support
def scanD(D, Ck, minSupport):
    #Candidate set count
    ssCnt = {}
    for tid in D:
        for can in Ck:
            if can.issubset(tid):
                if can not in ssCnt.keys(): ssCnt[can] = 1
                else: ssCnt[can] += 1

    numItems = float(len(D))
    Lk= []     # Frequent itemset Lk generated by candidate set item Cn
    supportData = {}    #Support Dictionary of candidate set item Cn
    #Calculate the support of candidate item set, supportData key: candidate, value: support
    for key in ssCnt:
        support = ssCnt[key] / numItems
        if support >= minSupport:
            Lk.append(key)
        supportData[key] = support
    return Lk, supportData

⭐ aprioriGen(Lk_1, k): splice the frequent itemset Lk-1 to obtain the candidate Ck itemset for scanD(D, Ck, minSupport): this function is used to generate the frequent itemset Lk

#Join operation to convert frequent Lk-1 itemsets into candidate k itemsets through splicing
def aprioriGen(Lk_1, k):
    Ck = []
    lenLk = len(Lk_1)
    for i in range(lenLk):
        L1_list = list(Lk_1[i])
        L1 = L1_list[:k - 2]
        L1.sort()
        for j in range(i + 1, lenLk):
            #If the first k-2 items are the same, merge the two sets
            L2_list = list(Lk_1[j])
            L2 = list(Lk_1[j])[:k - 2]
            L2.sort()
            if L1 == L2:
                Ck.append(Lk_1[i] | Lk_1[j])
    return Ck

⭐ apriori(dataSet, minSupport = 0.5): obtain and save the frequent itemset L of each level for later calculation and specify the confidence of frequent items

# Get and save the frequent itemset L of each level
def apriori(dataSet, minSupport = 0.5):
    C1 = createC1(dataSet)
    L1, supportData = scanD(dataSet, C1, minSupport)
    L = [L1]
    k = 2
    while (len(L[k-2]) > 0):
        Lk_1 = L[k-2]
        Ck = aprioriGen(Lk_1, k)
        # print("ck:",Ck)
        Lk, supK = scanD(dataSet, Ck, minSupport)
        supportData.update(supK)
        # print("lk:", Lk)
        L.append(Lk)
        k += 1
    return L, supportData

⭐ Generate rules (L, supportdata, minconf = 0.7): generate strong association rules according to frequent itemset L and minimum confidence

#Generate association rules
#50: Frequent itemset list
#supportData: a dictionary containing frequent itemset support data
#minConf minimum confidence
def generateRules(L, supportData, minConf=0.7):
    #List of rules with confidence
    bigRuleList = []
    #Traversal from frequent binomial set
    for i in range(1, len(L)):
        for freqSet in L[i]: 
            H1 = [frozenset([item]) for item in freqSet] # Split itemset
            if (i > 1):
                rulesFromConseq(freqSet, H1, supportData, bigRuleList, minConf)
            else:
                calcConf(freqSet, H1, supportData, bigRuleList, minConf)
    return bigRuleList

⭐ calcConf(freqSet, H, supportData, brl, minConf=0.7): used to prune and calculate whether the minimum confidence is satisfied

# Whether the calculation meets the minimum reliability
def calcConf(freqSet, H, supportData, brl, minConf=0.7):
    prunedH = []
    #Use each conseq as the last part
    for conseq in H:
        # Calculate confidence
        P_A = supportData[freqSet.difference(conseq)]
        conf = supportData[freqSet] / P_A
        if conf >= minConf:
            print(freqSet - conseq, '-->', conseq, 'conf:', conf)
            # Three elements in a tuple: antecedent, consequent, and confidence
            brl.append((freqSet - conseq, conseq, conf))
            prunedH.append(conseq)
    #Return to the following list
    return prunedH

⭐ rulesFromConseq(freqSet, H, supportData, brl, minConf=0.7): used to obtain all strong association rules that meet the minimum confidence, that is, to return all the mined strong association rules

def rulesFromConseq(freqSet, H, supportData, brl, minConf=0.7):
    m = len(H[0])
    if (len(freqSet) > (m + 1)):
        Hmp1 = aprioriGen(H, m + 1)
       # print(1,H, Hmp1)
        Hmp1 = calcConf(freqSet, Hmp1, supportData, brl, minConf)
        if (len(Hmp1) > 0):
            rulesFromConseq(freqSet, Hmp1, supportData, brl, minConf)

5. Summary

⭐ ⅸ after writing for a long time, if you find that there are ready-made packages written by others, you can directly switch them. The code is as follows:

# Installation: PIP install effective Apriori
from efficient_apriori import apriori
freqItemSet, rules = apriori(dataset, 0.5, 0.7)
print(rules)

⭐ The efficiency of Apriori algorithm is relatively low. It is recommended to directly use the FP growth algorithm developed based on Apriori algorithm. Because it is a tree data structure, it has high efficiency and can be realized through the following dependent packages: fpgrowth_py implementation code is as follows:

# Installation: pip install fpgrowth_py
from fpgrowth_py import fpgrowth
dataset = [[1, 3, 4], [2, 3, 5], [1, 2, 3, 5], [2, 5]]
freqItemSet, rules = fpgrowth(dataset, 0.5, 0.7)
print(rules) 

6. References

📗 Baidu Encyclopedia: APRIORI
📗 Zhihu: Apriori, the top ten algorithms of data mining
📗 CSDN: Apriori algorithm - association analysis algorithm (I)
📗 efficient-apriori 2.0.1
📗 fpgrowth-py 1.0.0

Tags: Algorithm AI Data Mining

Posted on Sun, 05 Dec 2021 01:53:10 -0500 by Hitoshi