Frequent Pattern Growth Algorithm

Frequent Pattern Growth Algorithm

The Frequent Pattern Growth (FP-Growth) algorithm is a highly efficient and popular algorithm for mining frequent itemsets from large transaction databases. It was proposed by Han, Pei, and Yin in 2000 as an improvement over the Apriori algorithm, primarily by avoiding the computationally expensive candidate generation step.

It’s widely used in areas like market basket analysis (e.g., “customers who buy bread and milk also tend to buy butter”), web usage mining, and bioinformatics, to discover hidden relationships and patterns in data.

The Frequent Pattern Growth (FP-Growth) algorithm is a highly efficient and popular algorithm for mining frequent itemsets from large transaction databases. It was proposed by Han, Pei, and Yin in 2000 as an improvement over the Apriori algorithm, primarily by avoiding the computationally expensive candidate generation step.

It’s widely used in areas like market basket analysis (e.g., “customers who buy bread and milk also tend to buy butter”), web usage mining, and bioinformatics, to discover hidden relationships and patterns in data.

Why FP-Growth? (Comparison to Apriori)

The traditional Apriori algorithm works by repeatedly scanning the database to generate candidate itemsets and then testing their support (frequency). This can be very inefficient, especially with large datasets or when dealing with long frequent itemsets, as it leads to:

  • High computational cost: Due to numerous candidate generations.
  • Multiple database scans: Which is time-consuming for large databases.

FP-Growth overcomes these limitations by using a divide-and-conquer strategy and a compact data structure called an FP-tree. It requires only two passes over the dataset.

The Frequent Pattern Growth (FP-Growth) algorithm is a highly efficient and popular algorithm for mining frequent itemsets from large transaction databases. It was proposed by Han, Pei, and Yin in 2000 as an improvement over the Apriori algorithm, primarily by avoiding the computationally expensive candidate generation step.

It’s widely used in areas like market basket analysis (e.g., “customers who buy bread and milk also tend to buy butter”), web usage mining, and bioinformatics, to discover hidden relationships and patterns in data.

Why FP-Growth? (Comparison to Apriori)

The traditional Apriori algorithm works by repeatedly scanning the database to generate candidate itemsets and then testing their support (frequency). This can be very inefficient, especially with large datasets or when dealing with long frequent itemsets, as it leads to:

  • High computational cost: Due to numerous candidate generations.
  • Multiple database scans: Which is time-consuming for large databases.

FP-Growth overcomes these limitations by using a divide-and-conquer strategy and a compact data structure called an FP-tree. It requires only two passes over the dataset.

Core Concepts:

  • Itemset: A collection of one or more items (e.g., {Milk, Bread}).
  • Support: The frequency of an itemset in the dataset (e.g., if {Milk, Bread} appears in 3 out of 10 transactions, its support is 3 or 30%).
  • Minimum Support (Min_Sup): A user-defined threshold. Only itemsets with support greater than or equal to this threshold are considered “frequent.”
  • Frequent Itemset: An itemset whose support is greater than or equal to the minimum support.
  • FP-Tree (Frequent Pattern Tree): A highly compressed tree-like structure that stores the frequent itemsets in a hierarchical manner. It’s designed to be compact by sharing common prefixes among transactions.
  • Conditional Pattern Base: For a given frequent item, it’s the set of prefix paths in the FP-tree that end with that item.
  • Conditional FP-Tree: A smaller FP-tree constructed from a conditional pattern base.

How FP-Growth Algorithm Works (Two Main Steps):

Step 1: Construct the FP-Tree

This step involves two sub-passes over the original transaction database:

  1. First Pass (Counting Frequencies):

    • Scan the entire transaction database once.
    • Count the frequency (support) of each individual item.
    • Identify all items that meet the Min_Sup threshold. These are the frequent 1-itemsets.
    • Sort these frequent items in descending order of their frequencies. This order is crucial for FP-tree compaction. Let’s call this the F_list (or header table).
  2. Second Pass (Building the Tree):

    • Create the root node of the FP-tree, labeled “null”.
    • Scan the transaction database a second time. For each transaction:
      • Filter out the infrequent items from the transaction.
      • Sort the remaining frequent items in the transaction according to the F_list (descending order of global frequency).
      • Insert the sorted transaction into the FP-tree:
        • Start from the root node.
        • For each item in the sorted transaction, if the current node has a child corresponding to that item, traverse to that child and increment its count.
        • If no such child exists, create a new child node for the item and link it from the current node. Increment its count to 1.
        • Each newly created node also needs a node-link pointer that connects it to the next occurrence of the same item in the tree. These node-links are maintained in the header table, which contains each frequent item and a pointer to the first node in the FP-tree representing that item.

Example of FP-Tree Construction:

Let’s say we have the following transactions and Min_Sup = 2:

Transaction ID Items
T1 {A, B, C}
T2 {A, B, D}
T3 {A, C, E}
T4 {B, D, F}
T5 {C, E}

The Frequent Pattern Growth (FP-Growth) algorithm is a highly efficient and popular algorithm for mining frequent itemsets from large transaction databases. It was proposed by Han, Pei, and Yin in 2000 as an improvement over the Apriori algorithm, primarily by avoiding the computationally expensive candidate generation step.

It’s widely used in areas like market basket analysis (e.g., “customers who buy bread and milk also tend to buy butter”), web usage mining, and bioinformatics, to discover hidden relationships and patterns in data.

Why FP-Growth? (Comparison to Apriori)

The traditional Apriori algorithm works by repeatedly scanning the database to generate candidate itemsets and then testing their support (frequency). This can be very inefficient, especially with large datasets or when dealing with long frequent itemsets, as it leads to:

  • High computational cost: Due to numerous candidate generations.
  • Multiple database scans: Which is time-consuming for large databases.

FP-Growth overcomes these limitations by using a divide-and-conquer strategy and a compact data structure called an FP-tree. It requires only two passes over the dataset.

Core Concepts:

  • Itemset: A collection of one or more items (e.g., {Milk, Bread}).
  • Support: The frequency of an itemset in the dataset (e.g., if {Milk, Bread} appears in 3 out of 10 transactions, its support is 3 or 30%).
  • Minimum Support (Min_Sup): A user-defined threshold. Only itemsets with support greater than or equal to this threshold are considered “frequent.”
  • Frequent Itemset: An itemset whose support is greater than or equal to the minimum support.
  • FP-Tree (Frequent Pattern Tree): A highly compressed tree-like structure that stores the frequent itemsets in a hierarchical manner. It’s designed to be compact by sharing common prefixes among transactions.
  • Conditional Pattern Base: For a given frequent item, it’s the set of prefix paths in the FP-tree that end with that item.
  • Conditional FP-Tree: A smaller FP-tree constructed from a conditional pattern base.

How FP-Growth Algorithm Works (Two Main Steps):

Step 1: Construct the FP-Tree

This step involves two sub-passes over the original transaction database:

  1. First Pass (Counting Frequencies):

    • Scan the entire transaction database once.
    • Count the frequency (support) of each individual item.
    • Identify all items that meet the Min_Sup threshold. These are the frequent 1-itemsets.
    • Sort these frequent items in descending order of their frequencies. This order is crucial for FP-tree compaction. Let’s call this the F_list (or header table).
  2. Second Pass (Building the Tree):

    • Create the root node of the FP-tree, labeled “null”.
    • Scan the transaction database a second time. For each transaction:
      • Filter out the infrequent items from the transaction.
      • Sort the remaining frequent items in the transaction according to the F_list (descending order of global frequency).
      • Insert the sorted transaction into the FP-tree:
        • Start from the root node.
        • For each item in the sorted transaction, if the current node has a child corresponding to that item, traverse to that child and increment its count.
        • If no such child exists, create a new child node for the item and link it from the current node. Increment its count to 1.
        • Each newly created node also needs a node-link pointer that connects it to the next occurrence of the same item in the tree. These node-links are maintained in the header table, which contains each frequent item and a pointer to the first node in the FP-tree representing that item.

Example of FP-Tree Construction:

Let’s say we have the following transactions and Min_Sup = 2:

Transaction ID Items
T1 {A, B, C}
T2 {A, B, D}
T3 {A, C, E}
T4 {B, D, F}
T5 {C, E}

Step 1: Construct the FP-Tree

This step involves two passes.

Pass 1: Count Item Frequencies & Create Sorted F_List

Item Frequency Is Frequent (>= Min_Sup)
A 3 Yes
B 3 Yes
C 3 Yes
D 2 Yes
E 2 Yes
F 1 No

FP-Growth Algorithm Example: Tabular Walkthrough

Let’s illustrate the FP-Growth algorithm with a step-by-step example in tabular form.

Original Transaction Database

Assume our minimum support count (Min_Sup) is 2.

Transaction ID Items
T1 {A, B, C}
T2 {A, B, D}
T3 {A, C, E}
T4 {B, D, F}
T5 {C, E}

Step 1: Construct the FP-Tree

This step involves two passes.

Pass 1: Count Item Frequencies & Create Sorted F_List

Item Frequency Is Frequent (>= Min_Sup)
A 3 Yes
B 3 Yes
C 3 Yes
D 2 Yes
E 2 Yes
F 1 No

Sorted Frequent Items (F_list) by Descending Frequency: This order is crucial for tree compression: A, B, C, D, E

Pass 2: Build the FP-Tree and Header Table

Here’s how each transaction is processed to build the tree. The FP-Tree is a conceptual structure, but we’ll show the path added and how the Header Table (which links to the actual nodes in the tree) is updated.

Initial State:

  • FP-Tree: Just a “null” root.
  • Header Table: Empty for all items.
Transaction ID Filtered & Sorted Transaction (based on F_list) Path Added to FP-Tree (Nodes & Counts) Header Table Update
T1 {A, B, C} (Root) -> A:1 -> B:1 -> C:1 A: NodeA1, B: NodeB1, C: NodeC1
T2 {A, B, D} (Root) -> A:2 -> B:2 -> D:1 D: NodeD1 (linked from B2)
T3 {A, C, E} (Root) -> A:3 -> C:1 -> E:1 C: NodeC2 (linked from A3), E: NodeE1 (linked from C1)
T4 {B, D} (Root) -> B:1 -> D:1 B: NodeB3 (new path), D: NodeD2 (linked from B1)
T5 {C, E} (Root) -> C:1 -> E:1 C: NodeC3 (new path), E: NodeE2 (linked from C1)

The actual FP-Tree would have shared prefixes. For instance, NodeA1 and NodeA2 in the “Path Added” column for T1 and T2 would actually be the same node, with its count incremented from 1 to 2, then to 3 for T3.

Step 2: Mine Frequent Itemsets from the FP-Tree

This is a recursive process. We start from the least frequent item in our F_list (E), and work our way up to the most frequent (A).


FP-Growth Algorithm Example: Tabular Walkthrough

Let’s illustrate the FP-Growth algorithm with a step-by-step example in tabular form.

Original Transaction Database

Assume our minimum support count (Min_Sup) is 2.

Transaction ID Items
T1 {A, B, C}
T2 {A, B, D}
T3 {A, C, E}
T4 {B, D, F}
T5 {C, E}

Step 1: Construct the FP-Tree

This step involves two passes.

Pass 1: Count Item Frequencies & Create Sorted F_List

Item Frequency Is Frequent (>= Min_Sup)
A 3 Yes
B 3 Yes
C 3 Yes
D 2 Yes
E 2 Yes
F 1 No

Sorted Frequent Items (F_list) by Descending Frequency: This order is crucial for tree compression: A, B, C, D, E


Pass 2: Build the FP-Tree and Header Table

Here’s how each transaction is processed to build the tree. The FP-Tree is a conceptual structure, but we’ll show the path added and how the Header Table (which links to the actual nodes in the tree) is updated.

Initial State:

  • FP-Tree: Just a “null” root.
  • Header Table: Empty for all items.
Transaction ID Filtered & Sorted Transaction (based on F_list) Path Added to FP-Tree (Nodes & Counts) Header Table Update
T1 {A, B, C} (Root) -> A:1 -> B:1 -> C:1 A: NodeA1, B: NodeB1, C: NodeC1
T2 {A, B, D} (Root) -> A:2 -> B:2 -> D:1 D: NodeD1 (linked from B2)
T3 {A, C, E} (Root) -> A:3 -> C:1 -> E:1 C: NodeC2 (linked from A3), E: NodeE1 (linked from C1)
T4 {B, D} (Root) -> B:1 -> D:1 B: NodeB3 (new path), D: NodeD2 (linked from B1)
T5 {C, E} (Root) -> C:1 -> E:1 C: NodeC3 (new path), E: NodeE2 (linked from C1)

The actual FP-Tree would have shared prefixes. For instance, NodeA1 and NodeA2 in the “Path Added” column for T1 and T2 would actually be the same node, with its count incremented from 1 to 2, then to 3 for T3.


Step 2: Mine Frequent Itemsets from the FP-Tree

This is a recursive process. We start from the least frequent item in our F_list (E), and work our way up to the most frequent (A).

1. Item: E

Node for E (from Header Table) Path to Root (Conditional Pattern) Count
NodeE1 (from T3 path) (A, C) 1
NodeE2 (from T5 path) (C) 1
  • Conditional Pattern Base for E: { (A, C):1, (C):1 }
  • Frequencies in CPB for E: A:1, C:2
  • Conditional FP-Tree for E (Min_Sup=2): Only ‘C’ is frequent. The tree is (null) -> C:2.
  • Frequent Itemsets from E’s Conditional Tree: {C}
  • Frequent Itemsets with E: {C, E} (Support = 2)

2. Item: D

Node for D (from Header Table) Path to Root (Conditional Pattern) Count
NodeD1 (from T2 path) (A, B) 1
NodeD2 (from T4 path) (B) 1
  • Conditional Pattern Base for D: { (A, B):1, (B):1 }
  • Frequencies in CPB for D: A:1, B:2
  • Conditional FP-Tree for D (Min_Sup=2): Only ‘B’ is frequent. The tree is (null) -> B:2.
  • Frequent Itemsets from D’s Conditional Tree: {B}
  • Frequent Itemsets with D: {B, D} (Support = 2)

3. Item: C

Node for C (from Header Table) Path to Root (Conditional Pattern) Count
NodeC1 (from T1 path) (A, B) 1 (derived from C’s count)
NodeC2 (from T3 path) (A) 1 (derived from C’s count)
NodeC3 (from T5 path) (Root) -> (empty prefix) 1 (derived from C’s count)
  • Conditional Pattern Base for C: { (A, B):1, (A):1, (empty):1 }
  • Frequencies in CPB for C: A:2, B:1
  • Conditional FP-Tree for C (Min_Sup=2): Only ‘A’ is frequent. The tree is (null) -> A:2.
  • Frequent Itemsets from C’s Conditional Tree: {A}
  • Frequent Itemsets with C: {A, C} (Support = 2)

4. Item: B

Node for B (from Header Table) Path to Root (Conditional Pattern) Count
NodeB1 (from T1 path) (A) 1 (derived from B’s count)
NodeB2 (from T2 path) (A) 1 (derived from B’s count)
NodeB3 (from T4 path) (Root) -> (empty prefix) 1 (derived from B’s count)
  • Conditional Pattern Base for B: { (A):1, (A):1, (empty):1 }
  • Frequencies in CPB for B: A:2
  • Conditional FP-Tree for B (Min_Sup=2): Only ‘A’ is frequent. The tree is (null) -> A:2.
  • Frequent Itemsets from B’s Conditional Tree: {A}
  • Frequent Itemsets with B: {A, B} (Support = 2)

5. Item: A

  • Since ‘A’ is the root of the main FP-Tree, its conditional pattern base would consider all paths leading to it. For this simplified example, we’ll stop here, but in a full recursion, you’d find frequent itemsets like {A, B, C} if they meet the support.

Final Frequent Itemsets Mined:

Based on this walkthrough (and assuming a comprehensive recursive mining of all conditional trees for A, B, C, D, E):

  • {A} (Support 3)
  • {B} (Support 3)
  • {C} (Support 3)
  • {D} (Support 2)
  • {E} (Support 2)
  • {A, B} (Support 2)
  • {A, C} (Support 2)
  • {B, D} (Support 2)
  • {C, E} (Support 2)

Advantages of FP-Growth:

  • No Candidate Generation: This is the biggest advantage, significantly reducing computation, especially for dense datasets.
  • Fewer Database Scans: Typically only two full passes over the database are required.
  • Compact Data Structure: The FP-tree compresses the dataset, making it memory-efficient if many transactions share common prefixes.
  • Faster Execution: Generally outperforms Apriori, especially for large datasets.

Disadvantages of FP-Growth:

  • Memory Intensive for Sparse Data: If the dataset is very sparse (few common items per transaction), the FP-tree can become wide and not offer significant compression, potentially consuming a lot of memory.
  • Complex Implementation: Compared to Apriori, building and recursively mining the FP-tree is more complex to implement.
  • Not Suitable for Dynamic Data: If transactions are constantly added or removed, the FP-tree needs to be rebuilt, which can be inefficient.

Overall, FP-Growth is a powerful and efficient algorithm for frequent itemset mining, particularly well-suited for large and dense datasets where the overhead of candidate generation in Apriori would be prohibitive.

Leave a Reply

Your email address will not be published. Required fields are marked *