Data mining involves information collection, storage, analysis and statistics.

Analysis is discovering hidden information in the often very large data sets. Several methods exist, this post will focus on dependency modelling, that is finding patterns and building rules to model dependencies.

Dependency Modelling

Finding frequent patterns is the majority of the work, their form ranges from simple item groups to non-trivial graphical structures. Efficient search algorithms are key, without this the search space can grow exponentially even for the simplest patterns.

Building rules to express the dependencies/relationships in the data follows the pattern search. It is these rules that can provide insights into the data. Examples include identifying supermarket products that sell together, helping shop floor designers. Or discovering protein interactions helping speed drug design in medicine.

Itemset Mining

Itemset mining is identifying items that co-occur together with high confidence.
For example, if bread is being bought, what is the chance that butter is also bought?

The answer depends on finding how often they are bought together, known as the support of the itemset, supp(XY). And likewise we need the support count for the itemset of just butter, supp(X).

With these numbers at hand, we can create an Association rule that tells us if butter is likely or not to be bought if already buying the bread. Association rules are strong if they reach a minimum confidence measure.

Confidence(Y|X) = \frac{supp(XY)}{supp(X)} > minConfidence

Efficient Search

Obviously to make such rules and infer dependencies between itemsets, we need to count the support values. A brute-force algorithm might count the support for all possible itemsets. As you will see, this is a bad idea.

Brute-Force
The naive algorithm iteratively generate all candidates. Starting with a set of size 1, it considers each possible itemset (k). At each step the size increases and it continues until itemsets of the largest possible size have been generated.
With candidates in place, for each one, it checks if it is a subset of each transaction in the database (D). If so, adds one to the support value of that candidate.
This candidate search can be viewed as a lattice, and can be iterated over in a Breadth-First or a Depth-First manner.

an-itemset-lattice
Large number of candidates, from just 5 items.

Checking each transaction for each itemset performs O(2^k * |D| * K) operations and it requires 2^k -1 database scans.

Apriori (1994)
Apriori algorithm can save alot of computation, it relies on the simple facts:

supp(X) \geq supp(Y) \iff  X \subset Y

This is the anti-monotone property, and tells us that if Y is frequent, then all subsets must also be frequent. Conversely,if X is infrequent, then all supersets are also infrequent.
It uses this to perform support-based pruning, reducing the candidate search space as it finds infrequent candidates. The algorithm begins with all candidates of size 1 and scans the database for the support counts. It will then generate and count the support for candidates of one size greater (k+1), using only frequent candidates from the previous round. In this way, with each candidate we test to be infrequent, a branch of the search space of candidates is pruned.

support_based_pruning
An infrequent itemset prunes all supersets.

Apriori falters in generating large numbers of candidates, and repeatedly scanning the database. It is these costs that FPGrowth tackles.

FPGrowth (2000)
FPGrowth takes the approach of pattern fragment growth, removing the cost of generating candidates. FPGrowth has two stages, it constructs an FP-tree to act as an index for the database. And second, it applies a divide & conquer mining algorithm to find frequent patterns using the FP-tree.

FP-tree construction requires two database scans, one to identify frequent items and another scan to construct the FP-tree. This tree represents the complete information required for frequent pattern mining.

fp-tree
Example FP-tree for frequent items {f,c,a,b,m,p}

The FP-growth algorithm to mine frequent patterns takes the initial FP-tree and recursively finds the complete set of frequent patterns. At each step, using a given itemset, it extracts a conditional sub-pattern base and conditional FP-tree from which frequent patterns can be extracted. It begins at 1-length itemsets, and progressively grows each by mining the subsequent conditional pattern base. Thus avoiding costly candidate generation.

condtional_fp-tree
Conditional pattern-base & conditional FP-tree for item ‘m’

FP-growth performs pattern mining on pattern bases that are much smaller than the original data, as are the FP-tree generated. Also, it’s operations consist of prefix count adjustments, counting and pattern fragment growth, operations that are less costly compared to candidate generation and tests.

External Links

http://rakesh.agrawal-family.com/papers/vldb94apriori.pdf
http://www.cs.sfu.ca/~jpei/publications/sigmod00.pdf

Advertisements