Association Rule Mining

Association Rule Mining [ Implementation using R here]

Association Rule mining is one of the classical DM technique. Association Rule mining is a very powerful technique of analysing / finding patterns in the data set. It is a supervised learning technique in the sense that we feed the Association Algorithm with a training data set( as called Experience E in machine learning context) to formulate hypothesis(H) . The input data to a association rule mining algorithm requires a format which will be detailed shortly.
Ok let me first introduce the readers with some of the application areas of this DM technique and motivation for the study of Association analysis. The classic application of the association rule mining is to analyse the Market Basket Data of a retail store. For example, Retail stores like Wal-Mart, Reliance fresh, big bazaar gather data about customer purchase behaviour and they have complete details of the goods purchased as part of a single bill. This is called Market basket data and its analysis is termed “market basket analysis”. It has been found that customers who buy diapers are more likely to buy beer. This is a pattern discovered by association analysis. Other applications include but not limited to scientific data analysis (earth science to study ocean, land and atm. Processes) and in the field of bioinformatics (genome sequence mining, etc.) Also it is used in document analysis for determining the words that often occur together and weblog mining temporal data for any pattern in online behaviour and website navigation. There are numerous other examples of association analysis which is only bounded by human imagination and capability.
Let’s start with Association mining with market basket data as the example. An itemset is the group of items. A k-itemset indicates the no. of items under study is K numbers. As part of a transaction (purchase by customer) one or more items from the itemset may be included. The occurrence/purchase of an item is indicated by a value 1 while non-inclusion is indicated by a value 0. Hence a typical market basket data like the one below:

	Book	Pen	Pencil	Eraser	Sharpener	Crayons	Maps	A4 sheets
T1	1	1	0	1	0	1	0	1
T2	0	1	0	1	0	1	0	0
T3	1	0	0	1	0	1	0	0

Hence it means
T1 – {Book,Pen,Eraser,Crayons}
T2 – {Pen,Eraser,Crayons}
T3 – {Book,Eraer,Crayons}
In the above example, we call it an 8-itemset.

If you see the above representation of market basket data, one may think there are few additional info which are missing like the quantity purchased, Amount involved in the transactions/purchase. Of course, the association analysis can be extended to involve such detail.
The application of Association rule mining algorithm results in the discovery of rules/patterns of the following form:
{Pencil} – > {Eraser}
{Book,Maps}- > {Pen}
What it simply says is “If a customer bought a pencil he is more likely to buy an eraser”. Mathematically, it says “Purchase of Pencil implies purchase of eraser”. Once a pattern is discovered, it can used/integrated into decision support system to form strategies based on the rule. In the above case, the company may use this rule to do cross-selling i.e place pencil and eraser as close to each other which increases the sales and hence profit. Just imagine, if a number of such strong rules are disovered in a jewellery shop it would result in tremendous value. Now let me answer what “strong” rule means?
Now that we have defined what a rule is, we are posed with two important questions. Are all rules discovered by my algorithm is really useful / meaningful? How confident i am about the rule? To answer these questions, we use some mathematical measure to quantify the usefulness and confidence.
Most common evaluation measures for a rule are support and confidence measures. There are other measures namely lift, interest factor, correlation. We will talk about it a bit later. A support measure answers the first question, the interestingness measure. It is represented in percentage. It defines how many of my transactions support this rule. If it is say 4/100 it means just 4 out of 100 transactions involve this rule, then probably this is uninteresting so we may choose to ignore it. Hence our Association rule mining algorithm sets some threshold/min value for the support to eliminate uninteresting rules and retain the interesting ones. An example of uninteresting rule could be {pen} – > {eraser}, this could be an uninteresting rule as pen and eraser might be purchased as a matter of chance, i.e it has lower support.
Now having answered the interestingness criteria, we are left with determining the confidence of the rule. A confidence measure quantifies the confidence as a ratio of no. of transaction holding this rule valid against the no. of transactions involving this rule. Higher the value, more reliable is the rule. A strong rule indicates a rule with higher confidence value.

Lets quickly jump into details of the algorithm. The Association rule mining is carried out using the famous Apriori Algorithm. We will also talk about the variations of this algorithm to apply it for continous data and hierarchial data. Before that, let’s formalize the definition of the association analysis problem:
“Given a set of transactions, the problem is to find all rules/patterns with support >= minsup and confidence >= minconf”

The Apriori Algorithm:
A brute force approach is very expensive task. Hence the approach followed by apriori algorithm is to break up the requirement of computing support and confidence as a two separate tasks. In the first step, frequent itemsets are generated i.e those itemsets which holds the criteria of minimum support. In the second and final step, Rule generation is made possible by evaluation the confidence measure. Let’s visualize the approach diagrammatically as shown below:

Apriori Algorithm

Measures could be classified into two categories – subjective and objective. A subjective measure often involves some heuristics and involves domain expertise to eliminate un interesting rules while objective measure are domain independent measures. Support and confidence are good examples of objective measures. Objective measures could be either symmetric binary or asymmetric binary. The choice of measure depends on the type of application and it must be carefully chosen to get quality results.
Simpson’s paradox states that there is a possibility of misinterpretation due to the hidden variable not as part of the analysis influencing the rules/patterns.

The apriori algorithm can be extended to solving various other problems by making little modifications to the data representation methods, Data structures and algorithm.

To handle categorital and continous data. For example gender is categorical attribute and can be represented using two items namely gender=’M’ and gender = ‘F’.
To handle concept of hierarchy in itemsets. For example, if IPod, Smartphone are two specific itemsets, then we can define a hierarchy item called electronic goods as a parent item.
To handle sequential pattern mining. Example: weblog mining, genome sequence mining, customer purchase behaviour.
Graph and sub-graph mining: eample – weblogs to identify navigation patterns, chemical structure analysis.
To identify infrequent patterns, negatively correlated patterns, etc.

You can also download a copy of the above material here: Association Rule Mining. Please feel free to comment on the topic. You can also subscribe to this blog by clicking on the subscribe button on the right side of the page.

10 comments

Analisa Pearle said: March 14, 20106:40 am

Heard about this site from my friend. He pointed me here and told me I’d find what I need. He was right! I got all the questions I had, answered. Didn’t even take long to find it. Love the fact that you made it so easy for people like me. More power

Reply ↓
Pingback: Market Basket Analysis/Association Rule Mining using R package – arules « Next generation BI
蛙鏡 said: November 15, 20107:41 pm

Nice post. All very good points. There are a few sites that I’ve registered at in order to comment, but they are few and far between.

Reply ↓
Jayakrishna said: November 26, 20104:58 pm

great pradeep! thanks a lot for all the infor! keep up the great work…it helps a lot of people like me!

Reply ↓
Raaj said: February 28, 20113:30 pm

Excellent Bro. I am trying to do a project for my friend using this concept.

Reply ↓
Kelli said: December 12, 20121:16 am

Hi all, here every one is sharing these kinds
of knowledge, so it’s fastidious to read this webpage, and I used to visit this webpage all the time.

Reply ↓
Swan-Iyah said: October 18, 201412:05 am

Hello, your post helped me for my assignment on ARM. I wanted to tell you that you missed A4 sheets for T1. There’s a 1 there 🙂
Cheers.

Reply ↓
- prdeepakbabu said: November 7, 20143:07 pm
  
  Hey ! Thats good to know. Wish you all the very best for your course.
  On the A4 sheets – Yup, you got it! My bad. will correct it. 🙂
  
  Reply ↓
ayane said: December 17, 20147:15 pm

Hey, great blogpost. Unfortunately the link to Association Rule Mining is not working anymore! Can upload it again please?
Thanks!

Reply ↓
- prdeepakbabu said: April 6, 201510:45 pm
  
  Hey Ayane. let me try to restore. Thanks for reporting.
  
  Reply ↓