informs ny Meeting Notes


Have news to share? Email your news announcement or press release to the webmaster.

Notes from May 19, 2004

Market Basket Analysis
by Leonardo E. Auslender of SAS Institute, Inc. 


The talk featured Dr. Auslender's newest results in Associations (Market Affinity) analysis (AA) from speaker's upcoming book.  As usual for this speaker, the talk was illustrated by an effusive, 81 slide show.

Association Rules/heuristic patterns, categorize customers by means of actionable profiles.  In order to reduce the typically large (exponential) number of combinations of binary attributes, various functions of 'interestness' are used.  Example shown has a function LIFT(A->B) defined as S(A->B)/S(A)/S(B), where S stands for 'support' (frequency) of an event and '->' means 'causes' (or 'coincides' since causality relation should be antisymmetrical).  LIFT measures how un-independent two events are.  Values of LIFT > 1 make events A and B (and the rule A->B ) 'interesting' (or positively correlated).  Note that if LIFT(A->B) > 1 then usually LIFT(A->~B) < 1 what makes us to be even more selective and to set the interest threshold higher.  The values of LIFT prove to be more selective than the regression coefficients particularly when there is no reliable regression. 

Association Analysis is also used for knowledge discovery (mining for nuggets of relevant information) and is useful in variable/rule selection in conjunction with a greater than 1 'interestness' function, support/frequency and high confidence %.  Market Basket Analysis typically prunes support at 5% - what tends to remove uncorrelated or negatively correlated rules. 

Unfortunately despite all that pruning, many of the rules we are left with may be/are irrelevant and anecdotal.  Even more doubt on the methodology usefulness is thrown by the Simpson paradox, where the directions of associations may be reversed when another factor is added into analysis - illustrated on example of death penalty race bias - slide 60; overall 11% of white defendants get death penalty, compared to 7.9% for black defendants.  However, when broken down by the race of the victim, for white victims white defendants get death 11.3% vs 22.9% for black defendants, and with black victims whites get death in 0% and black defendants in 2.8% of cases.

Other methods mentioned were:
- Association tree  - structure induced by descending Confidence values
- Association Chi-Square: items deemed dependent per Chi-Square test become composite (clustering) and process continues until there is no more composite items possible.
- Terse Representations of AA - utilizes a regression type representation of items
- Bayes Nets - classification rules on items/variables linked by probability 'statements'.,
- Log-linear model to summarize the results (?)
- Tree modeling - un-interpretable and non-intuitive,

An interesting method presented for visualization of the correspondence/association data is the link graph (Giudici and Passerone) that graphically depicts coincidence/casualty of events.  By throttling the coincidence/odds ratio levels we can obtain event/transaction clusterings of varying importance.  Promotional variables are added to clusters to check if promotions affect sales.

For all of the shortcomings and paradoxes, the AA is expected to help with many hard practical problems like:
Click stream analysis, creating profiles for fraud detection, police, identification of party voting affinity, book purchase associations (Amazon, Vignette), e-mail link analysis an web search, printing/selection of point-of-sale retail coupons, cross selling opportunities, sequencing of promotional offers,

Fortunately nobody said statistics and data analysis was easy (speaker said several times that there is no rose garden) or we wouldn't expect to earn big bucks doing it.

The talk slides are available under 'Practice'