informs ny Meeting Notes


Have news to share? Email your news announcement or press release to the webmaster.

Notes from March 17, 2004

A Bayesian Approach to Database Integration
by Dr. Jim Arvesen,
Strategic Solutions and Services


The talk focused on methods of analysis of correlated data coming from different sources/integration and a scripted freeware program WinBUGS implementing them.  Much of the WinBUGS software was demonstrated at the Deming Conference in Atlantic City last December.

It was known since James and Stein discovery over fifty years ago, that for three or more dimensions, the sample mean is inadmissible for estimation of the mean of a multivariate normal distribution.  (This may have to do with Gaussian Joint Variable Theorem.)  In 1975 Efron and Morris showed on examples the power of this result.  However only recently the computing power available had 'caught up' enough for practical uses (of a calculation intensive alternatives, like non-parametric methods)

The study presented at the meeting utilized several data-series; spatial/ZIP codes, household, product/medicine shipments, temporalů coming from various sources.  Using diverse data sources, improves coverage, insight, usage tracking, however partial pictures of the market are still likely to be incomplete.  This raises a need to integrate/merge the data sources (a database approach would be to treat data sources as individual tables and make a join/single table on attributes defining market segments). 

Missing data (and outliers) are filled (filtered out) by a maximum likelihood (e.g., binomial) estimates or using (e.g., Bayesian/conditional probability based) patterns.  Sample patterns: 1) tall parents have toll children, 2) children are 5% shorter/5% taller (depending on the country) than the parents.  Patterns convey more information than just one maximum likelihood mean number (besides the problems with three or more variables). 

There are some 36875 zip codes in USA, 5000 of them matter (in terms of drugs shipments).  In order to predict next time period sales a Bayesian smoothing was applied which included the 'zip code effect/contribution' and number of neighbors, in addition to more usual attributes.

The forecasted sales were color coded and visualized on a map of the sales territory.  Bayesian approach results 'looked better' than simple projection by quintiles.

The talk ended with presentation of WinBUGS - 'a flexible program for Bayesian analysis using MCMC (Monte Carlo Markov Chains) methods'.  It supports a wide range of models: linear, nonlinear, random/stochastic, Generalized Linear Mixed Models, ranking, Bayesian inference, Monte Carlo simulation, conjugate analysis (where prior and posterior distributions are of similar type/'in the same family' (btw. the exact form of prior distribution assumed in Bayes inference, is observed to matter little to the posterior distribution of calculated results).  WinBUGS has a powerful modeling language to define models and operations on them.  More information about WinBUGS is available on the web via Google.

Finally the talk reminded us about the role availability of number crunching power has on what is practically computable.  There were two milestones so far in this respect; appearance of computers during WWII and PC revolution (an explosion next to BigBang in magnitude.)  I don't remember how Statistics was done Before Computers.  It could resemble old, pre-PC textbooks I read; three to five columns of data, up to three dozens of rows of data, mean, standard deviation, ranking methods, closed-form distribution formulas and elegant theorems.  By contrast now we deal with terabyte databases, tens of thousands of variables, time consuming, non-parametric heuristics and virtually no means of validating the results.