|Have news to share? Email your news announcement or press release to the webmaster.|
|Notes from November 12, 2003|
A Genetic Jacknife
Method: 3-in-1 Tool for Variable Selection
Dr. Bruce Ratner, Senior VP, President of DM STAT-1 Consulting.
| The talk presented a compelling
introduction to Genetic Programming (GP) on example of author's own
In one process GenIQ selects the variables, functional form of the model (using Data Mining) and fits the model parameters/coefficients (builds the model). While traditional modeling fits coefficients to a pre-conceived functional model, the GP (randomly) searches also for a best expression/structure - tree build out of functions, variables and constants. It does it by maintaining a population of models (expression trees) and randomly subjecting it to an iterative process mimicking natural evolution (reproduction, mutation, survival of the fittest). Reproduction (mating) in GenIQ involves exchange of sub-trees ('genes') between parents; mutation introduces random changes to individual models. Constants are mutated randomly or re-regressed(?) when the model structure changes. Fittest (in terms of approaching the goal - predicting response) models are rewarded by increased chance of participation in the next round.
GenIQ supports quite rich function/operator set (with representatives of arithmetic, mathematical, trigonometric, Boolean/logic, conditional if-then-else) what allows it to produce quite non-trivial models (with procedural components in them).
Iteration in GenIQ process ranks the model population according to a measure of fitness (fitness value), e.g. R square of response variability explained. The fitness values of all models get normalized and are used as probabilities of each model getting randomly drawn and subjected to the next round. The size of population (the number of drawings in each round) and probabilities of mutation and mating/crossover are selectable by user.
The actual fitness value used by GenIQ is proprietary, (any measure of closeness/similarity between actual response and model's output will work), but we learned it seeks to first maximize response (improvement) in the upper deciles (or another selectable percentile sizes) of most fit models (in order to speed up the improvement of models?).
Data mining delivers data to evaluate and rank models. Models themselves are exported in GenIQ into an interpretable textual form that drive evaluation. Variables selected are those that participate in the fittest model.
GP has some appealing features: it is automatic, deals with large data sets and seems to uncover insights into structure of the data. On the other hand the process is iterative, solutions sub-optimal and many very different looking models may produce very similar responses. Also not every body is comfortable with that much randomness. The same model can be generated over and over and there is no guarantee that the optimal model is produced in any amount of iterations. Unfortunately, a systematic enumeration of all (non trivial) models, even only those with up to, say two operators and all combinations of variables and constants may produce many more models than GenIQ uses as an initial population. There is also a standard question of how/if GP can deal with locally best models ('multi-labeled correlations' from September 03 talk). I'm sure this is not the last we heard about this interesting and useful technique.