The talk
presented Structured (modeling?) Risk/error as a function of model
complexity consisting of two components  modeling/fit error which
decreases with model complexity and confidence interval risk/error which
increases with model complexity. The total risk has a minimum and there
is a best model for that minimal risk.
As a most popular implementation of SRM speaker pointed to Support Vector
Machines (SVM)  that apply kernel operators to machine learning. However
KXEN doesn't use kernel operators because the results obtained are 'too
difficult to explain to the clients'. Instead a concept of
VapnikChernonienkis (VC) dimension is used  but we didn't learn much
about it because it would take too long to explain during our meeting
(there are still Ph.D. dissertations made on it). Fortunately there is a
simpler concept of 'fractal dimension' contributed by Chaos Theory that
appears to have basic properties similar to those of VC dimension.
Let imagine xy plane (data set restricted to just two columns/attributes)
and the line x=y. If data points are confined to this line then they have
the fractal dimension of 1. If data points are somewhat frizzed, the
fractal dimension will be somewhat higher than 1. If the data points are
uniformly distributed on the xy plane, then the fractal dimension will be
2. Suppose we add/reconsider the z attribute/axis and find the line now
is given by x=y=z. Even if the xy data was confined to x=y line, there
can be some frizzing in z direction  and fractal dimension of xyz data
would be more than 1, that it was on xy plane. Data with fractal
dimension of one require just one variable to describe/model, even though
in our example it will be neither x nor y.
This example points to practical use of fractal dimension  calculate it
including all variables existing in the data set  this will be the upper
bound on the number of variables needed to construct the model. What is
left is to find the best set of variables. (Of course, there are some
pitfalls, like fractal dimension is local, i.e. in general it varies from
point to point, and independently the best variable set is local/may
change even if the dimension doesn't (change much). The process is also
quite computation intensive.)
Other VC dimension properties and SRM we learned are:
* It measures the complexity (number of variables?) of a set of mappings,
in a way employing the Ockham razor principle (after eliminating all
that's impossible, we are left with the solution, however unlikely it may
seem ?). It can be linked to generalization results, indicates
generalization capacity.
* Additional variables improve modeling quality at low cost, no harm from
random or correlated variables, no overfitting, no need for exploratory
analysis ()
* The modeling process can be automated, which is the primary
objective/benefit from SRM, not improved prediction /learning
/generalization. It also makes it simple to manage generalizations.
* SRM process is robust in several ways/aspects: regression (resistant to
outliers), statistical (free of distribution assumptions, not harmed by
skewed distributions), training (small training set(?), missing values),
engineering (gives answer, doesn't crash), deployment (not stymied by
values not seen, tells you how good model you have, model degrades slowly
and one can tell when it goes bad)
* implements smart segmentation/classification of data, as a Neural
Network (NN). Employs binning (discretization of a continuous variables
into a finite set of intervals), e.g. in piecewise models.
What is also good to remember is that:
* When you have a hammer, everything looks like a nail,
* Simplicity is hard. When solution is simple, God is answering (Albert
Einstein),
* Estimating distribution is harder than harder than estimating function.
The absence of details on promised Support Vectors and VC dimension was
compensated by an example of first order (piecewise) regression of a sine
function on the binned (Pi, Pi) interval. Overall the talk shed a light
on yet another aspect/technique of modeling, just when I thought we heard
it all in our meetings.
