Feed-Forward Neural Networks (FFNNs) are universal approximators and may be adapted to solve any representable problem to an arbitrary degree of accuracy (whether for predictive modelling or classification), limited only by the amount and quality of training data. One problem in using FFNNs is caused by memorization of a limited set of training data, which causes the FFNN to perform poorly on non-training data. The time and computational effort to produce a well-trained FFNN may also be significant and can seem arbitrary. In addition, the architecture and size of the network to be used is often not obvious. All of these problems can be solved easily using algorithmic-information-theoretic notions pioneered by Kolmogorov along with the realization that any one FFNN can be well-trained with little effort, but will always have individual bias based on its architecture, size, and initialization prior to training. Fortunately, one can quickly create a very robust (well-generalizing) collection of FFNNs of the appropriate size and architecture using an algorithmic information measurement called the Minimum Description Length (MDL). I will give you the process, then define the various parts, such as MDL and the training methodologies.
This is a very quick process. The computation of the least-squares is the most intensive process, and it’s light-weight and requires no iteration. Although each individual FFNN may be individually biased as a result of varying size, architecture, and initialization; the committees (pools) formed from them are robust to these biases. These resulting collections of FFNNs are unbiased.
Nguyen-Widrow initialization of the hidden layer weights/parameters spreads the hidden nodes over the input data space (or, for other layers, the output space of the previous layer). It does this in such a fashion that the whole input data space is uniformly covered. In general, that is the best you can ever do without using the training data. Don’t do additional training or adjustment to improve it, since training will just cause overspecialization and cause the resulting network to perform poorly on non-training data later.
Citations
In order to set the parameters/weight values of the output layer, apply the training data to the hidden layer to get its output. Then, minimize the Least-Mean-Square error over the weights/parameters of a weighted sum of the hidden nodes output values for the set of training output values. This will give you the values to use for the output layer weights/parameters.
The Minimum Description Length (MDL) criterion is an algorithmic information-theoretic measure (from Kolmogorov information theory) for the amount of information (in bits) used in the algorithm represented by an FFNN. It is the sum of two parts: 1) the information in the residual error resulting from the use of an FFNN; and 2) the information in the algorithm - the bits in the parameters/weights of the FFNN.
MDL = (w) * BRe + (1-w) * BPnn, where w ~= 0.5, but may be varied from 0.0 to 1.0
BRe = (# of samples) * log2(1 + mean square error)
BPnn = (# of parameters) * log2(1 + mean square parameter value)
Note that small parameters are mostly insignificant to network performance, that is the reason to use the mean square parameter value.
Citations (more available on request)
In short, you create a small set of FFNNs of the ideal size to avoid overspecialization and then combine their outputs as an average. Each FFNN is trained quickly on the available training data by using the Nguyen-Widrow initialization to set the hidden layer weights and using Least Mean Squares to set the output layer weights. A search is conducted for the optimal size of the FFNN’s hidden layer(s) by minimizing the calculated MDL as a fitness criterion. This method is quick and robust to overspecialization.
For further information, feedback, or to pose questions please contact the author at: jak@KassebaumEngineering.com.