# ANNiML

Artificial Neural Networks in ML

David Gianazza -- DSNA/DTI R&D -- POM team

## Presentation

ANNiML is an neural network written in Objective Caml.

The code is mainly inpired from the following book [Bishop96]: "Neural Networks for Pattern Recognition", Christopher M. Bishop, Oxford University Press, 1996. ISBN 0198538642.

ANNiML can be used for regression (interpolation of an unknown function), or classification (assign an input vector to a class). The network is trained on patterns provided as input of the program.

A neural network can be seen as a function `F` of an input vector `x`. It computes an output `y = F(x,w_0,w_1,w_2,...,w_p)`, where the `w_i` are the weights assigned to the connections between the units. The biases of the network are considered as additionnal connections from bias units with a constant output value `1`.

The weights must be tuned by training the neural network on a set of patterns `(x,t)`, where `t` is the target vector, so as to minimize an error depending on the difference between the output `y` and the target `t`.

To do that, we need to

1. compute the gradient of the error (partial derivatives with respect to the weights), when this gradient is used in the minimization method,
2. find the weights vector `w*` which minimizes the error on the training set.
Currently, the error gradient is computed by backpropagation of the error in the network, and the following optimization methods can be used:
• simple gradient descent with constant step `eta`,
• gradient descent with step `eta` and momentum `mu`,
• BFGS quasi-Newton minimization.
An important issue in neural networks is to avoid overfitting problems, when the network fits perfectly the training data, but is unable to generalize on fresh inputs. One should be careful to have a sufficiently large training set (much larger than the number of weights), and to choose the most adequate training method. When using gradient methods, the one with a momentum term should perform better in avoiding overfitting problems. BFGS with weights decay (or another kind of regularization) would be better than the simple BFGS but the weights decay has not been implemented yet.

The training methods iteratively search the weights space, and stop either when the error is small enough (below an absolute tolerance), or when it cannot be improved (relative tolerance), or when a maximum number of iterations is reached. For minimization methods with a constant step (gradient descent), it may be better not to use the relative tolerance (use `reltol=0`) as the step may accidentally lead to a point where the relative improvement of the objective function is less than `reltol`, stopping the training too early.

The stop criterion is also an important issue in avoiding overfitting problems: the training should stop early enough to avoid overfitting the training data, to the detriment of the generalization on fresh input data.

As only local optimization methods are used, the results highly depend on the initial weights, randomly chosen before the training starts. Several runs with different initial weights should be performed.

## Availability

ANNiML is not in the public domain yet. The source code is under CVS.

## Installation

Retrieve the MATH and ANNiML modules under CVS:

```> cvs checkout MATH
> cvs checkout ANNiML```

Compile the MATH libraries:

```> cd MATH
> make```

Go to the ANNiML directory and compile:

```> cd ../ANNiML
> make```

## Usage

`anniml [layers] -train <fpatterns> [other_options]`
or
`anniml -run <input_vector> [other options]`
or
`anniml -test <fpatterns> [other options]`
or
`anniml -predict <finputs> [other options]`

## Options

### Main options

` -train <fpatterns> ` train the network on a patterns file.

` -test <fpatterns> ` test the trained network on a patterns file.

` -run <input_vector> ` run the trained network on a new input vector passed as argument.

` -predict <finputs> ` make predictions on a new input data file. If `-p` option is not used, results are saved by default in a new file with a .pred extension.

### Options for filenames

` -dir <dirname> ` working directory.

` -n <net_file> ` file describing the network's topology. Default is the patterns file basename with a '.net' extension.

` -w <wts_file> ` weights file. Default is the patterns file basename with a '.wts' extension.

` -p <fpredict> ` save the results of the `-predict` option in another file than the input file with a .pred extension.

### Select the network's functions

` -out <string> ` output function name: `ident`, `logistic`, `tanh`, or `softmax`.

` -act <string> ` activation function name: `ident`, `logistic`, or `tanh`.

` -err <string> ` error function name `q` (`quadratic`,`sum of squares`), or `ce` (`cross entropy`).

### Options for patterns

` -c <columns_file> ` file listing the columns to select in the patterns file.

` -norm ` normalize the patterns: xi:= (xi-avg(xi))/sigma(xi)

### Stop criteria

` -max <int> ` maximum number of iterations.

` -abstol <float> ` absolute tolerance used in the stop condition (default 1E-07).

` -reltol <float> ` relative tolerance used in the stop condition (default 1E-07).

### Choose the training method

` -bfgs ` BFGS quasi-newton optimization.

` -g <float> `, basic gradient descent with constant step `eta` (default 0.20).

` -gm <float> <float> `, gradient descent with constant step `eta` (default 0.20), and with momentum `mu` (taken between 0 and 1, default 0.10).

` -batch ` batch learning for gradient descent (chunk size = nb_patterns).

` -online ` on-line learning for gradient descent (chunk size = 1).

` -chunk <int> ` size of the patterns blocks used to update the gradient. Possible options for gradient descent methods are: `-batch`, `-online`, or `-chunk <int>`.

### Printing options

` -fv ` frequency of the verbose output on stdout (default 10).

` -prc ` print classification results.

### Other options

` -rand ` root for the random generator (default 0).

` -help ` display this list of options.

` --help ` display this list of options.

## Patterns

Each line of the patterns file must contain first an input vector `x`, and second a target vector `t`. Be cautious to use a network's topology that is consistent with your patterns file (dimension of `x` must be equal to the number of input units, and dimension of target vector `t` must be equal to the number of output units).

There is an option (`-c`) allowing to select the columns that you really want to use in your patterns file.

When performing classification, the dimension of your target vector must be equal to the number of classes. For example, if you have three classes, then

• `t= (1,0,0)` will mean that the vector `x` belongs to class 0,
• `t= (0,1,0)` will mean that the vector `x` belongs to class 1,
• `t= (0,0,1)` will mean that the vector `x` belongs to class 2.
In this case, the network's output will be a vector `y=(y_0,y_1,y_2)` where `y_i` can be seen as the probability to belong to class `i`.

As already said at the beginning of this page, you must have a sufficiently large number of patterns (compared to the number of weights in the network) if you want to avoid overfitting (see [Bishop96]).

It is recommended to use only a part of your data to train the neural network, and to test the trained network on the rest of the data. Overfitting can be observed when the network fits very well on the training data, and badly on the test data.

Ideally, a cross-validation would even be better, where only a part of the training data is actually used to train the network, and the rest (validation set) is used to evaluate its performance and stop the training before overfitting arises. Several different splits of training and validation sets should be tried, as well as several random initial values for the weights, before selecting the network with best average performance. Cross-validation is not implemented yet.

Be careful to normalize the inputs of the neural network before training it. There is a `-norm` option to do that, which removes the average value and divides each input by the standard deviation. However, it operates only on the pattern file given as argument of the command line. So you may not normalize the same way on your training set and your test set. You should better pre-process your initial data set, and normalize the inputs before splitting it into a training set and a test set.

A few examples of patterns files can be found in the `examples/` directory. See module `Ann_patterns` for the functions that read, write, and normalize patterns.

## Functions

When using `anniml`, the user must choose the error function being minimized during the training, and also the transfer functions of the different units.

These functions are described in module `Ann_func`.

The default options are set to perform regression, with:

• sum of squares error function
• identity transfer function for the output layer
• logistic activation function for the hidden units. Another possible choice for the activation functions is the hyperbolic tangent (tanh).
In order to perform classification with the network, you should use the cross-entropy as error function, and the softmax function for the output layer. You can use either logistic or tanh activation functions for the hidden layers.

## Examples

### Non-linear regression

As an illustration, here is a regression on noisy data that was produced using the following function: `y(x)= 0.5 + 0.4 sin(2*pi*x)` to which was added a gaussian noise (standard deviation: 0.05)

`> anniml 1 5 1 -train sinus_train.pat -dir examples -g 0.6 -max 20000 -reltol 0.`
Train a network with one hidden layer (1 input unit, 5 hidden units, and 1 unit in the output layer) on the pattern file 'sinus_train.pat', using simple gradient descent with step 0.6, with 20000 iterations at most and default absolute tolerance 1E-7. The resulting weights are saved in 'examples/sinus_train.wts'. The network's topology (fully connected) is saved in 'examples/sinus.net'.

`> anniml -test sinus_test.pat -dir examples -n sinus_train.net -w sinus_train.wts`
Test the trained network on the pattern file 'sinus_test.pat'. The test sample should be different from the one used to train the net.

`> anniml -run 0.7 -dir examples -n sinus_train.net -w sinus_train.wts`
Compute the neural network's output for an input x=0.7

`> anniml -predict sinus_plot.new -dir examples -n sinus_train.net -w sinus_train.wts`
Read new inputs from file 'sinus_plot.new', compute the network's outputs, and save the inputs and outputs in file 'sinus_plot.pred'

```> cd examples
> gnuplot```
Go to the examples directory and launch gnuplot

`gnuplot> plot 'sinus_train.pat', 'sinus_plot.pred' w l, 0.5+0.4*sin(2*pi*x) w l`
Plot the noisy train data, the fitted curve, and the true function.

### Another example, using BFGS

`> anniml 1 5 1 -train sinus_train.pat -dir examples -w sinus_bfgs.wts -bfgs -max 1000`
Train the network, using BFGS to tune the weights. Save the resulting weights in 'examples/sinus_bfgs.wts'.

`> anniml -predict sinus_plot.new -p sinus_bfgs.pred -dir examples -n sinus_train.net -w sinus_bfgs.wts`
Read new inputs from 'sinus_plot.new', compute the network's outputs and save them in 'sinus_bfgs.pred'.

You can plot 'sinus_bfgs.pred' and compare with the results obtained with the simple gradient descent.

### Example of classification

`> anniml 2 5 2 -train sinclass_train.pat -dir examples -err ce -out softmax -act tanh -bfgs -max 1000 -rand 2 -prc `
Train the network for classification, with a cross-entropy error function, a softmax transfer function for the output units. For a change, we chose the hyperbolic tangent activation function for the hidden units.

`> anniml -test sinclass_test.pat -dir examples -n sinclass_train.net -w sinclass_train.wts  -err ce -out softmax -act tanh -prc`
Test the trained network on the pattern file 'sinus_test.pat'.

## Reference manual

Here are the ANNiML modules (see also file `anniml.ml`):

 Ann_backprop Forward and backward propagation through the neural network. Ann_config Parsing the program parameters from the command line. Ann_func Transfer functions and error functions. Ann_patterns Patterns for neural networks. Ann_random Random generation of the network's weights Ann_results Compute and print a few indicators of the neural network's performance. Ann_topology Neural networks topology: units, layers, and connections. Ann_weights Read weights from files, and write weights to files Annet ANNiML main functions.

You may also refer to the following indexes:

This documentation was produced with `ocamldoc`:

`ocamldoc -html -t ANNiML -intro Readme -d ../../doc/ANNiML/ -hide Pervasives *.mli`