macopt-a nippy wee optimizer
I used the conjugate gradient
code in the Numerical Recipes book for several years
to train successful neural networks. But this code is actually
very poorly written. It's not just
a question of their use of global variables and other ugliness.
The optimizers can be very clumsy when applied to real functions.
- Step sizes in line minimization
Bear in mind that the step size is a number that multiplies
the gradient vector to give a movement in parameter space.
Bear in mind the possibility that the gradient may be `big'
(eg proportional to a number of data points, N, with N=1000),
whereas the typical desired movement in parameter space may be
`small', of order 1, say.
In the NR algorithm they use a general purpose bracketing procedure
to do the line minimization. The initial step
size is always `1', even though this may be a very inappropriate
step size. What is worse,
if this initial step size is too big then the routine mnbrak
makes the next step equal to `-1.6': NB, this is in the wrong direction!
We know that we are expecting the minimum to lie in the positive direction.
And it is an even bigger step than the first one, so it is likely to
give even worse numerical problems.
So this is a waste of computer time, and creates an initial bracketing
interval that is far too big and lies 3/5 in the negative direction.
Imagine, for the sake of discussion, that the true minimum is at a step
size of about 0.000001 on each line search. Then the Numerical Recipes
algorithm will waste a lot of time creating silly guesses -- about
log_2 1000000 of them, in fact. So about 10-20 unnecessary function evaluations
are done on every line search, whereas if the initial step size were
0.001, then only a few function evaluations would be needed to locate
the minimum.
In macopt I rectify this in two ways. First the initial step size of
any line minimization is inversely proportional to the gradient.
And second, the constant of proportionality is kept as a static
variable held in a special structure and adapted during the optimization.
-
Another criticism of the NR code is that they do not use gradient
information in the line search. They mention that it is possible to
use gradient information, but their example code does not use it.
In fact, with gradient information, it is easier to find the line
minimum, because you can bracket the minimum with only two gradient
evaluations.
In macopt I go the whole hog and make no use of the function value at
all. This makes for a simpler program and means that one can minimize
functions whose gradient is easier to calculate than the real thing.
(There are examples of such functions in my work.)
-
Finally, it is not necessary to locate the line minimum as accurately
as linmin does.
Their general purpose line minimizer gives huge precision, whereas
for practical purposes a rough minimization is adequate.
In macopt I just bracket the minimum and then guess where it is
by linear interpolation.
(To put it another way,
the line search
looks for a zero crossing of the inner product of grad(f) and
x
where x is the line search direction.
I have written my own conjugate gradient algorithm that attempts to
improve on the NR code. My algorithm is called macopt.
It has the following properties:
-
The optimized parameters, as in the numerical recipes code, are
all contained in one double precision vector, e.g. x[1]..x[N].
(Note the offset of 1.)
-
an adaptive step size is used for the initial step of the line
search. The initial step is inversely proportional to the
gradient, so that the actual movement in parameter
space is of a roughly constant size (which is adaptive :-)
- gradients are used in the line search
- once the minimum has been bracketed, the line search
terminates immediately, and a new line search commences
using interpolation to estimate the location of the minimum
and the value of the gradient there. Alternatively
an extra gradient computation can be made here to be safe.
- typically, when the routine has adapted its step size, two
gradient evaluations per line minimization are performed.
- macopt makes sanity checks to confirm that the line search
direction and the gradient are consistent. If they
are not then it resets its g and h vectors to the gradient.
- macopt has a structure containing arguments and pointers that
it uses to keep track of vectors and gradients.
The user can control some of these which define:
-
the maximum number of iterations, and the max number
of steps in a line minimization.
-
the tolerance for the termination condition.
The tolerance can be expressed in various ways, one
of which is `covariant' (I think that is the right
jargon).
- `verbose' can set a diagnostic level from 0 to 4.
at level 0 the program is silent except when reporting
errors.
-
macopt is quite greedy with memory. If your parameter
vector has n dimensions, then macopt needs about 8n
doubles. i.e. it creates 8 vectors for storing
gradients and stuff in. But hey, that's not as bad as
variable metrics!
I find that this algorithm sometimes is ten times faster than the NR
code. The Xerion group at University of Toronto have also written
their own conjugate gradient optimizers, and they have
adaptive conjugate gradient optimizers that only need 1.6 gradient
evaluations per line search.
| original C version |
| older C version |
Other versions
Matlab and Octave: (Mon Jun 28 2004)
Iain Murray has made matlab and octave wrappers for macopt.
| matlab wrapper |
octave wrapper |
Notes from Iain: 1. For instructions, cd macopt_oct; cat README.
The command make will download macopt.tar.gz.
2. Then run octave-2.1.57 and help macoptII.
3. I haven't wrapped the covariant version, and I haven't exposed every
option to the end user.
Java:
Macopt has been put into
JAVA
in the
Phylogenetic Analysis Library
| PAL mirror site |
local copy of tar file, pal-1.4.tar.gz
|
Instructions for the New C version
Most of the original instructions, given below,
are accurate. The only difference is that I organize my directories
differently (for ease of maintaining executables on multiple platforms).
When you tar xvf, you will get a directory called newansi.
This directory must contain a directory called bin$ARC,
for example if $ARC is i386,
it should be called bini386.
This directory should inherit its makefile from newansi/_Makefile thus:
newansi/bini386/makefile -> ../_Makefile
The above example link is created for you by the tar file.
All the .o files and executables are put into bini386.
To execute an executable, give the path to it (eg bini386/test_macII)
or modify your path to include ./bini386.
Apologies for this minor complication, hope it works OK for you!
Instructions for original C version
There is a demonstration program
called test_mac which uses macopt to minimize a quadratic function.
When you get this tar file, uncompress it, tar xvf it, (Note this creates
a directory called ansi), and cd ansi, then type
make test_macII
Hopefully you will get an executable test_macII which, when executed,
minimizes the function 1/2 xAx - bx, where A and b are given by the
user interactively. [NB, you must give a positive definite symmetric matrix,
e.g.
2 1
1 2 .]
The program also makes use of the maccheckgrad function, to check
(visually) that the function that claims to return a gradient
really does so. In one column, the gradient is given, and in the
other the numerical first difference of the objective function.
The anonymous ftp route for the tar file, if you need it, is:
ftp www.inference.org.uk; cd pub/www/mackay/c; binary.
You may wish to change the following variables.
-
In the makefile:
-
Optimization at compile time.
Cut out the -O2 -funroll-loops flags to disable optimization.
This will make things run slower but will make them easier to
debug with many debuggers.
-
In the macopt_arg structure (default values set by
macopt_defaults). [See macopt.h for further information, and see
macopt_defaults in macopt.c.]
- end_if_small_step
-
Defines whether the end of the optimization occurs
when a small step is made, or when the gradient is small.
Some users find things work well if you
make the termination condition based on the gradient
rather than the step size.
- tol
-
Defines `small' in the above senses. I recommend using
end_if_small_step=1, and setting tol using your knowledge of
your parameter space. What is a `really small' change in a
typical parameter?
Note that tol only defines when the optimization ends. It
does not have any effect on what happens during the
line searches.
- lastx and lastx_default
-
defines typical distance in parameter space at which the line
minimum is expected; both these should be set. the default is
consulted if something goes badly wrong and a reset is
demanded. It is not essential to give these good values, as the
algorithm is adaptive, but the idea is that you should specify
a typical expected step size here. Err on the small side if
unsure.
- itmax
-
Maximum number of line minimizations performed.
- rich
-
Whether a gradient is evaluated at the beginning of every line
minimization. If you can get away with this being 0, then
the optimizer runs faster.
- verbose
- if verbose = 1 then there is one report for each
line minimization. if verbose = 2 then there is
an additional report for each step of the line
minimization. if verbose = 3 then extra
debugging routines kick in.
- Failure to compile.
-
If it complains about srand being redefined, don't worry.
We don't need srand.
-
If it complains about the return type of main(), tell it
to be less pedantic.
-
Warning! maclinmin overran- inner product at 0 = 0.09674
-
This sort of error message is a problem.
The inner product at the beginning of a line search should be
negative, otherwise the whole method doesn't work at all.
If you get this error message a lot, either
there is a bug in your
gradient computing software (always make use of the check_grad
routine to see if all is well with your gradient)
or macopt is being too ambitious and needs to
be a bit more conservative: set `rich' to 1.
- We are using your macopt package on an application
that we would like to release under the GNU Lesser General Pulic License
(LGPL).
Do you offer a license that will permit us to distribute your code under
the LGPL?
Yes, as of Thu 15/8/02, the macopt releases in both C and C++ on my website
are distributed under the LGPL; the associated example test files
are distributed under the GPL.
Note however that three files in the tar file belong to other people:
rand.h is copyright by Radford Neal. (It is not actually used by macopt,
so this doesn't matter.)
nrutil.c and
nrutil.h are derived from the Numerical Recipes in C software
library. These minor routines
are used by macopt to handle memory allocation and deallocation.
David MacKay's work is largely supported by donations.
If you find macopt useful, please feel free to make a donation.
-
is there now any need for mynr.h in the release - nothing in
it seems to get used anywhere ?
-
You are right, mynr.h is not needed. Sorry about this untidiness.
-
I didn't initially realise how inherently macopt depends
on gradients being easily computed. Is this really that
often the case ?
-
Yes, macopt is built on the assumption that a gradient evaluation
costs only about twice as much as a function evaluation.
This is the case in all the statistical models I work
with (neural nets, etc.)
If in fact function evaluations are relatively cheap and gradients
are very expensive, then you might want to change to a different
line minimizer. [A friend told me that lnsrch in the new edition of numerical
recipes is a good line minimizer.]
However, Barak Pearlmutter points out that
Gradients can always be calculated with at worst 5 times
more effort, and typically just 2 times (even if there is an
iterate-to-fixedpoint in the function evaluation itself, which complicates
things a little.) [references]
Thus it makes sense to always make a gradient-finding routine.
References
concerning optimization and the ease of computing derivatives:
@PHDTHESIS{Speelpenning1980a,
author = "Bert Speelpenning",
month = "January",
year = 1980,
title = "Compiling Fast Partial Derivatives of Functions Given by Algorithms",
address = "Urbana-Champaign, IL",
school = "Department of Computer Science, University of Illinois at
Urbana-Champaign",
key = "Speelpenning1980a",
abstract = "This is the author's doctoral thesis. It starts by comparing
previous work in the area of symbolic differentiation of
algorithms. Specifically it considers the work of Warner in 1975,
Joss in 1976 and Kedem in 1977. The conclusions reached in this
discussion are that the work by Joss is the best in terms of
improvement. The author proceeds to describe how Joss' work can
be improved in terms of speed, accuracy and space. A package,
Jake, is described which is a compiler that takes a Fortran 66
input definition of a function. This input is limited in that
only one subroutine can be specified, and that certain Fortran 66
statements are disallowed. Jake is instructed on how to perform
its task by directives within the subroutine. Timing results are
provided on the performance of the code produced by Jake over
those where finite differencing is used.",
keywords = "point algorithm; precompiler; numerical results."
}
@BOOK{Griewank2000a,
author = "Andreas Griewank",
year = 2000,
title = "Evaluating Derivatives: Principles and Techniques of Algorithmic
Differentiation",
series = "Frontiers in Appl. Math.",
number = 19,
publisher = "SIAM",
address = "Philadelphia, PA",
key = "Griewank2000a",
isbn = "0--89871--451--6"
}
David MacKay's:
home page,
publications and code.