## Bayesian methods for neural networks - FAQ.

compiled by David J.C. MacKay

For a review paper on Bayesian methods for neural networks, please see my publications page, in particular the papers Bayesian Interpolation' and A Practical Bayesian Framework for Backpropagation Networks' and Probable Networks and Plausible Predictions'. Most of these FAQs are from people who have read those papers and wanted clarification or further information.

If the answer to your question is not here in the FAQ, please proceed to the Bayesian Methods automated FAQ service; for other MacKay group FAQs, click here.

• You may also be interested in Radford's description of the philosophy of Bayesian inference and the giant comp.ai.neural-nets FAQ for which Radford Neal has written a page on Bayesian neural net learning.
• Anyone interested in non-linear data modelling with neural networks is encouraged to look into Gaussian processes.
• Carl Rasmussen's Gaussian Process web site.

### Give me one reason why I should be interested in Bayesian methods for neural networks

Sure, here are two!

## Contents

### Bayesian Interpolation

#### Why the posterior probability *has* to be of the form exp(-M) instead of anything else

I've got a bit stuck on why the posterior *has* to be of the form exp(-M) instead of anything else.
 where M = E(data) + E(weights) = total "cost function"
ie. in
P(w|D,alpha,beta,H) = f(M) / Z
why are we constrained to use exp for f: couldn't we use something else and thereby derive a completely different probabilistic interpretation? Is it that we also want the addition of an arbitrary constant to have no effect? [ie. we want Posterior_with_M = Posterior_with_M+C, and exp is the only function that'll do the trick].
OK, there's two answers to this one:
• if we want to model the noise as independent from point to point, then there is only one way to map the error function sum ( residual^2 ) onto a probabilistic model, and that is through "Exp". There are papers that prove this theorem by Tishby and people, I think.
• However there is another way to relate the error functions to posterior probabilities, which is a bit of an advanced topic. In this alternative view we don't assume independence of the residuals. Here is a sketch: Start from the probabilistic model
      P(w|D,alpha,beta,H) = exp(-M)/Z

and integrate over alpha and beta to represent the fact that we don't know them. Then the posterior probability P(w|D,H) has the form of a product of student distributions in which the w-dependent bits are just E_D and E_W. The reason for getting a different form from exp is that we are no longer assuming independence of the residuals--if we see how big some residuals are, it gives us an idea how bog other residuals are going to be. When you compute the derivative you get something almost equivalent to the standard derivative of standard learning. For more on this see the paper of Weigend and Buntine and my paper Hyperparameters: optimize, or integrate out? alpha.ps.gz, abstract.

#### Relationship between the likelihood function and a noise model

I don't understand the interpretation of the (regression) likelihood: if we're saying the likelihood = exp(-E(data)) / Z, I see that the likelihood of a given set of weights is at a maximum if its outputs equal the targets and dies off as a gaussian as those outputs differ from the targets, but I don't understand how to picture this in terms of "noise".
Sounds like you are very close here! You have clearly got the idea of the likelihood as a function of w being a function that measures how good w is, and that is exactly right. But let's step back a moment. Let's assume that there is a true function f(x;w), and that the targets t really are obtained by a little man adding noise nu which is Gaussian.
 t = f(x;w) + nu

We could spell this formula out longhand as a statement about a probability distribution,
 P ( t | w , H ) = Normal ( mean = f(x,w) , variance = sigma^2_nu )

And the assumption that all the noise terms nu are independent means that the probability of all the target values D = {t} is :
 P ( {t} | w , H ) = exp - 1/2 ( sum of residuals^2 ) / Z_D

Then we can ask the question how probable are weights w?', and get the unique answer (given our assumptions, and leaving out alpha and beta for brevity) by Bayes theorem -
 P( w | D , H ) = P ( D | w , H ) P ( w | H )
---------------------------
P ( D | H )

Notice the first term that measures how good w is according to the data is P ( D | w , H ), which is just the probability distribution of the assumed noise, P ( {t} | w , H ).

#### Validity of updating alpha and beta when w is not at a minimum of M(w)

From: udah222@bay.cc.kcl.ac.uk  Dirk Husmeier

I have read your introductory papers on the Bayesian framework for training multilayer perceptrons ("Bayesian Interpolation", "A Practical Bayesian Framework for Backpropagation Networks", Neural Computation 4 (1992)) with great interest, but have a question concerning the practical implementation. The scale parameters alpha and beta can be calculated via the number of free parameters, gamma= k-alpha*Trace(A^-1), according to 2*alpha*Ew(w_mp)=gamma, 2*beta*Ed(w_mp)=N-gamma (equations (4.8),(4.10) in "Bayesian Interpolation"). This assumes a quadratic approximation for M, equation (4.5), which certainly holds locally in the neighbourhood of w_mp. However, the parameter configuration w_mp is not known a priori and actually shall be found by minimizing M(w)= alpha*Ew+beta*Ed. It does not become clear how the scale parameters are chosen during the training process. Are they calculated using modified equations (4.8) and (4.10), replacing Ew(w_mp) and Ed(w_mp) by their current values Ew(w) and Ed(w) ? However, then grad(M) != 0, so the quadratic approximation, equation (4.5), does not hold any more. Can one really assume that the algorithm based on the calculation of Ew(w) and Ed(w) converges in some self-consistent way and finally leads to the minimum of M(w)?
Good question. There are several answers to this question.
• If the algorithm,
do {
partial optimization of w ;
re-estimation of alpha, beta ;
(using either expensive
or cheap-and-cheerful formulae)
} lots of times

does converge to some state {w,alpha}, then that state is a self-consistent solution, i.e. a pair such that w minimizes M(w|alpha) and alpha maximizes the evidence subject to the approximations and assumptions. So, if it works, it is OK. And empirically I find it works OK. I have never encountered instability or other problems with this algorithm. But there are no cast-iron guarantees in this field.

In the special case of linear models, it is possible to justify as a safe and convergent strategy the updating of alpha by the given formula before w has converged. This update procedure can be viewed as an E-M algorithm. See Ensemble Learning and Evidence Maximization'-- nips.ps.gz, abstract, and the reference to Neal and Hinton therein, for the concepts needed to confirm this assertion.

• But why should we expect it to work? Here is a possible answer. Assume that M(w) is locally quadratic. Then consider what happens if we apply the reestimation formulae for alpha and beta with w not quite at the minimum w_MP. By the local-quadratic assumption, the value of gamma will be the same as if it had been evaluated at w_MP, because A is locally constant. What about Ew and Ed? Because we are not at w_MP, the values of these quantities will not be exactly equal to their values Ew(w_mp), Ed(w_mp) there. But Ew and Ed are typically slowly varying quantities with w; once the optimization is getting somewhere, at least. Of course, it depends what optimizer you use, what sequence of w's you will explore. But thinking of conjugate gradients, for example, and concentrating on the w produced at the end of each line minimization, I think it is true to say that in a typical neural network, these quantities will change slowly. So if you are close to an optimum wmp, the values of alpha and beta found using the algorithm will not be greatly different from those that would be found at the optimum. Note the fact that grad M is not 0 is not relevant.
• Finally though, I should emphasise that the above algorithm should be used with caution, because the above arguments do not apply if the partial optimization of w is so incomplete that we are miles away from a good solution. For example, I know of people who have implemented the algorithm with a poor quality optimizer of M(w); after a few iterations of the optimizer, they applied the re-estimation formulae. The result is that the weights have moved so little from zero that alpha is reestimated to be huge, and the weights all decay to zero. Nothing is learnt. It is important therefore that the re-estimation formulae are only used after a substantial period of optimization using a quality optimizer. More about quality optimizers
DJCM Thu Jan 19 11:34:09 GMT 1995

#### Where to find background information on linear algebra?

How to derive w_mp=A^(-1)*B*w_ml?

Since I began to get interested in Bayesian probabilities (from a ftp draft of the future book of E.T. Jaynes), I have lacked many mathematical basis, and the books I "randomly" opened were anti-bayesian to the point they were quite useless. Do you know 1 or 2 good "technical" books on the mathematical basis of Bayesian probabilities, including the manipulation of Gaussian distributions (you often mention Gull, but I prefer to ask you before searching for them) ? And where can I find explanations on Hessian matrixes, and the tools to understand formulas like
d                      -1 dA                            2
-- log det A = Trace (A   --), and your paragraph on chi expectation being
da                        da                                    N or N-1 ?

Thanks, Eric Dedieu
I think that you will be able to get good information on matrices and Gaussians from any book on ordinary statistics and on linear algebra; it doesn't need to be Bayesian. Duda and Hart is a good book, that is Bayesian, and does a lot of Gaussian stuff. You will need to get a book on linear algebra and read up on eigenvectors and eigenvalues, traces and determinants. The mathematics (eg of chi^2 ~= N) is *not* hard. You just need a little confidence, and you should be able to confirm this result starting from the definition of the variance of a Gaussian.

For hints on how to understand

d                      -1 dA
-- log det A = Trace (A   --),
da                        da

take a look at this section.

#### What is Cross validation?

what is the cross validation method for network training in supervised neural networks?
There are typically a load of tricky parameters to set in a standard neural network, assuming you are not using Bayesian methods already. (For example, number of hidden units, weight decay rates, choice of input units, choice of noise model.) One popular way to set these parameters is to leave out a bit of the training data, calling it "validation data"; and try varying the tricky parameter(s) and see how the generalization error on the validation set varies. Because of the inherent noise in real data sets it is a good idea to repeat the cross--validation experiment multiple times, using different divisions of the data into train and validation data. There are many ways of making this systematic. Leave one out cross validation, Jackknife, bootstrapping, are a few names of methods.
Finally, one should test on an unseen "test set" that is completely fresh data not previously used for cross-validation.
That's cross-validation, and two things to emphasize about it are (1) it uses a noisy performance measure (the validation error); (2) it is hard to simultaneously control multiple tricky parameters with cross-validation because it is hard to search in a high-dimensional space for the optimum of a noisy function.
In contrast, in Bayesian complexity control methods (1) we use the evidence which is not noisy a functions of the parameters; (2) we can find the GRADIENT of the evidence' with respect to the parameters, so allowing search in high-dimensional complexity control spaces.

#### Relation between Bayes and GCV

Is GCV the same as the Bayesian approach to optimizing alpha?
Cross-validation is undeniably a sensible pragmatic procedure for model comparison and for setting regularisation constants. But it is not the same as evaluating the evidence. Here is my argument for why the evidence and cross validation are fundamentally different: The evidence can be decomposed as follows if we want: (subsuming alpha, beta into H, and using my notation -- data points are t_1..t_N)
 P( t_1...t_N | H ) = P( t_1 | H ) P( t_2 | t_1, H ) P( t_3 | t_1,t_2, H )
* ... P( t_N | t_1..t_N-1, H )

-- i.e. the product of all the predictive probabilities for each of the data points, using the model trained' on the previous data points. Now by arbitrary reordering of the points, I can expand the evidence in N! different such ways. The individual terms in the expansion will vary in value as I do this, but the overall product will stay the same. One way of representing this is to make a plot of the log predictive probability for each data point:
   | - log P( t_i | t_1...t_i-1, H )
|*
| *
|  **
|    * *
|     * * *
|        *   *
|          **
|             **   **
|               ***   *     *
|                    * *****  ****  **
|                            *    **  *
|
---+----------------------------------------------->  i
N

Now, as we vary the arbitrary labelling of the points, the *'s will wander up and down, but the log evidence, being (minus) the INTEGRAL under this curve, will stay the same. In the above I have drawn the typical behaviour for the terms as a function of i: the late data points are more predictable than the early ones (as the model learns), so the predictive error bars get smaller and the predictive probability gets bigger, typically. Anyway, it's definitely not flat, except in the most trivial case in which there is nothing to model. Only once the parameters have become fully determined will the average height of this curve level out to a constant value determined by the noise level. OK, how does this relate to CV? Well, CV, as I understand it, involves looking at the average value, over many data orderings, of
 LEAVE-ONE-OUT-ERROR =  - log P( t_N | t_1...t_N-1, H )

which is precisely the value of the *last* point on this graph. And looking at the graph, I see no way that examining the average value of the last point in the graph can tell you what the integral under the whole graph is!

Pragmatically, I think that what a *user* often cares about is what the predictive performance of his model is probably going to be for the next data point N+1:

 PRAGMATIC-PERFORMANCE = - log P( t_N+1 | t_1...t_N, H );

and it seems reasonable that LEAVE-ONE-OUT-ERROR is going to be a good predictor of this, for N>>1. Furthermore, the evidence is not necessarily going to be so well correlated with - log P( t_N+1 | t_1...t_N, H ), since the evidence is the integral, which may include a big transient tail at the i< Having established that CV and Bayes are different, I have three reasons for preferring Bayesian methods:
Philosophical
I believe that probability theory contains everything that we need in order to describe coherent inference. Any procedure that does not map onto probabilities is therefore incoherent in some way. Indeed, it is possible to construct little examples such that cross-validation gives a silly answer. This is not to say that I think crossvalidation gives silly answers in practice. I expect it usually works well; but my personal preference is to work on a theory that I know is fundamentally right. I like Bayesian methods, because I know what my assumptions are, and I know what my approximations are, and I obtain error bars with a well-defined meaning, and I can marginalise over nuisance variables in order to obtain predictive distributions. It is all well-defined, mechanical and beautiful.
Practical
Although Wahba says that crossvalidation can be used when there are multiple regularisers, I am sceptical that this would really be practical for a non-linear model with 27 regularisation constants. I think that her methods are specialised for the linear case, and that it becomes hard to do an optimisation in {alpha} in the case of a non-linear model. Whereas the Bayesian calculations, which yield the gradient of the evidence with respect to {alpha}, take the same amount of time independent of the number of alphas.
Generalisability
Why does GCV examine the sum-squared error? Why not some other norm? How should one choose between different norms? If I believe that the noise in a problem is not Gaussian but follows the distribution
 P(r) = exp ( - beta r^p ) / Z(beta, p)

where p is in [1,2], how should I optimise the model, and how should I do cross validation? How should I determine beta and p? What about the case of a classification model?

I have never seen these questions addressed in cross validation terms. Of course, a Bayesian has no problem addressing these questions.

However, having said this, I should emphasise that I think from a point of view of model criticism, it is best to do *both* Bayesian model comparison, and cross validation. Comparisons between these different methods can give useful insights into defects in the model (see my paper A practical Bayesian framework for backprop nets' for an example).

Also, if the evidence and cross validation are strongly in disagreement, I would predict that cross validation would be the better method for predicting generalisation error.

#### How to measure performance

I measure my models' performance using test error (sum squared error). Is there a more robust performance measure?
Yes. It is popular to use the test error (sum squared error) as the default performance measure, but in fact this may be a misleading criterion. In many applications there will be an opportunity not to simply make a scalar prediction, but rather to make a prediction with error bars, or maybe an even more complicated predictive distribution. It is then reasonable to compare models in terms of their predictive performance as measured by the log predictive probability of the test data or validation data. Under the log predictive error, as contrasted with the test error, the penalty for making a wild prediction is much less if that wild prediction is accompanied by appropriately large error bars. The log predictive error (LPE), assuming that for each example $m$ the model gives a prediction Normal( y^{(m)}, {\sigma^{(m)}_y}^2 ) is:
	{LPE} &=& \sum_m - \log P(t^{(m)}|D,\H) \\
& = & \sum_m \left[ \frac{1}{2}
\left( t^{(m)} - y^{(m)} \right)^2/ {\sigma^{(m)}_y}^2
+ \log (\sqrt{2 \pi} \sigma^{(m)}_y )
\right]

Also, going back to scalar predictions: consider using more robust error measures than the mean-square error. After all, is it usually the case that if an error of size 2.0 costs 3 pounds more than an error of size 1.0, then an error of size 3.0 costs 5 pounds more than the size 2.0 error, and an error of size 4.0 costs 7 pounds more than the size 3.0 error? Often a more robust measure such as the mean absolute deviation, or the mean squared error of the smallest 90\% of the residuals, may be a more realistic loss function.

But how do I compute \sigma^{(m)}_y?

#### Simple example of predictive distribution

Please write out a simple example of equation 8 from your Hyperparameters...' paper,
 P(D_2|D,alpha,H) = \int d^k w  P(D_2|w,H) P(w|D,alpha,H) .  (8)

I will use this notation for Gaussian distributions:
 Normal ( x ; mean , variance )

In the simplest case, w is a scalar and there are two data points. One training point, D = t_1, and one test point, D_2 = t_2, where
 P ( t | w )  = Normal ( t ; w , sigma^2 ).

Let the prior on w be
 P ( w | alpha ) = Normal ( w ; 0 , 1 / alpha ).

The posterior probability of w given t_1 is:
 P(w|t_1,alpha) \propto P(t_1|w)P(w|alpha)
= Normal ( w ; w_mp , sig_w1^2 )

where
 w_mp = t_1 * ( 1/sigma^2 ) / ( 1/sigma^2 + alpha )

and
 sig_w1^2 = 1 / ( 1/sigma^2 + alpha )

Let the new datum be t_2.

Equation 8 becomes:

 P ( t_2 | t_1 , alpha ) =
\int dw Normal ( t_2 ; w , sigma^2 )  Normal ( w ; w_mp , sig_w1^2 )

in words, t_2 has a probability distribution that is given by convolving the Normal posterior distribution of w with the Normal distribution of the noise. The result is:
 P( t_2 | t_1 alpha ) = Normal ( t_2 ; w_mp , ( sigma^2 + sig_w1^2 ) )


#### Model comparison in the case of Multiple hyperparameters

How do you set the prior on hyperparameters when you compare two models that have different numbers of hyperparameters? (In principle, this prior has an effect on the evidence.)
Often it is possible to define a sequence of models with increasing numbers of parameters or hyperparameters, such that the limit of an infinite number of parameters is well-behaved. In these cases, it is not important to compute the evidence as a function of the number of parameters. And indeed there is no overfitting problem as the number of parameters is increased.

However, this is not always the case. If we take Takeuchi and MacKay's interpolation model and increase the number of hyperparameters, the predictive properties do change as the number of hyperparameters increases. My intuition is that the predictive properties of the model change only in a small and unimportant way, but all the same, the question you ask is a valid one. How should the prior on {alpha} be set?

My answer would be that it is up to the user to define a prior that corresponds to the user's prior beliefs. In the case of the Takeuchi and MacKay model, the prior on the hyperparameters defines the simplicity and complexity properties that we would expect typical functions to have. Consider the simple case of an interpolation model H_2 with alpha(x) = alpha_1, x<0; alpha(x) = alpha_2, x>0. We might wish to compare this model with the simpler model H_1, alpha(x) = alpha_1 for all x.

[I note in passing that for practical purposes, it would not be unwise to always use H_2, because the extra hyperparameters of H_2 do not significantly increase its ability to overfit the data. The Occam factors penalizing the extra hyperparameters are correspondingly weak. The posterior error bars on alpha do not become ever smaller as the number of data increases.]

How should the prior on alpha_1, alpha_2 be set? A particular value for alpha_1 defines the characteristic power of the signal y(x) for x<0, and a prior on alpha_1 therefore expresses our prior uncertainty about how much power the signal y(x) might have in x<0. I think the correct way to set the prior on alpha_1 and alpha_2 is therefore to consider how uncertain one is about the power one expects the signal to have in x<0 and in x>0. In a typical problem I expect a reasonable value for the prior would say that the power is uncertain within a range of about two orders of magnitude--- log_10 alpha_1^max / alpha_1^min ~= 2, and log_10 alpha_2^max / alpha_2^min ~= 2. In some applications, the prior uncertainty might be smaller. I think it would be unusual for the prior uncertainty to be much larger. In this way, a reasonable prior can be assigned to {alpha}, and model comparison can be performed.

#### Non-invariance of MAP parameters with respect to non-linear transformations

Let us assume we want to determine a hyperparameter, say beta for the noise-level, by maximising P(beta|D)~ P(D|beta)P(beta), where D denotes a set of training data. We choose a non-informative (improper) prior so as to render the likelihood P(D|beta) data-translative, so consider u=log(beta) and p(u)~const, which, since du/d_beta= 1/beta, should be equivalent to using the original likelihood P(D|beta) with a modified prior, P(beta)= abs(du/d_beta)*P(u)~ 1/beta.

The transformation beta->u has no effect on the location of the evidence maximum, i.e., beta* maximises P(D|beta) iff u* := log(beta*) maximises p(D|u).

However, since P(beta) != P(u), the maximum for the posterior is shifted, i.e., if u** maximises P(u|D) and beta** maximises P(beta|D), then u** != log(beta**).

Hence the a posteriori most probable value of the hyperparameter seems to depend on the chosen coordinate representation.

??????? Where is the catch ?

There is no catch. The a posteriori most probable value of any parameters does depend on the chosen coordinate representation. Quite correct. This is why MAP parameters do not have any fundamental status.

Some people (including many Bayesians!) have the impression that Bayesian inference is all about finding MAP parameters (maximum a posteriori) but this is a mistaken view. Ideal Bayesian inferences are independent of chosen representation, and so do not depend on maximization. The only reason I ever maximize a posterior probability is this: *if* there is a representation in which a probability distribution is *Gaussian* (or well approximated by a Gaussian) then maximization over that probability distribution in that representation is useful, *not* because the particular parameters at the maximum have any special status, but because predictive distributions can be simply expressed (mean, variance) in terms of the maximum of the Gaussian and its curvature. Back to top

#### Follow up discussion of Non-invariance of MAP parameters

> From: Herman Bruyninckx
> Subject: Non-invariance of MAP...
>
> In Bayesian methods for neural networks - FAQ, you make the
> remark that Some people (including many Bayesians!) have the impression
> that Bayesian inference is all about finding MAP parameters (maximum a
> posteriori) but this is a mistaken view.'' This answer leaves me burning
> then *is* Bayesian inference all about? I mean, what decision processes are
> allowed' (in the sense of being invariant)?

David J.C. MacKay  responds:

Decision theory very rarely involves MAP.

The optimal decision is the one that minimizes the
expected cost.
The expected cost is found by MARGINALIZING over
the posterior distribution.

Only in toy problems will you find that the decision
from marginalizing is the same as MAP.

> Subject: Re: Non-invariance of MAP...
>
>
> > Decision theory very rarely involves MAP.
> >
> > The optimal decision is the one that minimizes the
> > expected cost.
> And what about other decision approaches, such as minmax? I guess they
> happen to be non-invariant much of the time, but they do have some
> intuitive appeal, don't they?

David J.C. MacKay  responds:

minmax can be obtained as a special case of the
general statement
'The optimal decision is the one that minimizes the
expected cost.'

Let the quantity 'u' that is minimaxed be related to the
'effective cost' by

c = exp ( beta u )

and consider what happens as you increase beta. (And
do the normal minimization of the expectation of c.)
In the limit of infinite beta, you end up choosing the minimax
action.

> Thanks *a lot* for this insight! In order to give credit to the proper
> source, allow me to ask this question: Is this a classic' in decision
> theory (if so, I apologize for not knowing it!), or is it impossible to
> find it in the literature because it is an original idea of yours?

I am afraid I don't know whether this is written down anywhere,
but I would be surprised if I am the first person to have noticed it.

David


#### Maximization of joint posterior probability of hyperparameters and parameters

Let D be a given set of target values D={y(1),...,y(T)} again, and define Ed= 0.5*sum(y(t)-f(x(t)))^2, where f(x) is the function implemented in the neural network. The noise-level sigma is treated as a hyperparameter, beta:= 1/(sigma^2), which according to your paper Neur. Comp. 4 (1992), eq.(4.10), is given by the number of well determined parameters, gamma, and the number of training data, T, according to 2*beta*Ed= T-gamma. This result is derived by maximising the evidence P(D|alpha,beta) with respect to beta, where P(D|alpha,beta) itself is obtained by integrating out w (the network weights): P(D|beta,alpha)= int[P(D|w,beta,alpha)*P(w|beta,alpha)]dw = int[P(D|w,beta)*P(w|alpha)]dw.

However, one could proceed in another way. Rather than integrate out w, adapt w and beta simultaneously, i.e., maximise P(w,beta|D,alpha)~ P(D|w,beta,alpha)*P(w|beta,alpha)= P(D|w,beta)*P(w|alpha) with respect to w and beta.

This is equivalent to maximising log[P(D|w,beta)]= log[{beta/(2*PI))^(T/2)}*exp(-beta*Ed)]= -beta*Ed+ (T/2)*log(beta)+ const, so that from d/d_beta(log[P(D|w,beta)]= -Ed+ T/(2*beta) = 0 we arrive at 2*beta*Ed= T (without the additive constant -gamma).

What is the reason for this deviation ?

It is not a deviation. It is a correct answer to a different question. This is analagous to the distinction between the sigmaN and sigmaN-1 buttons on your calculator.

Consider a data set {x} modelled as coming from a Gaussian distribution of mean mu and s.d. sigma.

If you maximize the joint likelihood function over mu and sigma you get (xbar, sigmaN). [Get a piece of paper and do it!]

If you integrate over mu and ask what is the most probable value of sigma, instead, you'll get a maximum at sigmaN-1.

To understand this it helps to make a contour plot of the likelihood function. The crucial concept is that it has a skew peak.

See this paper: ensemble.ps.gz. abstract.  Developments in Probabilistic Modelling with Neural Networks - Ensemble Learning '. And also Hyperparameters: optimize, or integrate out? alpha.ps.gz, abstract.

#### Terminology: meaning of likelihood'

What is the difference between probability and likelihood?
Whereas many people don't distinguish between these terms, they actually have different meanings and I recommend keeping the distinction. Here is the explanation, as given by Radford Neal, when criticising me for talking about "the likelihood of the data", which, I agree, is incorrect usage:

Your use of the term "likelihood" conflicts with what I believe to be the standard definition (although there are plenty of other people making the same "mistake" :-).

When you talk about "the likelihood of the data given the parameters w", you might just as well have said "the probability of the data given the parameters w" - you seem to view "likelihood" as a synonym for "probability", to be used when talking about the data rather than about parameters.

Amongst non-Bayesians, "likelihood" means something entirely different from "probability". If I recall correctly, the term was invented by Fischer. Since he had rejected Bayesian arguments, he wasn't allowed to talk about the "probability" of a parameter, so he invented the term "likelihood" instead. Likelihood is a function of the _parameters_ not the data, though its value is conditional on the data. Thus the "maximum likelihood" method consists of choosing the parameter set with largest likelihood - the phrase makes no sense if you view the likelihood as a function of the data, since you don't get to change the data to maximize anything.

So, the quantity P(data|params) is the likelihood of the parameters. If you *want* to say "likelihood of the data", just say "probability of the data" instead.

#### Contrasting Bayesian methods with frequentist methods

Do you know of any papers on significance testing where the correct Bayesian probabilities are compared with frequentist black magic (t-tests, etc.)? Yes, several such papers I think. And books. My favourites are Jaynes, Loredo, Berger. Berger is esp good because he used to be a nonBayesian.
@INCOLLECTION{Jaynes.intervals,
KEY            ="Jaynes",
AUTHOR         ="E. T. Jaynes",
TITLE          ="{B}ayesian Intervals versus Confidence Intervals",
BOOKTITLE      ="{E.T. Jaynes}. Papers on Probability,
Statistics and Statistical Physics",
EDITOR         ="R. D. Rosenkrantz",
PUBLISHER      ="Kluwer",
YEAR           ="1983",
PAGES          ="151"}
% reprinted in paperback 1989,
%  I just read utterly the best Jaynes essay ever. It is SO good; so even
% handed and confrontational; rubbing the noses of the opposition in the
% examples he gives, using the opponents of Galileo as analogy -- some of
% his opponents refused to look through his telescope to see Jupiter's
% moons, because they already knew'. It's a very pragmatic argument he uses,
% not philosophical -- just look at the results of the two approaches
% and see where they give different answers, then magnify those differences

 @INPROCEEDINGS{Loredo,
KEY            ="Loredo",
AUTHOR         ="T. J.  Loredo",
TITLE          ="From {L}aplace to Supernova {SN} {1987A}: {B}ayesian Inferencein Astrophysics",
BOOKTITLE      ="Maximum Entropy and {B}ayesian Methods, {D}artmouth, {U.S.A.}, 1989",
EDITOR         ="P. Fougere",
PUBLISHER      ="Kluwer",
YEAR           ="1990",
PAGES          ="81--142"}

% Nice quote: [from savage originally] Indeed to many {B}ayesians, belief
%       in the LP is the big difference between {B}ayesians and frequentists,
%       not the desire to involve prior information'
@BOOK{Berger,
KEY            ="Berger",
AUTHOR         ="J. Berger",
TITLE          ="Statistical Decision theory and {B}ayesian
Analysis",
PUBLISHER      ="Springer",
YEAR           ="1985"}

--- you might also like this
@Book{Sivia:96,
author =       "D. S. Sivia",
title =        "Data Analysis. A {B}ayesian Tutorial",
publisher =    "Oxford University Press",
year =         1996,
annote="ISBN 0-19-851889-7"
}

good value. Sivia was a Gull student in the generation before me.

#### About Cox axioms and 'closed hypothesis space'

My question concerns what you mean by 'closed hypothesis space'. I am familiar with Cox, 1946, but where in his paper does 'the closed hypothesis space' assumption enter? On page 82, your thesis, is written "... all properties of our hypothesis space have been articulated; thus we are forced to make explicit all our assumptions". I am still confused about the meaning of this 'closed hypothesis space'. An integral part of science is that we change our hypothesis space. We examine data , having hypotheses A and B in mind. We evaluate the posterior probabilities of A and B. But we notice (somehow) that in fact neither A nor B fits the data well [maybe by doing some non-Bayesian test]. So then we scratch our heads, saying "what is wrong?". Then (somehow) we invent an extra hypothesis or two, C, D. We re-evaluate the posterior probabilities in the new hypothesis space, and find C and D to be much more probable than A and B.

I think the above process is the way it works, and the question then is (a) can the above process be a Bayesian process? (b) if not, how does it relate to Cox?

I think that the step at which we become disatisfied with hypotheses A aand B is rarely Bayesian. It often involves an ALTERNATIVE-FREE test. Whereas Bayesian model criticism always depends on having an alternative model.

So what is wrong with cox axioms ? Why does the above science story not yield to the all-encompassing Cox? The reason, I reckon, is that Cox's second axiom says that P(A) is related to P(not-A). This is the closed-hypothesis-space assumption. And in fact, INFERENCE IS OPEN-ENDED (as Skilling says).

#### You say that the expectation of chi-squared is N+/-sqrt(N). Isn't it just N?

The expectation is N, yes. The standard deviation is sqrt[N]. So it should say "the expectation and standard deviation of chi-squared are N+/-sqrt(N)".

### Bayesian Methods for Backprop Networks

#### How do you calculate error bars on neural network outputs?

I'm interested in trying to use your formalism to find error bars. I understand, from Denker et al., how the variance (sigma^2) of the output distribution allows one to place a sort of error bar on the output confidence, and Denker et al. give a simple formula for computing this variance given the diagonal element of the Hessian. In NC 4(3):458, you show an example of error bars (actually ellipses) that you say were computed by using *all* elements of the Hessian, not merely the diagonal ones. However, I can't find the point at which you describe how to compute sigma^2 using the entire Hessian, and my grip on calculus is rusty enough that I can't seem to derive it myself. I was wondering if you might be able to provide a simple pointer to either the relevant equation or to some reference that would be more explicit in deriving the variance.
Yes, this is something I should have spelt out. I do spell it out a bit more in my recent review paper, see publications. In brief, here is the formula:
evaluate the Hessian A, and invert it -> A^-1

evaluate the sensitivity matrix G_o,i === dy_o / dw_i
[sensitivity of output y_o to weight w_i]
for all outputs o and weights i. Then the
Variance-Covariance matrix for the outputs is

G^T A^-1 G

where ^T means transpose, to make the product work right.

This is also written down on p.597 of Neural Computation -- not very useful for people wanting to understand the paper in the preceding issue! :-)

Should I ever get negative values for G^T A^{-1} G? The above approximation for the error bars rests on a gaussian approximation which assumes that the Hessian A is positive definite. (If your A is not positive definite, then I recommend forcing it to be so.) If A is pos def then G^T A^-1 G also must be positive definite. This means for example that all the diagonal elements of this matrix should be positive. To get the error bars on a single prediction, just take the square root of the corresponding diagonal element. To get error ellipsoids for two correlated predictions, plot contours of the function (delta y)(Y^{-1})(delta y), where Y = G^T A^{-1} G, and delta y is the deviation from the most probable outputs.

Another way of representing error bars is to draw typical samples from the posterior distribution P(w) propto exp( -1/2 (delta w)(A^{-1})(delta w) ). This gives entire typical functions y(x;w).

NB Error bars in classifiers are a whole different story. See my thesis or my Neural Comp papers of 1992. And see The FAQ on Classifier error bars.

#### But how do you evaluate the Hessian in the first place?

There are several ways. The first two ways I used are
Second differences
Maybe a better name for this is first differences on the gradient.
Here is how it works.
• You want to evaluate A = grad grad M at a point w*.
Repeat for each i
• Compute the gradient g^(i) = grad M at the point w = w* + epsilon e_i, where e_i is the unit vector (0,0,0,1,0,0...) which has a 1 in the ith position.
• Compute the gradient g^(-i) = grad M at w = w* - epsilon e_i.
• Approximate A_ij = [ g^(i)_j - g^(-i)_j ] / 2 epsilon
Done
This computation takes a lot of time because it requires k (or 2 k) gradient calculations, where k is the number of weights.

Sum of outer products
By linearizing the output of the network on each example in the data set, we can obtain
       A \simeq \beta \sum_n g^(n) g^(n)^T + \alpha I

where g^(n) = d y / d w, for example n. (This is a different g from the ones above.) Here terms of order d^2 y / d w^2 are omitted, which means that the net is treated as being locally linear.

This calculation scales OK with problem size. It only costs the same amount of backpropagation as does a single pass through the training set. But accumulating the entries of the matrix A, i.e., adding up the terms

g g^T,
may take a lot of time.
Maybe the method of choice is
Pearlmutter's R-backprop
@Article        {Pearlmutter,
author  =       "B. A. Pearlmutter",
title   =       "Fast Exact Multiplication by the {H}essian",
journal =       "Neural Computation",
year    =       1994,
volume  =       6,
number  =       1,
pages   =       "147--160",
annote    =       "Also available by ftp archive.cis.ohio-state.edu:
/pub/neuroprose/pearlmutter.hessian.ps.Z"
}


#### How to do Error bars when dealing with classification problems

I want to compare some different models for predicting the probability of an adverse outcome in pregnancy. I understand methods to compare point estimates of risk produced by the various models. What I do not know is how best to compare the error bars on estimates produced by the various models.

Since risk prediction falls under the auspices of "classification" rather than "regression", the difference between predicted and actual outcome is no longer normally distributed and I'm not sure how I would compare the error bars produced by different models.

Are you familiar with a practical method to compare the error bars on probability predictions produced by different models in a classification setting?

This is addressed in chapter 5 of my PhD thesis.

In a nutshell, the question is, how should we include the concept of error bars when we are dealing with a classification problem, e.g. a binary (0/1) problem.

The crucial point is to get away from the idea that what you want, when making a 0/1 prediction, is a probability "with error bars".

If I am attempting to model the probability that it's in class 1, and I know I have got a lousy model somewhere, then that means the P( ) that I spit out ought to be close to 0.5,0.5 (in the two class case). Thus the effect of "error bars" should be to "moderate" your predictions. If you have poorly determined parameters (big error bars on the parameters), your predictions will be close to 0.5, 0.5, and when the performance is assessed using the log predictive probability

   	sum_examples  log P(actual class)

you will be penalised by log(0.5) per prediction. On ther other hand if you make a poor assessment of the error bars on your parameters (e.g. if you ignore them totally) then you will make overconfident predictions; you will be automatically penalized, because when you say "I'm 0.9999 sure that that is a 1," when actually it is a 0, you will get badly burned to the tune of log 0.0001. - a big negative number.

For the case of binary classifiers this is all discussed in my paper "The evidence framework applied to classification networks". You'll note that I compare alternative models in terms of their performance on a test set in two ways: (a) the traditional way, which is to evaluate sum_examples log P(actual class| best fit parameters) and (b) the Bayesian way, which is to make an attempt at marginalization, to get P(actual class|Data ) = \int d params P(actual class| parameters) P(params | Data) and then evaluate sum_examples log P(actual class|Data) .

I call the classifier output P(class| best fit parameters) the "most probable outputs" and P(class|Data ) the "moderated outputs" -- a name which I regret introducing, since "marginalized outputs" would have done fine.

I haven't seen anyone make use of my techniques in real problems, except for my collaborator in Cambridge, Ichikawa.

An alternative way of implementing the Bayesian marginalization would be with a monte carlo approach like Radford Neal's. For multi-class problems this is the only practical approach to marginalization that I know of. (My paper only gives a gaussian approximation method for binary classification.)

Many thanks for your reply. I understand and am no longer confused 8) At the risk of asking another dumb question, does model averaging give you the same "moderation" of predictions as marginalization over network parameters? ...or does it only approximate that marginalization?

Model averaging and marginalizing over parameters both give you a "moderation" effect, yes. Both are forms of marginalization and are strongly recommended. NB, they are not the same. The ideal approach is to do both.

#### How to get classification predictions to default to a value other than 0.5?

Is there an easy way to change the smoothing formula that you briefly discuss in your PhD for the continous outputs of the bayesian learning neural network (0/1 classification) to allow smoothing of the predicted probabilities towards 0.35 instead of standard 0.5 ?

0.35 is the natural prior probability of my positive cases in my data set and I would think that smoothing the probability estimates towards 0.35 is a better choice in this case.

Great question - This is a question that has bothered me for 9 years! The bottom line is that I think this exposes a problem with the use of neural nets that have standard sigmoid or softmax outputs. It would be better to use probabilistic models that have a nonuniform measure built into them (as does the Dirichlet distribution).

The best idea I have for patching up the sigmoid model is to treat the network as a likeihood-ratio machine, and multiply by your priors (and renormalize) after the marginalization.

When training your network to produce likelihood ratios, you need to factor out the prior probability if possible. One way to do that would be to boost all the data points by a factor inverse to the prior probability when computing the gradient.

Since alpha and beta must be initially defined before they can be reestimated, have you any experience with how sensitive the final solution is to these initial values. For some real commercial data which I have been using - I have found that "incorrect" initial values for alpha and beta tend to give solutions that diverge dramatically when alpha and beta are reestimated. Is there a well known reason for why this happens? Have you any experience (or know of others) where this sort of phenomena has manifested itself? Have you any advice on how the initial values of alpha and beta should be chosen?
The initial value for alpha and beta does matter, yes. My method was to start out with beta larger than expected and alpha smaller, so that the learnt network starts by overfitting. Then the reestimation of a and b kicks in, and beta gets reduced, alpha increases, and the net finds a sensible soln.

If you have difficulties with this methodology I recommend using Radford Neal's mcmc software, which uses correct Bayesian sampling of the hyperparameters and parameters, and often works a little bit better, probably because it is more robust.

#### Problems reproducing results

At the moment, I'm trying to reproduce your experiment with the two-joint robot arm as described in "A practical bayesian framework for backpropagation networks" (Neural Computation 4(3)). The problem I cope with is, that there is not much of an increase in the error on a test set for complex models. I've experimented a little bit with the learning rate, a momentum term, the size of the training set and its noise, the range of the possible angles, and tried several stop criteria, but it does not really make a difference. Could you please mail me some information about these subjects in your experiments? It would help me a great deal.
Your message raises an amusing point: if you use a bad optimiser, then you don't have such bad overfitting problems! If you are using ordinary gradient descent with momentum, then I can believe that it will be hard to reach the true minimum (because gradient descent is not an efficient optimiser). I expect this is why you fail to get overfitting. I used a better optimiser (variable metric methods, Numerical Recipes), and it gave overfitting, as you can see in my paper. So, one might conclude 'if you have a bad optimiser, you don't need Bayesian methods!' -- however, my hope is that by using a good optimiser _and_ Bayesian methods, we can do even better.

#### Cheap generation of predictive distributions

I was recently reading your preprint 'Probable networks and plausible predictions ....', and noted your suggestions in Section 9.2. It seemed to me that a better alternative to your step (ii) would be to add noise, not to the already noisy t(m), but to the corresponding 'fitted values', given by
           y*(m)  =   y(x(m); w*).

If you use your definition of t1(m) then it seems that you are incorporating too much noise. The whole procedure seems to be a bootstrap-type approach, of which there are various variations for regression problems in the statistical literature, both parametric and nonparametric bootstraps, and even so-called Bayesian bootstraps. Your version is parametric in that you are adding Gaussian noise. In the nonparametric case you would create a set of residuals, defined by
          r*(m)   =  t(m) - y*(m),

sample from these at random, with replacement, to generate a bootstrap sample of residuals, r1(m), and define t1 by
           t1(m) = y*(m)  +  r1(m) .

I apologise in case this is all old-hat to you by now.
I understand your question, but I think the answer is no, the addition of noise that I describe is a correct procedure for generating samples from the posterior distribution of the parameters (subject to the Gaussian assumption, etc., of course). The procedure should not be viewed as a bootstrap procedure. It is not intended as such. It is intended as a Monte Carlo procedure for obtaining samples from the Bayesian posterior distribution. Maybe I can back up my assertion with the aid of a toy example. Assume that our model for an unknown parameter y is
P(y) = Normal ( 0 , alpha )

[NB here I use BUGS' convention, Normal (mean,1/variance)]

And assume that we observe t_n = Normal( 0 , beta ), n = 1 ... N. Then the posterior distribution of y is
P(y|Data,alpha) = Normal ( gamma t_bar , alpha + N beta ).

[t_bar = sum_n t_n / N, and gamma = N beta / ( alpha + N beta ) ]

Let us now consider the strategy I describe and see if it does generate samples having this probability distribtuion. We add noise to t_n, obtaining a new data error function
E_D1 = sum_n ( t_n + n_n - y )^2 / 2

where n_n ~ Normal ( 0 , beta ). We also perturb the regularizer, by changing the mean towards which y is drawn from 0 to mu, with mu ~ Normal ( 0, alpha ). The total objective function that we then minimize is
M(y) = beta E_D1 + alpha E_W1

where E_W1 = (y - mu)^2 / 2.

What is the y that minimizes this? Answer,

y_min1 = gamma t_bar + ( N beta n_bar + alpha mu ) / ( N beta + alpha )

OK, now what is the added term here?
Answer, it is a Gaussian random variable with variance
1/(N beta + alpha) .

Thus this procedure generates samples from precisely the posterior distribution.

#### Problems with small eigenvalues

While trying to evaluate the evidence for various models (MLP's and some polynomials) using the methods detailed in your Neural Computation papers, we have ran into problems with small (and negative) eigenvalues of the hessian matrix (A), which make evaluation of the determinant awkward (as I'm sure you are aware off). To overcome this we can ignore eigenvalues below a certain (arbitrary) threshold, but this seems, well, inelegant. Are there any practical methods (or tips) for determining the cut-off threshold or for keeping the eigenvalues positive? For example if the number of effective (good) parameters is say 20, then use only the largest 20 eigenvalues to determine the determinant? Or is there some relation between the eigenvalues and the size of the re-estimated alpha(s) ?
Good question. For practical purposes, how should one evaluate the trace and determinant of the Hessian? There are several comments I can make:
• It is not essential to evaluate the eigenvalues of a matrix to get its trace and determinant. The trace is simply the sum of the diagonal elements, so once the matrix has been produced, it can just be read out. Similarly the determinant is often produced as a byproduct of the process of inverting the matrix. So eigenvalues can be useful for expository purposes, but are not necessarily a good thing in practical implementation.
• Notice that the formulae for finding the optimal hyperparameters alpha and beta only involve the trace and not the determinant. They are thus insensitive to the details of the values of eigenvalues that are close to zero.
• We know by definition that if we are a local optimum of M(w) then the curvature grad grad M(w) must be positive definite, i.e. all its eigenvalues must be positive. Otherwise it isn't an optimum. If grad grad M(w) can be decomposed into two terms alpha C plus beta B, say, then for a general model we do not have any guarantee that both terms C and B are positive definite. Usually C is, but in principle B may not be. My theoretical work assumes that B is positive semi-definite. So, what to do when this assumption is wrong? I adopted the following ad hoc procedure.
I did indeed put in a cut-off which corrected' negative eigenvalues of B = Grad Grad E_D to zero. Maybe this procedure is rather naughty. I justified it at the time by assuming that it was the fault of my numerical algorithm that negative eigenvalues were appearing, in which case setting them to zero is OK. But if the Hessian really does have negative eigenvalues then I shouldn't have changed them in this way! Anyway, that is what I did. I am afraid I did not study this aspect in any detail, so I don't know how sensitive my results are to the details of this cut-off. Thodberg has made a more thorough study of the dependence of evidence calculations on this cut-off.
• There is a practical tip to make negative eigenvalues rear their head less. This is another ad hoccery, but it is recommended anyway. When we evaluate B = Grad Grad E_D, it is possible to separate the formulae into separate terms, some of which are guaranteed positive semi-definite, and some of which are not. One idea that several of us in the Hessian' community make use of is to deliberately ignore all the terms that are not guaranteed positive semi-definite.
• A final general comment. In my experience, evaluating the trace of the Hessian and using it to control hyperparameters is a good and successful idea in real applications. Evaluating the determinant and using it to evaluate the evidence for alternative models has not been so successful, though Thodberg's work is an exception to this observation. So I think that using the derivative of the evidence as a guide in a hyperparameter space is to be highly recommended, but the absolute value of the evidence, although I like it a lot in principle, in computational practice it doesn't seem to be so reliable a guide.

#### Architectures other than traditional two layer net

how to describe groups about more than one hidden lever and how to do if connecion from input units directly to output units
If your model has multiple hidden layers, my default strategy would be to put each extra set of weights between hidden layers in a regularization class by itself. Similarly direct connections from input to output could be lumped in a single class. Thus for, say, a net with a 10:10:10:1 architecture (10 inputs, 10 hidden 1, 10 hidden 2, 1 output), with direct connections from input to output, I think the deafult would be to define 5 regularization classes, each with its own independent weight decay parameter. These would be: input-to-h1; input-to-output; bias-to-h1; h1-to-h2(including biases of h2); h2-to-output(including biases of outputs).
But one could justify more complex assignments of weights to a larger number of classes, and couplings between the weight decay parameters too. For example, ARD' would put the input-h1 weights into 10 distinct classes. And one could couple the regularization constants for the input-output connections to be linked to the input-h1 regularization constants.

#### Where does the formula for the entropy of a Gaussian distribution come from?

The formula being S = k/2 (1+log(2 pi) ) + 1/2 log ( m^2 det A^{-1} )
The way to understand and prove these relationships is to think of the matrix in its diagonal basis. Any symmetric matrix can be diagonalized into the form
l1 0  0  0 ...
0  l2 0  0 ...
0  0  l3 0 ...
.   .  .  . .

where l1, l2 and l3 are its first three eigenvalues. This diagonalization involves an orthogonal transformation, i.e. a rotation, and such transformations preserve both traces and determinants; and entropies too. So whenever you have an integral of the form
\int d^k w  exp ( -1/2 w A w ) ...

you can (for convenience of thought) pretend that A is in fact a diagonal matrix, in which case, the integral has the separable form:
\int d^k w  exp ( -1/2 \sum_i l_i w_i^2 )

Such separable integrals are straightforward. All that remains is to compute the entropy of a one-dimensional gaussian, and that is left as an exercise for the reader.

How to derive w_mp=A^(-1)*B*w_ml?

Write out the expression for M(w). Differentiate with respect to w. Set derivative equal to zero to find the minimum.

#### Simple Examples of the Bayesian Occam's razor

I presented your Bayesian framework in my neural network class at MIT last week, to pretty much universal acclaim. The students particularly liked the classification example with the butterfly-shaped probability contours. There was one question that arose, however, that I didn't think I handled very well, so I thought I'd see how you handle it. Consider your arithmetic series example. Suppose that my model was the set {odd positive series, even positive series}, with p=.5 for each of the two alternatives. This model would of course have high predictive probability in your example (where the data is D=(2,4,6,8...)). More generally, the "model" consisting of the data itself always has high predictive probability. In an MDL framework, I wouldn't choose the data itself as a model because it would have a long code length. In the Bayesian framework, I would assume that that long code length would translate over into a low prior probability. But that seems somewhat adhoc; on what basis do I choose the low prior? The model seems simple enough; why should I give it a low prior? And indeed, we're now considering putting low prior probabilities on "simple" models; just the opposite of what the Occam factor is trying to do. We're fighting the Occam factor. This reasoning seems wrong, but I'm unable to pin down just how. It seems clear that the "model" {odd positive series, even positive series} should not be considered a "simple" model, but I'm not clear on just how this would cash out in the Bayesian formalism without some sort of adhockery. What might you suggest?
If I understand your question right, I think you are asking about the model that Ed Jaynes termed the "Sure thing" model, namely the model that conjectures a priori that the data will be {d_1....d_N}, which just happens to be exactly what the data actually are. The evidence for this model is unbeatably large, and Mr. Sure Thing is only going to lose if he receives a small prior.
I'll give my general asnwer to this first, then come back to the specific 2,4,6,8 example.
1) Inference *is* subjective. I have no problem with this. It is subjective in that your conclusions depend on the assumptions you make. But once all assumptions are on the table, then inference is well-defined, objective, agreed by all who conditionally accept the assumptions listed.
2) You get to choose the priors. Choose whatever priors you are happy with. There is rarely a way of asserting that a particular prior is right.
3) OK, but why should Sure THing be given a small prior? Well, I would view Sure Thing (ST) as belonging to a set of many equivalent hypotheses each of which says "data will be {d_1....d_N}", with different values of d1,...dN. It seems reasoonable to assign all of these hypotheses equal prior. Then the question is, how much prior do we want to assign to all of these ST-like hypotheses, and how much do we want to assign to all the real' hypotheses? If we choose to assign equal prior probability to all sets of hypotheses, e.g. 1/3 to arithmetic, 1/3 to cubic, and 1/3 to ST-like, then the prior assigned to ST itself is now very very small. The Occam factor appears in the prior, as it were.
4) Usually, ST is not present in our hypothesis space, and so ST is merely a philosophical item, with the above discussion saying how I would handle it, in principle, if it ever turned up. But sometimes, it really does rear its ugly head: for example, if I give a model enough hyperparameters of an inappropriately powerful sort, and if I set the hyperparameters by maximizing the evidence, then it is possible that the hyperparameters may find a value equivalent to ST. (Steve Gull calls this point in hyperparameter space the electric monastery, after the Electric Monk of Douglas Adams' book, whose function is to believe deeply in what it is told; the electric monastery is a potential problem in some image reconstruction models). If ST really is a problem in this way then there are two conclusions: (i) when dealing with models that contain electric monasteries, don't set the hyperparameters by maximizing the evidence; (ii) put a proper prior on the hyperparameters and find some way of integrating over them.
The above does not handle your specific example though. You suggested:
H_{odd/even}: the sequence will be the odd numbers in order, or the even
numbers in order.

Is this the same as ST?
My feeling is that this is a valid hypothesis that deserves to be the winner given D=2,4,6,8. I don't think it is the same as ST. That is not to say that I would assign it equal prior with the hypothesis H_a (arithmetic series). My subjective opinion is that I might give H_{odd/even} a slightly smaller prior than H_a:
Prior elucidation by Playing a guessing game:
I tell you the first n numbers in a sequence; please give me a predictive distribution for the next number. From your predictions, I can deduce what your prior is. (note this game never involves a correct answer being revealed; it is just a predictive distribution statement game)
For example, if I tell you (n=1):
		D = { 2 , ... }

what is your predictive distribution? I think it is reasonable to put quite a lot of predictive probability on "4", because of the existence of H_{odd/even}, and also the similar hypothesis:
H_multiple: the sequence is a list of multiples of x, starting with x.

However, I don't think I would put as much as 1/2 of my predicitve distbn on "4"; from which I can conclude that less than half of my prior probability is placed on H_{odd/even} and H_multiple.

#### How to regularize models in general

I'm working on...(details of model omitted)... Do you know of any regularization techniques (hopefully fairly easy to implement at first - I plan to extend the work later) for such "single layer networks"? I have read your work on Bayesian methods for neural networks and have attempted to implement your "quick and dirty" method of regularization bu I cannot seem to get it to work.
Yes, I think it would be a very good idea to regularize your networks. I would recommend using the plain sum-of-squares (or weight-decay') regularizer. [You could generalize this to sum w^p for some power p in (1, 2) if there is good reason to expect that a Gaussian prior is too tight-tailed. But I'd start with p=2.] I would strongly recommend setting up your program in a way that allows each class of weights in the model to have its own weight decay rate (though maybe to start with you will set them equal). It is plausible that your local experts' w's should have a different typical size from your gating networks' w's. And maybe even each local expert would like its own personal weight decay rate. This then leaves the question, how to set the constant alpha? (And the other constant beta that multiplies the data error term)
• beta: If you made your data with a certain additive noise level, set beta to that known value 1/sigma^2, for simplicity.
• alpha: If the cheap'n'cheerful reestimation technique is not working well (plausible) then it is possible to introspect and figure out reasonable values for the hyperparameters alpha, a priori. More about that in a moment.
• While I think of it, I should double check that your input variables are nicely behaved. Is it the case that each of your input variables has the same characteristic vertical scale? Is each of them roughly a zero-mean variable? If not, then the recommendations that follow for alpha may not apply. For example, if one input variable goes from 0 to 100 and another goes from 0 to 0.1 then it isn't a good idea for their weights to have the same decay rate. And if each input goes from 0 to 1, then that means that every input is partly playing the role of a bias unit, and the posterior probability of the biases and the weights is going to be more strongly correlated than it need be. It's better for the inputs to have ranges centred on zero (at least, that is my prejudice).
• OK, back to the alphas.
How can we set alphas to goood values a priori? It turns out that by viewing the alphas as defining a prior on the functions produced by a net, and by using our prior knowledge about the characteristic properties expected of those functions, that we can pin down their values quite well a priori. For the case of a two layer net, this is illustrated in the following draft preprint: gen.ps. In the case where each net is just single layer, the story will be even simpler. Just consider what typical functions you get as a function of (a) sigma_bias and (b) sigma_weights. Think about the characteristic gradient. And set the alphas such that the characteristic gradient, and the characteristic range of action of the function, are what you want. Have a look at the preprint and see if the above suggestion makes sense. If your inputs are +/- 1 and your output is typically in the range +/- 1 too, then probably the typical gradient that you want is of order 1, or maybe of order 10, depending how exciting the problem is; and the typical range of action is of order 1. This will motivate setting alpha_bias and alpha_weights both to something of order 1.
• Finally, a strategy I have often found useful is, early on in learning, to set all alphas to smaller values than these expected final values. This allows overfitting to happen early on in the optimization. Then I crank the alphas to their proper value (either set a priori, or else using the Bayesian formulae), and the overfitting gets washed away. This strategy is useful because if the alphas are strong at the beginning it may be that learning never starts at all.
Relationship to MDL
It seems to me that the definition $-log(evidence)$ is identical to what Rissanen calls stochastic complexity (see the chapter on stochastic complexity in his book), which is motivated by the desire to form a compact code for the data. The MDL principle, which is essentially a two-part code approximation to the SC is derived as an approximation, which is shown to be asymptotically optimal. Do you agree with this or am I missing something? Note also that the SC formulation does not need to make any assumptions about precision of parameters. The $(k/2)\log n$ factor comes out automatically.
Yes, there is a direct relationship between description lengths and Bayesian probabilities. --- indeed, this is how I first got into all this Occam's razor work. I read Steve Gull's throw away statement that Bayes and MDL were equivalent and thought it would be worth sitting down and proving it. --- proving that the Occam factor in the evidence is the same as the extra description length penalty for and extra parameter, for example. I subsequently found out that in fact this was the original motivation for MDL. MDL was first stated as a data modelling concept by Wallace and Boulton in 1968 or so, predating the Rissanen papers. Wallace and Boulton conceived of MDL simply as a way of describing and implementing Bayesian inference. So for them, the relationship between the two is trivial. I have got the impression that Rissanen disagrees with this point of view, but I don't understand why. Let me expand on it a little. Any way of assigning messages with length l(x) to events x implicitly defines a probabilistic model over x.
	p(x) = 2^{-l(x)}

---assuming that l(x) is a length in bits. If these probabilities don't add up to one, then you have a suboptimal coding system. If they do add up to one, then you have a valid code which is optimal if the true probability distribution of x (whatever we mean by that) is in fact p(x). Optimal in the sense that the expected message length is minimized. Now, you said Note also that the SC formulation does not need to make any assumptions about precision of parameters. Well, I maybe haven't read the SC formulation. But if a message coding scheme uses a parameter block followed by a data block then the parameter precision has to be taken into account. The optimal precision relates to the occam factor, i.e. to the posterior error bars, in the case of a gaussian posterior on the parameter. But there are indeed many other ways to send a message describing the data. One is not obliged to send parameters to some precision or other. Instead one can for example use on-line predictive coding, in which one chops the data x into a sequence x=(x1,x2,x3,...xN), and sends x1 using the prior predictive distribution p(x1|H) to define the message lengths, then x2 using p(x2|x1,H), etc. This can be done nicely by using the arithmetic coding algorithm invented by Rissanen. Another coding scheme in the MDL literature is Hinton's 'bits back' coding technique. See my review paper to appear in Network (ps.gz, abstract) for further discussion of this. You also said that The $(k/2)\log n$ factor comes out automatically. Yes, it does, in Rissanen's work, but NB, it is an assymptotic result. k/2 log N is not the right penalty in the general case of a well-regularised model with finite amounts of data. In such cases the evidence (and the minimal message length) is not related to k at all. The relevant quantity might be the quantity gamma, see my chapter 'Bayesian interpolation' in my thesis. I think the k/2 log N formula is often very misleading. I think the way forward is to use models with infinite numbers of parameters controlled by interesting priors. See Radford Neal's papers.
Relationship to Risk'
What is the relation between Bayesian inference for models and the risk' in Probability Theory (eg Structural Risk Minimization')?
The relationship to the Effective VC dimension' is discussed in the last two pages of
@INPROCEEDINGS{MacKay.nips4,
KEY		="MacKay",
AUTHOR		="D. J. C.  MacKay",
TITLE		="Bayesian Model Comparison and Backprop Nets",
BOOKTITLE	="Advances in Neural Information Processing Systems 4",
EDITOR		="J. E. Moody and S. J. Hanson and R. P. Lippmann",
PUBLISHER	="Morgan Kaufmann",
YEAR		="1992",
PAGES		="839-846"}


#### Criticisms of the evidence

The standard Bayesian philosophy seems always to assume that there is a true' parameter out there and that the prior is non-zero at that value. What bothers me is that in cases where this is not necessarily the case, we may expect the evidence and the generalization error criterion to yield very different model selection criteria. This can be seen, for example, in recent results by Marion and Saad from Edinburgh University who solved a linear model exactly and found a mismatch between prediction error and evidence.

My conclusion would be in general that evidence is appropriate only for large sample sizes (where for example the Gaussian assumption may be valid) and assuming the true model is on the list of models considered. For unrealizable cases, the problem is (I think) that minimizing the evidence is not necessarily the same as minimizing the generalization error EVEN asymptotically. They are just two different (albeit related) criteria.

It is certainly easy to construct cases where the evidence is not well correlated with generalization error. The evidence is just a relative measure of how probable a model is, which doesn't have any fundamental relationship to what the model's generalization error is. And yes, agreed, assymptotically the two are not related either. In fact I'd say even if the model space does include the true' model that I can construct cases where the minimum gen error is not associated with the evidence maximum. All that is needed to construct such a case is for the true model to be a little atypical in some way. [Another simple example where the evidence and gen error are unrelated is this: as the hyperparameters alpha and beta are taken to infinity, the log evidence often plunges down linearly with log alpha; whereas the generalization error obviously levels out to a constant value.]

But I would like to emphasise that there is much more to Bayesian methods than using the evidence as some sort of estimator of generalization error. This concentration on generalization error is very narrow minded. Bayesian methods can be used to optimize hyperparameters (by maximizing the evidence implicitly); and other model properties. Thus if one imagines that a noise level might vary with x, then one construct a model of sigma(x) and learn it under the guidance of the appropriate evidence function. One can use the evidence to optimize the radius of a radial basis function model. One can use multiple hyperparameters to represent a spatially varying smoothness and can infer these using the evidence. All these model improvements are likely to improve generalization error, but that's just a by-product of finding a more probable model.

The evidence is a crucial guide for one's search in model space. The evidence is a far richer guide than some mere generalization error criterion, which gives you only a single scalar, and a noisy one at that. With the evidence we can evaluate its derivatives with respect to multiple hyperparameters simultaneously.

I should also emphasise that the above uses of the evidence are very much alive and well at the small N end of things. I strongly disagree with the assertion that the evidence is only a useful creature in the infinite data limit. I now regularly use neural net models that have more parameters than there are data points. The evidence method is excellent at controlling the hyperparameters in these cases.

I think the way forward is not to concentrate on comparison of two discrete models (by the evidence or whatever) but rather to define huge single models spaces that contain alternative models as special cases.

#### What is training data', test set', and validation set'?

I have notes from one of your presentations and you refer to a VALIDATION SET.

From my understanding the initial data set was split into three parts,

a training set, used during training a test set, used during testing and a validation set.

Could you explain the purpose for the validation set.

It is common to use three sets (train, val, test) for the following reason:

At the end of an investigation, we want to have an idea how well our method would predict on truly unseen data. This is the role of the test set. If someone doesn't cheat, they will set aside the test set, never look at the performance on it, then they will finally test their method and publish the results.

What many people do is just divide the data into a training set and a test set, which they look at frequently; they tweak their algorithm until they get good performance on the test set, and then they report this performance. I hope it is obvious that this introduces a bias, such that the reported performance is probably optimistic.

OK - that has motivated splitting off a test set that we never look at until the very end. So what about the validation set? Well, in many learning methods there are hyperparameters and other control parameters that you can tweak, and what you would like to do is adjust them so as to get the best perfroamnce on the test set. But of course that would be cheating. So what we often do is divide the data that we *are* allowed to look at into a training set and a "pretend" test set, called a validation set.

#### Why is P(alpha|w,D)=P(alpha|w)?

on P. 6 of interpolation models with multiple hyperparameters', top, you state
P(\alpha,|w,D) = P(\alpha,|w) 
so \alpha only depends on D via w's dependence on D?
YES! You can see this in the equations or by drawing the graphical model for the world, which looks like this:
alpha -> w -> Data

(The w come from a prior defined by w; the data come from a noise model with mean defined by w.)
Analogy:

lightning storms influence probability of electricity failure

if electriicty failure then your beer (in fridge) goes warm; if no elec failure then your beer is cold.

Now you observe the data "elec failure". You might infer "hmm, maybe lightning storm happened".

If Joe then runs in and says, "hey, look, the beer is warm, it *must* have been a lightning storm!" you would slap him about the head and say, "come off it, we already know the power has gone". Observing D doesn't tell you anything more about alpha if you already know w.

Yes, you are welcome to use the data I used in my papers. I also strongly recommend that you use the well-defined learning database "DELVE" at Toronto.

First, A note about tar files. The Mosaic browser is stupid about tar files, I find. It gets the file, then doesn't know what to do with it, so it throws it away. If you use Mosaic, you probably have to find another way to transfer the tar files that follow. If you want, you can anonymous ftp to this machine (wol.ra.phy.cam.ac.uk) then cd pub/mackay/www to get to the same location.

Robot arm data
The training set, in all cases, was the first 200 examples in the files. The two test sets were the following 200, 200. The data is included in the old bigback simulator, in the tar file, and are available by themselves here: dat/Cinputs5 and dat/Ctargets5.

#### Question about robot arm test sets

I have recently been looking at neural nets and have used the robot arm data sets given at your web site.

My question surrounds the fact that I get quite varied results between the two test sets. In Neil's recent book on Bayesian neural networks he compares performance of his models on the robot data set against the results given in your 1992 Neural Computation paper: A practical bayesian framework..

Can you let me know which test set was used and give me any clues as to why I get better results on the second series?

imagine that you *know* that the true function is y=0 everywhere, and you evaluate your performance on a variety of independent test sets, each containing 200 data points.

The data has noise level 1.0.

What is the *probability distribution* of the test error Sum ( t-y )^2 ?

Answer, a Chi-squared distbn with mean 200 and variance 200, i.e. s.d. 14.

Thus it is perfectly normal and to be expected that the value of the test error will be different by more than 10% between any two given test sets.

The test set used by Radford and me was...... I can't remember, but you can tell by looking in my thesis and comparing with your own results.

#### Classification data

tar file. This fake data is maybe not the most interesting data for testing a classifier. It comes from Gaussians, so it is really rather easy (locally) for a neural net. Anyway, note the following division into train, test and test:
Data_files:      inputs targets
Number_data                     11000
Train from line 1 to 150, 300, 600, whatever you want up to 1000
Test_set_1                      1001    6000
Test_set_2                      6001    11000

Prediction competition data
tar file. NB, don't cheat when you try this problem! Don't look at the answers before you have completely finished your modelling.

Above I have provided this data in the form of the original competition. Another format that may be of interest is the training + validation files that I used for my neural networks. Here, the data has been substantially preprocessed. It is not clear how much my winning the competition owed to the preprocessing, and how much to the use of Bayesian methods! (Of course the Bayesian methods helped refine the preprocessing, so it is not completely separable. Any work that makes use of this representation of the data should acknowledge that the preprocessing and glitch removal were both based on examination of Bayesian models.)
The tar file contains these files:

aug4.i
2672 lines of training inputs. Each line contains 25 inputs. These lines are not in temporal order. They are reordered at random such that lines 1-666 can be used as a training set and lines 667-2672 as validation set. Also, glitches have been cut out, so not all times in the original training data are present in this file.
all.3.t
2672 lines of training data targets for output variable number 3, corresponding line-for-line to aug4.i.
all.1.t, all.2.t
Similar files for the other two output variables.
tr.aug4
The training inputs, complete, and in temporal order, in the same 25 input representation. This is the input set used to make plots of a neural net's predictions on the training data. This file is much the same as aug4.i except that it contains the inputs in the correct temporal order and with all time points included.
te.aug4
The test inputs, complete, and in temporal order, in the same 25 input representation. This is the input set used to make plots of the neural nets' predictions on the competition test data.
test1.t, test2.t, test3.t The target values corresponding to the test inputs. (Don't cheat!) 1282 lines.
omitlist omitlist.p randbits24 aug.p write_targets.p
These files detail the omitted lines (for glitches, etc) and the random numbers used to make the division into train and test. aug.p is the main program.
NB, please eyeball all these files before using them to make sure that everything makes sense. Don't trust my typing abilities. Make sanity checks!

I am writing an optimizer to optimize something whose gradient I can compute. Recommendations?
For optimization concepts, the Numerical Recipes book of Press et al is highly recommended, but for actual implementation, their code is NOT recommended. For illustration of the practical differences between optimizers, see Tim Jervis's thesis and technical report. (tim@sparktechnology.com)
When coding up a general purpose optimizer, I'd recommend taking a look at the optimizers in the Xerion package from the university of Toronto. (xerion@ai.toronto.edu) [This is a good neural nets package.] They have implemented a variety of conjugate gradient optimizers that work efficiently. There are also various optimizers in the backprob package available on this server, but they are not as well-proven as those of Xerion. I have written my own optimizer based on the numerical recipes code, called macopt. This optimizer only makes use of gradient information and doesn't actually need a routine that returns the function value. If anyone wants this, it's free to use under the GNU general public license. Macopt, when all is well, does conjugate gradients with two gradient evaluations per line search. It's been used happily by quite a few people now, and there is a C++ version too. More information about macopt.

#### Software for Bayesian neural nets

Recommendations?
Only use software that includes (1) regularization; (2) error bars on predictions; (3) some sort of complexity control (eg hyperparameter optimization using Bayesian methods, such as A.R.D.).
• Neural nets
• The Aston group (Bishop, Williams et al) produced netlab', a package for matlab, not to be confused with the nn toolbox in matlab.
• David MacKay: bigback (the first Bayesian neural net software; still in use)
• The software of Phil Goodman, Reno (goodman@unr.edu) has ARD: nevprop
• Gaussian processes (as a replacement for supervised neural networks)
• See our local Gaussian process web site. - packages exist by Chris Williams, Carl Rasmussen, Radford Neal, Gibbs and MacKay. Radford Neal's package has the greatest capabilities.

#### What does the 'seed' do in the bigback simulator?

I think this question is already answered in the manual. When a neural network is trained, its weights have to be set to initial values. It is common practice to set the weights to RANDOM initial values. This is done using a random number generator, and that generator must be given a seed. Every time you use the same seed you get the same initial weights (to allow easy reproduction of experiments). You can use any seed you want.

### A question about the work of Donald M. MacKay on Information Theory.

I've heard from various sources that Donald MacKay was your father. Is this true? I have run across several references to some ideas that he (Donald MacKay) had on information theory (originally published as "On the Quantal Aspects of Scientific Information" in _Philosophical Magazine_ in March 1950) that he claimed were complementary to Shannon's theory. Given that you are well versed in information theory, I was wondering if you were familiar with his ideas and, if so, what is your perspective on them?

I am familiar with a small fraction of my father's ideas and I think many of them were really neat, some of them were a bit simple in retrospect, and some of them (eg his views on science, religion, free will) were unfortunate.

If you want to find out what he thought about measures of information and the status of shannon information as "a" measure of information, then an ideal place to look is his book information mechanism and meaning', published by MIT press, but probably out of print, so try a library. It's a good book, though it is dated in parts.

It is a book on information theory written when there was still confusion around, and when the Shannon information had not become *the* conventional view of information'. Just to give you an idea of alternative measures of information (which do make perfectly good sense) --- there is the number-of-independent-degrees-of-freedom in a signal. There is the signal to noise ratio of the signal. He suggests names metrical information content; selective information content' for the alternative measures of information.

These names are no longer in use because it turned out people stopped being confused without the help of these names. (Except maybe philosophers who are permanently confused.)

While I am writing about him, let me mention his idea that I think is really good. He had a radical view of how the brain works which I think is spot on. He viewed it not as a feedforward device from retina to V1 to V2 etc., but as being fundamentally a generative modelling machine, in which (for example) V1 contains an internal model of the world, which allows it to anticipate what is coming in from the retina, i.e. to predict it. It sends these predictions *back* from V1 to LGN; and the LGN has the role of comparator, taking the difference between what comes in and what was already anticipated, and just sending up some representation of that difference information. This is the elementary event of perception at that level. Something not anticipated by the internal model is detected and communicated up the feedforward pathway. This causes the internal model to adjust itself in such a way as to cancel out the feedforward signal, so that there is no longer a mismatch. Notice two things. This control circuit doesn't need a high quality feedforward bit. If the ff signal just gives an indication of mismatch then the internal model can hunt around under its guidance for a hypothesis that eliminates the mismatch signal. But what *is* needed is a good quality feedback bit; we need a good quality prediction circuit from V1 back to LGN. And FACT: V1 to LGN projects 10 times as many fibres as LGN projects feedforward to V1. This fact is desperately neglected by the orthodox feedforwardist school, who have a pathetic view of the feedback circuit ("it just provides global inhibition all over the LGN", I hav heard said. What hogwash! As if evolution would use ten times as many fibres to just perform global inhibition!) This idea of my father's is present in many of his papers, all of which are rather innaccessible; for example Towards an information flow model of visual processing'.

Incidentally, the `Helmholtz machine' of Hinton et al works on the principles outlined above (roughly!).