Energies, free energies, and errors

Next: Posterior and likelihood Up: Basic model and notations Previous: Independent, dependent, and hidden Contents

Energies, free energies, and errors

Often it is more convenient to work with log-probabilities = $\ln p$ than with probabilities. Firstly, this ensures non-negativity of probabilities = $e^L \ge 0$ for arbitrary . (For = 0 the log-probability becomes = $-\infty$ .) Thus, when working with log-probabilities one can skip the non-negativity constraint which would be necessary when working with probabilities. Secondly, the multiplication of probabilities for independent events, yielding their joint probability, becomes a sum when written in terms of . Indeed, from = $p(A \,\mbox{\sc and}\, B)$ = it follows for = $\ln P(A,B)$ that = $L(A \, \mbox{\sc and} \,B)$ = . Especially in the limit where an infinite number of events is combined by AND, this would result in an infinite product for but yields an integral for , which is typically easier to treat.

Besides the requirement of being non-negative, probabilities have to be normalized, e.g., $\int \!dx\, p(x)$ = 1. When dealing with a large set of elementary events normalization is numerically a nontrivial task. It is then convenient to work as far as possible with unnormalized probabilities from which normalized probabilities are obtained as = with partition sum = $\int\! dx\, Z(x)$ . Like for probabilities, it is also often advantageous to work with the logarithm of unnormalized probabilities, or to get positive numbers (for ) with the negative logarithm = $-(1/\beta)\ln Z(x)$ , in physics also known as energy. (For the role of $\beta$ see below.) Similarly, = $-(1/\beta) \ln Z$ is known as free energy.

Defining the energy we have introduced a parameter $\beta$ . Varying the parameter $\beta$ generates an exponential family of densities which is frequently used in practice by (simulated or deterministic) annealing techniques for minimizing free energies [114,156,199,43,1,203,243,68,244,245]. In physics $\beta$ is known as inverse temperature and plays the role of a Lagrange multiplier in the maximum entropy approach to statistical physics. Inverse temperature $\beta$ can also be seen as an external field coupling to the energy. Indeed, the free energy is a generating function for the cumulants of the energy, meaning that cumulants of can be obtained by taking derivatives of with respect to $\beta$ [62,9,13,163]. For a detailled discussion of the relations between probability, log-probability, energy, free energy, partition sums, generating functions, and also bit numbers and information see [133].

The posterior $p({h}\vert f)$ , for example, can so be written as

$\displaystyle p({h}\vert f)$	$\textstyle =$	$\displaystyle e^{L({h}\vert f)} = \frac{Z({h}\vert f)}{Z({H}\vert f)} = \frac{e^{-\beta E({h}\vert f)}}{Z({H}\vert f)}$
	$\textstyle =$	$\displaystyle e^{-\beta \left( E({h}\vert f)-F({H}\vert f) \right)} = e^{-\beta E({h}\vert f)+c({H}\vert f)} ,$	(3)

with (posterior) log-probability

$\begin{displaymath} L({h}\vert f) = \ln p({h}\vert f), \end{displaymath}$

(4)

unnormalized (posterior) probabilities or partition sums

$\begin{displaymath} Z({h}\vert f) ,\qquad Z({H}\vert f) = \int\! d{h} \, Z({h}\vert f) , \end{displaymath}$

(5)

(posterior) energy

$\begin{displaymath} E({h}\vert f) = -\frac{1}{\beta} \ln Z({h}\vert f) \end{displaymath}$

(6)

and (posterior) free energy

$\displaystyle F({H}\vert f)$	$\textstyle =$	$\displaystyle -\frac{1}{\beta} \ln Z({H}\vert f)$	(7)
	$\textstyle =$	$\displaystyle -\frac{1}{\beta} \ln \int\! d{h}\, e^{-\beta E({h}\vert f)} ,$	(8)

yielding

$\displaystyle Z({h}\vert f)$	$\textstyle =$	$\displaystyle e^{-\beta E({h}\vert f)} ,$	(9)
$\displaystyle Z({H}\vert f)$	$\textstyle =$	$\displaystyle \int\! d{h} \, e^{-\beta E({h}\vert f)} ,$	(10)

where $\int\! d{h}$ represent a (functional) integral, for example over variables (functions)

= $p(y\vert x,{h})$ , and

$\begin{displaymath} c({H}\vert f) = -\ln Z({H}\vert f) = \beta F({H}\vert f) . \end{displaymath}$

(11)

Note that we did not include the $\beta$ -dependency of the functions

in the notation.

For the sake of clarity, we have chosen to use the common notation for conditional probabilities also for energies and the other quantities derived from them. The same conventions will also be used for other probabilities, so we will write for example for likelihoods

$\begin{displaymath} p(y\vert x,{h}) = e^{-\beta^\prime \left( E(y\vert x,{h}) - F(Y\vert x,{h}) \right) } , \end{displaymath}$

(12)

for $y\in Y$ . Inverse temperatures may be different for prior and likelihood. Thus, we may choose $\beta^\prime \ne \beta$ in Eq. (12) and Eq. (3).

In Section 2.3 we will discuss the maximum a posteriori approximation where an optimal is found by maximizing the posterior $p(h\vert f)$ . Since maximizing the posterior means minimizing the posterior energy $E({h}\vert f)$ the latter plays the role of an error functional for to be minimized. This is technically similar to the minimization of an regularized error functional as it appears in regularization theory or empirical risk minimization, and which is discussed in Section 2.5.

Let us have a closer look to the integral over model states . The variables represent the parameters describing the data generating probabilities or likelihoods $p(y\vert x,{h})$ . In this paper we will mainly be interested in ``nonparametric'' approaches where the -dependent numbers $p(y\vert x,{h})$ itself are considered to be the primary degrees of freedom which ``parameterize'' the model states . Then, the integral over is an integral over a set of real variables indexed by , , under additional non-negativity and normalization condition.

$\begin{displaymath} \int\!d{h} \rightarrow \int\! \left(\prod_{x,y}dp(y\vert x,{h})\right) . \end{displaymath}$

(13)

Mathematical difficulties arise for the case of continuous

where $p({h}\vert f)$ represents a stochastic process. and the integral over

becomes a functional integral over (non-negative and normalized) functions $p(y\vert x,{h})$ . For Gaussian processes such a continuum limit can be defined [51,77,228,144,150] while the construction of continuum limits for non-Gaussian processes is highly non-trivial (See for instance [48,37,103,249,189,233,234,34,206] for perturbative approaches or [77] for a non-perturbative $\phi^4$ -theory.) In this paper we will take the numerical point of view where all functions are considered to be finally discretized, so the

-integral is well-defined (``lattice regularization'' [41,204,163]).

Next: Posterior and likelihood Up: Basic model and notations Previous: Independent, dependent, and hidden Contents

Joerg_Lemm 2001-01-21