Often it is more convenient to work with log-probabilities
=
than with probabilities.
Firstly,
this ensures non-negativity
of probabilities
=
for arbitrary
.
(For
= 0 the log-probability becomes
=
.)
Thus, when working with log-probabilities
one can skip the non-negativity constraint
which would be necessary when working with probabilities.
Secondly,
the multiplication of probabilities for independent events,
yielding their joint probability,
becomes a sum when written in terms of
.
Indeed,
from
=
=
it follows for
=
that
=
=
.
Especially in the limit where an infinite number of events
is combined by AND,
this would result in
an infinite product for
but yields an integral for
,
which is typically easier to treat.
Besides the requirement of being non-negative,
probabilities have to be normalized,
e.g.,
= 1.
When dealing with a large set of elementary events
normalization is numerically a nontrivial task.
It is then convenient
to work as far as possible with
unnormalized probabilities
from which normalized probabilities are obtained
as
=
with partition sum
=
.
Like for probabilities, it is also often advantageous
to work with the logarithm of unnormalized probabilities,
or to get positive numbers (for
)
with the negative logarithm
=
,
in physics also known as energy.
(For the role of
see below.)
Similarly,
=
is known as free energy.
Defining the energy we have introduced a parameter .
Varying the parameter
generates an exponential family of densities
which is frequently used in practice
by (simulated or deterministic) annealing techniques
for minimizing free energies
[114,156,199,43,1,203,243,68,244,245].
In physics
is known as inverse temperature
and plays the role of a Lagrange multiplier
in the maximum entropy approach to statistical physics.
Inverse temperature
can also be seen as an external field coupling to the energy.
Indeed, the free energy
is a generating function for the cumulants of the energy,
meaning that cumulants of
can be obtained by taking derivatives of
with respect to
[62,9,13,163].
For a detailled discussion
of the relations between probability, log-probability, energy, free energy,
partition sums, generating functions,
and also bit numbers and information see [133].
The posterior , for example, can so be written
as
![]() |
(4) |
![]() |
(5) |
![]() |
(6) |
![]() |
![]() |
![]() |
(7) |
![]() |
![]() |
(8) |
![]() |
![]() |
![]() |
(9) |
![]() |
![]() |
![]() |
(10) |
![]() |
(11) |
For the sake of clarity,
we have chosen to use the common notation for conditional probabilities
also for energies and the other quantities derived from them.
The same conventions will also be used for other probabilities,
so we will write for example for likelihoods
In Section 2.3 we will discuss the maximum a posteriori approximation
where an optimal is found by maximizing the posterior
.
Since maximizing the posterior means minimizing
the posterior energy
the latter plays the role of an error functional for
to be minimized.
This is technically similar to the
minimization of an regularized error functional
as it appears in regularization theory or
empirical risk minimization,
and which is discussed in
Section 2.5.
Let us have a closer look
to the integral over model states .
The variables
represent the parameters describing
the data generating probabilities or likelihoods
.
In this paper we will mainly be interested in
``nonparametric'' approaches where
the
-dependent numbers
itself are considered to be the primary degrees of freedom
which ``parameterize'' the model states
.
Then, the integral over
is an integral over a set of real variables
indexed by
,
,
under additional non-negativity and normalization condition.
![]() |
(13) |