The principle to prefer simple models over complex models and to find an optimal trade-off between fitting data and model complexity is often referred to as Occam's razor (William of Occam, 1285-1349). Regularization terms, penalizing for example non-smooth (``complex'') functions, can be seen as an implementation of Occam's razor.
The related phenomena appearing in practical learning is called over-fitting [224,96,24]. Indeed, when studying the generalization behavior of trained models on a test set different from the training set, it is often found that there is an optimal model complexity. Complex models can due to their higher flexibility achieve better performance on the training data than simpler models. On a test set independent from the training set, however, they can perform poorer than simpler models.
Notice, however, that the Bayesian interpretation
of regularization terms as (a priori) information about Nature
and the Frequentist interpretation as additional
cost terms in the loss function are not equivalent.
Complexity priors reflects the case
where Nature is known to be simple
while complexity costs
express the wish for simple models
without the assumption of a simple Nature.
Thus, while the practical procedure of minimizing an error functional
with regularization terms
appears to be identical for empirical risk minimization
and a Bayesian Maximum A Posteriori Approximation,
the underlying interpretation for this procedure is different.
In particular, because the Theorem in Section 2.3
holds only for log-loss,
the case of loss functions differing from log-loss requires
from a Bayesian point of view
to distinguish explicitly between model states
and actions
.
Even in saddle point approximation,
this would result in a two step procedure,
where in a first step the hypothesis
,
with maximal posterior probability is determined,
while the second step minimizes the risk for action
under that hypothesis
[132].