Interpretations of Occam's razor

Next: A priori information and Up: Bayesian framework Previous: Empirical risk minimization Contents

Interpretations of Occam's razor

The principle to prefer simple models over complex models and to find an optimal trade-off between fitting data and model complexity is often referred to as Occam's razor (William of Occam, 1285-1349). Regularization terms, penalizing for example non-smooth (``complex'') functions, can be seen as an implementation of Occam's razor.

The related phenomena appearing in practical learning is called over-fitting [224,96,24]. Indeed, when studying the generalization behavior of trained models on a test set different from the training set, it is often found that there is an optimal model complexity. Complex models can due to their higher flexibility achieve better performance on the training data than simpler models. On a test set independent from the training set, however, they can perform poorer than simpler models.

Notice, however, that the Bayesian interpretation of regularization terms as (a priori) information about Nature and the Frequentist interpretation as additional cost terms in the loss function are not equivalent. Complexity priors reflects the case where Nature is known to be simple while complexity costs express the wish for simple models without the assumption of a simple Nature. Thus, while the practical procedure of minimizing an error functional with regularization terms appears to be identical for empirical risk minimization and a Bayesian Maximum A Posteriori Approximation, the underlying interpretation for this procedure is different. In particular, because the Theorem in Section 2.3 holds only for log-loss, the case of loss functions differing from log-loss requires from a Bayesian point of view to distinguish explicitly between model states and actions . Even in saddle point approximation, this would result in a two step procedure, where in a first step the hypothesis ${h}^{*}$ , with maximal posterior probability is determined, while the second step minimizes the risk for action $a\in A$ under that hypothesis ${h}^{*}$ [132].

Next: A priori information and Up: Bayesian framework Previous: Empirical risk minimization Contents

Joerg_Lemm 2001-01-21