Statistical Decision Theory

We will develop a small amount of theory that provides a framework for developing models in this section.

We first consider the case of quantitative output. Let's assume and with joint distribution . We seek a function to predict with input . This theory require a loss function for penalizing errors in prediction, and by far the most common and convenient is squares loss error: . This leads us to a criterion for choosing :

the expected prediction error. Apparently, in most case, we already know , but we don't know . To solve this problem, we can change joint probability into conditional probability. That change the equation as:

and we see that it suffices to minimize EPE point-wise:

The solution is

the conditional expectation is also known as the regression function. Thus the best prediction of at any point is the conditional mean, when best is measured by average squared error.

The nearest-neighbor methods attempt to directly implement this recipe using the training data. At each point we have:

There are two approximations here:

  • expectation is approximated by averaging over sample data;
  • conditioning at a point is relaxed to conditioning on some region "closed" to target point.

If we get a large training data size , as and , .

Now let's consider how to make linear model fit this framework. We just assumed . Plugging this function into and we can solve for theoretically:

Since is a scalar, .

Therefor, . Let this equation be 0, and we can get:

The least squares replaces the expectation above by average over the training data.

So both -nearest neighbors and least squares end up approximating conditional expectation by averages. But the differences are model assumptions:

  • Least squares assumes is well approximated by globally linear model.
  • The -nearest neighbors assumes is well approximated by a locally constant function.

And if we replace the loss function as 1, the solution will be the conditional median,

which is a different measure of location. Meanwhile its estimation are more robust than those for the conditional mean. However, the criteria have discontinuities in their derivatives, which has hindered their widespread use.

If our output is categorical, we can use zero-one loss function. The expected predict error is

where again the expectation is taken with respect to the joint distribution . Again we condition, and can write EPE as

and again it suffices to minimize EPE pointwise:

With the 0-1 loss function this simplifies to

or simply

This solution is known as Bayes classifier. And says that we classify to the most possible class, using the conditional distribution .

1. Call it criteria

results matching ""

    No results matching ""