Statistical Models, Supervise Learning and Function Approximation
Our goal is to find a useful approximation to the function that underlies the predictive relationship between the inputs and outputs. In the second section of this chapter, we saw that square error loss lead us to the regression function for a quantitative response. The class of nearest-neighbor methods can be viewed directly as estimates of this conditional probability. But we've seen that they can fail in at least two ways:
- if the dimension of input space is high, the nearest neighbors need not be close to the target point, and can result in large error;
- if special structure is known to exist, this can be used to reduce both of the bias and variance of the estimates.
A Statistical Model for Joint Distribution
Suppose in fact that our data arose from a statistical model: where the random error has and is independent of . Note that for this model, , and in fact the conditional distribution depends on only through the conditional mean
The additive error model is a useful approximation to the truth. It assume we can catch all errors from a deterministic relationship via the error which includes other unmeasurable variables contribute to , measurement error. Generally, the unmeasurable variables not in input-output pairs .
The assumption in that the errors are independent and identically distribution is not necessary. With such a model it become natural to use least squares as a data criterion for model estimation. In general, the conditional distribution can depend on in complicated ways, but the additive error model precludes these.
Additive error models are typically not used for qualitative response . The target function is the conditional density , and this is modeled directly.
Supervised Learning
Supervise learning attempts to learn by example from a "teacher". The learning algorithm has the property that it can modify the relationship between and in response to differences between original and generated output.
Function Approximation
The goal of function approximation is to obtain a useful approximation to for all in some region of , given the representations in .
Many of the approximations we will encounter have associated a set of parameters that can be modified to suit the data at hand. For example, the linear model has . Another class of useful approximators can be expressed as linear basis expansions where the are a suitable set of functions or transformation of the input vector . There are lot of examples of , like polynomial , trigonometric , sigmoid transformation common to neural network,
We can use least squares to estimate in as we did for the linear model, by minimizing the residual sum-of-squares as a function of .
While least squares is generally very convenient, it is not the only criterion used in some cases would not make sense. A more general principle for estimation is maximum likelihood estimation.
Suppose we have a random sample from a density indexed by some parameters . The log-probability of the observed sample is The principle of maximum likelihood assume that the most reasonable value of are those for which the probability of observed sample is largest. Least squares for the additive error model , which , is equivalent to maximum likelihood using the conditional likelihood Although the additional assumption of normality seems more restrictive, the results are the same. The log-likelihood of the data is and the only term involving is the last, which is the up to a scalar negative multiplier.
Suppose we have a model for the conditional probability of each class given , indexed by the parameter vector . Then the log-likelihood(also referred to as cross-entropy) is