Two Simple Approaches to Prediction
There are tow simple but powerful prediction methods in this section, they are the linear model fit by least squares and the -nearest-neighbor prediction rule.
The linear model makes huge assumptions about structures and yields stable but maybe inaccurate prediction. The -nearest-neighbor makes very mild structural assumptions, its predictions often accurate but can be unstable.
Linear Models and Least Squares
The linear model has maintained the statistics for over 30 years, and remains one of our most important tools.
Given a vector of inputs , we predict the output via the model The term is the intercept, also known as the bias in machine learning. Often it is convenient to include the constant variable 1 in , include in the vector of coefficients , and we can rewrite this model form as an inner product where denotes vector or matrix transpose. Here we are modeling a single output, so is a scalar; in general can be a -vector, in which case would be a matrix of coefficients. In the -dimensional input-output space, represent a hyperplane.
How do we fit the linear model to a set of training dataset? There are many different methods, but by far the most popular is the method of least squares. In this approach, we pick the coefficients to minimize the residual sum of squares is a quadratic function of the parameters, and hence its minimum always exits, but may not be unique. We can rewrite this formula into matrix notation differentiating w.r.t. we get the normal equations If is nonsingular, then the unique solution is given by1 and the fitted value at the th input is .
Nearest-Neighbor Methods
The -nearest neighbor fit for is defined as follows: where is the neighborhood of defined by the closest points in the training sample. In words, we find the observations with closest to in input space, and average their responses.
From Least Squares to Nearest Neighbors
The linear decision boundary from least squares is usually smooth and stable to fit, that means it has low variance. But it rely heavily on assumptions, that makes it may contain bias.
On the other hand, the -nearest-neighbor do not rely on the any assumptions, that makes it has low bias. But its decision boundary depends on a handful of training data points, that makes it unstable, and contain high variance.
Each methods has its own situations for which to work best. Linear regression is more suitable for Scenario 12, and the -nearest-neighbor is more suitable for Scenario 23.
A large subset of the most popular techniques in use today are variants of these two simple procedures. The following list describes some ways in which these simple procedures been enhanced:
- Kernel method use weights that decrease smoothly to zero with distance from the target points, rather than the effective 0/1 weights used by -nearest-neighbors.
- In high-dimensional spaces the distance kernels are modified to emphasize some variables rather then others.
- Local regression fit linear models by locally weighted least squares, rather than fitting constantly locally.
- Linear models fit a basis expansion of the original inputs allow arbitrarily complex models.
- Project pursuit and neural network consist of sums of nonlinearly transformed linear models.
1. let , we can get the solution. The operation rules see reference. ↩
2. Scenario 1: The training data in each class were generated from bivariate Gaussian distributions with uncorrelated components and different means. ↩
3. Scenario 2: The training data in each class came from a mixture of 10 low-variance Gaussian distributions, with individual means themselves distributed as Gaussian ↩