Local Methods in High Dimensions
As we mentioned above sections, the larger training data size the closer to the theoretical conditional expectations.
However, it will break down in high dimensions, and we call this phenomenon as the curse of dimensionality.
First consider the nearest-neighbor procedure for input uniformly distributed in -dimensional unit hypercube. And we can easily calculate that the range of sample we covered, let's call it , will be the form bellow: and we can find out that . That means if is constant, when the dimension goes larger, the edge will get closer to 1. This is a very spare data space. If we wanna cover 1% data, the edge we need is 0.63 in 10-dimensions. That makes no sense.
The other consequence of the spare sampling in high dimensions is that all sample points are close to an edge of sample. Consider data points uniformly distributed in a -dimensional unit ball centered at the origin. The median distance from origin to the closest data point is: For , , , more than half way to boundary. When we are using nearest-neighbor method, we are using the neighbor of some data point to predict it, since the distance is so far, we cannot call it "neighbor".
If the dimension , and the data size is , the density is . If , the density is . In order to keep density equals to , the data size we need is . In -dimensional space, we need data size .
Let's assume we have 1000 training data generated uniformly on . Assume the true function is: without any measurement error. We use the 1-nearest-neighbor rule to predict on the test point . Denote the training set by . We can compute the expectation error at : Since the expectation of is 0, the equation above is feasible. Unless the nearest neighbor is at 0, will be smaller than . When the dimension grows, the nearest distance will become much more greater, and the estimation tends to be 0 more often.
Suppose, on the other hand, the relation between and is linear, where and we fit this model by least squares to the training data.
Therefor:
Where is the th element of .
Since the least squares estimate is unbiased, we find that:
Here we have incurred an additional variance in the prediction error, since our target is not deterministic. There is no bias, and the variance depends on . If is large and were selected at random, and assuming , then and Here we see that the expected EPE increases linearly as a function of , with slope . If is large and/or is small, this growth in variance is negligible (0 in the deterministic case). By imposing some heavy restrictions on the class of models being fitted, we have avoided the curse of dimensionality.