A different curse of dimensionality

High-dimensional space is  not only empty, it is also really painful to think about!

I went through three years of my CS studies without ever having an actual concept of the curse of dimensionality. I might have heard it here or there, but it never stayed in my mind for long. It wasn’t until I did the machine learning course, taught by my current boss and eventually even TA’d by me, that I had a grasp of what the curse of dimensionality might really mean. The good part is, the best solution to deal with it is to simply ignore that it exists and happily continue sciencing! A few semesters later I did a course on data mining, where people seemed to spend a lot more time worrying about high dimensional data and the problems that occur. At this point it became clear to me that this really must be an issue. Then finally, after finishing my masters and doing about two years of my PhD studies, I went to a PhD defense of a student from the data mining chair. Here it hit me! When these people talked about “high dimensional data” they considered 8D and above. I was highly amused since my machine, while I was sitting in the talk, was computing a k-means clustering in 128D. Something that seemed to work without any issues.

Over time, I guess I developed a slightly more subtle feeling for the curse of dimensionality and I guess it is more about how empty space becomes and how weird certain things seem to behave. However, I never gave too much thought to it, because to be fair, my clustering was still working.

Fast-forward to now, my colleague and me are working on an exciting new project and we had to discuss points on the surface of a hypersphere in n-dimensional space. He kept insisting that my intuitions about 3D spheres do not generalize to n-dimensional hyperspheres. Oh boy, he was right! Initially his comments didn’t really change my thinking, not because I didn’t believe him (he is usually right), but since I just couldn’t imagine what he was talking about. He suggested to read this post which to be honest didn’t do a lot for me, but I would rather recommend this video. I know it starts really simple, but if you stick through it you might really get a better feeling for hyperspheres! After all this reading, thinking and discussing, the curse of dimensionality became something far bigger than the theoretical problem. To me the fact that it is super hard to think in n-dimensional space is an additional, possibly even worse, curse of dimensionality!

Let’s consider a little example. Two days ago I had to reason about distances in n-dimensional space. Imagine a unit hypersphere with many points on the surface. My task is to pick a point on the surface and push all other points on the surface further away from this point than a given margin r. In other words, you place another hypersphere with radius r centered at the point and you push any other point until it is outside of this hypersphere. Here is an awesome illustration I made for this problem in 2D. (Why did I ever give my drawing tablet to my father?)

2d_spheres

The green points are fine, but the red points within the smaller circle have to be pushed out. “Why are you doing this” you ask? None of your business! I’m just telling you how messed up this whole thing is. Our initial intuition was, if you use a margin of 1, you have to push points super far over the surface of the sphere. If you consider the illustration above and set r to 1, you will indeed see you have to push it quite a bit over the surface of the unit circle. But how far? Does this change with n dimensions?

Before I tell you my solution to finding my answer, eventually I  had to agree with my colleague, my hypersphere intuition was garbage. However, as it turns out, his was as well. I thought of this for a long time and figured the surface of the unit hypersphere inside the other hypersphere becomes smaller and smaller with growing dimensions (thinking always is better on the toilet!). BUT, I had no way to be sure. I googled how to compute the hypersphere intersection, but at the time couldn’t find anything obvious. So what can one do? My idea: put many points uniformly on a surface, thus approximating the surface of the sphere and then check which percentage of points lie within the other hypersphere. Should be straight forward, but then you get to the point where you want to evenly sample points on an n-dimensional hypersphere. Not trivial, but also not impossible! Naively one could just sample points between -1 and 1 in each dimension and normalize the n-dimensional vector. Sadly this is wrong. Since in high-dimensional space, most pace tend to be in the corners of a hypercube, you would sample more points there and thus through normalization you would project “blobs of points” on your sphere. Shit… But there is an obvious solution. You just reject every sample that is outside of the unit hypersphere and then normalize. Easy!

import numpy as np

def sample_nd_sphere(n, size):
    samples = np.zeros((size, n))
    i = 0
    while i < size:
        sample = np.random.uniform(-1, 1, size=n)
        sample_norm = np.linalg.norm(sample);
        if  sample_norm <=1:
            samples[i] = sample/sample_norm
            i+=1
            
    return samples

While this is correct, the curse of dimensionality strikes again, and this time it strikes really hard. Getting 100 samples in 3 dimensions? Easy 8ms. In 12? Borderline, but okay within 2 seconds. But 13 dimensions this code already took more than 5 seconds and I was aiming for 128 dimensions and more than 100 samples. Urgh…!

So what does wolfram alpha say, someone must have done this right!? Apparently even in 3D it’s easy to mess up! But hey, in 1972 someone developed a “simple” way to pick points in 4D. Luckily wolfram alpha also discussed a way to easily sample in n dimensions . Actually it is embarrassingly simple, you just sample every dimension from a normal distribution and normalize the vector to the desired radius. This does feel obvious to me, but if you want to know why, well I don’t know, good luck figuring it out. So we get this:

import numpy as np

def sample_nd_sphere(n, size):
    samples = np.random.normal(size = (size,n))
    l = np.linalg.norm(samples, axis=1)
    samples = samples/l[...,None]

    return samples

Much faster! A million 128-dimension samples in just over 6 seconds.

So now the intersection part is easy:

import numpy as np

def percentage_inside_nd_sphere_range(radius, samples, dims):
    res = []
    for i in dims:
        test = sample_nd_sphere(i, samples)
        inside = 0
        total = 0
        for j in range(0, samples):
            current = test[j]
            dist = np.sqrt(np.sum(
                (test - np.repeat(current[None], samples, 0))**2
                , axis=1)) < radius
            inside += np.sum(dist)
            total += samples
        res.append(100.0*inside/total)
    return res

I take a point, place a hypersphere with radius around it and count points within the hypersphere and finally average the statistics over every point. To generate my data I used 2000 points and only placed a hypersphere on every tenth point to speed up the computations a little. Nevertheless if you run this several times, for different radii and for different dimensionalities you get a pretty cool plot!

sphere intersection.png

On the vertical axis you see the percentage of points in the intersection, which in this case is a proxy for the percentage of the surface inside the hypersphere and on the horizontal axis the log of the dimensionality (0-128). Every line represent a specific radius of the hypersphere and thus in my case the margin I want to achieve. As can be seen a “1D-sphere” behaves a bit odd, but it is not really a sphere anyways. But if you look at r=1, you can indeed see that with an increasing dimensionality, the surface in the hypersphere intersection becomes very small. In fact many radii seem to show this behavior until there is a certain tipping point where exactly the opposite is true. For larger radii the hypersphere will soon almost enclose the complete unit hypersphere!

My colleagues initial intuition was the opposite with respect to small radii and increasing dimensionality. However, he does get credit for pointing out the tipping point. For some reason it was completely obvious to him that this is the case for r=sqrt(2). The plot confirms this, even though the line is not perfectly straight due to sampling, it does in fact stick very well to 50%. He didn’t want to discuss the details and to be honest at this point I don’t feel like proving it either. The plot clearly agrees with him on this matter! He also thinks that using bright colors on a black background makes me look like a script-kiddy x)

So what have I learned? The curse of dimensionality now has one more interpretation for me. I get serious headaches when thinking about this too long, my colleague has a bad taste in colors and writing a blog post about a trivial thing like this takes way longer than expected, albeit not as long as it took my colleague and me to discuss hyperspheres in high dimensional space in the past week. A rather spiky and smooth issue as it turns out!

Advertisements
A different curse of dimensionality

Write something to depress me even more

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s