Approximating Correlation

There are of course several ways one can approximate correlation. In this post I thought I would outline the use of kernel approximation and how to relate that to correlation measures.

Rough outline:

Realise Cosine similarity is the same as correlation when centered
Use kernel approximation method (Nystroem)

Cosine Similarity

The link to cosine similarity is best described in this post.

The important aspect is that

$$ \rho_{xy} = \frac{1}{n} \sum_i z_x z_y$$

where $z_i$ is the z-score for the $i$th variable. As the vector in Cosine is a z-score if it was already centered, then assuming that the vector was “pre-centered”, the results would be equivalent.

In order to approximate it quickly, then the approach could be

from sklearn import preprocessing
from sklearn.kernel_approximation import Nystroem
from sklearn.metrics.pairwise import pairwise_kernels
import numpy as np

X_scaled = preprocessing.scale(X_train)
X_approx_cor = pairwise_kernels(X_scaled.T, metric='cosine')
np.corrcoef(X_train.T) # this will be the same as above