There are of course several ways one can approximate correlation. In this post I thought I would outline the use of kernel approximation and how to relate that to correlation measures.

Rough outline:

  • Realise Cosine similarity is the same as correlation when centered
  • Use kernel approximation method (Nystroem)

Cosine Similarity

The link to cosine similarity is best described in this post.

The important aspect is that

$$ \rho_{xy} = \frac{1}{n} \sum_i z_x z_y$$

where \(z_i\) is the z-score for the \(i\)th variable. As the vector in Cosine is a z-score if it was already centered, then assuming that the vector was “pre-centered”, the results would be equivalent.

In order to approximate it quickly, then the approach could be

from sklearn import preprocessing
from sklearn.kernel_approximation import Nystroem
from sklearn.metrics.pairwise import pairwise_kernels
import numpy as np

X_scaled = preprocessing.scale(X_train)
X_approx_cor = pairwise_kernels(X_scaled.T, metric='cosine')
np.corrcoef(X_train.T) # this will be the same as above