Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Total variance explained > 1 #7

Open
tsgouvea opened this issue Nov 5, 2018 · 3 comments
Open

Total variance explained > 1 #7

tsgouvea opened this issue Nov 5, 2018 · 3 comments

Comments

@tsgouvea
Copy link

tsgouvea commented Nov 5, 2018

Forgive me if this is a known counterintuitive point deemed irrelevant, but I noticed total variance explained by all components is greater than one. That's true in my dataset with missing values, but also in the complete example below.

import numpy as np
from ppca import PPCA

x = np.random.randn(50,20)
m = PPCA()
m.fit(data=x)

print(m.var_exp)
[0.11460246 0.21691676 0.30977113 0.40169889 0.4885789  0.56032857
 0.62697946 0.68458968 0.73693932 0.78439966 0.82519526 0.86416853
 0.89399395 0.9215888  0.94654215 0.9696493  0.98877802 1.00211534
 1.01186025 1.02040816]

It seems to be related to the fact that the sum of all eigenvalues is greater than the number of dimensions in the original dataset. Since sum of eigenvalues should be equal to trace of correlation matrix, I would not expect that to be the case.

print(np.cumsum(m.eig_vals)/20.)
[0.11460246 0.21691676 0.30977113 0.40169889 0.4885789  0.56032857
 0.62697946 0.68458968 0.73693932 0.78439966 0.82519526 0.86416853
 0.89399395 0.9215888  0.94654215 0.9696493  0.98877802 1.00211534
 1.01186025 1.02040816]

print(m.eig_vals.sum())
20.40816326530614
@burtonrj
Copy link

burtonrj commented Aug 2, 2020

Correct me if this is wrong, but is the above because var_exp is the variance of each component as opposed to the percentage explained variance?

@ichitaka
Copy link

ichitaka commented Aug 2, 2020

This is what i see in the README.md

variance_explained = ppca.var_exp

It let's me think, that var_exp should be the variance explained.

@flcong
Copy link

flcong commented Nov 27, 2023

Not really. According to the code, the reason is an inconsistent use of degrees of freedom in the calculation of covariance (as in the numerator) and variance (as in the denominator).

When calculating variance explained by each PC, np.cov is used with default parameters. Based on the documentation of np.cov, by default, the divisor is sample size less 1.

vals, vecs = np.linalg.eig(np.cov(np.dot(data, C).T))

However, when calculating the total variance, np.nanvar is used with default parameters. Based on numpy's documentation, in this case, the divisor is the sample size, i.e. no degrees of freedom adjustment.

pca-magic/ppca/_ppca.py

Lines 127 to 129 in 7eef1c0

var = np.nanvar(data, axis=1)
total_var = var.sum()
self.var_exp = self.eig_vals.cumsum() / total_var

As a result, the denominator is slightly smaller or the numerator is slightly larger, which causes the >1 explained variance ratio. A simple fix is to multiply var_exp by (N-1)/N where N is the sample size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants