Total variance explained > 1 #7

tsgouvea · 2018-11-05T16:45:54Z

Forgive me if this is a known counterintuitive point deemed irrelevant, but I noticed total variance explained by all components is greater than one. That's true in my dataset with missing values, but also in the complete example below.

import numpy as np
from ppca import PPCA

x = np.random.randn(50,20)
m = PPCA()
m.fit(data=x)

print(m.var_exp)
[0.11460246 0.21691676 0.30977113 0.40169889 0.4885789  0.56032857
 0.62697946 0.68458968 0.73693932 0.78439966 0.82519526 0.86416853
 0.89399395 0.9215888  0.94654215 0.9696493  0.98877802 1.00211534
 1.01186025 1.02040816]

It seems to be related to the fact that the sum of all eigenvalues is greater than the number of dimensions in the original dataset. Since sum of eigenvalues should be equal to trace of correlation matrix, I would not expect that to be the case.

print(np.cumsum(m.eig_vals)/20.)
[0.11460246 0.21691676 0.30977113 0.40169889 0.4885789  0.56032857
 0.62697946 0.68458968 0.73693932 0.78439966 0.82519526 0.86416853
 0.89399395 0.9215888  0.94654215 0.9696493  0.98877802 1.00211534
 1.01186025 1.02040816]

print(m.eig_vals.sum())
20.40816326530614

The text was updated successfully, but these errors were encountered:

burtonrj · 2020-08-02T16:12:52Z

Correct me if this is wrong, but is the above because var_exp is the variance of each component as opposed to the percentage explained variance?

ichitaka · 2020-08-02T20:12:00Z

This is what i see in the README.md

variance_explained = ppca.var_exp

It let's me think, that var_exp should be the variance explained.

flcong · 2023-11-27T03:46:08Z

Not really. According to the code, the reason is an inconsistent use of degrees of freedom in the calculation of covariance (as in the numerator) and variance (as in the denominator).

When calculating variance explained by each PC, np.cov is used with default parameters. Based on the documentation of np.cov, by default, the divisor is sample size less 1.

pca-magic/ppca/_ppca.py

Line 98 in 7eef1c0

vals, vecs = np.linalg.eig(np.cov(np.dot(data, C).T))

However, when calculating the total variance, np.nanvar is used with default parameters. Based on numpy's documentation, in this case, the divisor is the sample size, i.e. no degrees of freedom adjustment.

pca-magic/ppca/_ppca.py

Lines 127 to 129 in 7eef1c0

 var = np.nanvar(data, axis=1) 

 total_var = var.sum() 

 self.var_exp = self.eig_vals.cumsum() / total_var

As a result, the denominator is slightly smaller or the numerator is slightly larger, which causes the >1 explained variance ratio. A simple fix is to multiply var_exp by (N-1)/N where N is the sample size.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Total variance explained > 1 #7

Total variance explained > 1 #7

tsgouvea commented Nov 5, 2018

burtonrj commented Aug 2, 2020

ichitaka commented Aug 2, 2020

flcong commented Nov 27, 2023 •

edited

Total variance explained > 1 #7

Total variance explained > 1 #7

Comments

tsgouvea commented Nov 5, 2018

burtonrj commented Aug 2, 2020

ichitaka commented Aug 2, 2020

flcong commented Nov 27, 2023 • edited

flcong commented Nov 27, 2023 •

edited