-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: handle user-defined time-varying universes (and better error checks with temporary nan
s in user-provided returns)
#137
Comments
@pcgm-team The The first step in an issue report is always to make a minimally reproducible example, which means providing enough information that someone else can reproduce your problem. The second is to compare the expected behavior to the actual behavior in detail (i.e. if you think your 500x500 correlation matrix should have a 15-factor decomposition, state why, and what the answer should be). To address your question: N=15 factors is quite a lot. Really, nobody has any business decompositing a correlation matrix beyond perhaps N=3 factors for stock prices. You are probably on the edge of violating the linear algebra requirements / numerical requirements and finally hit the issue: the correlation matrix has to be symmetric and positive semidefinite, have rank >= N, and therefore have N eigenvalues that are nonzero (ideally, not anywhere near zero when represented on computer). For further questions about the linear algebra requirements or numerical stability of factor analysis, seek other sources. |
Thanks @joseortiz3 and yes @pcgm-team can you please provide more context; what were the inputs to |
I've been looking closer at the code and it appears that |
Mmmh, it's different; this code
works, giving just a CVXPY warning (complex are cast to real). Please then also post Cvxportfolio and CVXPY versions you are using... The CVXPY warning is like this
|
I believe the error actually has to do with the way that the backtester and forecaster handles stocks with nan on the period. I've updated my cvxpy and cvxportfolio to latest versions and now the error message has changed to (on same data/row):
I do have reason to believe this is an issue with the project itself, since NaN handling is stated as a core functionality of the project. The error persists when changing I'm currently unable to simply recreate (or publicly share the data) - I will continue trying. I've confirmed that the returns data itself has no columns (or rows) with all NaN. But all my function calls are as follows:
I've also confirmed that setting It runs fine until it hits a particular date, and prior to that date- there are NaN that are handled, which is what makes it hard for me to recreate. The full error call is here:
|
OK, thanks for providing more context. Here's what's going on at time of failure
What can you do:
What I'll do:
|
By digging into the code, I've found the source of the issue. It has to do with stocks having a NaN period after being non-nan for awhile. The issue arose in my data specifically whenever AMD was re-added to the SP500. My data has NaN whenever the symbol is not included in the SP500, and is non-nan when the symbol is in the SP500. When I print the result of:
The output is:
So specifically, it's the stocks which had no history prior to when AMD goes to NaN in my data. Ie. AMD is while the other stocks are for example: |
Ok, that's actually very interesting, thanks for sharing your findings. So, it seems import pandas as pd
past_returns = pd.DataFrame([
[-0.01, np.nan],
[np.nan, 0.01],
[np.nan, -0.01],
[np.nan, 0.01],
[0.01, np.nan],
], columns=['AMD', 'AAPL'])
print(past_returns.cov()) Which prints
So, I guess this is not an issue with Cvxportfolio (although that is a breaking mode I hadn't anticipated, and should be much easier to diagnose than it was now), but with the data you're providing to it. It seems to me that you're using |
It is true that the time-varying investable universe feature should solve the issue for my specific case (and improve the actual performance of the outputs for mine and many uses). However, presumably you would want to support cases when the data actually goes NaN for whatever reason and you don't have a detailed investable universe for each timestamp- for one example if a stock is delisted and relisted under the same name (and most retail users haven't paid for detailed information on investable universes or may not be restricting to something like sp500). There should be a way to add a check, throw the symbol out of consideration, and print a warning. For example a parameter could be added that is like What do you think? |
What should happen to missing values corresponding to lack of data for a stock in a certain time period isn't well-defined. In some cases, the value is missing because the stock was delisted, and its shares are now worthless, or halted, and its shares are frozen for hours or possibly years. In other cases (ie this one), the stock is missing from the index temporarily. In others, the stock identifier (e.g. ticker) changed because of a merger, buyout, etc and generally there is an unrecorded cash payout or payment in shares of another company. Is it not sort of "mission creep" for cvxportfolio to handle one or more of these scenarios? A minimal "fix" would be that cvxportfolio can set a default friendly behavior which is simply setting missing prices to .ffill(), returns to zero, and volume to zero, and optionally artificially "selling off" that asset as cash. Such a functionality should be optional I think, since some users will prefer that an error is thrown. The user on the other hand can simply remove that asset from their universe of data being fed to cvxportfolio, or replace the missing values how they see fit (e.g. returns = returns.fillna(0), price = price.ffill(), etc). Prompting the user "your data has missing values for {ticker}. Unclear what you want me to do with these - please fix on your end" is reasonable. In general, asset data is very sparse in (asset identifier x date) space over long enough time periods since stocks are being listed and delisted all the time. The sort of two possibilities is to use multiple rectangular dataframes (row: date, column: asset identifier) for needed quantities (this is how cvxportfolio does it, which does match the paper nicely), or instead use multi-index dataframes or series (row: [identifier x date], columns: price, volume, etc). The former is more sparse the longer the data interval. The latter is dense, no matter how the universe changes over however long of time. This is to say, cvxportfolio is already built for universes that don't change too much, because it uses multiple rectangular dataframes instead of a single multi-indexed dataframe or multiple multi-indexed series for quantities like returns, price, etc. |
I agree, but cvxportfolio already handles it exceptionally well by-default at this point in my opinion (the base handling you proposed is already implemented); and the remaining fix seems not overly complicated. (both the user-provided time-varying universe and the check to avoid covariance error) It's true my proposed check would need to be done at every timestamp; but it should be very cheap to simply count the number of concurrent values in the past_returns dataframe and make sure they are above threshold. I think expecting it to run on normal data is a different bar than expecting super-proper behavior in all cases, which I agree would bloat the mission. |
Thanks both. There's a lot to unpack there, and true that we can't have Cvxportfolio handle every scenario, like asset names that are re-used for a different asset. Identifiers should be unique, otherwise all forecasts based on historical data are meaningless. For users with enough resources handling their own naming to ensure uniqueness (e.g., use CUSIPs, ...) should be enough. If one uses the free Yahoo Finance data that is not an issue, delisted stocks disappear from the history, and the new name only has history for itself. Cvxportfolio uses already a lot of heuristics to clean the data (however, currently only in the Yahoo Finance interface), like forward filling missing prices. If you don't want the name to be removed, you should clean the returns before providing them (e.g., Regarding checks for co-occurrence of names in order to compute the covariance, I'm afraid that is what you incurred in; that check I still believe, however, that it should not happen. Each column of the returns' dataframe should only have |
nan
s in user-provided returns)
I think this is excellent. I also agree now that you are correct that the only time the error could reasonably come up is the time-varying-universe use-case, since everything else could/should be handled with ffill() price and proper symbol naming; so work-arounding it internally doesn't make sense. Thanks @enzbus. |
This is a simple implementation of optional time-varying investable universes in the market data servers. It was discussed in GH issue #137.
I get the error:
ValueError: Parameter value must be real.
with callback to:
I checked that there are many non-nans in every row (500+), my num_factors is set to 15, the datatype of all values is correct, etc.
I assume this might have something to do with the matrix not being positive semi-definite or some other numerical problem. It only occurs for specific date of data. Not sure how to fix and very unclear error message.
The text was updated successfully, but these errors were encountered: