Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confidence Interval for categorical outcome #862

Open
ellpri opened this issue Mar 19, 2024 · 3 comments
Open

Confidence Interval for categorical outcome #862

ellpri opened this issue Mar 19, 2024 · 3 comments

Comments

@ellpri
Copy link

ellpri commented Mar 19, 2024

Hi @kbattocchi, keith, I am building a CausalForest where i have Treatment Variable which is multi categorical [0,1,2,3,5] and the outcome is [0,1], where 1 being severe.

econml_causalForest = CausalForestDML(model_y=RandomForestRegressor(random_state=42),
                                  model_t=RandomForestClassifier(min_samples_leaf=10, random_state=42),
                                   discrete_treatment=True, cv=3, random_state=123
                                )
econml_causalForest.fit(Y=y_train, T=T_train, X=X_train, W=None)
print(f'econml_ATE_forest: {econml_causalForest.ate(X_test, T0=0, T1=5)}')

print(econml_causalForest.summary())
print(econml_causalForest.ate_inference(X))

Got the results as follows

 Doubly Robust ATE on Training Data Results          
==============================================================
         point_estimate stderr zstat  pvalue ci_lower ci_upper
--------------------------------------------------------------
ATE|T0_1          0.128   0.02  6.402    0.0    0.089    0.167
ATE|T0_2          0.143  0.019  7.596    0.0    0.106     0.18
ATE|T0_3          0.164   0.02   8.35    0.0    0.126    0.203
ATE|T0_5          0.313   0.02 15.827    0.0    0.274    0.352


econml_ATE_forest: 0.27076799164408494
               Uncertainty of Mean Point Estimate              
===============================================================
mean_point stderr_mean zstat pvalue ci_mean_lower ci_mean_upper
---------------------------------------------------------------
     0.109       1.059 0.103  0.918        -1.968         2.185
      Distribution of Point Estimate     
=========================================
std_point pct_point_lower pct_point_upper
-----------------------------------------
    0.946          -0.263           0.233
     Total Variance of Point Estimate     
==========================================
stderr_point ci_point_lower ci_point_upper
------------------------------------------
       1.421         -0.374          0.377
------------------------------------------

Which results should i take into consideration Doubly Robust or DoublML. Both ATE estimates are different ? And how should i intrepret the ATE and CI?

@kbattocchi
Copy link
Collaborator

If you just care about the ATE on the training set, then use the doubly robust ATE (which you can get programmatically from the ate_ attribute). The ate() method is more flexible, allowing you to also compute the ATE for other populations X, but it is not doubly-robust.

In terms of interpretation, a value of 0.313 means that increasing the probability of assigning an individual to treatment 5 instead of treatment 0 by p will increase the likelihood of a severe outcome by 0.313p. (This estimate is linear in the treatment probability which may not be completely realistic for a discrete outcome, since for some values of X we may have small variations in treatment that correspond to large variations in output, which would extrapolate to more than a 100% change in severity probability given a 100% change in treatment from one level to another, which is impossible)

@ellpri
Copy link
Author

ellpri commented Mar 25, 2024

@kbattocchi Hi Keith, Thanks for the reply. I am working with accident data. The treatment variable taken here is the relative velocity and '5' indicates more than 80kmph and '0' is 20kmph. The target variable is injury severity. I expected a result that would say if relative velocity changes from 0 to 5, it would increases the injury severity probability by x. But the way you intrepreted is little different.

  1. So the use case is not applicable here? In general, i want to analyse the parameters from the accident Database and its influence on injury severity which is a categorical variable.
  2. As you mentioned , ATE here is linear, so should i use Treatment Featurizer?

@kbattocchi
Copy link
Collaborator

@ellpri I think that my answer is consistent with what you're looking for - changing 100% from '0' to '5' means changing the severity probability by 100% of 0.313, i.e. increasing it by 0.313. I only added the caveat because the linearity of the model is not necessarily completely realistic for discrete outcomes - we perform the estimate conditional on X by regressing the unexpected variation in outcome conditional on X and W to on the unexpected variation in treatment conditional on X and W, and empirically it's possible that for some X there was a big unexpected change in Y (there was a severe injury when we thought that was only 10% likely given X, say) but only a small unexpected change in T (the relative velocity was 5, and we thought that was 95% likely given X) - in that case it looks like a very small change in T leads to a big change in Y, which will extrapolate to a more than 100% change in outcome given a change in treatment from '0' to '5'. Despite this, empirically DML seems to generally perform well with discrete outcomes even though theoretically something like a "double machine learning for logistic regression" setup might be more appropriate.

The treatment featurizer won't affect this - you're already fitting a CATE model that is flexible in X (because you are using CausalForestDML), so featurizing X won't buy you anything - the linearity that I'm talking about is linearity in the treatment (probability). But discrete models are linear in the treatment without loss of generality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants