Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Unable to Generate SHAP values for a dataframe containing text data trained on lstm model #3668

Open
2 of 4 tasks
Caster12 opened this issue May 21, 2024 · 2 comments
Open
2 of 4 tasks
Labels
bug Indicates an unexpected problem or unintended behaviour deep explainer Relating to DeepExplainer, tensorflow or pytorch

Comments

@Caster12
Copy link

Issue Description

Hi I am trying to generate SHAP values for my LSTM model. I am trying to pass a dataframe with only one column called 'text' that contains the text records to the explainer. However the explainer returns the following error: TypeError: cannot use a string pattern on a bytes-like object. I have attached the code that i have tried below:

Minimal Reproducible Example

import shap

def model_predict(data):
    X = [vocab(tokenizer(str(text))) for text in data['text'].tolist()] ## Tokenize and map tokens to indexes
    X = [tokens+([0]* (max_words-len(tokens))) if len(tokens)<max_words else tokens[:max_words] for tokens in X] ## Bringing all samples to max_words length
    return model.predict_proba(X)

masker = shap.maskers.Text(tokenizer=r"\W+")
exp = shap.Explainer(model=model_predict, masker = masker)
spv = exp(input_df)

Traceback

File c:\.venv\lib\site-packages\shap\explainers\_partition.py:146, in Partition.explain_row(self, max_evals, main_effects, error_bounds, batch_size, outputs, silent, fixed_context, *row_args)
    143     raise ValueError("Unknown fixed_context value passed (must be 0, 1 or None): %s" %fixed_context)
    145 # build a masked version of the model for the current input sample
--> 146 fm = MaskedModel(self.model, self.masker, self.link, self.linearize_link, *row_args)
    148 # make sure we have the base value and current value outputs
    149 M = len(fm)

File c:\.venv\lib\site-packages\shap\utils\_masked_model.py:30, in MaskedModel.__init__(self, model, masker, link, linearize_link, *args)
     28 # if the masker supports it, save what positions vary from the background
     29 if callable(getattr(self.masker, "invariants", None)):
---> 30     self._variants = ~self.masker.invariants(*args)
     31     self._variants_column_sums = self._variants.sum(0)
     32     self._variants_row_inds = [
     33         self._variants[:,i] for i in range(self._variants.shape[1])
     34     ]

File c:\.venv\lib\site-packages\shap\maskers\_text.py:301, in Text.invariants(self, s)
    298 def invariants(self, s):
    299     """ The names of the features for each mask position for the given input string.
    300     """
--> 301     self._update_s_cache(s)
    303     invariants = np.zeros(len(self._tokenized_s), dtype=bool)
    304     if self.keep_prefix > 0:

File c:\.venv\lib\site-packages\shap\maskers\_text.py:280, in Text._update_s_cache(self, s)
    278 if self._s != s:
    279     self._s = s
--> 280     tokens, token_ids = self.token_segments(s)
    281     self._tokenized_s = np.array(token_ids)
    282     self._segments_s = np.array(tokens)

File c:\.venv\lib\site-packages\shap\maskers\_text.py:183, in Text.token_segments(self, s)
    181     return parts, token_data["input_ids"]
    182 except (NotImplementedError, TypeError): # catch lack of support for return_offsets_mapping
--> 183     token_ids = self.tokenizer(s)['input_ids']
    184     if hasattr(self.tokenizer, "convert_ids_to_tokens"):
    185         tokens = self.tokenizer.convert_ids_to_tokens(token_ids)

File c:\\.venv\lib\site-packages\shap\maskers\_text.py:360, in SimpleTokenizer.__call__(self, s, return_offsets_mapping)
    358 offset_ranges = []
    359 input_ids = []
--> 360 for m in re.finditer(self.split_pattern, s):
    361     start, end = m.span(0)
    362     offset_ranges.append((pos, start))

File C:\tools\python3.10\latest\lib\re.py:247, in finditer(pattern, string, flags)
    242 def finditer(pattern, string, flags=0):
    243     """Return an iterator over all non-overlapping matches in the
    244     string.  For each match, the iterator returns a Match object.
    245 
    246     Empty matches are included in the result."""
--> 247     return _compile(pattern, flags).finditer(string)

TypeError: cannot use a string pattern on a bytes-like object

Expected Behavior

No response

Bug report checklist

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest release of shap.
  • I have confirmed this bug exists on the master branch of shap.
  • I'd be interested in making a PR to fix this bug

Installed Versions

Name: shap
Version: 0.42.1
Summary: A unified approach to explain the output of any machine learning model.
Home-page:
Author:
Author-email: Scott Lundberg [email protected]
License: MIT License

@Caster12 Caster12 added the bug Indicates an unexpected problem or unintended behaviour label May 21, 2024
@CloseChoice
Copy link
Collaborator

This is a longstanding problem we have with LSTMs. Currently there is no solution in sight for this

@Caster12
Copy link
Author

Caster12 commented May 23, 2024

Is this problem only for LSTM models or for NLP usecases in general? Since I see the problem is in the text msker where it expects a list of strings for NLP

@CloseChoice CloseChoice added the deep explainer Relating to DeepExplainer, tensorflow or pytorch label Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behaviour deep explainer Relating to DeepExplainer, tensorflow or pytorch
Projects
None yet
Development

No branches or pull requests

2 participants