BUG: Unable to Generate SHAP values for a dataframe containing text data trained on lstm model #3668

Caster12 · 2024-05-21T12:46:59Z

Issue Description

Hi I am trying to generate SHAP values for my LSTM model. I am trying to pass a dataframe with only one column called 'text' that contains the text records to the explainer. However the explainer returns the following error: TypeError: cannot use a string pattern on a bytes-like object. I have attached the code that i have tried below:

Minimal Reproducible Example

import shap

def model_predict(data):
    X = [vocab(tokenizer(str(text))) for text in data['text'].tolist()] ## Tokenize and map tokens to indexes
    X = [tokens+([0]* (max_words-len(tokens))) if len(tokens)<max_words else tokens[:max_words] for tokens in X] ## Bringing all samples to max_words length
    return model.predict_proba(X)

masker = shap.maskers.Text(tokenizer=r"\W+")
exp = shap.Explainer(model=model_predict, masker = masker)
spv = exp(input_df)

Traceback

File c:\.venv\lib\site-packages\shap\explainers\_partition.py:146, in Partition.explain_row(self, max_evals, main_effects, error_bounds, batch_size, outputs, silent, fixed_context, *row_args)
    143     raise ValueError("Unknown fixed_context value passed (must be 0, 1 or None): %s" %fixed_context)
    145 # build a masked version of the model for the current input sample
--> 146 fm = MaskedModel(self.model, self.masker, self.link, self.linearize_link, *row_args)
    148 # make sure we have the base value and current value outputs
    149 M = len(fm)

File c:\.venv\lib\site-packages\shap\utils\_masked_model.py:30, in MaskedModel.__init__(self, model, masker, link, linearize_link, *args)
     28 # if the masker supports it, save what positions vary from the background
     29 if callable(getattr(self.masker, "invariants", None)):
---> 30     self._variants = ~self.masker.invariants(*args)
     31     self._variants_column_sums = self._variants.sum(0)
     32     self._variants_row_inds = [
     33         self._variants[:,i] for i in range(self._variants.shape[1])
     34     ]

File c:\.venv\lib\site-packages\shap\maskers\_text.py:301, in Text.invariants(self, s)
    298 def invariants(self, s):
    299     """ The names of the features for each mask position for the given input string.
    300     """
--> 301     self._update_s_cache(s)
    303     invariants = np.zeros(len(self._tokenized_s), dtype=bool)
    304     if self.keep_prefix > 0:

File c:\.venv\lib\site-packages\shap\maskers\_text.py:280, in Text._update_s_cache(self, s)
    278 if self._s != s:
    279     self._s = s
--> 280     tokens, token_ids = self.token_segments(s)
    281     self._tokenized_s = np.array(token_ids)
    282     self._segments_s = np.array(tokens)

File c:\.venv\lib\site-packages\shap\maskers\_text.py:183, in Text.token_segments(self, s)
    181     return parts, token_data["input_ids"]
    182 except (NotImplementedError, TypeError): # catch lack of support for return_offsets_mapping
--> 183     token_ids = self.tokenizer(s)['input_ids']
    184     if hasattr(self.tokenizer, "convert_ids_to_tokens"):
    185         tokens = self.tokenizer.convert_ids_to_tokens(token_ids)

File c:\\.venv\lib\site-packages\shap\maskers\_text.py:360, in SimpleTokenizer.__call__(self, s, return_offsets_mapping)
    358 offset_ranges = []
    359 input_ids = []
--> 360 for m in re.finditer(self.split_pattern, s):
    361     start, end = m.span(0)
    362     offset_ranges.append((pos, start))

File C:\tools\python3.10\latest\lib\re.py:247, in finditer(pattern, string, flags)
    242 def finditer(pattern, string, flags=0):
    243     """Return an iterator over all non-overlapping matches in the
    244     string.  For each match, the iterator returns a Match object.
    245 
    246     Empty matches are included in the result."""
--> 247     return _compile(pattern, flags).finditer(string)

TypeError: cannot use a string pattern on a bytes-like object

Expected Behavior

No response

Bug report checklist

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest release of shap.
I have confirmed this bug exists on the master branch of shap.
I'd be interested in making a PR to fix this bug

Installed Versions

Name: shap
Version: 0.42.1
Summary: A unified approach to explain the output of any machine learning model.
Home-page:
Author:
Author-email: Scott Lundberg [email protected]
License: MIT License

CloseChoice · 2024-05-22T07:37:34Z

This is a longstanding problem we have with LSTMs. Currently there is no solution in sight for this

Caster12 · 2024-05-23T06:16:13Z

Is this problem only for LSTM models or for NLP usecases in general? Since I see the problem is in the text msker where it expects a list of strings for NLP

Caster12 added the bug Indicates an unexpected problem or unintended behaviour label May 21, 2024

CloseChoice added the deep explainer Relating to DeepExplainer, tensorflow or pytorch label Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Unable to Generate SHAP values for a dataframe containing text data trained on lstm model #3668

BUG: Unable to Generate SHAP values for a dataframe containing text data trained on lstm model #3668

Caster12 commented May 21, 2024

CloseChoice commented May 22, 2024

Caster12 commented May 23, 2024 •

edited

BUG: Unable to Generate SHAP values for a dataframe containing text data trained on lstm model #3668

BUG: Unable to Generate SHAP values for a dataframe containing text data trained on lstm model #3668

Comments

Caster12 commented May 21, 2024

Issue Description

Minimal Reproducible Example

Traceback

Expected Behavior

Bug report checklist

Installed Versions

CloseChoice commented May 22, 2024

Caster12 commented May 23, 2024 • edited

Caster12 commented May 23, 2024 •

edited