Implement HuggingFace Language Modeling Estimators #2336

f4str · 2023-11-27T21:43:57Z

Is your feature request related to a problem? Please describe.
The next step of integrating HuggingFace into ART is to add support for the language modeling estimators. This involves creating ART estimator wrappers for the HuggingFace text models. These estimators should support

Describe the solution you'd like
A new module will be created: art.estimators.language_modeling which will be where all of the new HuggingFace language modeling estimators will be implemented.

A new estimator will be created for each language modeling task (e.g., masked LM, sequence classification, next sentence prediction, etc.). Each estimator will be named accordingly (e.g., HuggingFaceMaskedLM, HuggingFaceSequenceClassificationLM, HuggingFaceNextSequencePredictionLM, etc.). This is due to the fact that the expected input and output for each task is unique.

Each estimator will take in a HuggingFace model and the corresponding tokenizer. In this approach, the model and tokenizer will be coupled in the same wrapper. This is the simplest approach since the tokenizer is specific to the text model and is not very useful standalone for ART's use cases.

Describe alternatives you've considered
The tokenizer can be made its own standalone module that is passed in to the ART wrapper. However, the tokenizer by itself is not very useful since it is dependent on the model (BERT, GPT-2, T5, etc.) and adds unnecessary complexity to creating the language model. If needed, the tokenizer can always be decoupled from the model and made standalone at a later point.

Additional context
The naming for the module and estimators are not finalized and are open to suggestions.

The text was updated successfully, but these errors were encountered:

OrsonTyphanel93 · 2023-11-28T10:35:14Z

Hi Dear @f4str , incorporating NLP into ART isn't a bad idea!, I hope the goal will be to "backdoored" these LLMs or "poisoned" these models to better understand their potential vulnerabilities and flaws,? because if the goal is simply to insert HuggingFace models that are based on pre-trainer models that are themselves vulnerable .....

In short, a technical problem to bear in mind: a stand-alone tokenizer is less useful for ART use cases( I think ) because it's specific to a particular HuggingFace model and adds unnecessary complexity?
On the other hand, decoupling the tokenizer may introduce unnecessary complexity into the estimator creation process.

An improvement(s) could consist in : A mechanism for dynamically selecting the appropriate tokenizer based on the specified model. Adding automatic model loading to streamline the model preparation process, with integration with ART's tuning capabilities to enable optimization of HuggingFace's future models and tasks, which change on an almost monthly or quarterly basis, so as not to disrupt ART's existing structure.

Thanks ! : )

beat-buesser assigned f4str Nov 30, 2023

beat-buesser added the enhancement New feature or request label Nov 30, 2023

f4str linked a pull request Dec 14, 2023 that will close this issue

Hugging Face Language Models Implementation #2350

Open

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement HuggingFace Language Modeling Estimators #2336

Implement HuggingFace Language Modeling Estimators #2336

f4str commented Nov 27, 2023 •

edited

OrsonTyphanel93 commented Nov 28, 2023 •

edited

Implement HuggingFace Language Modeling Estimators #2336

Implement HuggingFace Language Modeling Estimators #2336

Comments

f4str commented Nov 27, 2023 • edited

OrsonTyphanel93 commented Nov 28, 2023 • edited

f4str commented Nov 27, 2023 •

edited

OrsonTyphanel93 commented Nov 28, 2023 •

edited