Hugging Face Language Models Implementation #2350

f4str · 2023-12-14T02:41:16Z

Description

Implementation of language models under a new art.estimators.language_modeling submodule. Currently only Hugging Face language models using a PyTorch back-end have been implemented. This is implemented as the HuggingFaceLanguageModel which is a generic estimator that is able to run basic functionality on any Hugging Face model.

This new language model estimator takes in a Hugging Face model and tokenizer and acts as a basic ART wrapper for now until attacks and defenses for language models are implemented. Currently the estimator only supports the following tasks:

Tokenization to be fed into the model
Encoding strings to tokens
Decoding tokens to strings
Running inference on the model using a string input (with auto tokenization)
Running text generation on the model using a string input (with auto tokenization)

Inference on the estimator will simply return the output dictionary from running inference on the HuggingFace model. The estimator currently does not support training or loss gradients as these are more complex features that will be added later. Once this PR is merged in, additional issues will be created to implement training and loss gradients which will be done as separate PRs.

A demo notebook in notebooks/hugging_face_language_model.ipynb was created to illustrate the usage.

Fixes #2336

Type of change

Please check all relevant options.

Improvement (non-breaking)
Bug fix (non-breaking)
New feature (non-breaking)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Testing

Please describe the tests that you ran to verify your changes. Consider listing any relevant details of your test configuration.

Unit tests for the HuggingFaceLanguageModel estimator.

Test Configuration:

OS
Python version
ART version or commit number
TensorFlow / Keras / PyTorch / MXNet version

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
My changes have been tested using both CPU and GPU devices

codecov-commenter · 2023-12-14T02:44:59Z

Codecov Report

Attention: 119 lines in your changes are missing coverage. Please review.

Comparison is base (403623c) 73.14% compared to head (38f4429) 77.69%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@              Coverage Diff               @@
##           dev_1.18.0    #2350      +/-   ##
==============================================
+ Coverage       73.14%   77.69%   +4.55%     
==============================================
  Files             327      330       +3     
  Lines           30205    30386     +181     
  Branches         5589     5634      +45     
==============================================
+ Hits            22094    23609    +1515     
+ Misses           6807     5429    -1378     
- Partials         1304     1348      +44

Files	Coverage Δ
art/estimators/__init__.py	`100.00% <100.00%> (ø)`
art/estimators/language_modeling/__init__.py	`100.00% <100.00%> (ø)`
art/estimators/language_modeling/language_model.py	`100.00% <100.00%> (ø)`
art/estimators/language_modeling/hugging_face.py	`23.22% <23.22%> (ø)`

... and 94 files with indirect coverage changes

art/estimators/language_modeling/hugging_face.py

Signed-off-by: Farhan Ahmed <[email protected]>

f4str changed the base branch from main to dev_1.17.0 December 14, 2023 02:41

github-advanced-security bot found potential problems Dec 14, 2023

View reviewed changes

art/estimators/language_modeling/hugging_face.py Fixed Show fixed Hide fixed

f4str marked this pull request as ready for review December 14, 2023 04:59

beat-buesser self-requested a review December 14, 2023 12:24

beat-buesser self-assigned this Dec 14, 2023

f4str added 9 commits January 18, 2024 15:53

create new art language model module

f0e539d

Signed-off-by: Farhan Ahmed <[email protected]>

create template classes

866e4c4

Signed-off-by: Farhan Ahmed <[email protected]>

added basic language model functionality

8b2c0d4

Signed-off-by: Farhan Ahmed <[email protected]>

implement most features for langauge model

8163283

Signed-off-by: Farhan Ahmed <[email protected]>

update docstrings

88f279b

Signed-off-by: Farhan Ahmed <[email protected]>

update subclassing

4b16695

Signed-off-by: Farhan Ahmed <[email protected]>

finish language model implementation

e612ccb

Signed-off-by: Farhan Ahmed <[email protected]>

added language modeling unit tests

f59fe6b

Signed-off-by: Farhan Ahmed <[email protected]>

add hugging face language model demo notebook

38f4429

Signed-off-by: Farhan Ahmed <[email protected]>

f4str force-pushed the hf-language-models branch from 8d7b586 to 38f4429 Compare January 18, 2024 23:53

f4str changed the base branch from dev_1.17.0 to dev_1.18.0 January 18, 2024 23:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hugging Face Language Models Implementation #2350

Hugging Face Language Models Implementation #2350

f4str commented Dec 14, 2023 •

edited

codecov-commenter commented Dec 14, 2023 •

edited

Hugging Face Language Models Implementation #2350

Are you sure you want to change the base?

Hugging Face Language Models Implementation #2350

Conversation

f4str commented Dec 14, 2023 • edited

Description

Type of change

Testing

Checklist

codecov-commenter commented Dec 14, 2023 • edited

Codecov Report

f4str commented Dec 14, 2023 •

edited

codecov-commenter commented Dec 14, 2023 •

edited