Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hugging Face Language Models Implementation #2350

Open
wants to merge 9 commits into
base: dev_1.18.0
Choose a base branch
from

Conversation

f4str
Copy link
Collaborator

@f4str f4str commented Dec 14, 2023

Description

Implementation of language models under a new art.estimators.language_modeling submodule. Currently only Hugging Face language models using a PyTorch back-end have been implemented. This is implemented as the HuggingFaceLanguageModel which is a generic estimator that is able to run basic functionality on any Hugging Face model.

This new language model estimator takes in a Hugging Face model and tokenizer and acts as a basic ART wrapper for now until attacks and defenses for language models are implemented. Currently the estimator only supports the following tasks:

  • Tokenization to be fed into the model
  • Encoding strings to tokens
  • Decoding tokens to strings
  • Running inference on the model using a string input (with auto tokenization)
  • Running text generation on the model using a string input (with auto tokenization)

Inference on the estimator will simply return the output dictionary from running inference on the HuggingFace model. The estimator currently does not support training or loss gradients as these are more complex features that will be added later. Once this PR is merged in, additional issues will be created to implement training and loss gradients which will be done as separate PRs.

A demo notebook in notebooks/hugging_face_language_model.ipynb was created to illustrate the usage.

Fixes #2336

Type of change

Please check all relevant options.

  • Improvement (non-breaking)
  • Bug fix (non-breaking)
  • New feature (non-breaking)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Testing

Please describe the tests that you ran to verify your changes. Consider listing any relevant details of your test configuration.

  • Unit tests for the HuggingFaceLanguageModel estimator.

Test Configuration:

  • OS
  • Python version
  • ART version or commit number
  • TensorFlow / Keras / PyTorch / MXNet version

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • My changes have been tested using both CPU and GPU devices

@f4str f4str changed the base branch from main to dev_1.17.0 December 14, 2023 02:41
@codecov-commenter
Copy link

codecov-commenter commented Dec 14, 2023

Codecov Report

Attention: 119 lines in your changes are missing coverage. Please review.

Comparison is base (403623c) 73.14% compared to head (38f4429) 77.69%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

Impacted file tree graph

@@              Coverage Diff               @@
##           dev_1.18.0    #2350      +/-   ##
==============================================
+ Coverage       73.14%   77.69%   +4.55%     
==============================================
  Files             327      330       +3     
  Lines           30205    30386     +181     
  Branches         5589     5634      +45     
==============================================
+ Hits            22094    23609    +1515     
+ Misses           6807     5429    -1378     
- Partials         1304     1348      +44     
Files Coverage Δ
art/estimators/__init__.py 100.00% <100.00%> (ø)
art/estimators/language_modeling/__init__.py 100.00% <100.00%> (ø)
art/estimators/language_modeling/language_model.py 100.00% <100.00%> (ø)
art/estimators/language_modeling/hugging_face.py 23.22% <23.22%> (ø)

... and 94 files with indirect coverage changes

@f4str f4str marked this pull request as ready for review December 14, 2023 04:59
@beat-buesser beat-buesser self-requested a review December 14, 2023 12:24
@beat-buesser beat-buesser self-assigned this Dec 14, 2023
@f4str f4str changed the base branch from dev_1.17.0 to dev_1.18.0 January 18, 2024 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement HuggingFace Language Modeling Estimators
3 participants