Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates run_lora_clm.py with enhanced dataset support #955

Merged
merged 9 commits into from
Jun 11, 2024

Conversation

dmsuehir
Copy link
Contributor

@dmsuehir dmsuehir commented May 6, 2024

What does this PR do?

This PR updates the examples/language-modeling/run_lora_clm.py script with the flexibility to support more named datasets from Hugging Face Hub. Previously, this example script was written to only support two named datasets for non-SQL prompts: tatsu-lab/alpaca and timdettmers/openassistant-guanaco (specifying other datasets would raise an error saying that the dataset is unsupported). The tatsu-lab/alpacadataset has standard columns for instruction, input, and output that are formed into a prompt string before tokenization. timdettmers/openassistant-guanaco has a single text column.

To support more datasets, I added arguments to allow the user to specify column names for "instruction", "input", and "output" (or in the case of using SQL prompts it's "question", "context", and "answer"). If the user provides custom column names, I am renaming those columns to use the standard names (instruction/input/output or question/context/answer) so that the rest of the code in the script will work as expected. Also, for non-SQL prompt datasets, the "input" is optional (which is why there's a "prompt_with_input" and "prompt_without_input" template). The existing code was handing the case where the "input" column was blank, but was not able to handle the the "input" column not existing, so I updated the code to also handle the "input" column not existing at all.

I left the code specific to timdettmers/openassistant-guanaco alone, since that was a more special use case with it only having the one column and not getting preprocessed with the prompt template.

I tested this update with several different HF datasets: databricks/databricks-dolly-15k, ruggsea/stanford-encyclopedia-of-philosophy_instruct, b-mc2/sql-create-context, flytech/python-codes-25k, gbharti/finance-alpaca, medalpaca/medical_meadow_medical_flashcards.

As an example, databricks/databricks-dolly-15k is a dataset that has different column names than the original code expects (it has "instruction", "context", and "response" instead of "instruction", "input", and "output"). To use run_lora_clm.py with this dataset, you can include args for --input_column_name "context" and --output_column_name "response".

  python run_lora_clm.py \
  --model_name_or_path huggyllama/llama-7b \
  --dataset_name databricks/databricks-dolly-15k \
  --dataset_concatenation True \
  --per_device_train_batch_size 16 \
  --evaluation_strategy "no" \
  --save_strategy "no" \
  --learning_rate 1e-4 \
  --warmup_ratio  0.03 \
  --lr_scheduler_type "constant" \
  --max_grad_norm  0.3 \
  --gradient_accumulation_steps 1 \
  --learning_rate 2e-4 \
  --num_train_epochs 3 \
  --output_dir /tmp/output \
  --overwrite_output_dir \
  --validation_split_percentage 20 \
  --use_fast_tokenizer False \
  --lora_rank 8 \
  --lora_alpha 16 \
  --lora_dropout 0.1 \
  --lora_target_modules q_proj v_proj \
  --do_train True \
  --do_eval True \
  --use_habana \
  --use_lazy_mode \
  --adam_epsilon 1e-08 \
  --lr_scheduler_type "constant" \
  --max_grad_norm  0.3 \
  --bf16 \
  --throughput_warmup_steps 3 \
  --input_column_name "context" \
  --output_column_name "response" \
  --max_steps 50 \
  --max_eval_samples 50

This can also be run without the --input_column_name arg, which would then test the use case where a dataset does not have an "input" column at all.

Fixes #629

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@yeonsily
Copy link
Collaborator

@dmsuehir Can you please add this case to CI and make sure it doesn't break anything?

@dmsuehir
Copy link
Contributor Author

@yeonsily I have added a test. Please let me know if there are any other updates needed to my PR. Thanks!

@yeonsily
Copy link
Collaborator

@dmsuehir thank you! what setup did you test for the baseline number?

@dmsuehir
Copy link
Contributor Author

@dmsuehir thank you! what setup did you test for the baseline number?

I used a single card from a Gaudi2 machine from SDP cloud with v1.15.1 Gaudi software/driver that is part of the Kubernetes cluster. I had the memory allocated at 120Gi and hugepages-2Mi 4400Mi. The base container that I used was vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0.

@dmsuehir
Copy link
Contributor Author

dmsuehir commented Jun 5, 2024

@yeonsily Is there anything else needed for this PR?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR!

@regisss regisss merged commit 1825d15 into huggingface:main Jun 11, 2024
6 of 7 checks passed
imangohari1 pushed a commit to imangohari1/optimum-habana that referenced this pull request Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

run_lora_clm.py support for other datasets
4 participants