Updates run_lora_clm.py with enhanced dataset support #955
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR updates the
examples/language-modeling/run_lora_clm.py
script with the flexibility to support more named datasets from Hugging Face Hub. Previously, this example script was written to only support two named datasets for non-SQL prompts:tatsu-lab/alpaca
andtimdettmers/openassistant-guanaco
(specifying other datasets would raise an error saying that the dataset is unsupported). Thetatsu-lab/alpaca
dataset has standard columns forinstruction
,input
, andoutput
that are formed into a prompt string before tokenization.timdettmers/openassistant-guanaco
has a singletext
column.To support more datasets, I added arguments to allow the user to specify column names for "instruction", "input", and "output" (or in the case of using SQL prompts it's "question", "context", and "answer"). If the user provides custom column names, I am renaming those columns to use the standard names (instruction/input/output or question/context/answer) so that the rest of the code in the script will work as expected. Also, for non-SQL prompt datasets, the "input" is optional (which is why there's a "prompt_with_input" and "prompt_without_input" template). The existing code was handing the case where the "input" column was blank, but was not able to handle the the "input" column not existing, so I updated the code to also handle the "input" column not existing at all.
I left the code specific to
timdettmers/openassistant-guanaco
alone, since that was a more special use case with it only having the one column and not getting preprocessed with the prompt template.I tested this update with several different HF datasets: databricks/databricks-dolly-15k, ruggsea/stanford-encyclopedia-of-philosophy_instruct, b-mc2/sql-create-context, flytech/python-codes-25k, gbharti/finance-alpaca, medalpaca/medical_meadow_medical_flashcards.
As an example,
databricks/databricks-dolly-15k
is a dataset that has different column names than the original code expects (it has "instruction", "context", and "response" instead of "instruction", "input", and "output"). To userun_lora_clm.py
with this dataset, you can include args for--input_column_name "context"
and--output_column_name "response"
.This can also be run without the
--input_column_name
arg, which would then test the use case where a dataset does not have an "input" column at all.Fixes #629
Before submitting