New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError Reason #30570
Comments
Hi @ananegru, thanks for opening this issue! The reason this error is being raised is because the tiiuae/falcon-7b tokenizer doesn't have a In order to get this example to work for this model, I'd suggest setting the tokenizer's cls token to an equivalent token that can represent this, or adapting the script to filter out these problem examples. |
Yes, this script is rather outdated - with modern models, it is more common to do QA by just providing the text and asking an instruct model directly! We can probably work around this by just setting the value to 0 (and emitting an 'empty' answer at the start of the sequence) when no CLS token is present, since the CLS token location is only used to create 'dummy' spans for impossible answers. I'll make a PR now! |
@ananegru a PR is ready. Can you try it out and let us know if it works for you? You can install the PR branch with this command:
Let us know if you can train Falcon after running it! |
Hi, thanks a lot for the help! However, when I try running the command you sent above I get the following error: Collecting git+https://github.com/huggingface/transformers.git@fix_qa_example × git checkout -q fix_qa_example did not run successfully. note: This error originates from a subprocess, and is likely not a problem with pip. × git checkout -q fix_qa_example did not run successfully. note: This error originates from a subprocess, and is likely not a problem with pip. |
@ananegru I think this is because the branch was deleted after merge. The commit is now on main - you can get this by installing from source |
Unfortuantely, the same issue regarding the valueerror for the CLS token is persisting, even after running the command you sent above as well, anything else that might be done to fix it? |
@ananegru Is this occurring when using the falcon checkpoint? Unfortunately, it's not possible to use this model here - the QA script is designed for MLM, whereas falcon is a CLM. |
Ah yes, its occurring at the checkpoint. Okay so I just can't use this script at all for fine-tuning falcon? Do you perhaps know of any other scripts to fine-tune a CLM for QA? |
There isn't a script currently in the library, however as there's a FalconForQuestionAnswering head, this should probably be supported cc @Rocketknight1 |
I'll investigate! But also @ananegru, is there a reason you specifically want a CLM for this kind of span-extraction task? The most common approaches for question answering in 2024 are:
The second option is harder to fine-tune for, but the base accuracy will be very high if you use a state-of-the-art chat model like LLaMA-3, DBRX, Mixtral or Command-R. |
System Info
transformers
version: 4.41.0.dev0Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I'm using the script, run_qa.py from the HF transformers trainer from the following repository to fine-tune the large language model Falcon on the SQuAD dataset:
https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering
The following parameters are being applied, also coming from the same repository I linked above:
python run_qa.py
--model_name_or_path tiiuae/falcon-7b
--dataset_name squad
--do_train
--do_eval
--per_device_train_batch_size 12
--learning_rate 3e-5
--num_train_epochs 2
--max_seq_length 384
--doc_stride 128
--output_dir /home/anegru/Test_Folder/Unqover/unqover/fine_tuning_output
And I'm running into the following value error that I would like some help with solving:
Traceback (most recent call last):
File "/gpfs/home3/anegru/Test_Folder/Unqover/unqover/fine_tuning/fine_tune.py", line 725, in
main()
File "/gpfs/home3/anegru/Test_Folder/Unqover/unqover/fine_tuning/fine_tune.py", line 491, in main
train_dataset = train_dataset.map(
File "/home/anegru/anaconda3/envs/py39/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 602, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/anegru/anaconda3/envs/py39/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/anegru/anaconda3/envs/py39/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3156, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "/home/anegru/anaconda3/envs/py39/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3547, in _map_single
batch = apply_function_on_filtered_inputs(
File "/home/anegru/anaconda3/envs/py39/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3416, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/gpfs/home3/anegru/Test_Folder/Unqover/unqover/fine_tuning/fine_tune.py", line 438, in prepare_train_features
cls_index = input_ids.index(tokenizer.cls_token_id)
ValueError: None is not in list
Expected behavior
Save the trained Falcon model on the SQuAD dataset to a folder
The text was updated successfully, but these errors were encountered: