Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Language specific validation sets to deepspeed #1

Open
hadyelsahar opened this issue Sep 8, 2021 · 4 comments
Open

Adding Language specific validation sets to deepspeed #1

hadyelsahar opened this issue Sep 8, 2021 · 4 comments
Assignees

Comments

@hadyelsahar
Copy link
Contributor

hadyelsahar commented Sep 8, 2021

The idea of this issue to modify the megatron-deepspeed repository code that we use for training all models. In order to track the progress of validation loss on several validaiton sets separately. This would allow us to track the progress of training independtly on separate languages.

Currently, the validation loss is calculated on a single validation set that includes the same language combination as the training data. (see here 13B param model training on tensorboard)

image

Useful pointers

  • How datasets are loaded in model pre-training here
  • Dataset loader for GPT here
  • Validation step execution here

Progress

@hadyelsahar hadyelsahar changed the title Adding Language specific validation set to deepspeed Adding Language specific validation sets to deepspeed Sep 8, 2021
@sbmaruf
Copy link
Collaborator

sbmaruf commented Sep 8, 2021

I can review/implement this part.

@hadyelsahar hadyelsahar self-assigned this Sep 8, 2021
@lintangsutawika
Copy link
Collaborator

My current understanding is that in training.py , the train, validation, and test datasets are loaded from a function build_train_valid_test_data_iterators.

https://github.com/hadyelsahar/Megatron-DeepSpeed/blob/9e14c02a1dd22e4d36e2ee9a33e44d33774b8de7/megatron/training.py#L123-L136

Evaluation is then done here, both for valid_data_iterator and test_data_iterator.

https://github.com/hadyelsahar/Megatron-DeepSpeed/blob/9e14c02a1dd22e4d36e2ee9a33e44d33774b8de7/megatron/training.py#L152-L166

We could set

and call evaluate_and_print_results iteratively for each language.

for each_language_data_loader in valid_data_iterator:
    evaluate_and_print_results(
        prefix, forward_step_func, 
        each_language_data_loader, 
        model, 
        eval_metric
    )

Some modification to evaluate_and_print_results will be required so that we save each validation metric for each language.

@hadyelsahar
Copy link
Contributor Author

Currently the code base yields 1 single validation / test sets. There’s no support of adding args for the specifications of the multiple validation datasets.

my adhoc solution is to add an extra argument:

  --extra-valid-data-path [EXTRA_VALID_DATA_PATH ...]
Path to extra validation dataset to be monitored during trainingAccepted format: 
1) a single data path, 
2) multiple datasets in the form:data1-weight data1-path data2-path data2-weight yielding single validation set 
3) allow multiple validation sets by multiple (2) separated by commas in the form: data1-weight data1-path data2-weight data2-path, data3-weight3 data3-path data3-weight data3-path ...

The idea here is to allow mixing different validation sets on the fly

python pretrain_gpt2.py. …. --extra-valid-data-path. 0.5 en_data, 0.5 fr_data, 0.33 rare1_data 0.33 rare2_data 0.33 rare3_data

any thoughts about a better design?

@hadyelsahar
Copy link
Contributor Author

work in progress PR sent here: bigscience-workshop/Megatron-DeepSpeed#97

haileyschoelkopf referenced this issue in haileyschoelkopf/multilingual-modeling May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants