[BERT/PyTorch] How can we use create_datasets_from_start.sh for BERT pretraining #1359

Druva24 · 2023-10-09T16:34:04Z

Related to Model/Framework(s)
(e.g. GNMT/PyTorch or FasterTransformer/All)

BERT/PyTorch

In Readme.md of https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT
it is mentioned that on running create_datasets_from_start.sh, it will generate pretraining dataset for BERT. However whenever I tried to run the given shell script, it resuled in an error: download_wikipedia: command not found, I believe this is happening because, lddl is moved to a different repo here : https://github.com/NVIDIA/LDDL?, If that is the reason, what are the steps that I need to inorder to generate a pre training dataset, do we need any sudo privileges for running lddl on a slurm cluster. We are using slurm, and I don't have any sudo privileges, if lddl requires sudo privileges, do we have any alternatives for using lddl?

sanjeebtiwary · 2023-12-30T20:59:14Z

The script likely assumes that certain dependencies, including the download_wikipedia command, are available in your environment. However, it seems that there might be changes or issues with the dependencies.

Druva24 added the bug Something isn't working label Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BERT/PyTorch] How can we use create_datasets_from_start.sh for BERT pretraining #1359

[BERT/PyTorch] How can we use create_datasets_from_start.sh for BERT pretraining #1359

Druva24 commented Oct 9, 2023

sanjeebtiwary commented Dec 30, 2023

[BERT/PyTorch] How can we use create_datasets_from_start.sh for BERT pretraining #1359

[BERT/PyTorch] How can we use create_datasets_from_start.sh for BERT pretraining #1359

Comments

Druva24 commented Oct 9, 2023

sanjeebtiwary commented Dec 30, 2023