Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large Dataset Fills up RAM (Enable "streaming" parameter to load_dataset) #331

Open
ddemillard opened this issue Aug 17, 2023 · 0 comments

Comments

@ddemillard
Copy link

Hi, thank you for creating such a great wrapper to huggingface. It truly makes things very easy to get started and is simple to use.

I am running into a problem where I am training a text classification model ("roberta-base") and I have a fairly large dataset (> 20 million text paragraph blocks and the csv file is about 1GB on disk). My workstation has a pretty hefty 256 GB of RAM so I can generally load most dataset into memory and have been working with this library for awhile and it hasn't been much of an issue. But when I try to run

happy_tc.train()

The RAM usage blows up during the "Preprocessing dataset..." stage and eventually runs out and the kernel crashes.

I'm not entirely sure why this is the case since the dataset is only 1GB on disk so even with a substantial 100x factor for preprocessing, everything should still fit in memory.

Regardless, I think the problem is ultimately because the huggingface "load_dataset" tries to load everything into memory by default. But you can pass the "streaming=True" parameter to avoid this. https://huggingface.co/learn/nlp-course/chapter5/4?fw=pt

Is there a way to pass this parameter in the current configuration? If not would you enable it via the training args?

Otherwise, do you have any other ideas about what might be causing this problem?

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant