Large Dataset Fills up RAM (Enable "streaming" parameter to load_dataset) #331

ddemillard · 2023-08-17T15:20:38Z

Hi, thank you for creating such a great wrapper to huggingface. It truly makes things very easy to get started and is simple to use.

I am running into a problem where I am training a text classification model ("roberta-base") and I have a fairly large dataset (> 20 million text paragraph blocks and the csv file is about 1GB on disk). My workstation has a pretty hefty 256 GB of RAM so I can generally load most dataset into memory and have been working with this library for awhile and it hasn't been much of an issue. But when I try to run

happy_tc.train()

The RAM usage blows up during the "Preprocessing dataset..." stage and eventually runs out and the kernel crashes.

I'm not entirely sure why this is the case since the dataset is only 1GB on disk so even with a substantial 100x factor for preprocessing, everything should still fit in memory.

Regardless, I think the problem is ultimately because the huggingface "load_dataset" tries to load everything into memory by default. But you can pass the "streaming=True" parameter to avoid this. https://huggingface.co/learn/nlp-course/chapter5/4?fw=pt

Is there a way to pass this parameter in the current configuration? If not would you enable it via the training args?

Otherwise, do you have any other ideas about what might be causing this problem?

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large Dataset Fills up RAM (Enable "streaming" parameter to load_dataset) #331

Large Dataset Fills up RAM (Enable "streaming" parameter to load_dataset) #331

ddemillard commented Aug 17, 2023

Large Dataset Fills up RAM (Enable "streaming" parameter to load_dataset) #331

Large Dataset Fills up RAM (Enable "streaming" parameter to load_dataset) #331

Comments

ddemillard commented Aug 17, 2023