massive read only cache, missing something obvious? #1507

qrdlgit · 2023-09-19T19:05:31Z

I have workers which require access to a several GBs sized read only cache to do various things. When I left the cache as a global variable, joblib was very slow, so I started loading them from pickle on each spawn.

This improved performance dramatically, but it's still loading the data on each spawn!

Admittedly it's probably getting it from OS level cached io in memory (so mostly skipping disk reads), but still it has to unpickle and some overhead accessing io is there. Also, memory usage is multiplied across workers.

Is there a way to just directly access a shared read only object without the serialization/deserialization?

I thought this would be a first class use case, and maybe it is so obvious that it doesn't get great documentation.

I tried passing around a memory object, but that didn't work, and the documentation doesn't mention this as a use case.

fcharras · 2024-04-03T13:32:00Z

Have you considered using native multiprocessing.shared_memory ?

I thought this would be a first class use case

Joblib is mostly targeted for simpler use cases of embarassingly parallels jobs and requirements such as shared resources divert away from this initial goal. Though we could consider more generic apis if this kind of features become more and more requested.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

massive read only cache, missing something obvious? #1507

massive read only cache, missing something obvious? #1507

qrdlgit commented Sep 19, 2023 •

edited

fcharras commented Apr 3, 2024

massive read only cache, missing something obvious? #1507

massive read only cache, missing something obvious? #1507

Comments

qrdlgit commented Sep 19, 2023 • edited

fcharras commented Apr 3, 2024

qrdlgit commented Sep 19, 2023 •

edited