Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: Provide option to modify cache folder for entity linker knowledge base downloads #415

Open
davidshumway opened this issue Jan 26, 2022 · 2 comments
Labels
enhancement New feature or request

Comments

@davidshumway
Copy link

CACHE_ROOT = Path(os.getenv("SCISPACY_CACHE", str(Path.home() / ".scispacy")))

For Google Colab users, the Path.home() location is /root/, which is deleted when the runtime is cleared. As runtimes are cleared fairly often, this means re-downloading the KBs. Perhaps there is a way to alter Path.home from pathlib? Another option is to allow the user to enter a cache folder, which Colab users could set to their Google Drive (fwiw just a regular folder as seen by python within Colab), thus making the download permanent.

@davidshumway davidshumway changed the title Enhancement: Provide option to modify cache folder entity linker knowledge base downloads Enhancement: Provide option to modify cache folder for entity linker knowledge base downloads Jan 26, 2022
@dakinggg
Copy link
Collaborator

I think you actually can do this, although admittedly I have not tried it. Can you try setting the SCISPACY_CACHE environment variable (used on this line

CACHE_ROOT = Path(os.getenv("SCISPACY_CACHE", str(Path.home() / ".scispacy")))
) to whatever folder you want to use, before importing the library?

@davidshumway
Copy link
Author

Makes sense.

So it seems to pretty much be working with a bit of a workaround.

The files are initially cached to /root/.scispacy/datasets/.

After caching, move the cache folder to a permanent folder on Google drive:

!mv /root/.scispacy/ /content/gdrive/MyDrive/test/
!ls /content/gdrive/MyDrive/test/.scispacy/
>>> datasets

To update the environment variable, as described:

import os
os.environ['SCISPACY_CACHE'] = '/content/gdrive/MyDrive/test/.scispacy/'

However, this alone does not find the cached files. It will re-download the files again. In order to see the new environment variable, it's necessary to restart the runtime: Runtime->Restart runtime.

Now when running the entity linker, it will see the permanently cached files.

So is an enhancement necessary? It'd definitely be easier and more foolproof to simply add a parameter such as cache_folder to the nlp.add_pipe() method. For example:

nlp.add_pipe(
  "scispacy_linker",
  config={
    "resolve_abbreviations": True,
    "linker_name": "umls",
    "cache_folder": "/content/gdrive/MyDrive/test/"})

which would then be used to look for a subfolder .scispacy, i.e. /content/gdrive/MyDrive/test/.scispacy/ in this case.

@dakinggg dakinggg added the enhancement New feature or request label Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants