Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training scispacy pipelines require recreating the vocab file #440

Open
Hammad-NobleAI opened this issue Jul 14, 2022 · 6 comments
Open
Labels
bug Something isn't working

Comments

@Hammad-NobleAI
Copy link

Hammad-NobleAI commented Jul 14, 2022

I'm attempting to use your "en_core_sci_lg" pipeline to extract chemical entities from documents, and then using those entities as a basis to train Spacy's Entity Linker (as shown in this document). Here are the relevant portions of my code:

import spacy
import scispacy
nlp = spacy.load("en_core_sci_lg")

... prepare training documentation as Spacy specified in the form [tuples of form (text, {"links": (span.start, span.end), {qID: probability})]...

entity_linker = nlp.create_pipe("entity_linker", config={"incl_prior": False})

def create_kb(vocab):
    kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=200)

    for qid, desc in desc_dict.items():
        desc_doc = nlp(desc)
        desc_enc = desc_doc.vector
        kb.add_entity(entity=qid, entity_vector=desc_enc, freq=342)
    return kb

entity_linker.set_kb(create_kb)
nlp.add_pipe("entity_linker", last=True)

from random import random
from spacy.util import minibatch, compounding
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "entity_linker"]
with nlp.disable_pipes(*other_pipes):   # train only the entity_linker
        optimizer = nlp.begin_training() ## ERROR HERE
        for itn in range(500):   # 500 iterations takes about a minute to train on this small dataset
            random.shuffle(TRAIN_DOCS)
            batches = minibatch(TRAIN_DOCS, size=compounding(4.0, 32.0, 1.001))   # increasing batch size
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(
                    texts,
                    annotations,
                    drop=0.2,   # prevent overfitting
                    losses=losses,
                    sgd=optimizer,
                )
            if itn % 50 == 0:
                print(itn, "Losses", losses)   # print the training loss
print(itn, "Losses", losses)

When I get to the error line (commented towards the end of the code block), I get the following error:

RegistryError: [E893] Could not find function 'replace_tokenizer' in function registry 'callbacks'. If you're using a custom function, make sure the code is available. If the function is provided by a third-party package, e.g. spacy-transformers, make sure the package is installed in your environment.

Available names: spacy.copy_from_base_model.v1, spacy.models_and_pipes_with_nvtx_range.v1, spacy.models_with_nvtx_range.v1

I'm running on Mac OS 12.4, M1 Pro, 16 GB unified memory. Scispacy==0.5.0, spacy==3.2.4. Are Scispacy models compatible with this workflow, or is that something that hasn't/won't be implemented? Thanks in advance!

@dakinggg
Copy link
Collaborator

Can you try adding a from scispacy.base_project_code import * to the top of your file?

@Hammad-NobleAI
Copy link
Author

Thanks for getting back to me. I tried that, and it seems to have got beyond that issue now, but has led into this:

File ~/.pyenv/versions/3.10.5/envs/el-demo/lib/python3.10/site-packages/spacy/language.py:1249, in Language.begin_training(self, get_examples, sgd)
   1242 def begin_training(
   1243     self,
   1244     get_examples: Optional[Callable[[], Iterable[Example]]] = None,
   1245     *,
   1246     sgd: Optional[Optimizer] = None,
   1247 ) -> Optimizer:
   1248     warnings.warn(Warnings.W089, DeprecationWarning)
-> 1249     return self.initialize(get_examples, sgd=sgd)

File ~/.pyenv/versions/3.10.5/envs/el-demo/lib/python3.10/site-packages/spacy/language.py:1286, in Language.initialize(self, get_examples, sgd)
   1284     before_init(self)
   1285 try:
-> 1286     init_vocab(
   1287         self, data=I["vocab_data"], lookups=I["lookups"], vectors=I["vectors"]
   1288     )
...
     23 if require_exists and not location.exists():
---> 24     raise ValueError(f"Can't read file: {location}")
     25 return location

ValueError: Can't read file: project_data/vocab_lg.jsonl

@dakinggg
Copy link
Collaborator

dakinggg commented Jul 18, 2022

Ok, I think you are working from an outdated example, because the begin_training function is deprecated (https://spacy.io/api/language#initialize). If you want to write your own training loop, you will probably need to look deeper into how spacy does it in the train CLI. That being said, you should probably use their config system and CLI for training as much as possible. Check out project.yml and the configs here https://github.com/explosion/projects/tree/v3/tutorials/nel_emerson. All that being said, I think this is also a question about spacy, not scispacy, as I think you will get similar errors if you run your script using en_core_web_md, so further questions are probably better directed to the spacy folks. Feel free to reopen if it ends up being scispacy specific.

@dakinggg dakinggg reopened this Jul 18, 2022
@dakinggg
Copy link
Collaborator

Edit: looks like the base spacy models don't have this issue, so it is something more specific. I think it might still be a question for the spacy folks, but first you should try using the config system and CLI.

@dakinggg
Copy link
Collaborator

If it turns out you do just need that vocab file to continue, you can probably recreate it from the en_core_sci_lg model somehow, but you can definitely also just create it the same way that we do. See the convert-lg command in our project.yml.

@dakinggg dakinggg added the bug Something isn't working label Sep 7, 2022
@dakinggg
Copy link
Collaborator

see #450 for a workaround

@dakinggg dakinggg changed the title Training custom EL through Spacy's default approach Training scispacy pipelines require recreating the vocab file Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants