New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot train Arabic models with a custom tokenizer #13248
Comments
A third question:
|
As reported in the #7146 (comment) of the discussion Arabic language support, now I obtained a significant improvement of the scores, in training a reduced pipeline (which excludes the parser) with a modified version of my custom tokenizer, tentatively written in Cython; see: https://github.com/gtoffoli/commons-language/tree/master/nlp/spacy_custom/ar.
However, the problem related to parser training persists, so that I'm not able to train the full pipeline. Could somebody help me to fix it? |
I decided to use Cython, although I have no experience with it. I enclose here below the train output printout: compared to the training done by using the native spaCy tokenizer (see discussion Arabic language support), the overall score increased from 0.66 to 0.83 (+0.17), but all partial scores improved to varying degrees.
|
This issue was initially about a possible bug in the training pipeline, related to the parser (see below). But now I believe that posing preliminary questions is more appropriate:
__call__
method?Some context information
In the discussion Arabic language support, comment I'm willing to prototype a spaCy language model for Arabic (SMA), I reported on the choice of a training set and on the unsatisfactory training results obtained using the native spaCy tokenizer. Then, I reported on the integration/adaptation of an alternative tokenizer whose output, according to the printout of the debug data command, shows a better alignment with the tokens in the training set (after a minor modification of the training set itself).
With the subsequent comment, in the same discussion, I reported on
Here below is an excerpt of the Traceback related to the exception (point 1). You can find the full Traceback in the discussion to which I refer.
My Environment
The text was updated successfully, but these errors were encountered: