Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot train Arabic models with a custom tokenizer #13248

Open
gtoffoli opened this issue Jan 18, 2024 · 3 comments
Open

Cannot train Arabic models with a custom tokenizer #13248

gtoffoli opened this issue Jan 18, 2024 · 3 comments
Labels
feat / tokenizer Feature: Tokenizer lang / ar Arabic language data and models

Comments

@gtoffoli
Copy link
Contributor

gtoffoli commented Jan 18, 2024

This issue was initially about a possible bug in the training pipeline, related to the parser (see below). But now I believe that posing preliminary questions is more appropriate:

  • is it possible to create a completely custom tokenizer, which does not define custom rules and a few methods, but just redefines the main __call__ method?
  • in that case, where can I find documentation on how the tokenizer should use the Vocabulary API to feed the vocabulary while tokenizing?

Some context information

In the discussion Arabic language support, comment I'm willing to prototype a spaCy language model for Arabic (SMA), I reported on the choice of a training set and on the unsatisfactory training results obtained using the native spaCy tokenizer. Then, I reported on the integration/adaptation of an alternative tokenizer whose output, according to the printout of the debug data command, shows a better alignment with the tokens in the training set (after a minor modification of the training set itself).

With the subsequent comment, in the same discussion, I reported on

  1. an exception emitted by a parser-related module of the spaCy training software, when executing the train command with the same data and configuration as debug data;
  2. the very bad results (low overall score) obtained with a reduced configuration, excluding the parser.

Here below is an excerpt of the Traceback related to the exception (point 1). You can find the full Traceback in the discussion to which I refer.

⚠ Aborting and saving the final best model. Encountered exception:
KeyError("[E900] Could not run the full pipeline for evaluation. If you
specified frozen components, make sure they were already initialized and
trained. Full pipeline: ['tok2vec', 'tagger', 'morphologizer',
'trainable_lemmatizer', 'parser']")
Traceback (most recent call last):
  File "C:\language310\lib\site-packages\spacy\training\loop.py", line 298, in evaluate
    scores = nlp.evaluate(dev_corpus(nlp))
  File "C:\language310\lib\site-packages\spacy\language.py", line 1459, in evaluate
    for eg, doc in zip(examples, docs):
  File "C:\language310\lib\site-packages\spacy\language.py", line 1618, in pipe
    for doc in docs:
  File "C:\language310\lib\site-packages\spacy\util.py", line 1685, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy\pipeline\transition_parser.pyx", line 255, in pipe
  File "C:\language310\lib\site-packages\spacy\util.py", line 1704, in raise_error
    raise e
  File "spacy\pipeline\transition_parser.pyx", line 252, in spacy.pipeline.transition_parser.Parser.pipe
  File "spacy\pipeline\transition_parser.pyx", line 345, in spacy.pipeline.transition_parser.Parser.set_annotations
  File "spacy\pipeline\_parser_internals\nonproj.pyx", line 176, in spacy.pipeline._parser_internals.nonproj.deprojectivize
  File "spacy\pipeline\_parser_internals\nonproj.pyx", line 181, in spacy.pipeline._parser_internals.nonproj.deprojectivize
  File "spacy\strings.pyx", line 160, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '8206900633647566924'. This usually refers to an issue with the `Vocab` or `StringStore`."

The above exception was the direct cause of the following exception:
(omissis)

My Environment

  • Operating System: Windows 11
  • Python Version Used: 3.10
  • spaCy Version Used: 3.7
@svlandeg svlandeg added lang / ar Arabic language data and models feat / tokenizer Feature: Tokenizer labels Jan 25, 2024
@gtoffoli
Copy link
Contributor Author

A third question:

  • .. or should I use Cython and try to mimic the standard Tokenizer class to interact more directly with the vocabulary data structure?

@gtoffoli
Copy link
Contributor Author

gtoffoli commented Feb 7, 2024

As reported in the #7146 (comment) of the discussion Arabic language support, now I obtained a significant improvement of the scores, in training a reduced pipeline (which excludes the parser) with a modified version of my custom tokenizer, tentatively written in Cython; see: https://github.com/gtoffoli/commons-language/tree/master/nlp/spacy_custom/ar.
This is an excerpt of the printout:

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'tagger', 'morphologizer', 'trainable_lemmatizer']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS MORPH...  LOSS TRAIN...  TAG_ACC  POS_ACC  MORPH_ACC  LEMMA_ACC  SCORE
---  ------  ------------  -----------  -------------  -------------  -------  -------  ---------  ---------  ------
  0       0          0.00       263.59         263.59         292.87    16.57    33.42      18.74      25.89    0.23
  0     200       3262.11     36358.51       36859.45       49349.43    70.98    83.54      71.36      54.52    0.68
...
 14    8800       7540.39      4534.11        4685.17        3129.70    84.87    90.86      85.10      85.51    0.86
 14    9000       8161.44      4675.95        4845.82        3460.23    84.86    90.76      85.03      85.59    0.86
✔ Saved pipeline to output directory
output\model-last

However, the problem related to parser training persists, so that I'm not able to train the full pipeline. Could somebody help me to fix it?
More in general, I need suggestions on how I could proceed to integrate the custom tokenizer in spaCy.

@gtoffoli
Copy link
Contributor Author

gtoffoli commented Feb 9, 2024

I decided to use Cython, although I have no experience with it.
Then, googling a lot, I came not quite to fully understand the parser problem, but to find useful clues and to overcome the problem with a bad patch.
The bug seems related to the fact that, in circumstances occurring more frequently in certain languages, the parser code adds to the dependency annotation some labels not present in the training set, to "projectivize" otherwise unmanageable dependency trees; however, it forgets to enter these labels in the vocabulary. See, for reference, #7282 and https://support.prodi.gy/t/e018-when-fine-tuning-parser/4650.
For now, the only thing that worked for me was to change the min_action_freq parameter in the parser section of config.cfg, assigning it the value 1 (min_action_freq = 1).

I enclose here below the train output printout: compared to the training done by using the native spaCy tokenizer (see discussion Arabic language support), the overall score increased from 0.66 to 0.83 (+0.17), but all partial scores improved to varying degrees.

python -m spacy train config.cfg --code ./functions.py --output ./output --paths.train ./ar_padt-ud-train.spacy --paths.dev ./ar_padt-ud-dev.spacy
ℹ Saving to output directory: output
ℹ Using CPU

=========================== Initializing pipeline ===========================
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'tagger', 'morphologizer', 'trainable_lemmatizer', 'parser']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS MORPH...  LOSS TRAIN...  LOSS PARSER  TAG_ACC  POS_ACC  MORPH_ACC  LEMMA_ACC  DEP_UAS  DEP_LAS  SENTS_F  SCORE
---  ------  ------------  -----------  -------------  -------------  -----------  -------  -------  ---------  ---------  -------  -------  -------  ------
  0       0          0.00       263.59         263.59         292.87       579.49    16.57    33.42      18.74      25.89    14.80     4.49     0.17    0.20
  0     200       7090.62     36676.64       37140.17       49966.76     59734.23    70.06    83.29      70.49      52.76    62.21    52.22    31.19    0.64
  0     400      12506.15     21152.08       21428.89       35933.35     42154.89    76.25    87.07      76.56      64.43    67.80    58.52    59.02    0.71
  0     600      13606.38     17619.55       17874.69       27713.22     37639.12    78.65    88.43      78.92      70.97    69.27    60.54    64.39    0.75
  1     800      14572.25     14635.05       14903.07       22296.53     35976.71    80.09    88.77      80.39      75.03    69.67    61.62    67.23    0.76
  1    1000      16195.95     14686.61       14950.42       20141.24     36449.27    81.04    89.37      81.33      77.81    71.95    63.65    75.21    0.78
  1    1200      15651.39     13545.52       13750.79       17281.56     33520.08    81.54    89.72      81.90      79.10    72.17    64.41    66.77    0.79
  2    1400      15020.55     11283.96       11481.52       13772.12     31561.40    82.08    89.72      82.37      79.99    72.47    65.06    64.02    0.79
  2    1600      16437.48     11455.01       11663.24       13211.00     32402.13    82.13    90.01      82.40      81.02    73.59    65.87    75.84    0.80
  2    1800      17595.95     12163.83       12406.75       13397.03     32890.70    83.16    90.43      83.43      82.23    73.79    65.82    61.82    0.81
  3    2000      15506.24      9640.61        9816.51       10129.93     29419.00    83.06    90.37      83.31      82.21    73.72    66.11    54.67    0.81
  3    2200      17806.32     10608.22       10800.75       10703.91     31175.57    83.47    90.28      83.70      82.66    74.04    66.33    56.30    0.81
  3    2400      18095.58     10403.45       10628.51       10472.92     30771.19    83.68    90.33      83.91      83.18    74.21    67.04    58.34    0.81
  4    2600      17475.93      9238.98        9470.86        8978.11     29532.16    83.74    90.51      83.96      83.30    73.89    66.45    55.04    0.81
  4    2800      17691.83      8871.94        9025.45        8373.97     28670.14    83.95    90.72      84.19      83.78    74.84    67.47    64.33    0.82
  4    3000      18221.40      9058.19        9230.65        8602.03     28444.75    83.93    90.57      84.14      84.13    74.24    67.06    55.07    0.82
  5    3200      18954.61      8563.13        8774.97        7757.85     30104.06    84.09    90.65      84.31      83.84    74.27    67.12    55.43    0.82
  5    3400      19013.75      8424.62        8602.31        7607.48     28075.96    84.16    90.71      84.39      84.37    74.72    67.68    58.85    0.82
  5    3600      18708.03      8160.17        8316.78        7379.20     26281.13    84.29    90.64      84.50      84.34    74.65    67.66    58.19    0.82
  6    3800      18836.48      7550.73        7700.54        6759.15     27041.53    84.22    90.68      84.44      84.28    74.73    67.58    58.01    0.82
  6    4000      19369.11      7489.13        7681.20        6373.26     26804.98    84.29    90.67      84.49      84.42    74.51    67.41    55.31    0.82
  6    4200      20735.11      8339.28        8508.93        7176.00     27021.86    84.52    90.77      84.72      84.20    74.99    67.73    58.71    0.82
  7    4400      19234.16      6814.80        6989.01        5826.03     25340.43    84.58    90.80      84.80      84.53    74.91    67.76    57.98    0.82
  7    4600      19522.48      6755.72        6909.24        5606.62     25162.38    84.50    90.75      84.72      84.82    74.89    67.81    57.58    0.82
  7    4800      21583.23      7663.31        7850.63        6424.30     26639.64    84.62    90.85      84.81      84.96    74.93    67.94    56.50    0.82
  8    5000      19851.74      6527.48        6686.77        5153.68     25241.51    84.69    90.92      84.91      85.02    75.23    68.26    60.48    0.82
  8    5200      21314.75      6738.47        6886.67        5426.31     25201.63    84.38    90.65      84.64      84.91    75.19    68.03    57.75    0.82
  8    5400      22795.03      7113.36        7283.54        6061.03     25818.28    84.88    90.99      85.10      85.26    75.20    68.00    58.68    0.83
  9    5600      21136.55      6380.74        6530.58        5050.83     24897.23    84.88    91.01      85.10      85.20    74.90    67.92    56.95    0.82
  9    5800      21765.27      6235.19        6368.10        4876.71     24305.18    84.78    90.84      85.00      85.15    74.81    67.83    58.87    0.82
  9    6000      23302.27      6804.31        6982.51        5352.58     25129.89    84.12    90.29      84.36      85.11    74.95    67.99    60.22    0.82
 10    6200      21450.76      6064.13        6195.06        4605.89     23083.23    84.66    90.76      84.90      85.24    75.69    68.60    59.12    0.83
 10    6400      23464.79      6042.41        6205.93        4687.37     24596.00    84.86    90.91      85.08      85.36    75.70    68.67    59.56    0.83
 10    6600      23860.85      6253.02        6403.01        4964.86     24355.72    84.90    90.85      85.14      85.11    75.09    67.77    57.60    0.82
 11    6800      21873.04      5503.54        5635.16        4272.72     22186.62    84.79    90.79      84.98      85.24    75.32    68.02    56.34    0.83
 11    7000      24376.50      5840.32        5981.43        4420.84     23436.49    84.67    90.57      84.87      85.30    75.01    68.24    57.53    0.82
 11    7200      25574.41      6027.68        6218.32        4768.59     25033.99    85.08    90.93      85.34      85.37    75.45    68.48    56.75    0.83
 12    7400      24154.31      5612.50        5751.63        4241.89     22830.90    85.05    90.81      85.24      85.31    75.32    68.31    56.95    0.83
 12    7600      25775.95      5752.17        5892.24        4301.90     23344.18    84.91    90.74      85.13      85.31    75.54    68.65    61.42    0.83
 12    7800      25384.32      5511.17        5634.93        3967.66     23542.20    85.09    90.95      85.39      85.48    75.44    68.48    59.02    0.83
 13    8000      24808.20      5270.60        5397.93        4106.40     22249.33    85.16    90.92      85.40      85.29    75.10    68.27    61.08    0.83
 13    8200      25455.79      5224.74        5372.79        3757.57     22682.60    85.16    91.00      85.36      85.60    75.74    68.87    59.95    0.83
 13    8400      28854.76      5809.43        5956.61        4528.57     23880.97    85.29    91.09      85.48      85.52    75.55    68.63    59.38    0.83
 14    8600      24888.92      4971.59        5099.44        3654.07     21161.32    84.91    90.78      85.11      85.33    75.22    68.53    58.57    0.83
 14    8800      26581.91      4930.29        5059.93        3622.15     21528.82    84.77    90.77      84.94      85.19    75.38    68.56    58.52    0.83
 14    9000      28893.20      5379.39        5519.86        4138.56     23021.09    85.24    90.94      85.40      85.49    75.65    68.83    59.37    0.83
 15    9200      28123.25      5195.78        5341.69        3837.01     22810.12    85.03    90.88      85.24      85.51    75.74    68.80    62.05    0.83
 15    9400      27938.20      4776.08        4907.61        3526.56     21771.94    85.06    90.89      85.28      85.57    75.53    68.31    60.14    0.83
 15    9600      28987.51      5006.48        5153.82        3878.02     22215.08    85.15    90.84      85.30      85.67    75.88    68.96    59.76    0.83
 16    9800      27973.68      4872.55        5006.78        3581.61     20801.21    85.12    90.75      85.29      85.30    75.52    68.63    61.04    0.83
 16   10000      30470.21      4858.28        5011.76        3536.29     21887.46    85.26    90.99      85.49      85.47    75.36    68.60    55.96    0.83
 16   10200      29581.08      4816.37        4927.27        3477.24     21443.73    85.18    90.99      85.43      85.59    75.47    68.56    56.63    0.83
 17   10400      29462.07      4745.42        4881.74        3489.94     21086.93    85.12    90.94      85.34      85.61    75.89    68.95    59.22    0.83
 17   10600      29006.41      4435.96        4585.45        3114.73     20335.59    85.08    90.87      85.28      85.43    75.22    68.38    59.73    0.83
 17   10800      32378.89      4948.64        5073.06        3616.69     21873.19    85.16    90.84      85.31      85.45    75.54    68.71    58.53    0.83
 18   11000      31757.19      4681.69        4807.56        3437.19     21288.35    85.09    90.87      85.32      85.63    75.25    68.27    55.55    0.83
 18   11200      29111.47      3980.86        4082.18        2811.51     19214.91    85.09    90.97      85.29      85.51    75.19    68.04    58.55    0.83
✔ Saved pipeline to output directory
output\model-last

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer lang / ar Arabic language data and models
Projects
None yet
Development

No branches or pull requests

2 participants