Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: Too few (0) documents with category indicative terms found for category 1; try to add more unlabeled documents to the training corpus (recommend) or reduce --match_threshold (not recommend) #22

Open
ForeverNightmare opened this issue May 19, 2023 · 4 comments

Comments

@ForeverNightmare
Copy link

ForeverNightmare commented May 19, 2023

Hi,
I'm traning my model under your framework. I got this error information:

Number of documents with category indicative terms found for each category is: {0: 9014, 1: 0, 2: 0, 3: 551, 4: 1478, 5: 20642, 6: 0, 7: 7429, 8: 8676, 9: 4814, 10: 1368, 11: 23, 12: 418}
Traceback (most recent call last):
File "src/train.py", line 66, in
main()
File "src/train.py", line 57, in main
trainer.mcp(top_pred_num=args.top_pred_num, match_threshold=args.match_threshold, epochs=args.mcp_epochs)
File "/home/xuanw/HL/LOTClass-master/src/trainer.py", line 451, in mcp
self.prepare_mcp(top_pred_num, match_threshold)
File "/home/xuanw/HL/LOTClass-master/src/trainer.py", line 392, in prepare_mcp
assert category_doc_num[i] > 10, f"Too few ({category_doc_num[i]}) documents with category indicative terms found for category {i}; "
AssertionError: Too few (0) documents with category indicative terms found for category 1; try to add more unlabeled documents to the training corpus (recommend) or reduce --match_threshold (not recommend)

But when I directly run the sh file again(the dataset dir in sh file is replaced with mine), it runs successfully without any error. Will the result I get be correct? Does the previous error message "affect" this result to make it wrong?

@yumeng5
Copy link
Owner

yumeng5 commented May 19, 2023

Hi,

The error is pretty much explained by the printouts -- for several categories (1, 2, 6) there are 0 documents with category indicative terms (as indicated by the dictionary printed out). So you probably need to add more documents likely to pertain to these categories to the corpus; otherwise, there is no way of training the classifier to detect these categories (and of course, the resulting classifier won't be accurate).

Thanks,
Yu

@ForeverNightmare
Copy link
Author

ForeverNightmare commented May 20, 2023

Hi @yumeng5 ,

Thanks for your reply! My question is, my training dataset includes about 230,000 pieces of data, and each label of my 12 labels has many instances in the dataset. So I'm really confused how can the "Too few (0) documents with category indicative terms found for category 1" happens. Like for label 6, there are 2839 instances in the dataset, but the number of documents with category indicative terms found for 6 is 0. While for label 10, there are 808 instances but the number of documents with category indicative terms found for 10 is 1368, even more than 808.Label 5, 6482, but 20642 is shown. Based on your understanding of your thesis, would you mind speculating on what caused this result?

@yumeng5
Copy link
Owner

yumeng5 commented May 21, 2023

The number of documents found with category indicative terms is derived based on the category vocabulary constructed in the first step and is not directly related to the actual number of instances in that category -- does the category vocabulary make sense for those categories without enough matching documents (e.g., label 1, 2, 6)?

I'd suggest trying different label names (more common and distinctive terms tend to work better) and checking the category vocabulary accordingly.

Thanks,
Yu

@ForeverNightmare
Copy link
Author

@yumeng5 Thanks for your seggestions! Now I started training on a new dataset and met a new issue. I set the parameter like this:
MCP_EPOCH=20
SELF_TRAIN_EPOCH=10

But the result shows that the self train epochs are only excuted 2 time:
100%|██████████| 226/226 [01:41<00:00, 2.22it/s]lr: 9.929e-07
Average training loss: 0.10797090083360672
Test acc: 0.7305699586868286
lr: 8.905e-07
Average training loss: 0.11300306767225266
Test acc: 0.7253885865211487
Saving final model to datasets/movies/final_model.pt

What may cause this? I didn't set the early step parameter in .sh file so it should be false.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants