-
Notifications
You must be signed in to change notification settings - Fork 429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lists implementation for fulltext model #429
base: master
Are you sure you want to change the base?
Conversation
…o add the rest of the war packaging components
… but pretty sure that they are proper figures) - making sure that CrossrefClient does not prevent JVM from exiting
…ction - extracting standalone figures (for which we didn't detect captions,…
update the links for INIST and TEI in the documentation
…ken, when a \n is encountered kermitt2#180
[wip] build docker from local source
Backslashes in URLs were being passed through verbatim into JSON for the reference annotations, resulting in invalid JSON output. This was because the JSON was being built via string concatenation, without any escaping. This switches to using Jackson instead, to ensure the JSON is valid and properly escaped.
Fix BibDataSetContextExtractor to quote replacement text
Fix JSON generation for reference annotations
Add workaround for Java version to Troubleshooting
make it build in IntelliJ
Hi @kermitt2, Can I do something more to help with lists implementation? |
Hello @Vitaliy-1 ! I am sorry for the time I am taking to react to your PR. I think I excluded the lists at some point because it was not reliable and messing up the final TEI body serialization. In between the serialization of the body has improved because the Apart from that you also fixed the TEI serialization for list, so there would be nothing special to do beyond checking if the accuracy is acceptable. |
I made some tests with
So there's still not enough training data for the moment for including list during the training in the master release. As you saw, it was already implemented a while ago but not included. The bottleneck is adding more training data, item list is a very unbalanced label so it requires significantly more training data to be included. |
I will do additional tests with some more non-public training data for the fulltext model, it will give an idea how much training data is necessary for an acceptable accuracy regarding the item list. |
Thanks, @kermitt2! Let me know about the acceptable amount of training data for lists. |
Hi @Vitaliy-1 ! I had to relaunch two times my evaluation on the larger training set because of a unexpected reboot, sorry. Actually in the second training I obtained slightly better results than the previous ones. In my extended training set, I have an additional ~70 annotated document body, I made a split for train/eval at 80%:
So list recognition is precise here, but recall is low, which is quite usual with lack of training data. To have a more accurate picture, I would need to do a 10-fold training and average the scores, but I don't have enough free CPU available right now to do it. This shows that the addition of labels is technically OK but we would really need more public training data for practically using it. |
Hi @kermitt2, Thanks for checking how it looks like on a bigger training set. I'll look at how much additional training data we can provide. Do you already have a set with annotated lists? If only lack of free CPUs is the issue, this is what I can ask to provide. |
Do you mean in the existing training data ? There are a few documents with lists. Or do you mean some XML full text free to reuse with lists? The CPU would be just for getting more accurate evaluation, it's not an issue. |
Yes, I mean training data. And yes, it's quite time-consuming to produce it :) I'll look at how much additional annotated data with lists we can provide. |
Hi @kermitt2, Can you explain the mechanism for measuring accuracy, precision, and recall for models? |
Basically it uses the usual format of the sequence labeling:
for comparing expected labels with those produced by the model, the format becomes
this goes though the |
Former-commit-id: ae1ba53
Hi, just wondering what the plan is with this PR? |
Nowadays, lots of datasets for llms are published at huggingface. |
Hi @kermitt2,
I've noticed that list items are excluded from being labeled by the fulltext model. Are you interested in their implementation? Maybe you are considering to put them into a separate model, like figures?
In this PR model.wapiti is trained with the default corpus.