Lists implementation for fulltext model #429

Vitaliy-1 · 2019-04-27T14:27:58Z

I've noticed that list items are excluded from being labeled by the fulltext model. Are you interested in their implementation? Maybe you are considering to put them into a separate model, like figures?

In this PR model.wapiti is trained with the default corpus.

…o add the rest of the war packaging components

… but pretty sure that they are proper figures) - making sure that CrossrefClient does not prevent JVM from exiting

…ction - extracting standalone figures (for which we didn't detect captions,…

update the links for INIST and TEI in the documentation

…ken, when a \n is encountered kermitt2#180

…t2#180

[wip] build docker from local source

Backslashes in URLs were being passed through verbatim into JSON for the reference annotations, resulting in invalid JSON output. This was because the JSON was being built via string concatenation, without any escaping. This switches to using Jackson instead, to ensure the JSON is valid and properly escaped.

Fix BibDataSetContextExtractor to quote replacement text

Fix JSON generation for reference annotations

…okens

Add workaround for Java version to Troubleshooting

see kermitt2#418

make it build in IntelliJ

coveralls · 2019-04-27T14:37:20Z

Coverage decreased (-0.02%) to 36.519% when pulling 08a7784 on Vitaliy-1:core_fulltext_lists into 4e1bb24 on kermitt2:master.

Vitaliy-1 · 2019-05-14T12:36:08Z

Hi @kermitt2,

Can I do something more to help with lists implementation?

kermitt2 · 2019-05-31T03:56:16Z

Hello @Vitaliy-1 ! I am sorry for the time I am taking to react to your PR. I think I excluded the lists at some point because it was not reliable and messing up the final TEI body serialization.

In between the serialization of the body has improved because the TaggingTokenClusteror is used so it should not be a problem anymore. Maybe it would be nice to review quickly the "list" annotations in the training data, to be sure it is always present.

Apart from that you also fixed the TEI serialization for list, so there would be nothing special to do beyond checking if the accuracy is acceptable.

kermitt2 · 2019-06-09T17:36:35Z

I made some tests with train_eval and random split at 80% and item lists are still very inaccurate given the existing training data. Based on 3 training with random splits, being sure we have item lists in both partitions, we have this:

train 1
                     P            R            F-score
<figure>             99.62        70.73        82.73  
<figure_marker>      68.89        100          81.58  
<item>               56.38        12.18        20.04  
<paragraph>          88.36        94.65        91.4   
<section>            100          99.15        99.57  
<table>              71.9         87.89        79.09  
<table_marker>       90           75           81.82

train 2
                     P            R            F-score
<citation_marker>    90.18        64.79        75.41  
<equation>           74.81        92.33        82.65  
<equation_label>     100          56.25        72     
<figure>             26.76        26.37        26.56  
<figure_marker>      67.69        69.84        68.75  
<item>               1.13         0.98         1.05   
<paragraph>          87.62        87.45        87.54  
<section>            57.61        81.65        67.56  
<table>              40.03        44.28        42.05  
<table_marker>       85.71        57.14        68.57  

train 3
                     P            R            F-score
<citation_marker>    93.63        86.31        89.82  
<equation>           46.54        91.15        61.62  
<equation_label>     85.71        75           80     
<figure>             17.38        38.2         23.89  
<figure_marker>      88.29        58.13        70.1   
<item>               0            0            0      
<paragraph>          91.08        76.93        83.41  
<section>            86.35        43.88        58.19  
<table>              58.49        56.79        57.63  
<table_marker>       93.85        85.92        89.71

So there's still not enough training data for the moment for including list during the training in the master release. As you saw, it was already implemented a while ago but not included.

The bottleneck is adding more training data, item list is a very unbalanced label so it requires significantly more training data to be included.

kermitt2 · 2019-06-10T09:28:15Z

I will do additional tests with some more non-public training data for the fulltext model, it will give an idea how much training data is necessary for an acceptable accuracy regarding the item list.

Vitaliy-1 · 2019-06-10T11:39:07Z

Thanks, @kermitt2!

Let me know about the acceptable amount of training data for lists.

kermitt2 · 2019-06-22T14:32:58Z

Hi @Vitaliy-1 !

I had to relaunch two times my evaluation on the larger training set because of a unexpected reboot, sorry. Actually in the second training I obtained slightly better results than the previous ones. In my extended training set, I have an additional ~70 annotated document body, I made a split for train/eval at 80%:

===== Token-level results =====


label                accuracy     precision    recall       f1     

<citation_marker>    99.6         93.02        94.08        93.55  
<equation>           99.29        88.31        77.5         82.55  
<equation_label>     99.99        100          88.24        93.75  
<equation_marker>    99.99        33.33        50           40     
<figure>             95.36        73.03        45.18        55.83  
<figure_marker>      99.91        76.97        92.03        83.83  
<item>               98.02        79.91        34.86        48.54  
<paragraph>          93.6         93.11        98.05        95.51  
<section>            99.89        96.83        92.56        94.65  
<table>              97.93        81.62        87.36        84.39  
<table_marker>       99.95        73.2         91.06        81.16  

all fields           98.5         91.05        91           91.03   (micro average)
                     98.5         80.85        77.36        77.62   (macro average)

===== Field-level results =====

label                accuracy     precision    recall       f1     

<citation_marker>    93.64        80.63        82.82        81.71  
<equation>           98.29        61.9         55.71        58.65  
<equation_label>     99.94        100          88.24        93.75  
<equation_marker>    99.84        33.33        25           28.57  
<figure>             97.21        31.34        32.31        31.82  
<figure_marker>      97.8         66.09        70.37        68.16  
<item>               98.17        84.31        45.74        59.31  
<paragraph>          78.63        75.04        75.41        75.22  
<section>            99.29        95.5         93.17        94.32  
<table>              98.42        43.48        44.44        43.96  
<table_marker>       99.41        78.57        86.27        82.24  

all fields           96.42        75.8         75.18        75.49   (micro average)
                     96.42        68.2         63.59        65.25   (macro average)

So list recognition is precise here, but recall is low, which is quite usual with lack of training data. To have a more accurate picture, I would need to do a 10-fold training and average the scores, but I don't have enough free CPU available right now to do it.

This shows that the addition of labels is technically OK but we would really need more public training data for practically using it.

Vitaliy-1 · 2019-06-24T06:48:05Z

Hi @kermitt2,

Thanks for checking how it looks like on a bigger training set. I'll look at how much additional training data we can provide. Do you already have a set with annotated lists?

If only lack of free CPUs is the issue, this is what I can ask to provide.

kermitt2 · 2019-06-25T00:52:56Z

Do you already have a set with annotated lists?

Do you mean in the existing training data ? There are a few documents with lists. Or do you mean some XML full text free to reuse with lists?

The CPU would be just for getting more accurate evaluation, it's not an issue.
The issue is, as always, lack of training data :D It's very time-consuming to produce good quality training data for the full text model.

Vitaliy-1 · 2019-06-27T08:57:58Z

Yes, I mean training data. And yes, it's quite time-consuming to produce it :) I'll look at how much additional annotated data with lists we can provide.

Vitaliy-1 · 2019-07-18T17:38:37Z

Hi @kermitt2,

Can you explain the mechanism for measuring accuracy, precision, and recall for models?

kermitt2 · 2019-07-19T04:48:17Z

Basically it uses the usual format of the sequence labeling:

token f0...fn label

for comparing expected labels with those produced by the model, the format becomes

token f0...fn expected_label predicted_label

this goes though the evaluateStandard() method in EvaluationUtilities.java and generate a report (from an object called Stats.java which contains all the statistics).
The tagger is a parameter so it applies to every models.

Former-commit-id: ae1ba53

de-code · 2020-10-09T17:14:13Z

Hi, just wondering what the plan is with this PR?

ThiloteE · 2024-06-12T22:40:51Z

Nowadays, lots of datasets for llms are published at huggingface.

lfoppiano and others added 30 commits January 22, 2018 11:43

Minor cosmetic changes on dockerfile

972a935

remove forgotten war plugin - can be re-added after working out how t…

c18f2c6

…o add the rest of the war packaging components

version bump

ef0de09

updating documentation

ac863de

updating documentation

461633d

- extracting standalone figures (for which we didn't detect captions,…

5d2ed24

… but pretty sure that they are proper figures) - making sure that CrossrefClient does not prevent JVM from exiting

Update doc for kermitt2#286

29de83e

Merge pull request kermitt2#285 from kermitt2/standalone-figure-extra…

4efac54

…ction - extracting standalone figures (for which we didn't detect captions,…

update the links for INIST and TEI

b37580c

update the links for INIST and TEI

0a56fa5

Merge pull request kermitt2#296 from tantikristanti/master

154c530

update the links for INIST and TEI in the documentation

First iteration for pdfalto integration.

cdafcc5

Add pdfalto bin for mac-64.

34777d8

updating documentation: grobid as standalone application kermitt2#298

f2291ae

Rearranged tests to test separate files and read using resources

9b5d2e3

minor corrections: typos in comments, imports, code shortcuts

34fba0a

updating grobidAnalisers to consider break line in tokenizeToLayoutTo…

27ffd3b

…ken, when a \n is encountered kermitt2#180

First implementation of the dehypenisation using layout tokens kermit…

4e4b6f9

…t2#180

Use ICU library for diacritics handling.

000fb0a

dehypenisation chapter 2: bug fixing and algorithm improvement kermit…

a529f40

…t2#180

build docker from local source

9338013

Merge pull request kermitt2#306 from de-code/local-dockerfile

1d4f339

[wip] build docker from local source

disable consolidation also when DOI available kermitt2#300

f6a3eda

added no-daemon flag to docker gradle build

5b58b6b

Complete tagset

aa8652b

Add unit test for getJsonAnnotations

ba56600

Fix BibDataSetContextExtractor to quote replacement text

1ec624a

Merge pull request kermitt2#317 from csw/bibdata-quoting

31936a3

Fix BibDataSetContextExtractor to quote replacement text

Merge pull request kermitt2#318 from csw/ref-annot-fix

6d195f0

Fix JSON generation for reference annotations

Aazhar and others added 13 commits March 19, 2019 10:21

Update annotation actions.

0a9ea3f

Avoid processing short texts as only one continuous chunk of layout t…

99d7083

…okens

Merge branch 'master' of https://github.com/kermitt2/grobid

d8ecab7

Add workaround for Java version to Troubleshooting

828c1b9

Merge pull request kermitt2#414 from rgieseke/patch-1

d5d713d

Add workaround for Java version to Troubleshooting

Add missing win-64 binary

e24a7d4

Updating docker documentation kermitt2#416

b33428f

Add workaround for compiling with recent java version

f8322f3

see kermitt2#418

Update Troubleshooting.md

97bceac

make it build in IntelliJ

22666b0

Merge pull request kermitt2#421 from boumenot/boumenot/playground

a35341b

make it build in IntelliJ

Some dependency updates for JVM version 10

da0a8f1

Lists implementation for fulltext model

08a7784

kermitt2 added a commit that referenced this pull request Jun 9, 2019

Correct TEI serialization if list items, thanks @Vitaliy-1 #429

ae1ba53

kermitt2 force-pushed the master branch from 4a7e4f3 to 9ad861e Compare October 26, 2019 18:54

tantikristanti pushed a commit that referenced this pull request Nov 15, 2019

Correct TEI serialization if list items, thanks @Vitaliy-1 #429

b72be36

Former-commit-id: ae1ba53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lists implementation for fulltext model #429

Lists implementation for fulltext model #429

Vitaliy-1 commented Apr 27, 2019

coveralls commented Apr 27, 2019

Vitaliy-1 commented May 14, 2019

kermitt2 commented May 31, 2019

kermitt2 commented Jun 9, 2019 •

edited

kermitt2 commented Jun 10, 2019

Vitaliy-1 commented Jun 10, 2019

kermitt2 commented Jun 22, 2019

Vitaliy-1 commented Jun 24, 2019

kermitt2 commented Jun 25, 2019

Vitaliy-1 commented Jun 27, 2019

Vitaliy-1 commented Jul 18, 2019

kermitt2 commented Jul 19, 2019

de-code commented Oct 9, 2020

ThiloteE commented Jun 12, 2024

Lists implementation for fulltext model #429

Are you sure you want to change the base?

Lists implementation for fulltext model #429

Conversation

Vitaliy-1 commented Apr 27, 2019

coveralls commented Apr 27, 2019

Vitaliy-1 commented May 14, 2019

kermitt2 commented May 31, 2019

kermitt2 commented Jun 9, 2019 • edited

kermitt2 commented Jun 10, 2019

Vitaliy-1 commented Jun 10, 2019

kermitt2 commented Jun 22, 2019

Vitaliy-1 commented Jun 24, 2019

kermitt2 commented Jun 25, 2019

Vitaliy-1 commented Jun 27, 2019

Vitaliy-1 commented Jul 18, 2019

kermitt2 commented Jul 19, 2019

de-code commented Oct 9, 2020

ThiloteE commented Jun 12, 2024

kermitt2 commented Jun 9, 2019 •

edited