Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UD POS tagging Results #620

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ExplorerFreda
Copy link
Contributor

Added the UD POS tagging results from Substructure Substitution: Structured Data Augmentation for NLP (Shi et al., Findings of ACL 2021).

Added the UD POS tagging results from Substructure Substitution: Structured Data Augmentation for NLP (Shi et al., Findings of ACL 2021).
@LifeIsStrange
Copy link
Contributor

LifeIsStrange commented Sep 1, 2022

@ExplorerFreda THANK YOU FOR THIS!!

does not need to be in this PR but note that the pen treebank recently had a new SOTA https://paperswithcode.com/paper/sequence-alignment-ensemble-with-a-single
(didn't happen since 2018..)
although as a reminder POS tag accuracy progress is blocked on the absurdity of the datasets and as always when it matters in AI research, no one seems to care.. not even Google.
NLP datasets, including the pen treebank contain a high percentage of errors. The thing is we update models but we do not update the datasets, the pen treebank is the same since decades instead of being versionned.
This inept (yet normalized) tragedy is well explained and quantified in https://nlp.stanford.edu/pubs/CICLing2011-manning-tagging.pdf

The huge (multiple percents) error rate in mainstream datasets and their non-evolution, is also demonstrated in https://labelerrors.com/
This has huge consequences including this unintended one on research

https://labelerrors.com/about#:~:text=Surprisingly%2C%20we%20find%20lower%20capacity%20models,5%25%20of%20accurately%20labeled%20test%20data.
Surprisingly, we find lower capacity models may be practically more useful than higher capacity models in real-world datasets with high proportions of erroneously labeled data. For example, on the ImageNet validation set with corrected labels: ResNet-18 outperforms ResNet-50 if we randomly remove just 6% of accurately labeled test data. On the CIFAR-10 test set with corrected labels: VGG-11 outperforms VGG-19 if we randomly remove just 5% of accurately labeled test data.

  1. Errors in dataset seems to significantly lower/hide the accuracy gains possible via large neural networks. (How much untapped potential is there with this finding?)
  2. Many key NLP/vision tasks have close to >90% accuracy, the share of the error they inherit from the ones in the dataset therefore often account for > 50% of all remaining possible accuracy gains, yet no one works on this and neural network research is hitting a wall/ai winter, one of the reason being better ideas might show worse results or no improvement, because of said errors.
  3. There are cascading/exploding effects, since POS tagging is used in key downstream tasks such as Dependency parsing, it induce a lower bound on dependency parsing accuracy and most NLP tasks.

A common example of abandonware dataset is Wordnet, unlike its open source successor https://github.com/globalwordnet/english-wordnet which is much more complete and accurate and evolving as language evolve.
All/most NLP datasets should be forked in an opensource organization and receive funding for improving accuracy. So much money is being directed at pointless or irrealist goals.. Why doesn't enterprises fund datasets accuracy evolution? Why doesn't they realize this is the most impactful roadbloack towards improving the state of the art and breaking the current NLP ai plateau/semi-winter. My personal answer is that as often no one cares, people just pretend to care, but most of the actions in AI are virtue signaling, marketing PR and hype driven short termism or irrealism.
It's time for action.

As the paper shows, 84% of errors in POS taggers are because of pen treebank errors.
Also POS taggers accuracy per sentence is extremely bad and block any serious NLU ambitions.

It is perhaps more realistic to look at the
rate of getting whole sentences right, since a single bad mistake in a sentence
can greatly throw off the usefulness of a tagger to downstream tasks such as
dependency parsing. Current good taggers have sentence accuracies around 55–
57%, which is a much more modest score

Therefore improving POS tagging sentence accuracy via a simple, relatively low-financial cost, expert paid correction of the pen treebank, would lead to an explosion of possibilities for NLU research and products as it is the most salient bottleneck.

@ExplorerFreda So my question is: does the UD en POS tag dataset is versionned/improved over time unlike the pen treebank?
@sebastianruder Could you voice this dataset accuracy non-evolution problem inside Google and other powerful circles?
The two papers I linked show how urgent and impactful the problem is, and how actionable the solution is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants