Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Part of speech identification seems buggy #775

Open
1 task done
daobrien opened this issue Feb 19, 2024 · 2 comments
Open
1 task done

Part of speech identification seems buggy #775

daobrien opened this issue Feb 19, 2024 · 2 comments

Comments

@daobrien
Copy link

Check for existing issues

  • Completed

Environment

Fedora Linux 38
Installed from RPM
vale version 3.0.7

Describe the bug / provide steps to reproduce it

Trying to write a rule to identify complex adjectives, which should be hypenated. E.g., in the phrase "the upper left corner", "upper-left" should be hyphenated.

The rule currently appears as follows:

extends: sequence
message: "Use '%[1]s-%[2]s %[3]s', because '%[1]s-%[2]s' is an adjective."
level: error
tokens:
  - tag: JJ
    pattern: upper|lower
  - tag: JJ
    pattern: left|right
  - tag: NN

We've used several test cases and cannot get consistent results:

The status icons are in the lower left corner.

The status icons are in the upper left corner.

The status icons are in the upper right corner.

The status icons are in the lower right corner.

Vale only catches the last test case.

We used Vale Studio to test the parts of speech, but the results are inconsistent:

image

This blocks further development of this rule for us. Would really appreciate any help.
Thanks.

@jdkato
Copy link
Member

jdkato commented Feb 25, 2024

Unfortunately, there's no straightforward solution here.

I'd argue that "buggy" is the wrong word here; the results are actually objectively good. For comparison, the NLTK (a very widely-used NLP library) gives the same exact results when using its default tagger.

And when you consider the other constraints Vale has (~20MB binary, offline, no NLP installation dependencies, etc.), the results are very good.

That said, the fact that I had to write my own NLP library to even get this far is obviously not ideal. I've tried a number of ideas to incorporate third-party libraries but it complicates the installation / setup process pretty significantly.

For example, two of the best available libraries:

  • CoreNLP (~482 MB download, would require a local Java server).
  • spaCy (~436 MB download, would require a local Python server).

Just aren't that practical for many of Vale's use cases.

I'm not sure what the solution here is yet, but it's definitely something that I've put a lot of time into trying to improve.

@daobrien
Copy link
Author

Thanks for your explanation of what's going on. Maybe s/buggy/imperfect/ and obviously enough getting perfect software is really hard. I can pass all this on to the team who help me with our Vale setup, but relying on local servers is probably not something they'll get excited about.

Feel free to update the status of this to whatever you deem appropriate.
David

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants