Part of speech identification seems buggy #775

daobrien · 2024-02-19T12:45:20Z

Check for existing issues

Completed

Environment

Fedora Linux 38
Installed from RPM
vale version 3.0.7

Describe the bug / provide steps to reproduce it

Trying to write a rule to identify complex adjectives, which should be hypenated. E.g., in the phrase "the upper left corner", "upper-left" should be hyphenated.

The rule currently appears as follows:

extends: sequence
message: "Use '%[1]s-%[2]s %[3]s', because '%[1]s-%[2]s' is an adjective."
level: error
tokens:
  - tag: JJ
    pattern: upper|lower
  - tag: JJ
    pattern: left|right
  - tag: NN

We've used several test cases and cannot get consistent results:

The status icons are in the lower left corner.

The status icons are in the upper left corner.

The status icons are in the upper right corner.

The status icons are in the lower right corner.

Vale only catches the last test case.

We used Vale Studio to test the parts of speech, but the results are inconsistent:

This blocks further development of this rule for us. Would really appreciate any help.
Thanks.

The text was updated successfully, but these errors were encountered:

jdkato · 2024-02-25T03:36:37Z

Unfortunately, there's no straightforward solution here.

I'd argue that "buggy" is the wrong word here; the results are actually objectively good. For comparison, the NLTK (a very widely-used NLP library) gives the same exact results when using its default tagger.

And when you consider the other constraints Vale has (~20MB binary, offline, no NLP installation dependencies, etc.), the results are very good.

That said, the fact that I had to write my own NLP library to even get this far is obviously not ideal. I've tried a number of ideas to incorporate third-party libraries but it complicates the installation / setup process pretty significantly.

For example, two of the best available libraries:

CoreNLP (~482 MB download, would require a local Java server).
spaCy (~436 MB download, would require a local Python server).

Just aren't that practical for many of Vale's use cases.

I'm not sure what the solution here is yet, but it's definitely something that I've put a lot of time into trying to improve.

daobrien · 2024-02-26T01:32:06Z

Thanks for your explanation of what's going on. Maybe s/buggy/imperfect/ and obviously enough getting perfect software is really hard. I can pass all this on to the team who help me with our Vale setup, but relying on local servers is probably not something they'll get excited about.

Feel free to update the status of this to whatever you deem appropriate.
David

daobrien added the Type: Bug label Feb 19, 2024

jdkato added Type: Enhancement and removed Type: Bug labels Feb 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Part of speech identification seems buggy #775

Part of speech identification seems buggy #775

daobrien commented Feb 19, 2024

jdkato commented Feb 25, 2024

daobrien commented Feb 26, 2024

Part of speech identification seems buggy #775

Part of speech identification seems buggy #775

Comments

daobrien commented Feb 19, 2024

Check for existing issues

Environment

Describe the bug / provide steps to reproduce it

jdkato commented Feb 25, 2024

daobrien commented Feb 26, 2024