Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support custom abbreviation #30

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from
Open

Conversation

krambox
Copy link

@krambox krambox commented May 7, 2024

Currently, adjustments to the abbreviations can only be made directly in the data directory. In order to use SoMaJo also for domain-specific texts with own abbreviations, the constructor has been extended so that own abbreviations can be used without fork in SoMaJo.

@tsproisl
Copy link
Owner

Thanks, this is something that has been requested a couple of times!

Before I merge it into develop, could you please address the following minor issues?

  • Add a space before the commas
  • Change the default value of custom_abbreviations to None (to avoid mutable default arguments)
  • Check the indentation level in TesttCustomAbbreviation
  • Fix the typo in TesttCustomAbbreviation

TODOs (intended as reminders to myself) until it can be merged into master and released:

  • Update the docstrings
  • When merging the custom abbreviations with the default list, check for duplicates and sort all abbreviations by length (it’s probably best to pass the custom abbreviations to utils.read_abbreviation_file() as additional argument and initialize the abbreviations set with them, respecting to_lower)
  • Add an argument custom_single_token_abbreviations for abbreviations that should not be split (corresponding to the single_token_abbreviations_*.txt files)
  • Add the functionality to the command-line interface, e.g. via options that let the user provide custom abbreviation files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants