Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new GREL function to normalize characters #6608

Open
zyadtaha opened this issue May 16, 2024 · 1 comment
Open

Add new GREL function to normalize characters #6608

zyadtaha opened this issue May 16, 2024 · 1 comment
Labels
grel The default expression language, GREL, could be improved in many ways! Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.

Comments

@zyadtaha
Copy link
Contributor

zyadtaha commented May 16, 2024

It would be easier if OpenRefine allows searching for words with diacritical marks or extended western characters.
For example, if I have the name Björn Borg in a data set, if you try to use Text filter and write Bjorn Borg, you won't get any results.

Proposed solution

Transform the cells value to its normalization. This could happen by providing a new builtin GREL function, perhaps called normalize(), that do the following :

  • remove diacritics
  • normalize extended western characters to their ASCII representation

For example:
"gödel".normalize() -> godel
"Villazón".normalize() -> Villazon

Alternatives considered

Installing Jython 2.7 + unidecode library (like here)

import sys
sys.path.append(r'E:\jython2.7.1rc1\Lib\site-packages')
from unidecode import unidecode
return unidecode(value)

Additional context

Look at the normalize() function in FingerprintKeyer class here

@zyadtaha zyadtaha added Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements. Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators grel The default expression language, GREL, could be improved in many ways! labels May 16, 2024
@tfmorris tfmorris removed the Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators label May 16, 2024
@tfmorris
Copy link
Member

A normalize() function which parallels the existing fingerprint() function would be useful to allow access to the algorithm that fingerprint uses internally, but we probably want to include some other algorithms as well, such as the four Unicode normalization forms. The java.text.Normalizer class provides access to this functionality.

Separately, we probably also want to enable access to locale sensitive string comparisons of various strengths as provided by java.text.Collator. The associated CollationKeys aren't really usable directly by themselves except for collating and testing whether two strings are equivalent at a given strength (e.g. for Western European languages, tertiary = ignore strength, secondary = ignore diacritics). Because this needs to have a locale specified, it probably needs a separate function, since normalize() is locale independent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
grel The default expression language, GREL, could be improved in many ways! Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.
Projects
None yet
Development

No branches or pull requests

2 participants