Add new GREL function to normalize characters #6608

zyadtaha · 2024-05-16T15:34:18Z

It would be easier if OpenRefine allows searching for words with diacritical marks or extended western characters.
For example, if I have the name Björn Borg in a data set, if you try to use Text filter and write Bjorn Borg, you won't get any results.

Proposed solution

Transform the cells value to its normalization. This could happen by providing a new builtin GREL function, perhaps called normalize(), that do the following :

remove diacritics
normalize extended western characters to their ASCII representation

For example:
"gödel".normalize() -> godel
"Villazón".normalize() -> Villazon

Alternatives considered

Installing Jython 2.7 + unidecode library (like here)

import sys
sys.path.append(r'E:\jython2.7.1rc1\Lib\site-packages')
from unidecode import unidecode
return unidecode(value)

Additional context

Look at the normalize() function in FingerprintKeyer class here

The text was updated successfully, but these errors were encountered:

tfmorris · 2024-05-16T20:50:43Z

A normalize() function which parallels the existing fingerprint() function would be useful to allow access to the algorithm that fingerprint uses internally, but we probably want to include some other algorithms as well, such as the four Unicode normalization forms. The java.text.Normalizer class provides access to this functionality.

Separately, we probably also want to enable access to locale sensitive string comparisons of various strengths as provided by java.text.Collator. The associated CollationKeys aren't really usable directly by themselves except for collating and testing whether two strings are equivalent at a given strength (e.g. for Western European languages, tertiary = ignore strength, secondary = ignore diacritics). Because this needs to have a locale specified, it probably needs a separate function, since normalize() is locale independent.

tfmorris removed the Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new GREL function to normalize characters #6608

Add new GREL function to normalize characters #6608

zyadtaha commented May 16, 2024 •

edited by tfmorris

tfmorris commented May 16, 2024

Add new GREL function to normalize characters #6608

Add new GREL function to normalize characters #6608

Comments

zyadtaha commented May 16, 2024 • edited by tfmorris

Proposed solution

Alternatives considered

Additional context

tfmorris commented May 16, 2024

zyadtaha commented May 16, 2024 •

edited by tfmorris