Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on edge case of mecab ipadic hiragana conversion #80

Open
qip opened this issue Mar 29, 2021 · 1 comment
Open

Error on edge case of mecab ipadic hiragana conversion #80

qip opened this issue Mar 29, 2021 · 1 comment

Comments

@qip
Copy link

qip commented Mar 29, 2021

const result = await kuroshiro.convert("ユニ・チャーム、シリーズ最軽量の「超快適マスク SMART COLOR」", { to: "hiragana" });

Code above throws an error complaining about converting undefined to hiragana, with mecab ipadic(-neologd):

/home/user/mecab/node_modules/kuroshiro/lib/util.js:7
function _toConsumableArray(arr) { if (Array.isArray(arr)) { for (var i = 0, arr2 = Array(arr.length); i < arr.length; i++) { arr2[i] = arr[i]; } return arr2; } else { return Array.from(arr); } }
                                                                                                                                                                                     ^

TypeError: undefined is not iterable (cannot read property Symbol(Symbol.iterator))
    at Function.from (<anonymous>)
    at _toConsumableArray (/home/user/mecab/node_modules/kuroshiro/lib/util.js:7:182)
    at toRawHiragana (/home/user/mecab/node_modules/kuroshiro/lib/util.js:142:22)
    at Kuroshiro._callee2$ (/home/user/mecab/node_modules/kuroshiro/lib/core.js:341:108)
    at tryCatch (/home/user/mecab/node_modules/regenerator-runtime/runtime.js:62:40)
    at Generator.invoke [as _invoke] (/home/user/mecab/node_modules/regenerator-runtime/runtime.js:296:22)
    at Generator.prototype.<computed> [as next] (/home/user/mecab/node_modules/regenerator-runtime/runtime.js:114:21)
    at step (/home/user/mecab/node_modules/kuroshiro/lib/core.js:19:191)
    at /home/user/mecab/node_modules/kuroshiro/lib/core.js:19:361
@qip
Copy link
Author

qip commented Mar 29, 2021

After digging into it a little bit, it's more of a kuroshiro - mecab analyzer - ipadic mixed issue:
ユニ・チャーム itself doesn't need to be converted, but nevertheless kuroshiro sends it to analyzer, while in ipadic, it returns ユニチャーム as reading (check ipadic csvs for more examples):

$ echo "ユニ・チャーム" | mecab
ユニ・チャーム  名詞,固有名詞,組織,*,*,*,ユニ・チャーム,ユニチャーム,ユニチャーム
EOS

As result, after analyzer.parse() and patchToken(), the token end up being this:

[
  {
    surface_form: 'ユニ・チャーム',
    pos: '名詞',
    pos_detail_1: '固有名詞',
    pos_detail_2: '組織',
    pos_detail_3: '*',
    conjugated_type: '*',
    conjugated_form: '*',
    basic_form: 'ユニ・チャーム',
    reading: 'ユニチャーム',
    pronunciation: 'ユニチャーム'
  }
]

While in core.js, hiragana and katakana are processed in this way:

for (let c2 = 0; c2 < tokens[i].surface_form.length; c2++) {
    notations.push([tokens[i].surface_form[c2], 2, toRawHiragana(tokens[i].reading[c2]), (tokens[i].pronunciation && tokens[i].pronunciation[c2]) || tokens[i].reading[c2]]);
}

And the issue is, the aforementioned token has this property reading shorter than surface_form, which makes the loop fail at the last character of token.reading, which is undefined that toRawHiragana() won't handle.

A quick dirty fix is to update toRawHiragana() to check on input first:

const toRawHiragana = function (str) {
    if (!str) return '';
    return [...str].map((ch) => {
        if (ch > "\u30a0" && ch < "\u30f7") {
            return String.fromCharCode(ch.charCodeAt(0) + KATAKANA_HIRAGANA_SHIFT);
        }
        return ch;
    }).join("");
};

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant