Improve PyThaiNLP performance #685

wannaphong · 2022-07-03T06:33:05Z

PyThaiNLP wants you help us to improve the performance. You can fork this git, coding new code to improve the performance, and send your pull request to PyThaiNLP.

These are some lists for you.

Delete unused code
Port model to onnx Porting model to ONNX model #639
Reduce the use of external packages that are not standard Python packages.
Use Cython for any code but still keep pure python for does not installed cython or cannot install cython

BLKSerene · 2022-08-21T18:10:48Z

I propose to remove tinydb as an external dependency (as mentioned in #506). It seems to me that using a 3rd-party database library is an overkill for just querying a JSON file which is by no means large in size.

tinydb is only used twice, once in init.py for database initialization, which could be simply replaced by creating an empty JSON file, and once in core.py for querying the local database, which could be done using json in the Python standard library instead.

And it is confusing that while tinydb is used to query the local JSON database, it is not used when parsing JSON data returned from https://pythainlp.github.io/pythainlp-corpus/db.json (see core.py).

The performance improvement gained by using tinydb or other 3rd-party database libraries is, I suppose, to be negligible, but to have one more unnecessary external dependency is not a good idea, I think, for future development and maintenance of the project. Instead, JSON querying could be easily done using json in the standard library.

What is your opinion? If you agree, I would be happy to work on this and send an PR.

wannaphong · 2022-08-22T09:25:44Z

I propose to remove tinydb as an external dependency (as mentioned in #506). It seems to me that using a 3rd-party database library is an overkill for just querying a JSON file which is by no means large in size.

tinydb is only used twice, once in init.py for database initialization, which could be simply replaced by creating an empty JSON file, and once in core.py for querying the local database, which could be done using json in the Python standard library instead.

And it is confusing that while tinydb is used to query the local JSON database, it is not used when parsing JSON data returned from https://pythainlp.github.io/pythainlp-corpus/db.json (see core.py).

The performance improvement gained by using tinydb or other 3rd-party database libraries is, I suppose, to be negligible, but to have one more unnecessary external dependency is not a good idea, I think, for future development and maintenance of the project. Instead, JSON querying could be easily done using json in the standard library.

What is your opinion? If you agree, I would be happy to work on this and send an PR.

I am agree.

BLKSerene · 2022-08-23T04:05:05Z

When working on #691, I found a problem relating to the version checking behavior of the corpus downloader.

The downloader checks both the name and the version here, so later if the corpus is found in the local database and a re-download is not forced, the versions should always match.

I'm not sure what the expected behavior is here. If a corpus with the same name but different version is found in local database (and force = True is not specified), should the downloader suggest the user to use force = True or just silently re-download and rewrite the latest version of the corpus?

wannaphong · 2022-08-23T04:32:34Z

Yes, It is not force download that you can see https://github.com/PyThaiNLP/pythainlp/runs/7957249794?check_suite_focus=true#step:5:5219 and https://github.com/PyThaiNLP/pythainlp/blob/dev/tests/test_corpus.py#L86.

BLKSerene · 2022-08-23T05:14:57Z

@wannaphong I mean that since both the name and the version of the corpus is checked, the else block would never be reached, so the user will never be notified that a newer version of the requested corpus is available.

That is, in cases that the corpus is found in the local database, but the version do not match, the latest version of the corpus would be silently downloaded and rewritten. If this is indeed the expected behavior, the else block would be redundant. If the user should be notified in these cases, the checking logic should be modified.

wannaphong · 2022-08-23T06:04:03Z

OK. I'm agree.

BLKSerene · 2022-08-23T06:34:56Z

@wannaphong So what is the expected behavior? If the user should be notified to use force = True when newer versions of the corpus are available, I could work on a patch to fix it.

wannaphong · 2022-08-23T13:22:19Z

@wannaphong So what is the expected behavior? If the user should be notified to use force = True when newer versions of the corpus are available, I could work on a patch to fix it.

I'm agree. The user should get notified when newer versions are available.

BLKSerene · 2022-08-25T08:18:35Z

@wannaphong Should be fixed in #692, please review the PR.

wannaphong · 2022-10-08T15:49:13Z

I'm doing reduce import time. #719

wannaphong added this to the Future milestone Jul 17, 2022

wannaphong added the Hacktoberfest for Hacktoberfest event label Sep 28, 2022

wannaphong pinned this issue Oct 8, 2022

bact added the refactoring a technical improvement which does not add any new features or change existing features. label Oct 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve PyThaiNLP performance #685

Improve PyThaiNLP performance #685

wannaphong commented Jul 3, 2022 •

edited

BLKSerene commented Aug 21, 2022

wannaphong commented Aug 22, 2022

BLKSerene commented Aug 23, 2022

wannaphong commented Aug 23, 2022

BLKSerene commented Aug 23, 2022 •

edited

wannaphong commented Aug 23, 2022 •

edited

BLKSerene commented Aug 23, 2022

wannaphong commented Aug 23, 2022

BLKSerene commented Aug 25, 2022

wannaphong commented Oct 8, 2022

Improve PyThaiNLP performance #685

Improve PyThaiNLP performance #685

Comments

wannaphong commented Jul 3, 2022 • edited

BLKSerene commented Aug 21, 2022

wannaphong commented Aug 22, 2022

BLKSerene commented Aug 23, 2022

wannaphong commented Aug 23, 2022

BLKSerene commented Aug 23, 2022 • edited

wannaphong commented Aug 23, 2022 • edited

BLKSerene commented Aug 23, 2022

wannaphong commented Aug 23, 2022

BLKSerene commented Aug 25, 2022

wannaphong commented Oct 8, 2022

wannaphong commented Jul 3, 2022 •

edited

BLKSerene commented Aug 23, 2022 •

edited

wannaphong commented Aug 23, 2022 •

edited