Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve PyThaiNLP performance #685

Open
2 of 4 tasks
wannaphong opened this issue Jul 3, 2022 · 10 comments
Open
2 of 4 tasks

Improve PyThaiNLP performance #685

wannaphong opened this issue Jul 3, 2022 · 10 comments
Labels
Hacktoberfest for Hacktoberfest event refactoring a technical improvement which does not add any new features or change existing features.
Milestone

Comments

@wannaphong
Copy link
Member

wannaphong commented Jul 3, 2022

PyThaiNLP wants you help us to improve the performance. You can fork this git, coding new code to improve the performance, and send your pull request to PyThaiNLP.

These are some lists for you.

  • Delete unused code
  • Port model to onnx Porting model to ONNX model #639
  • Reduce the use of external packages that are not standard Python packages.
  • Use Cython for any code but still keep pure python for does not installed cython or cannot install cython
@wannaphong wannaphong added this to the Future milestone Jul 17, 2022
@BLKSerene
Copy link
Contributor

I propose to remove tinydb as an external dependency (as mentioned in #506). It seems to me that using a 3rd-party database library is an overkill for just querying a JSON file which is by no means large in size.

tinydb is only used twice, once in init.py for database initialization, which could be simply replaced by creating an empty JSON file, and once in core.py for querying the local database, which could be done using json in the Python standard library instead.

And it is confusing that while tinydb is used to query the local JSON database, it is not used when parsing JSON data returned from https://pythainlp.github.io/pythainlp-corpus/db.json (see core.py).

The performance improvement gained by using tinydb or other 3rd-party database libraries is, I suppose, to be negligible, but to have one more unnecessary external dependency is not a good idea, I think, for future development and maintenance of the project. Instead, JSON querying could be easily done using json in the standard library.

What is your opinion? If you agree, I would be happy to work on this and send an PR.

@wannaphong
Copy link
Member Author

I propose to remove tinydb as an external dependency (as mentioned in #506). It seems to me that using a 3rd-party database library is an overkill for just querying a JSON file which is by no means large in size.

tinydb is only used twice, once in init.py for database initialization, which could be simply replaced by creating an empty JSON file, and once in core.py for querying the local database, which could be done using json in the Python standard library instead.

And it is confusing that while tinydb is used to query the local JSON database, it is not used when parsing JSON data returned from https://pythainlp.github.io/pythainlp-corpus/db.json (see core.py).

The performance improvement gained by using tinydb or other 3rd-party database libraries is, I suppose, to be negligible, but to have one more unnecessary external dependency is not a good idea, I think, for future development and maintenance of the project. Instead, JSON querying could be easily done using json in the standard library.

What is your opinion? If you agree, I would be happy to work on this and send an PR.

I am agree.

@BLKSerene
Copy link
Contributor

When working on #691, I found a problem relating to the version checking behavior of the corpus downloader.

The downloader checks both the name and the version here, so later if the corpus is found in the local database and a re-download is not forced, the versions should always match.

I'm not sure what the expected behavior is here. If a corpus with the same name but different version is found in local database (and force = True is not specified), should the downloader suggest the user to use force = True or just silently re-download and rewrite the latest version of the corpus?

@wannaphong
Copy link
Member Author

@BLKSerene
Copy link
Contributor

BLKSerene commented Aug 23, 2022

@wannaphong I mean that since both the name and the version of the corpus is checked, the else block would never be reached, so the user will never be notified that a newer version of the requested corpus is available.

That is, in cases that the corpus is found in the local database, but the version do not match, the latest version of the corpus would be silently downloaded and rewritten. If this is indeed the expected behavior, the else block would be redundant. If the user should be notified in these cases, the checking logic should be modified.

@wannaphong
Copy link
Member Author

wannaphong commented Aug 23, 2022

OK. I'm agree.

@BLKSerene
Copy link
Contributor

@wannaphong So what is the expected behavior? If the user should be notified to use force = True when newer versions of the corpus are available, I could work on a patch to fix it.

@wannaphong
Copy link
Member Author

@wannaphong So what is the expected behavior? If the user should be notified to use force = True when newer versions of the corpus are available, I could work on a patch to fix it.

I'm agree. The user should get notified when newer versions are available.

@BLKSerene
Copy link
Contributor

@wannaphong Should be fixed in #692, please review the PR.

@wannaphong wannaphong added the Hacktoberfest for Hacktoberfest event label Sep 28, 2022
@wannaphong
Copy link
Member Author

I'm doing reduce import time. #719

@wannaphong wannaphong pinned this issue Oct 8, 2022
@bact bact added the refactoring a technical improvement which does not add any new features or change existing features. label Oct 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Hacktoberfest for Hacktoberfest event refactoring a technical improvement which does not add any new features or change existing features.
Projects
None yet
Development

No branches or pull requests

3 participants