Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve update cache process #353

Open
navarroaxel opened this issue May 19, 2021 · 10 comments
Open

Improve update cache process #353

navarroaxel opened this issue May 19, 2021 · 10 comments
Labels
enhancement help wanted Open for new contributors to work on

Comments

@navarroaxel
Copy link
Member

navarroaxel commented May 19, 2021

Actual behavior

Updating this client’s cache takes ~41 seconds.
Loading the pages' cache takes < 1 second.
The cache for the --search argument takes the rest of the time.

Explaining the build index process

The process reads every page and builds a dictionary {key: count} with key as the word and count indicating how many times that word appears on the given page.

Options to improve this process

  1. Create the dictionary when --search is executed and the search-corpus.json file isn't created. Then clean the file when the index is updated.
  2. Split the process into threads (child process, threads or whatever) using chunks or one thread per platform.
  3. Use a background process to create the search-corpus.json.
  4. Use the solution proposed in Feature Request: Wait 30 minutes to update cache again #350 (don't update the dictionary on every miss).

Environment

  • Operating system - Arch Linux
  • Node.js version any (tested in v12 and v16)
@sbrl
Copy link
Member

sbrl commented May 19, 2021

It's also possible that there are some performance gains to be had in the search index calculation process, but I haven't looked too closely at it to know.

@agnivade
Copy link
Member

I'd prefer 2 or 3. Or even take a deeper look into the search index build process and look if there are any low hanging fruits there as @sbrl says. The search index was a one-off feature added by a contributor and wasn't really touched after that. So I won't be surprised if there are some simple gains to reap.

@vladimyr
Copy link
Collaborator

Honestly, building a search index on the end-user machine is wasting resources. Each time cache is updated (tarball downloaded and extracted) it needs to get rebuilt so why don't we simply prebuild it and ship it together with pages in the first place?

@agnivade
Copy link
Member

I like that. I'd also prefer to have a default build and a minimal build with no search index for users who don't use search at all. And for this one, we can dynamically build the index if the user decides to do a search at some point.

@vladimyr
Copy link
Collaborator

And for this one, we can dynamically build the index if the user decides to do a search at some point.

Or simply download it at page search time.

@navarroaxel
Copy link
Member Author

I don't think so, each client should use the data structure that it finds as a better fit for the language and platform. Also, that sounds like a hard dependency that I don't see compatible with this project.

@vladimyr
Copy link
Collaborator

I don't think so, each client should use the data structure that it finds as a better fit for the language and platform.

Um, I don't see how prebuilding search index for node client using GH actions changes that?

Also, that sounds like a hard dependency that I don't see compatible with this project.

What kind of dependency we are talking about?

@navarroaxel
Copy link
Member Author

Um, I don't see how prebuilding search index for node client using GH actions changes that?

Do you want to add that process in the https://github.com/tldr-pages/tldr repository when a PR is merged to main? What if the Go client wants his own format, or the Python one?

What kind of dependency we are talking about?

What if we want changes, small or big, in this index file? Should https://github.com/tldr-pages/tldr use a version system for every update of the index file? Sounds a too specific and hard dependency between repositories.

@vladimyr
Copy link
Collaborator

Do you want to add that process in the https://github.com/tldr-pages/tldr repository when a PR is merged to main? What if the Go client wants his own format, or the Python one?

Yes. They are free to add their own workflows or continue building index locally?

What if we want changes, small or big, in this index file? Should https://github.com/tldr-pages/tldr use a version system for every update of the index file? Sounds a too specific and hard dependency between repositories.

You can't really tweak the search index in the way you describe it. It is built by the selected library and meant to be consumed by a matching client, there is no room for tweaking. Regarding coupling, we already assume pages tree layout which is a far stronger assumption than this one IMO.

@sbrl
Copy link
Member

sbrl commented May 27, 2021

Search indexes can get big, and are also usually stored in a format that's fairly specific to the language you're using / your use case in order to improve performance, so I'm unsure of the value / wisdom of generating the index in the main tldr repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement help wanted Open for new contributors to work on
Projects
None yet
Development

No branches or pull requests

4 participants