Improve update cache process #353

navarroaxel · 2021-05-19T16:11:26Z

Actual behavior

Updating this client’s cache takes ~41 seconds.
Loading the pages' cache takes < 1 second.
The cache for the --search argument takes the rest of the time.

Explaining the build index process

The process reads every page and builds a dictionary {key: count} with key as the word and count indicating how many times that word appears on the given page.

Options to improve this process

Create the dictionary when --search is executed and the search-corpus.json file isn't created. Then clean the file when the index is updated.
Split the process into threads (child process, threads or whatever) using chunks or one thread per platform.
Use a background process to create the search-corpus.json.
Use the solution proposed in Feature Request: Wait 30 minutes to update cache again #350 (don't update the dictionary on every miss).

Environment

Operating system - Arch Linux
Node.js version any (tested in v12 and v16)

The text was updated successfully, but these errors were encountered:

sbrl · 2021-05-19T16:21:08Z

It's also possible that there are some performance gains to be had in the search index calculation process, but I haven't looked too closely at it to know.

agnivade · 2021-05-20T04:30:49Z

I'd prefer 2 or 3. Or even take a deeper look into the search index build process and look if there are any low hanging fruits there as @sbrl says. The search index was a one-off feature added by a contributor and wasn't really touched after that. So I won't be surprised if there are some simple gains to reap.

vladimyr · 2021-05-21T00:31:31Z

Honestly, building a search index on the end-user machine is wasting resources. Each time cache is updated (tarball downloaded and extracted) it needs to get rebuilt so why don't we simply prebuild it and ship it together with pages in the first place?

agnivade · 2021-05-21T04:59:34Z

I like that. I'd also prefer to have a default build and a minimal build with no search index for users who don't use search at all. And for this one, we can dynamically build the index if the user decides to do a search at some point.

vladimyr · 2021-05-21T10:05:46Z

And for this one, we can dynamically build the index if the user decides to do a search at some point.

Or simply download it at page search time.

navarroaxel · 2021-05-21T10:11:29Z

I don't think so, each client should use the data structure that it finds as a better fit for the language and platform. Also, that sounds like a hard dependency that I don't see compatible with this project.

vladimyr · 2021-05-21T10:22:59Z

I don't think so, each client should use the data structure that it finds as a better fit for the language and platform.

Um, I don't see how prebuilding search index for node client using GH actions changes that?

Also, that sounds like a hard dependency that I don't see compatible with this project.

What kind of dependency we are talking about?

navarroaxel · 2021-05-21T10:28:48Z

Um, I don't see how prebuilding search index for node client using GH actions changes that?

Do you want to add that process in the https://github.com/tldr-pages/tldr repository when a PR is merged to main? What if the Go client wants his own format, or the Python one?

What kind of dependency we are talking about?

What if we want changes, small or big, in this index file? Should https://github.com/tldr-pages/tldr use a version system for every update of the index file? Sounds a too specific and hard dependency between repositories.

vladimyr · 2021-05-21T10:38:38Z

Do you want to add that process in the https://github.com/tldr-pages/tldr repository when a PR is merged to main? What if the Go client wants his own format, or the Python one?

Yes. They are free to add their own workflows or continue building index locally?

What if we want changes, small or big, in this index file? Should https://github.com/tldr-pages/tldr use a version system for every update of the index file? Sounds a too specific and hard dependency between repositories.

You can't really tweak the search index in the way you describe it. It is built by the selected library and meant to be consumed by a matching client, there is no room for tweaking. Regarding coupling, we already assume pages tree layout which is a far stronger assumption than this one IMO.

sbrl · 2021-05-27T23:31:46Z

Search indexes can get big, and are also usually stored in a format that's fairly specific to the language you're using / your use case in order to improve performance, so I'm unsure of the value / wisdom of generating the index in the main tldr repo.

agnivade added enhancement help wanted Open for new contributors to work on labels Sep 4, 2022

agnivade mentioned this issue Sep 4, 2022

Feature Request: Wait 30 minutes to update cache again #350

Closed

vivekjoshi556 mentioned this issue Jan 4, 2024

perf: improve creating tfIdf with cache #433

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve update cache process #353

Improve update cache process #353

navarroaxel commented May 19, 2021 •

edited

sbrl commented May 19, 2021

agnivade commented May 20, 2021

vladimyr commented May 21, 2021

agnivade commented May 21, 2021

vladimyr commented May 21, 2021

navarroaxel commented May 21, 2021

vladimyr commented May 21, 2021

navarroaxel commented May 21, 2021

vladimyr commented May 21, 2021

sbrl commented May 27, 2021

Improve update cache process #353

Improve update cache process #353

Comments

navarroaxel commented May 19, 2021 • edited

Actual behavior

Explaining the build index process

Options to improve this process

Environment

sbrl commented May 19, 2021

agnivade commented May 20, 2021

vladimyr commented May 21, 2021

agnivade commented May 21, 2021

vladimyr commented May 21, 2021

navarroaxel commented May 21, 2021

vladimyr commented May 21, 2021

navarroaxel commented May 21, 2021

vladimyr commented May 21, 2021

sbrl commented May 27, 2021

navarroaxel commented May 19, 2021 •

edited