Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve JSON Loading Performance #16410

Open
1 task done
apainintheneck opened this issue Dec 30, 2023 · 45 comments
Open
1 task done

Improve JSON Loading Performance #16410

apainintheneck opened this issue Dec 30, 2023 · 45 comments
Assignees
Labels
discussion Input solicited from others features New features in progress Maintainers are working on this

Comments

@apainintheneck
Copy link
Contributor

apainintheneck commented Dec 30, 2023

Verification

Provide a detailed description of the proposed feature

We should try to improve the loading times for JSON data. This is especially relevant because the JSON API has large files that get read into memory whenever a formula or cask is loaded from the API.

$ cd $(brew --cache)
$ ls -FlaSh api/*.json | choose -5:-1
26M Dec 29 21:59 api/formula.jws.json
6.4M Dec 29 21:59 api/cask.jws.json
2.3K Dec 29 21:59 api/cask_tap_migrations.jws.json
2.0K Dec 29 21:59 api/formula_tap_migrations.jws.json

Off the top of my head, I see a few ways that we could potentially improve performance.

  1. Move to a faster JSON library

After @Bo98's excellent work, we're living in the future and everybody's running Ruby 3.1. A side effect of this is that we could potentially include native libraries and ship them with the Ruby 3.1 bottle. This should allow us to explore using libraries like oj which have better JSON parsing performance.

  1. Reduce the size of the JSON files

It's easier to parse smaller files and smaller files mean that fewer objects get allocated when things are parsed. Currently, we don't purge nil values in hashes before creating the JSON files. This leads to a large number of extra hash keys which are all strings and have to be allocated. It also increases the size of the files too which slows down parsing and means that more data has to be shipped across the network.

$ grep -Eo ':null[{,]'  api/formula.jws.json | wc -l
   93785
$ grep -Eo ':null[{,]'  api/cask.jws.json | wc -l
   46265

We're allocating an extra 140k hash keys by not eliminating null values from hashes.

Note: I looked into this a few months ago to see if it would make a big difference from a network file size point of view but it only seemed to change the file size by 100s of kilobytes after getting compressed so I don't expect to see a large benefit there.

  1. Stop parsing the API JSON files twice

Currently we essentially parse both of the API JSON files twice. The first pass is used to validate the JSON web signature and the second parses the content. Beyond removing duplicate work, combining these two passes somehow will mean not having to allocate additional large strings that end up being around the same size as the JSON files since the signature header takes up only a small part of the file anyway.

It would also potentially make the JSON a bit smaller since we'd be able to avoid escaping all of the quotes in the stringified JSON.

JSON API payload string:

{"payload":"[\n  {\"name\":\"a2ps\",\"full_name\":\"a2ps\",\"tap\":\"homebrew/core\",...

Escaped quotes:

$ grep -oF '\"'  api/cask.jws.json | head -n 5
\"
\"
\"
\"
\"
$ grep -oF '\"'  api/formula.jws.json | wc -l
 1970788
$ grep -oF '\"'  api/cask.jws.json | wc -l
  537182

What is the motivation for the feature?

Currently there is a fixed cost whenever the API JSON gets loaded into memory that affects all commands that load formulae or casks from the API. On my M1 MacBook it takes around 3/10ths of a second to load the formulae API JSON by itself.

How will the feature be relevant to at least 90% of Homebrew users?

It will make most commands snappier because most users use the JSON API. This is an especially big deal with the work to move brew desc to use the JSON API in #16237.

What alternatives to the feature have been considered?

Doing nothing...

@apainintheneck apainintheneck added features New features discussion Input solicited from others labels Dec 30, 2023
@Bo98
Copy link
Member

Bo98 commented Dec 30, 2023

The problem is twofold a bit. First is the parser speed as you mention but second is the GC impact. The end result of the JSON parse doesn't just add a few thousand objects, it has a footprint of hundreds of thousands of objects. This makes subsequent major GC sweeps quite slow. We can tweak the GC a bit to reduce the frequency of sweeps but ultimately having that large of an object footprint is not ideal.

  1. Move to a faster JSON library

I've looked at oj a bit but don't have numbers yet. It will indeed likely improve parse speed. Not sure about the object problem though.

We do have the problem of native gems here. Realistically it's difficult to enforce native gems installs as there's a million things that can go wrong, as we saw when trying to enable Bootsnap for all and as I continue to sometimes see from an occasional discussion post (e.g. some people have OOM issues running rm during a compile for some reason?).

The thing with alternative parsers however, particularly since the default JSON gem isn't that bad, is they can be optional and what we can do is bundle it with Portable Ruby which 99% of users use and use it if available and fallback if not. This is trivial to do.

  1. Stop parsing the API JSON files twice

There is definitely a difference in speed between the two passes. The first pass is much cheaper than the second in both time taken and objects created. With that said I do acknowledge it's not the most ideal.

The payload being a string was intentional given we were following the JWS standard which defines the payload to be the string. Technically we could roll our own format and not follow the standard that closely. I rolled out the feature very quickly (developed largely in one day) so it was useful at the time to point to the standard in PRs as proof of something that is used and tested in real world applications rather than being something I pulled out of a hat.

It can definitely be compressed to remove spaces however. At the time it was awkward due to Jekyll but this has been made easier since now we control the build more and use v4.

A bigger change is potentially even something not-JSON. In particular something that doesn't require parsing the whole file by allowing us to seek to the relevant section for a particular formula. Such format may solve most problems as we're parsing only a part of the data which has the potential to be cheaper in both time taken and in the number of objects created. This isn't necessarily easy, and there's a several other factors such as whether it's possible to do so with reasonable speed in pure Ruby or if a native gem is necessary. But it's an option at least that I've reflected on a bit.

@apainintheneck
Copy link
Contributor Author

The problem is twofold a bit. First is the parser speed as you mention but second is the GC impact. The end result of the JSON parse doesn't just add a few thousand objects, it has a footprint of hundreds of thousands of objects. This makes subsequent major GC sweeps quite slow. We can tweak the GC a bit to reduce the frequency of sweeps but ultimately having that large of an object footprint is not ideal.

I hadn't realized this but in hindsight it seems obvious. That reminds me that it'd be nice to have memory profiling get added to brew prof at some point.

The thing with alternative parsers however, particularly since the default JSON gem isn't that bad, is they can be optional and what we can do is bundle it with Portable Ruby which 99% of users use and use it if available and fallback if not. This is trivial to do.

That makes sense to me though I'm not sure how complex that bundling would be. At least oj seems to be compatible with the default json library.

There is definitely a difference in speed between the two passes. The first pass is much cheaper than the second in both time taken and objects created. With that said I do acknowledge it's not the most ideal.

Very true. This shows up clearly in the profiling output.

Screen Shot 2023-12-30 at 11 18 58 AM

A bigger change is potentially even something not-JSON. In particular something that doesn't require parsing the whole file by allowing us to seek to the relevant section for a particular formula. Such format may solve most problems as we're parsing only a part of the data which has the potential to be cheaper in both time taken and in the number of objects created. This isn't necessarily easy, and there's a several other factors such as whether it's possible to do so with reasonable speed in pure Ruby or if a native gem is necessary. But it's an option at least that I've reflected on a bit.

Do you have an existing serialization format in mind here? I can't think of any off-the-shelf format that fits those characteristics though I agree that it would obviously be nice to have the ability to not have to load the entire file into memory to just load one formula or cask.

@Bo98
Copy link
Member

Bo98 commented Dec 30, 2023

I hadn't realized this but in hindsight it seems obvious. That reminds me that it'd be nice to have memory profiling get added to brew prof at some point.

I did try this at some point but never sent a PR with it. Might take a look again.

The garbage collector bar in --stackprof output does point to the time taken by the GC and you can tell the difference between a run with and without JSON parsing.

That makes sense to me though I'm not sure how complex that bundling would be. At least oj seems to be compatible with the default json library.

Ruby does have its own gem bundling system that we could piggyback off of by adding a line to https://github.com/ruby/ruby/blob/master/gems/bundled_gems.

There's also the vendor_ruby directory system that's left entirely up to us what we use it for.

Do you have an existing serialization format in mind here? I can't think of any off-the-shelf format that fits those characteristics though I agree that it would obviously be nice to have the ability to not have to load the entire file into memory to just load one formula or cask.

Not currently because I've not researched it much - I was talking conceptually and in theory we could have a TOC listing all formula names and their offsets that can be very quickly read followed by the actual payload. I know stuff like protobuf has lazy properties but that's not to suggest that's necessarily the perfect fit or not.

It's an option, but there's maybe not too much appetite for change this soon.

@MikeMcQuaid
Copy link
Member

2. Reduce the size of the JSON files

We should do this regardless. There's a lot of keys that are literally never used by us but yet are downloaded by everyone.

We're allocating an extra 140k hash keys by not eliminating null values from hashes.

This seems like a good idea.

3. Stop parsing the API JSON files twice

If possible: this seems like a good/nice idea.

We do have the problem of native gems here.

We can't/shouldn't really use native gems for anything except developer commands.

The payload being a string was intentional given we were following the JWS standard which defines the payload to be the string.

I wonder about making the payload e.g. Gzipped data?

It's an option, but there's maybe not too much appetite for change this soon.

I don't really have an appetite for here until we try the above ideas and see them insufficient.

I still think the biggest impact will come from reducing the number of keys we actually include in the JSON payload that we use.

@Bo98
Copy link
Member

Bo98 commented Dec 31, 2023

I wonder about making the payload e.g. Gzipped data?

It can be, though the JSON parse of the actual payload is the slower bit and this won't really help there - only the inital parse.

@apainintheneck
Copy link
Contributor Author

We do have the problem of native gems here.

We can't/shouldn't really use native gems for anything except developer commands.

I assumed that the reason we don't include native gems outside of dev commands was to reduce install time when they need to be updated and avoid problems where native gems refuse to build on some systems. If we essentially roll those gems into the Ruby 3.1 bottle that we're shipping to everyone already, that should solve most of those problems, right? I agree that it should probably not be the first option though since it adds more complexity.

@apainintheneck
Copy link
Contributor Author

Thoughts on Reducing JSON Size

Are we okay with having two different JSON representations of casks/formulae? One would be for loading packages internally and the other for brew info --json=v2 that will also get shown on formulae.brew.sh. This would require some adjustment to the brew generate-*-api
commands.

We could probably remove all top-level hash keys that point to blank data. This would be more automatic. At a glance, it looks like most hash keys that we would expect not to be used would be blank anyway like pinned, outdated, etc. The load_formula_from_api method would have to updated to expect more missing hash keys.

Maybe we should store the package data as a hash instead of an array. Internally, we turn the array of package info hashes into a hash of package name to package info every time we load it into memory. I doubt that is slow though but it would require less fiddling.

@MikeMcQuaid
Copy link
Member

If we essentially roll those gems into the Ruby 3.1 bottle that we're shipping to everyone already, that should solve most of those problems, right?

It would but that will likely dramatically increase how often we need to update this bottle (and users need to download it).

Are we okay with having two different JSON representations of casks/formulae? One would be for loading packages internally and the other for brew info --json=v2 that will also get shown on formulae.brew.sh.

Not just OK but: this seems actively desirable.

We could probably remove all top-level hash keys that point to blank data. This would be more automatic. At a glance, it looks like most hash keys that we would expect not to be used would be blank anyway like pinned, outdated, etc. The load_formula_from_api method would have to updated to expect more missing hash keys.

Seems like a great idea 👍🏻

Maybe we should store the package data as a hash instead of an array. Internally, we turn the array of package info hashes into a hash of package name to package info every time we load it into memory. I doubt that is slow though but it would require less fiddling.

Also a good idea 👍🏻

@apainintheneck
Copy link
Contributor Author

apainintheneck commented Jan 1, 2024

If we essentially roll those gems into the Ruby 3.1 bottle that we're shipping to everyone already, that should solve most of those problems, right?

It would but that will likely dramatically increase how often we need to update this bottle (and users need to download it).

That's what I hadn't considered. Good point!


To add a new slimmer JSON representation we'd need to...

  • Update the code used to load casks and formulae from JSON to allow for missing hash keys.
  • Update formulae.brew.sh API generation to ignore these new files.
  • Create the new JSON file in the brew generate-*-api commands.
  • Update formulae.brew.sh API generation to use these new files.

@Bo98
Copy link
Member

Bo98 commented Jan 1, 2024

It would but that will likely dramatically increase how often we need to update this bottle (and users need to download it).

It doesn't have to. The update cadences for oj (and bootsnap if we bundle that) are more frequent than Ruby, but very rarely are those updates critical. They can easily wait for the next release. If it's a security release then the release notes will say and we can react, but there hasn't been any in either for as far as I can see back in their changelogs. So realistically: it does not need to increase how often we update just like how Ruby upstream don't ship a new Ruby every time one of their own bundled gems has a new release, unless it's a security update. And the idea here was to share the same model Ruby upstream have with bundled gems.

We already don't automatically ship new Portable Rubies when a dependency updates, and arguably things like OpenSSL are more important than any gem update (though we only use OpenSSL on a very basic level so most vulnerabilities have largely not been a concern).

(On a related note to this: locally I have put together autobumping for the Portable Ruby repository so expect to see a PR for that this month. That will mean the whole repository is 100% automated except for updates between minor versions e.g. 3.1 -> 3.2, and hitting the release button)

@MikeMcQuaid
Copy link
Member

  • Update the code used to load casks and formulae from JSON to allow for missing hash keys.

    • Note: We should keep the current tests for loading from the public JSON representation for backwards compatibility.
  • Update formulae.brew.sh API generation to ignore these new files.

  • Create the new JSON file in the brew generate-*-api commands.

  • Update formulae.brew.sh API generation to use these new files.

Makes sense to me 👍🏻

It doesn't have to. The update cadences for oj (and bootsnap if we bundle that) are more frequent than Ruby, but very rarely are those updates critical. They can easily wait for the next release. If it's a security release then the release notes will say and we can react, but there hasn't been any in either for as far as I can see back in their changelogs. So realistically: it does not need to increase how often we update just like how Ruby upstream don't ship a new Ruby every time one of their own bundled gems has a new release, unless it's a security update. And the idea here was to share the same model Ruby upstream have with bundled gems.

Ok, that's fair 👍🏻

(On a related note to this: locally I have put together autobumping for the Portable Ruby repository so expect to see a PR for that this month. That will mean the whole repository is 100% automated except for updates between minor versions e.g. 3.1 -> 3.2, and hitting the release button)

🎉 looking forward to it.

Would be nice for this to be a trivial, automated process.

@carlocab
Copy link
Member

carlocab commented Jan 3, 2024

I don't have much to add here but:

  1. I suspect most of the gains will likely come from being able to selectively parse specific formulae/casks from the API JSON (rather than the speed of the JSON parser itself).
  2. I'm still excited by work here in any direction -- feel free to tag me for 👀 on any PRs.

@Bo98
Copy link
Member

Bo98 commented Jan 3, 2024

From quick tests now, while oj is consistently faster it's not as significant enough as I hoped (~0.05s). I think most of the overhead is potentially the number of objects created which didn't really change between the two parsers. It'll be interesting to see the numbers of a slimmed JSON suggestion as that could reduce the object count (for the keys since the values will be a shared nil object).

Vendoring bootsnap in Portable Ruby might still be nice though. Not a deal breaker (~0.12s) but we already put the effort to integrate it in brew to use it so we might as well try to lean more on it rather than something only a couple of maintainers plus a limited amount of our CI uses. Maybe need to figure out smarter cache cleaning though.

@Bo98
Copy link
Member

Bo98 commented Jan 3, 2024

To demonstrate the object thing. A brew info --json=v2 hello run looks something like this:

120943 objects prior to formula JSON load
673778 objects after formula JSON load
699929 objects prior to cask JSON load
857648 objects after cask JSON load
979820 objects at end of brew run

(exact values may vary from run to run)

So JSON data accounts for about 72.5% of all Ruby objects created in a brew run, even when we use <1% of it.

Forcing a GC (several times to make sure) frees up about 8800 slots after a formula JSON load and 14500 after a cask JSON load.

Disabling the GC on its own (for demonstration purposes) saves ~0.1s, though creating the objects themselves probably have overhead too. Again not night and day but these things quickly add up and pair everything together (even oj) and we're looking at <0.5s from 0.75s.

There's not much to take from this comment - just following up on earlier theories and will be interesting to check back and see what areas improved from trying the various ideas (e.g. JSON slimming) to give an indication of the type of things that are helping more than others. For example, I suspect the whitespace stripping is probably not significant based on the data so far, but we should do it anyway as it's good practice.

@RandomDSdevel
Copy link
Contributor

     Why not split the JSON up into one file per formula or cask? (This may be something to consider further later on down the line, though, depending on how invasive of a refactor or not that might be right now.)

@Bo98
Copy link
Member

Bo98 commented Jan 3, 2024

It's a trade-off a bit. Certainly would be slower to download but if you mean instead a single download that is split afterwards then maybe, though the extra write IO cost may be notable.

@RandomDSdevel
Copy link
Contributor

but if you mean instead a single download that is split afterwards then maybe

     (Nods.) One obvious option would be a tarball.

@apainintheneck
Copy link
Contributor Author

apainintheneck commented Jan 4, 2024

Yeah, that should definitely be an option. I took a quick stab at seeing how much smaller the formula JSON would get by just removing blank values and unused keys and it made a big difference. This, of course, hasn't been tested yet though so it's possible that some of these are still needed.

IGNORABLE_KEYS = %w[
  oldname
  installed
  linked_keg
  pinned
  outdated
  keg_only
  disabled
  deprecated
].freeze

original = Homebrew::API::Formula.all_formulae
File.write "original.json", JSON.generate(original)

slim = original.transform_values do |hash|
  hash = hash
    .reject { |key, _| IGNORABLE_KEYS.include?(key) }
    .reject { |_, value| value.blank? }

  files = hash.dig("bottle", "stable", "files").to_h do |key, hash|
    [key, hash.slice("cellar", "sha256")] # Ignore unused "url"
  end

  hash["bottle"]["stable"]["files"] = files
  hash
end
File.write "slim.json", JSON.generate(slim)
24M original.json
13M slim.json

I was surprised at how much space was taken up by the bottle urls which aren't used at all. That was around 6 megabytes by itself.

{
  "stable": {
    "rebuild": 0,
    "root_url": "https://ghcr.io/v2/homebrew/core",
    "files": {
      "arm64_sonoma": {
        "cellar": "/opt/homebrew/Cellar",
        "url": "https://ghcr.io/v2/homebrew/core/a2ps/blobs/sha256:f6fec638555f04cdcb741cb9d930f95842f603e8d37c6ca5312b25c7a9e585cd",
        "sha256": "f6fec638555f04cdcb741cb9d930f95842f603e8d37c6ca5312b25c7a9e585cd"
      },

Edit: We could also potentially save another half a megabyte by only including the tap info at the top-level of the JSON instead of inside each individual formula definition.

Edit 2: I found that recursively removing blank elements from the formula hashes didn't result in a significantly smaller file.

@Bo98
Copy link
Member

Bo98 commented Jan 4, 2024

Assuming 7 bottles per formula that's 7 * ~7000 ≈ 50k string objects (probably more from removing the key too) so yeah it might be noticeable on the Ruby side too

@MikeMcQuaid
Copy link
Member

Edit 2: I found that recursively removing blank elements from the formula hashes didn't result in a significantly smaller file.

It may not slim the file but it will likely reduce the number of required allocations.

@apainintheneck
Copy link
Contributor Author

From quick tests now, while oj is consistently faster it's not as significant enough as I hoped (~0.05s). I think most of the overhead is potentially the number of objects created which didn't really change between the two parsers. It'll be interesting to see the numbers of a slimmed JSON suggestion as that could reduce the object count (for the keys since the values will be a shared nil object).

I was afraid of that.

Vendoring bootsnap in Portable Ruby might still be nice though. Not a deal breaker (~0.12s) but we already put the effort to integrate it in brew to use it so we might as well try to lean more on it rather than something only a couple of maintainers plus a limited amount of our CI uses. Maybe need to figure out smarter cache cleaning though.

I think that sounds like a good idea.

@Bo98
Copy link
Member

Bo98 commented Jan 9, 2024

Again not night and day but these things quickly add up and pair everything together (even oj) and we're looking at <0.5s from 0.75s.

Looks like that's about the absolute limit. Just for fun for a couple hours I experimented to see what a best case scenario could look like: https://github.com/Bo98/lazyjson-ruby (intentionally doesn't include Hash#except/delete etc)

What we have now: ~988,000 objects, ~0.72s
image
Proof-of-concept: ~160,000 objects, ~0.53s
Screenshot 2024-01-09 at 02 33 13
(on a very basic brew deps hello - so nothing using variations etc which would need adjusting to work on the PoC)

Also note that brew usage without any JSON parsing is ~145k objects, so you can subtract roughly that from the above two numbers (and perhaps even more since I didn't force a GC sweep with the above numbers).

So if we can get anything close to those numbers from what we're doing here we will have done well.

@apainintheneck
Copy link
Contributor Author

Very impressive! I don't think we'll get close to that since I'm not aware of any Ruby gems that offer lazy loading for JSON (though to be honest I haven't looked that hard). Of course, we could use your repo but I doubt that's in the cards either.

@Bo98
Copy link
Member

Bo98 commented Jan 9, 2024

Yeah there's very likely none as it's sort of a "fake" binding given it doesn't return native Ruby Hashes and Arrays but instead custom ones that don't expose children to Ruby. Idea was make the GC only see one object (+ any values you extract from it) rather than hundreds of thousands. The custom classes mean it isn't a drop-in replacement for all use cases of JSON parsing in the wild (but code adapted to work with it would remain compatible with existing JSON parsing). The lazy part is referring to the Ruby exposure rather than the parsing itself which is not lazy (but really fast given it uses SIMDs like NEON, AVX and SSE). Lazy parsing is possible but only really for streaming and not random access, and is close to being a micro-optimisation when we're in the 2-3% range.

Exploring all this was a nice break from the norm too as I hadn't really tried working with Ruby extensions much before.

apainintheneck added a commit to apainintheneck/homebrew-dev-utils that referenced this issue Jan 10, 2024
This allows me to sanity check the changes to the API JSON
from a loading perspective.

Related to: Homebrew/brew#16410
@apainintheneck
Copy link
Contributor Author

Since these changes are much more involved than what was originally proposed, it means that we'll need parallel download and parsing logic for the old and new formats to be added to the api module along with extra work arounds in the Formulary and Cask::CaskLoader modules.

The JSON format itself will need to be versioned as well which should be easy enough to add.

{
  "payload" => ...,
  "signature" => ...,
  "version" => 2,
}

Then, we can parse it based on that version number.

This will be awfully difficult to test beforehand so maybe we should roll it out behind an environment variable until we're confident in it. Essentially, generate the old and new formats in formulae.brew.sh and give them different names and then the environment variable decides which one you try to download.

Since we expect users to move to a newer version of Homebrew pretty regularly, we'd probably be able to remove the first version of the JSON API code as a follow-up after a a month or two.

@apainintheneck
Copy link
Contributor Author

The main difference with this new approach is that it means rewriting a lot of the API code which isn't necessary to just minify the JSON hashes.

@Bo98
Copy link
Member

Bo98 commented Jan 19, 2024

Essentially, generate the old and new formats in formulae.brew.sh and give them different names and then the environment variable decides which one you try to download.

Yeah a clean break would make this very doable and I agree what you describe here would be the approach I would take - generate both old and new on formulae.brew.sh.

Regardless what we do here, this will be the first time we properly do something like this so it's very uncertain how much issues it may cause.

We did make a minor compatibility break once before. #15168 wasn't in the original API JSON (shipped in 4.0.12 instead) and I accidentally broke compatibility with pre-4.0.12 JSONs in 4.1.0: #15676. I believe this was the cause of reports like https://github.com/orgs/Homebrew/discussions/4701 and https://github.com/orgs/Homebrew/discussions/4750. Though using an old JSON for so long is also a concern - seems to be some permission issue going on somewhere.

The reverse (new JSON incompatible with old brew) has never been tested and the API minifying approach alone would be the first example of doing that given it needs the other PRs that ignore nil-values on the loader/reader side. So the minifying approach isn't necessarily risk-free - it's a different, unexplored type of risk. Frozen CI images (common in Circle CI etc) without the API pre-cached may be good example of such concern, along with multi-user systems. It may work better, it may work worse - I don't really know.

I've leaned more on the clean break approach because we can add any necessary compatibility code we need while the minifying approach is impossible to handle compatibility for should it turn out to be necessary. It is however of course more effort.

@apainintheneck
Copy link
Contributor Author

Regardless what we do here, this will be the first time we properly do something like this so it's very uncertain how much issues it may cause.

Would we need to wait for the next major release here? It's an internal API change but it is breaking. At the very least, it should probably wait for a minor release.

The reverse (new JSON incompatible with old brew) has never been tested and the API minifying approach alone would be the first example of doing that given it needs the other PRs that ignore nil-values on the loader/reader side. So the minifying approach isn't necessarily risk-free - it's a different, unexplored type of risk. Frozen CI images (common in Circle CI etc) without the API pre-cached may be good example of such concern, along with multi-user systems. It may work better, it may work worse - I don't really know.

Yeah, I could very well be underestimating the complexity of the minimized approach. Frozen CI images never even crossed my mind, for example.

I've leaned more on the clean break approach because we can add any necessary compatibility code we need while the minifying approach is impossible to handle compatibility for should it turn out to be necessary. It is however of course more effort.

This is why I was advocating for the other approach. I'm just lazy and this looks like a whole lot more work. I also think that any additional performance benefits will be marginal at best.

@MikeMcQuaid
Copy link
Member

I think that this is pretty well sorted out by #16460. The cask version needs to be added as well though.

Agreed 👍🏻

This means that we can combine the tap migrations with the formula or cask file it corresponds to.

Yeh, would be good to migrate these two (and leave "room" in the format to potentially add more tap-specific data if needed in future).

It doesn't make sense to combine them in my opinion because we often don't want to load casks and formulas at the same time. This is especially true for Linux users.

Yup, that's fair.

This probably only really makes sense on macOS but we could tarball/gzip the cask and formula files and ship them together. Not only will this reduce network requests but it might also reduce overall file size.

Yeh, worth seeing if this makes things any smaller/faster. If it doesn't meaningfully: can leave it.

Since these changes are much more involved than what was originally proposed, it means that we'll need parallel download and parsing logic for the old and new formats to be added to the api module along with extra work arounds in the Formulary and Cask::CaskLoader modules.

Yes, this is non-ideal but hopefully this logic can be separated enough that it's just a matter if deleting a few if and method definitions. Could add the usual # odeprecated pattern here.

This will be awfully difficult to test beforehand so maybe we should roll it out behind an environment variable until we're confident in it. Essentially, generate the old and new formats in formulae.brew.sh and give them different names and then the environment variable decides which one you try to download.

Good idea, agreed 👍🏻

Would we need to wait for the next major release here? It's an internal API change but it is breaking. At the very least, it should probably wait for a minor release.

Could do this before we remove the generation of the old format.

I'm just lazy and this looks like a whole lot more work.

Lazy is a virtue! This sort of scope creep can be bad, I appreciate you pushing back on it ❤️

The JSON format itself will need to be versioned as well which should be easy enough to add.

This will be v3, FYI.

I think this also makes this a good candidate to be a v3-internal rather than v3 proper; it lets us test out a new format for our own internal use before we can consider moving the formulae.brew.sh APIs to use this format with all data output eventually, too.

@apainintheneck
Copy link
Contributor Author

Yes, this is non-ideal but hopefully this logic can be separated enough that it's just a matter if deleting a few if and method definitions. Could add the usual # odeprecated pattern here.

Yeah, I see 10+ functions that will need different logic for the different API versions. We could add that branching logic into each method or just have different mixins that we use based on whether the v3 flag is set.

This will be v3, FYI.

I think this also makes this a good candidate to be a v3-internal rather than v3 proper; it lets us test out a new format for our own internal use before we can consider moving the formulae.brew.sh APIs to use this format with all data output eventually, too.

I wasn't planning on changing the public API. We could decide to do that later on depending on how things go though.

@MikeMcQuaid
Copy link
Member

Yeah, I see 10+ functions that will need different logic for the different API versions. We could add that branching logic into each method or just have different mixins that we use based on whether the v3 flag is set.

Another option is separate classes using duck-typing.

I wasn't planning on changing the public API. We could decide to do that later on depending on how things go though.

Yeh, I don't think we need that in the short-term but I think the current v2 (at least for formulae) API is so terrible that it is worth learning from this "v3-internal" and using there.

@apainintheneck
Copy link
Contributor Author

TODO: Check to see if we can improve the JSON format for listing dependencies.

CC: @MikeMcQuaid @Bo98

@apainintheneck
Copy link
Contributor Author

For the improved JSON format for formula dependencies, what do you think of the following alternative?

{"dependencies"=>                                                                                      
  {"autoconf"=>{:tags=>[:build]},                                                                      
   "automake"=>{:tags=>[:build]},                                                                      
   "[email protected]"=>{:tags=>[:build]},                                                                   
   "sphinx-doc"=>{:tags=>[:build]},                                                                    
   "xz"=>{:tags=>[:build]},                                                                            
   "gnu-sed"=>{:tags=>[:build]},                                                                       
   "m4"=>{:tags=>[:build], :uses_from_macos=>nil},                                                     
   "ncurses"=>{:uses_from_macos=>nil}},                                                                
 "head_dependencies"=>                                                                                 
  {"autoconf"=>{:tags=>[:build]},                                                                      
   "automake"=>{:tags=>[:build]},                                                                      
   "[email protected]"=>{:tags=>[:build]},                                                                   
   "sphinx-doc"=>{:tags=>[:build]},                                                                    
   "xz"=>{:tags=>[:build]},
   "gnu-sed"=>{:tags=>[:build]},
   "m4"=>{:tags=>[:build], :uses_from_macos=>nil},
   "ncurses"=>{:uses_from_macos=>nil}}}
{
  "dependencies": {                                      
    "bdw-gc": null,                                      
    "gmp": null,                                         
    "libevent": null,                                    
    "libyaml": null,                                     
    "llvm": null,                                        
    "openssl@3": null,                                   
    "pcre2": null,                                       
    "pkg-config": null                                   
  },                                                     
  "head_dependencies": {                          
    "bdw-gc": null,                               
    "gmp": null,                                  
    "libevent": null,                             
    "libyaml": null,
    "llvm": null,
    "openssl@3": null,
    "pcre2": null,
    "pkg-config": null,
    "libffi": {
      "uses_from_macos": null
    }
  }
}

@MikeMcQuaid
Copy link
Member

For the improved JSON format for formula dependencies, what do you think of the following alternative?

@apainintheneck Looks a lot nicer!

I'm thinking parsing may be a bit nicer/easier/cleaner if dependencies are either e.g. "bdw-gc": true or "bdw-gc": {}.

Similarly, :uses_from_macos=>true (or :uses_from_macos=>{} if there's potential hash values) feels a bit nicer to me, too.

@apainintheneck
Copy link
Contributor Author

Okay, how about something like this? The first result is what we currently have and the second is the new format. It surprisingly ends up being even a bit smaller too.

-rw-r--r--   1 kevinrobell  admin   1.0M Mar 18 22:21 new.json
-rw-r--r--   1 kevinrobell  admin   1.4M Mar 18 22:22 old.json
brew(main):002:0> :ruby.f.dependencies_hash
=> 
{"build_dependencies"=>["autoconf", "pkg-config", "rust"],
 "dependencies"=>["libyaml", "openssl@3"],
 "test_dependencies"=>[],
 "recommended_dependencies"=>[],
 "optional_dependencies"=>[],
 "uses_from_macos"=>["gperf", "libffi", "libxcrypt", "zlib"],
 "uses_from_macos_bounds"=>[{}, {}, {}, {}]}
brew(main):003:0> :ruby.f.to_internal_api_hash.slice("dependencies", "head_dependencies")
=> 
{"dependencies"=>
  [{:name=>"libyaml"},
   {:name=>"openssl@3"},
   {:name=>"autoconf", :tags=>[:build]},
   {:name=>"pkg-config", :tags=>[:build]},
   {:name=>"rust", :tags=>[:build]},
   {:name=>"gperf", :uses_from_macos=>true},
   {:name=>"libffi", :uses_from_macos=>true},
   {:name=>"libxcrypt", :uses_from_macos=>true},
   {:name=>"zlib", :uses_from_macos=>true}],
 "head_dependencies"=>
  [{:name=>"libyaml"},
   {:name=>"openssl@3"},
   {:name=>"autoconf", :tags=>[:build]},
   {:name=>"pkg-config", :tags=>[:build]},
   {:name=>"rust", :tags=>[:build]},
   {:name=>"gperf", :uses_from_macos=>true},
   {:name=>"libffi", :uses_from_macos=>true},
   {:name=>"libxcrypt", :uses_from_macos=>true},
   {:name=>"zlib", :uses_from_macos=>true}]}

@apainintheneck
Copy link
Contributor Author

Actually, I was comparing it against the current public json v2 instead of the current internal json v3 which is much slimmer. This would increase the size significantly.

-rw-r--r--   1 kevinrobell  admin   1.0M Mar 18 22:21 new.json
-rw-r--r--   1 kevinrobell  admin   481K Mar 18 22:28 old.json

I'd prefer to find a format that doesn't require adding to the file size.

@MikeMcQuaid
Copy link
Member

What do dependencies look like in the current internal JSON v3? Sorry, I'm a bit muddled at this point.

No idea whether it actually ends up adding to the size but I wonder if having dependencies as a single array (for both stable and head) would make more sense. Each array item could either be a String for a dependency in both cases or a T::Hash[String, Array] of dependency "tags" which could also include uses_from_macos and head that aren't currently strictly tags but look a lot like them in the API (and we don't allow arbitrary tags anyway).

@apainintheneck
Copy link
Contributor Author

What do dependencies look like in the current internal JSON v3? Sorry, I'm a bit muddled at this point.

No idea whether it actually ends up adding to the size but I wonder if having dependencies as a single array (for both stable and head) would make more sense. Each array item could either be a String for a dependency in both cases or a T::Hash[String, Array] of dependency "tags" which could also include uses_from_macos and head that aren't currently strictly tags but look a lot like them in the API (and we don't allow arbitrary tags anyway).

Sorry I forgot to reply to this. It's the same as what we have now but without fields that have blank values. So it would roughly look like this if you convert the above example.

{"build_dependencies"=>["autoconf", "pkg-config", "rust"],
 "dependencies"=>["libyaml", "openssl@3"],
 "uses_from_macos"=>["gperf", "libffi", "libxcrypt", "zlib"],
 "uses_from_macos_bounds"=>[{}, {}, {}, {}]}

I don't remember how I calculated the format size before so I redid by just generating the entire JSON file for the core tap. Unfortunately I ended up getting basically with the same results. old.json is the current JSON v3 format and new.json is the new JSON v3 format.

brew(main):001:0> ENV["HOMBREW_NO_INSTALL_FROM_API"] = "1"
=> "1"
brew(main):002:0> Formula.generating_hash!
=> true
brew(main):003:0> File.write("new.json", JSON.generate(CoreTap.instance.to_internal_api_hash))
$ ll *.json
-rw-r--r--  1 kevinrobell  admin    12M Mar 26 00:14 new.json
-rw-r--r--  1 kevinrobell  admin    11M Mar 26 00:11 old.json
$ du *.json
23736   new.json
22160   old.json

@MikeMcQuaid
Copy link
Member

old.json is the current JSON v3 format and new.json is the new JSON v3 format.

Ok, that seems near enough to not worry about.

I think it's good to minimise the size but readability and being DRY is more important than that, IMO (but I'm open to disagreement there).

@apainintheneck
Copy link
Contributor Author

I think I'll go with this one: #16410 (comment)

It's a bit smaller than the other new format. I'll also keep the empty values as null since that should make unpacking things a bit simpler.

@Bo98 any thoughts here before I open a PR?

@lockieluke

This comment was marked as off-topic.

@MikeMcQuaid

This comment was marked as off-topic.

@lockieluke

This comment was marked as off-topic.

@MikeMcQuaid

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Input solicited from others features New features in progress Maintainers are working on this
Projects
None yet
Development

No branches or pull requests

6 participants