WIP: Slim down formula API JSON #16433

apainintheneck · 2024-01-05T05:44:06Z

Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same change?
Have you added an explanation of what your changes do and why you'd like us to include them?
Have you written new tests for your changes? Here's an example.
Have you successfully run brew style with your changes locally?
Have you successfully run brew typecheck with your changes locally?
Have you successfully run brew tests with your changes locally?

Related to #16410

The idea here is to create a file with the slimmed down formula JSON that omits certain fields and removes nil values, empty hashes and empty arrays. It looks like the resulting file is around half the size.

apainintheneck · 2024-01-05T05:45:07Z

Library/Homebrew/dev-cmd/generate-formula-api.rb

@@ -74,6 +78,7 @@ def generate_formula_api
 raise
 end

+ File.write("formula.json", JSON.generate(api_json))


I'm not sure where this file should go.

Location seems fine. Filename could be something like formula_internal.json or formula_slim.json or something?

Maybe they should both go in the same directory? internal/formula.json and internal/cask.json

MikeMcQuaid

Great work @apainintheneck. A few ideas here; I don't expect them all to block the PR but I think this work is exposing the legacy mess that is our existing JSON format 😅

MikeMcQuaid · 2024-01-05T10:13:05Z

Library/Homebrew/api/formula.rb

@@ -101,6 +102,71 @@ def write_names_and_aliases(regenerate: false)
 end
 end
 end
+
+ REQUIRED_FIELDS = %w[


I know this blows up the scope a bit but: I'd instead be tempted to actually have a new Formula#to_hash (and Cask equivalent) or pass a versioned argument there and generate an entirely new hash format with e.g. less nesting/repetition that better meets our needs.

May be a good time to actually have some sort of stricter versioning or e.g. specifying a minimum Homebrew version in the JSON itself here, too.

That would be ideal but I'm a bit worried about backwards compatibility. I'm not sure how many people depend on the output of brew info --json=v2 though. I guess that could just be json v3 and we could just keep the old format at json v2.

Also, changing the format entirely would add a lot of complexity to the load from API methods which rely on the current format. This is especially important for formula that get removed since we'll have to load them from the original JSON definition.

I guess we could have multiple loaders and just have a key like json_version that specifies which one we should try and load it with.

Here's a sample of the current slimmed-down JSON representation. Let me know if there's anything that looks like it can be removed and/or simplified.

https://gist.github.com/apainintheneck/882bb75d29394f7bdd5f8b5efe482038

"versions" -> "head" and "versions" -> "bottle" can probably be trimmed. "full_name" and "tap" too probably

"versions" -> "head" is quite an interesting one. In theory we might support:

head do url "..." version "whatever" end

In practice: does it actually work? Not that it's a concern for Homebrew/core, but maybe something we could make error on a Ruby DSL level if it behaves weirdly. If it just makes it behave like the old devel spec then it's probably ok, even if weird.

Yeah in my mind a theoretical tap agnostic format would have been:

{ "tap": { ..., "migrations": [...] }, "formulae": [...], "casks": [...] }

Which is basically the same thing you suggest.

The tap_migrations JSON being a separate download is a hack indicative of this problem. All of this is backwards incompatible changes however. We're already going to break backwards compatibility here in terms of old brew versions (given 984dcf8), but this would be a compatibility break in terms of old cached or mirrored JSON data (though we can do what we did when moving to JWS if we want to go this direction).

Though a compatibility layer is also possible without too significant effort.

If we go this direction at some point there would also be room to revive the dependency redo code shelved for a theoretical v2 (which might(?) also help our goal of reducing overall object footprint as it deduplicates some things, but it also adds more subkeys so not necessarily): Bo98@25aec4b

I'm not sure I understand your comment about 984dcf8. Those changes should be backwards compatible.

I like your idea about the tap agnostic format. I think it should be possible to implement that in a backwards compatible way though. Basically, we could add additional logic to build the Homebrew::API::Formula.all_formulae hash so that it stays the same as before minus the missing hash keys despite the changes to the formula.jws.json file. Whether that's worth the effort is another question entirely. This would mean that the JSON files saved during the installation should hold the same amount of information as before.

What was the goal of the dependency redo code you mention here?

So we'd have two different branches of parsing logic for the old and new formula.jws.json files but the same resulting Homebrew::API::Formula.all_formulae hash if that makes sense.

I'm not sure I understand your comment about 984dcf8. Those changes should be backwards compatible.

I mean if we switch to generate JSONs that strip those previously not-nilable values then older brew versions can't read newer JSON (typically wasn't a concern for git taps, but API downloads can be detached from update.sh).

What was the goal of the dependency redo code you mention here?

Idea at the time was to make a formula that does:

depends_on "abc" => [:build, :test] depends_on "def" => [:build, :test] depends_on "ghi" => [:build, :test] head do depends_on "jkl" => [:build, :test] end

which currently produces

"build_dependencies": [ "abc", "def", "ghi" ], "test_dependencies": [ "abc", "def", "ghi" ], "head_dependencies": [ "build_dependencies": [ "abc", "def", "ghi", "jkl" ], "test_dependencies": [ "abc", "def", "ghi", "jkl" ], ]

be:

"deps": [ { "name": "abc", "tags": ["build", "test"] }, { "name": "def", "tags": ["build", "test"] }, { "name": "ghi", "tags": ["build", "test"] }, { "name": "jkl", "tags": ["build", "test"], "specs": ["head"] } ]

but thinking more about more how the Ruby JSON parser works while it might make it physically smaller, it won't necessarily make it more performant so still very much a "to investigate" for the future.

So we'd have two different branches of parsing logic for the old and new formula.jws.json files but the same resulting Homebrew::API::Formula.all_formulae hash if that makes sense.

Yeah that was what I was thinking yeah

I guess that could just be json v3 and we could just keep the old format at json v2.

Yes, sorry, this was my thinking 👍🏻. Might even be something like vHomebrewCore or something as well as a separate v3 which could have more information in a nicer structure.

That's a good example of something we can likely remove because it doesn't fit the core repo use case.

Yeh, this is my thinking: we remove anything that we won't use/need in core.

For example, the following fields all seem to default to 0 so we could just omit them in that case and add 0 manually when loading things.
...
Also at a glance the versions hash always seems to include "bottle": true for core formulas anyway. Not sure if we need to include that.

These seem like good things to omit.

but this would be a compatibility break in terms of old cached or mirrored JSON data (though we can do what we did when moving to JWS if we want to go this direction).

I think we should aim to just force a new API download when we release a new enough Homebrew version. I can see an argument for a couple of point releases where we have branching logic here but it's a small enough download that needs refetched/updated often enough I don't think it's worth keeping around for long at all beyond testing/gradual rollout.

Whether that's worth the effort is another question entirely.

It is not 😁

MikeMcQuaid · 2024-01-05T10:14:16Z

Library/Homebrew/dev-cmd/generate-formula-api.rb

@@ -74,6 +78,7 @@ def generate_formula_api
 raise
 end

+ File.write("formula.json", JSON.generate(api_json))


Location seems fine. Filename could be something like formula_internal.json or formula_slim.json or something?

MikeMcQuaid · 2024-01-05T10:14:55Z

Library/Homebrew/utils/collection.rb

+# frozen_string_literal: true
+
+module Utils
+ module Collection


Is expectation that this will be used for Casks too?

Obviously no rush yet but: good to have tests for this.

MikeMcQuaid · 2024-01-05T10:15:23Z

Library/Homebrew/api/formula.rb

+ private_constant :REQUIRED_FIELDS
+
+ sig { params(formula_hash: Hash).returns(Hash) }
+ def slim_hash(formula_hash)


Again: don't need this yet but probably want to have tests ensuring that formulae are loaded as expected from this JSON.

apainintheneck · 2024-01-09T07:53:24Z

Neither the recursive compacting of empty values or removing a few more keys really seems to move the needle much at all and adds a lot more complexity. Just doing the following gets us most of the intended gains.

def api_hash
  hash = to_hash_with_variations
  if stable_bottle = hash.dig("bottle", "stable")
    stable_bottle["files"] = stable_bottle["files"].transform_values do |file|
      # The "url" is not used here and takes up a lot of space.
      file.slice("cellar", "sha256")
    end
  end

  hash.reject { |_, value| value.blank? }.to_h
end

This gets us down from 24MB to 13MB and we could get down to 12MB which is essentially where we're at here in this PR but I hardly think it's worth the effort. The big difference maker is removing the bottle URL from the hash. If this isn't something we're overly attached to, we could just remove it entirely. It doesn't seem to get used internally anywhere.

I think the simplest thing to do would be call this the Formula#api_hash method and leave the current Formula#to_hash method since it doesn't include variations. Any further clean up around the #to_hash methods could happen in another PR.

MikeMcQuaid · 2024-01-09T08:49:07Z

This gets us down from 24MB to 13MB and we could get down to 12MB which is essentially where we're at here in this PR but I hardly think it's worth the effort.

24MB to 13MB is a huge difference, nice work!

The big difference maker is removing the bottle URL from the hash. If this isn't something we're overly attached to, we could just remove it entirely. It doesn't seem to get used internally anywhere.

I'm not opposed to removing it 👍🏻

I think the simplest thing to do would be call this the Formula#api_hash method and leave the current Formula#to_hash method since it doesn't include variations. Any further clean up around the #to_hash methods could happen in another PR.

Works for me 👍🏻

muescha · 2024-01-09T09:18:59Z

in the generated smaller json I see many sha256 defined, specific always a list for many os. which of the other os I would need if I can download a pre-filtered jsons like slim-formula-{os}.json:

slim-formula-arm64_sonoma.json
slim-formula-arm64_monterey.json

muescha · 2024-01-09T10:25:23Z

played a little - tried hjson and yaml

and leave only the first item in files (*-stripfiles)
files here: https://gist.github.com/muescha/f28a82767513e615ce4638b179c1fb10?permalink_comment_id=4823317#gistcomment-4823317

apainintheneck · 2024-01-10T07:01:28Z

@muescha That's another interesting idea. As far as I can tell, the only reason to keep information about multiple bottles is to keep backwards compatibility with the current brew info json=v2 output. That being said it's obviously not necessary for installs or other things on any individual machine. Of course, we'd have to generate multiple API JSON files and change how we request that data but it's not that far out there if we really wanted to optimize things further

I took a stab at it and it looks like the size drops around another 5MB from the current reduced size. Down from 13MB to 7.4MB.

apainintheneck · 2024-01-10T07:51:15Z

Closing this because the PR has been superseded by this other one: #16460

Bo98 · 2024-01-10T12:48:32Z

the only reason to keep information about multiple bottles is to keep backwards compatibility with the current brew info json=v2 output

And brew fetch --bottle_tag=

Slim down json size

51fe879

apainintheneck commented Jan 5, 2024

View reviewed changes

MikeMcQuaid reviewed Jan 5, 2024

View reviewed changes

This was referenced Jan 10, 2024

formulary: allow for more keyless fields when loading from API #16459

Merged

formula: add slimmer api json v3 representation #16460

Closed

apainintheneck closed this Jan 10, 2024

apainintheneck mentioned this pull request Jan 27, 2024

Next gen api formula json v3 #16541

Merged

7 tasks

github-actions bot added the outdated PR was locked due to age label Feb 10, 2024

github-actions bot locked as resolved and limited conversation to collaborators Feb 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Slim down formula API JSON #16433

WIP: Slim down formula API JSON #16433

apainintheneck commented Jan 5, 2024

apainintheneck Jan 5, 2024

MikeMcQuaid Jan 5, 2024

apainintheneck Jan 6, 2024

MikeMcQuaid left a comment

MikeMcQuaid Jan 5, 2024

apainintheneck Jan 6, 2024

apainintheneck Jan 6, 2024

apainintheneck Jan 7, 2024

Bo98 Jan 7, 2024 •

edited

Bo98 Jan 8, 2024 •

edited

apainintheneck Jan 8, 2024

apainintheneck Jan 8, 2024

Bo98 Jan 8, 2024 •

edited

MikeMcQuaid Jan 8, 2024

MikeMcQuaid Jan 5, 2024

MikeMcQuaid Jan 5, 2024

MikeMcQuaid Jan 5, 2024

apainintheneck commented Jan 9, 2024

MikeMcQuaid commented Jan 9, 2024

muescha commented Jan 9, 2024

muescha commented Jan 9, 2024

apainintheneck commented Jan 10, 2024

apainintheneck commented Jan 10, 2024

Bo98 commented Jan 10, 2024

WIP: Slim down formula API JSON #16433

WIP: Slim down formula API JSON #16433

Conversation

apainintheneck commented Jan 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeMcQuaid left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bo98 Jan 7, 2024 • edited

Choose a reason for hiding this comment

Bo98 Jan 8, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bo98 Jan 8, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apainintheneck commented Jan 9, 2024

MikeMcQuaid commented Jan 9, 2024

muescha commented Jan 9, 2024

muescha commented Jan 9, 2024

apainintheneck commented Jan 10, 2024

apainintheneck commented Jan 10, 2024

Bo98 commented Jan 10, 2024

Bo98 Jan 7, 2024 •

edited

Bo98 Jan 8, 2024 •

edited

Bo98 Jan 8, 2024 •

edited