Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Slim down formula API JSON #16433

Conversation

apainintheneck
Copy link
Contributor

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same change?
  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your changes? Here's an example.
  • Have you successfully run brew style with your changes locally?
  • Have you successfully run brew typecheck with your changes locally?
  • Have you successfully run brew tests with your changes locally?

Related to #16410

The idea here is to create a file with the slimmed down formula JSON that omits certain fields and removes nil values, empty hashes and empty arrays. It looks like the resulting file is around half the size.

@@ -74,6 +78,7 @@ def generate_formula_api
raise
end

File.write("formula.json", JSON.generate(api_json))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where this file should go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Location seems fine. Filename could be something like formula_internal.json or formula_slim.json or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe they should both go in the same directory? internal/formula.json and internal/cask.json

Copy link
Member

@MikeMcQuaid MikeMcQuaid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @apainintheneck. A few ideas here; I don't expect them all to block the PR but I think this work is exposing the legacy mess that is our existing JSON format 😅

@@ -101,6 +102,71 @@ def write_names_and_aliases(regenerate: false)
end
end
end

REQUIRED_FIELDS = %w[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this blows up the scope a bit but: I'd instead be tempted to actually have a new Formula#to_hash (and Cask equivalent) or pass a versioned argument there and generate an entirely new hash format with e.g. less nesting/repetition that better meets our needs.

May be a good time to actually have some sort of stricter versioning or e.g. specifying a minimum Homebrew version in the JSON itself here, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be ideal but I'm a bit worried about backwards compatibility. I'm not sure how many people depend on the output of brew info --json=v2 though. I guess that could just be json v3 and we could just keep the old format at json v2.

Also, changing the format entirely would add a lot of complexity to the load from API methods which rely on the current format. This is especially important for formula that get removed since we'll have to load them from the original JSON definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could have multiple loaders and just have a key like json_version that specifies which one we should try and load it with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a sample of the current slimmed-down JSON representation. Let me know if there's anything that looks like it can be removed and/or simplified.

https://gist.github.com/apainintheneck/882bb75d29394f7bdd5f8b5efe482038

Copy link
Member

@Bo98 Bo98 Jan 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"versions" -> "head" and "versions" -> "bottle" can probably be trimmed. "full_name" and "tap" too probably

"versions" -> "head" is quite an interesting one. In theory we might support:

head do
  url "..."
  version "whatever"
end

In practice: does it actually work? Not that it's a concern for Homebrew/core, but maybe something we could make error on a Ruby DSL level if it behaves weirdly. If it just makes it behave like the old devel spec then it's probably ok, even if weird.

Copy link
Member

@Bo98 Bo98 Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah in my mind a theoretical tap agnostic format would have been:

{
  "tap": {
    ...,
    "migrations": [...]
  },
  "formulae": [...],
  "casks": [...]
}

Which is basically the same thing you suggest.

The tap_migrations JSON being a separate download is a hack indicative of this problem. All of this is backwards incompatible changes however. We're already going to break backwards compatibility here in terms of old brew versions (given 984dcf8), but this would be a compatibility break in terms of old cached or mirrored JSON data (though we can do what we did when moving to JWS if we want to go this direction).

Though a compatibility layer is also possible without too significant effort.

If we go this direction at some point there would also be room to revive the dependency redo code shelved for a theoretical v2 (which might(?) also help our goal of reducing overall object footprint as it deduplicates some things, but it also adds more subkeys so not necessarily): Bo98@25aec4b

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand your comment about 984dcf8. Those changes should be backwards compatible.

I like your idea about the tap agnostic format. I think it should be possible to implement that in a backwards compatible way though. Basically, we could add additional logic to build the Homebrew::API::Formula.all_formulae hash so that it stays the same as before minus the missing hash keys despite the changes to the formula.jws.json file. Whether that's worth the effort is another question entirely. This would mean that the JSON files saved during the installation should hold the same amount of information as before.

What was the goal of the dependency redo code you mention here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we'd have two different branches of parsing logic for the old and new formula.jws.json files but the same resulting Homebrew::API::Formula.all_formulae hash if that makes sense.

Copy link
Member

@Bo98 Bo98 Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand your comment about 984dcf8. Those changes should be backwards compatible.

I mean if we switch to generate JSONs that strip those previously not-nilable values then older brew versions can't read newer JSON (typically wasn't a concern for git taps, but API downloads can be detached from update.sh).

What was the goal of the dependency redo code you mention here?

Idea at the time was to make a formula that does:

depends_on "abc" => [:build, :test]
depends_on "def" => [:build, :test]
depends_on "ghi" => [:build, :test]

head do
  depends_on "jkl" => [:build, :test]
end

which currently produces

"build_dependencies": [
  "abc",
   "def",
   "ghi"
],
"test_dependencies": [
  "abc",
   "def",
   "ghi"
],
"head_dependencies": [
  "build_dependencies": [
    "abc",
     "def",
     "ghi",
     "jkl"
  ],
  "test_dependencies": [
    "abc",
     "def",
     "ghi",
     "jkl"
  ],
]

be:

"deps": [
  {
    "name": "abc",
    "tags": ["build", "test"]
  },
  {
    "name": "def",
    "tags": ["build", "test"]
  },
  {
    "name": "ghi",
    "tags": ["build", "test"]
  },
  {
    "name": "jkl",
    "tags": ["build", "test"],
    "specs": ["head"]
  }
]

but thinking more about more how the Ruby JSON parser works while it might make it physically smaller, it won't necessarily make it more performant so still very much a "to investigate" for the future.

So we'd have two different branches of parsing logic for the old and new formula.jws.json files but the same resulting Homebrew::API::Formula.all_formulae hash if that makes sense.

Yeah that was what I was thinking yeah

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that could just be json v3 and we could just keep the old format at json v2.

Yes, sorry, this was my thinking 👍🏻. Might even be something like vHomebrewCore or something as well as a separate v3 which could have more information in a nicer structure.

That's a good example of something we can likely remove because it doesn't fit the core repo use case.

Yeh, this is my thinking: we remove anything that we won't use/need in core.

For example, the following fields all seem to default to 0 so we could just omit them in that case and add 0 manually when loading things.
...
Also at a glance the versions hash always seems to include "bottle": true for core formulas anyway. Not sure if we need to include that.

These seem like good things to omit.

but this would be a compatibility break in terms of old cached or mirrored JSON data (though we can do what we did when moving to JWS if we want to go this direction).

I think we should aim to just force a new API download when we release a new enough Homebrew version. I can see an argument for a couple of point releases where we have branching logic here but it's a small enough download that needs refetched/updated often enough I don't think it's worth keeping around for long at all beyond testing/gradual rollout.

Whether that's worth the effort is another question entirely.

It is not 😁

@@ -74,6 +78,7 @@ def generate_formula_api
raise
end

File.write("formula.json", JSON.generate(api_json))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Location seems fine. Filename could be something like formula_internal.json or formula_slim.json or something?

# frozen_string_literal: true

module Utils
module Collection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is expectation that this will be used for Casks too?

Obviously no rush yet but: good to have tests for this.

private_constant :REQUIRED_FIELDS

sig { params(formula_hash: Hash).returns(Hash) }
def slim_hash(formula_hash)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again: don't need this yet but probably want to have tests ensuring that formulae are loaded as expected from this JSON.

@apainintheneck
Copy link
Contributor Author

Neither the recursive compacting of empty values or removing a few more keys really seems to move the needle much at all and adds a lot more complexity. Just doing the following gets us most of the intended gains.

def api_hash
  hash = to_hash_with_variations
  if stable_bottle = hash.dig("bottle", "stable")
    stable_bottle["files"] = stable_bottle["files"].transform_values do |file|
      # The "url" is not used here and takes up a lot of space.
      file.slice("cellar", "sha256")
    end
  end

  hash.reject { |_, value| value.blank? }.to_h
end

This gets us down from 24MB to 13MB and we could get down to 12MB which is essentially where we're at here in this PR but I hardly think it's worth the effort. The big difference maker is removing the bottle URL from the hash. If this isn't something we're overly attached to, we could just remove it entirely. It doesn't seem to get used internally anywhere.

I think the simplest thing to do would be call this the Formula#api_hash method and leave the current Formula#to_hash method since it doesn't include variations. Any further clean up around the #to_hash methods could happen in another PR.

@MikeMcQuaid
Copy link
Member

This gets us down from 24MB to 13MB and we could get down to 12MB which is essentially where we're at here in this PR but I hardly think it's worth the effort.

24MB to 13MB is a huge difference, nice work!

The big difference maker is removing the bottle URL from the hash. If this isn't something we're overly attached to, we could just remove it entirely. It doesn't seem to get used internally anywhere.

I'm not opposed to removing it 👍🏻

I think the simplest thing to do would be call this the Formula#api_hash method and leave the current Formula#to_hash method since it doesn't include variations. Any further clean up around the #to_hash methods could happen in another PR.

Works for me 👍🏻

@muescha
Copy link
Contributor

muescha commented Jan 9, 2024

in the generated smaller json I see many sha256 defined, specific always a list for many os. which of the other os I would need if I can download a pre-filtered jsons like slim-formula-{os}.json:

  • slim-formula-arm64_sonoma.json
  • slim-formula-arm64_monterey.json

@muescha
Copy link
Contributor

muescha commented Jan 9, 2024

played a little - tried hjson and yaml

@apainintheneck
Copy link
Contributor Author

@muescha That's another interesting idea. As far as I can tell, the only reason to keep information about multiple bottles is to keep backwards compatibility with the current brew info json=v2 output. That being said it's obviously not necessary for installs or other things on any individual machine. Of course, we'd have to generate multiple API JSON files and change how we request that data but it's not that far out there if we really wanted to optimize things further

I took a stab at it and it looks like the size drops around another 5MB from the current reduced size. Down from 13MB to 7.4MB.

@apainintheneck
Copy link
Contributor Author

Closing this because the PR has been superseded by this other one: #16460

@Bo98
Copy link
Member

Bo98 commented Jan 10, 2024

the only reason to keep information about multiple bottles is to keep backwards compatibility with the current brew info json=v2 output

And brew fetch --bottle_tag=

@github-actions github-actions bot added the outdated PR was locked due to age label Feb 10, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
outdated PR was locked due to age
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants