Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

formula: add slimmer api json v3 representation #16460

Closed

Conversation

apainintheneck
Copy link
Contributor

@apainintheneck apainintheneck commented Jan 10, 2024

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same change?
  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your changes? Here's an example.
  • Have you successfully run brew style with your changes locally?
  • Have you successfully run brew typecheck with your changes locally?
  • Have you successfully run brew tests with your changes locally?

Alternative approach to #16433 based on discussion there and in the original issue #16410.

This is a second go at this after the recursive compacting approach in the earlier PR ended up not yielding big gains beyond what's in this PR. IMO this is the simplest approach to reducing the JSON size by omitting the bottle url field that we don't use anywhere and also removing blank fields at the top-level of the resulting hash.

I decided to memoize the variations generation since that takes a while but shouldn't change once a formula is loaded. This reduces time in the brew generate-formula-api command.

TODO:

@apainintheneck apainintheneck changed the title formula: add slimmer api json representation formula: add slimmer api json representation v2 Jan 10, 2024
Copy link
Member

@MikeMcQuaid MikeMcQuaid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes a lot of sense. I like the minimal approach here!

@apainintheneck apainintheneck force-pushed the slim-formula-json-v2 branch 2 times, most recently from 2494eee to 4c00010 Compare January 13, 2024 08:25
Comment on lines -374 to +409
allow(described_class).to receive(:loader_for).and_return(described_class::FormulaAPILoader.new(formula_name))
allow(Homebrew::EnvConfig).to receive(:no_install_from_api?).and_return(false)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By adding the ruby_source_path to the hash above and by mocking things so that it thinks it should use the API we no longer have to mock the Formulary.loader_for method. This means that the tests should better represent how the code is used in real life. It doesn't seem to make the tests noticeably slower either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice 🎉

@apainintheneck
Copy link
Contributor Author

As far as I can tell there should be no harm in merging this in as is. The brew generate-formula-api command will generate more files and take a little longer but those files will get ignored when the data is selectively archived and stored for later steps.

- name: Archive data
  run: tar czvf data-core.tar.gz _data/formula/ _data/formula_canonical.json api/formula/ api/formula_tap_migrations.json formula/

From: https://github.com/Homebrew/formulae.brew.sh/blame/master/.github/workflows/scheduled.yml

I'm not entirely sure what changes need to be made in that repo after this ships to start shipping smaller JSON files to all users though.

@MikeMcQuaid
Copy link
Member

@apainintheneck Happy to give final review and merge when no longer draft.

@apainintheneck
Copy link
Contributor Author

I just wanted to open a WIP PR over on the formulae.brew.sh repo before merging this in to see if there were any changes I wanted to make (there was one).

@apainintheneck
Copy link
Contributor Author

Note that even with the memoization added to Formula#to_hash_with_variations I expect brew generate-formula-api to be 10-15 seconds slower than before. Of course, this is on an older machine but you get the idea.

Before

$ time brew generate-formula-api
________________________________________________________
Executed in  113.40 secs    fish           external
   usr time   96.75 secs  190.00 micros   96.75 secs
   sys time   13.92 secs  616.00 micros   13.92 secs

After

$ time brew generate-formula-api
________________________________________________________
Executed in  125.60 secs    fish           external
   usr time  103.79 secs  194.00 micros  103.79 secs
   sys time   15.89 secs  625.00 micros   15.89 secs

@apainintheneck apainintheneck marked this pull request as ready for review January 17, 2024 07:02
@apainintheneck
Copy link
Contributor Author

This should probably not get merged in until the site generation is working again for formulae.brew.sh.

@apainintheneck apainintheneck marked this pull request as draft January 17, 2024 07:05
@MikeMcQuaid
Copy link
Member

Note that even with the memoization added to Formula#to_hash_with_variations I expect brew generate-formula-api to be 10-15 seconds slower than before. Of course, this is on an older machine but you get the idea.

Seems reasonable, thanks @apainintheneck!

@MikeMcQuaid
Copy link
Member

@apainintheneck @Bo98 Just a thought here, sorry it's here so late: is it worth trying to crap tap_migrations data in here somewhere/somehow so we don't need to redownload? Similarly, I also wonder if putting core and cask in the same gzipped file makes it smaller/bigger than two separate downloads are.

@Bo98
Copy link
Member

Bo98 commented Jan 17, 2024

Yes I was saying in the other issue that if we're doing a complete compatibility break here then now's the time to rethink things and start combining the files.

I also think redoing the list from an array to a "name" => data format will also help, basically incorporating at least part of this code into the format itself:

cache["aliases"] = {}
cache["renames"] = {}
cache["formulae"] = json_formulae.to_h do |json_formula|
json_formula["aliases"].each do |alias_name|
cache["aliases"][alias_name] = json_formula["name"]
end
(json_formula["oldnames"] || [json_formula["oldname"]].compact).each do |oldname|
cache["renames"][oldname] = json_formula["name"]
end
[json_formula["name"], json_formula.except("name")]
end

cache["renames"] = {}
cache["casks"] = json_casks.to_h do |json_cask|
token = json_cask["token"]
json_cask.fetch("old_tokens", []).each do |old_token|
cache["renames"][old_token] = token
end
[token, json_cask.except("token")]
end

@apainintheneck
Copy link
Contributor Author

Yeah, I was planning on making the smallest possible change here that would benefit parsing performance. The changes here only affect file size not data layout. All of the other suggestions would change it's layout and possibly the download strategy which feels like something outside of the scope of this PR.

If you all still think that's the way to go, we need to decide on the new download strategy and file layouts before we can continue. Beyond that we'll have to update the API code so that it can manage both the old and new file formats temporarily. The slim cask JSON representation will need to be added in too.

@MikeMcQuaid
Copy link
Member

@apainintheneck I think it's worth getting this into a stage/format that we can incrementally move to a better structure in future with minimal changes here and take this chance to reduce the number of downloads as well as the size. Feel free to push back on any of that, though, if you feel it blows up the scope too much.

Comment on lines 58 to 65
internal_formula_v3_hash = {
"aliases" => tap.alias_table,
"formulae" => {},
"renames" => tap.formula_renames,
"tap_migrations" => tap.tap_migrations,
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is meant to match the formula cache in the api/formula.rb file.

cache["aliases"] = {}
cache["renames"] = {}
cache["formulae"] = json_formulae.to_h do |json_formula|

@apainintheneck
Copy link
Contributor Author

I think this is ready for final review. The main change here is that we're generating this new internal_formula_v3.json file which will end up getting generated automatically and hosted by formulae.brew.sh. Of course, it's a private API so it probably doesn't matter if it's available to download for a bit.

I'll make a follow-up PR to start signing those files in the formulae.brew.sh repo along with deleting the unsigned internal json files after I make a PR for the cask equivalent to this PR. Those should be trivial changes though.

Once we are generating both of the internal representation files, it will be much simpler to test using them to load packages locally. We can proceed with the environment variable approach of loading only for certain internal users and then testing it on larger groups of users. The final change to enable it for all users could wait for a minor release.

@apainintheneck apainintheneck marked this pull request as ready for review January 20, 2024 05:32
@Bo98
Copy link
Member

Bo98 commented Jan 20, 2024

Given we've now lifted up renames and aliases, do we need oldnames (and the deprecated oldname) and aliases in the formulae hashes themselves anymore or can we strip them too?

@apainintheneck apainintheneck changed the title formula: add slimmer api json representation v2 formula: add slimmer api json v3 representation Jan 20, 2024
@Bo98
Copy link
Member

Bo98 commented Jan 20, 2024

Would also be good to lift up tap info like tap name and tap_git_head.

@apainintheneck
Copy link
Contributor Author

Given we've now lifted up renames and aliases, do we need oldnames (and the deprecated oldname) and aliases in the formulae hashes themselves anymore or can we strip them too?

You're right I don't think those are really needed anymore. We can just skip the manual overrides in Formulary.load_formula_from_api and things should work correctly as long as we source the information correctly in the core tap.

To Skip

@oldnames_array = json_formula["oldnames"] || [json_formula["oldname"]].compact
def oldnames
self.class.instance_variable_get(:@oldnames_array)
end
@aliases_array = json_formula.fetch("aliases", [])
def aliases
self.class.instance_variable_get(:@aliases_array)
end

To Update

# a table mapping alias to formula name
# @private
def alias_table
return @alias_table if @alias_table
@alias_table = {}
alias_files.each do |alias_file|
@alias_table[alias_file_to_name(alias_file)] = formula_file_to_name(alias_file.resolved_path)
end
@alias_table
end


I had this idea that we were caching the formula source file but it looks like we only do that for casks and that's why I got confused. If we're caching the json source, then we do need all of this information to be there when installing since we won't have the whole API JSON file at uninstall time. Of course, we could just add that info manually too.

source: JSON.pretty_generate(json_cask),

@apainintheneck
Copy link
Contributor Author

Would also be good to lift up tap info like tap name and tap_git_head.

I don't think tap name gets used at all though but git head wouldn't be bad to add. Then, we can just reference it in CoreTap if we need to access tap_git_head.

To Skip

@tap_git_head_string = json_formula["tap_git_head"]
def tap_git_head
self.class.instance_variable_get(:@tap_git_head_string)
end

@apainintheneck
Copy link
Contributor Author

Of course, the changes to formula loading will happen in a follow-up PR. I'll just omit more fields from the JSON based on this.

@apainintheneck
Copy link
Contributor Author

Just to be double sure, we always store the Ruby source file in something like $(brew --cellar)/foo/1.0/.brew/foo.rb so we don't have to worry about backwards compatibility with loading old cached JSON representations of installed formulae. Is that right?

It seems like that's only going to be a concern for casks.

Comment on lines 2451 to 2456
# Remove keys that default to zero.
%w[revision version_scheme].each do |key|
next unless api_hash[key].zero?

api_hash.delete(key)
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@revision = self.class.revision || 0
@version_scheme = self.class.version_scheme || 0

@apainintheneck
Copy link
Contributor Author

apainintheneck commented Jan 20, 2024

Okay, I omitted a few more fields from the hash that were already included at the top-level. I also removed a few values that default to zero if they are in fact zero.

This is a second go at this after the approach in the earlier PR
ended up not yielding big gains. IMO this is the simplest approach
to reducing the JSON size by omitting the bottle url field that we
don't use anywhere and also removing blank fields at the top-level
of the resulting hash.
This tests loading the slimmed down hash. In the process I found
out that we were mocking all of the calls to the FormulaAPILoader
which seemed unnecessary so I changed that so that it uses the
Formulary loaders directly. This better simulates how this works
outside of tests.
This value shouldn't be excluded from the hash if it is false
because nil ends up meaning true for backwards compatibility here.
It seemed to make more sense once I started working on this over on
the `formulae.brew.sh` repo to have this match the output directory
of the `formula_tap_migrations.json` file which behaves similarly.
…ture

We want the internal JSON used to send the package list over the internet
instead of using the tap Git repo to match the internal data representation
we have for the core formula tap. That means including additional information
like aliases, renames and tap migrations along with storing the formulae
as a hash of name to formula info. This not only reduces extra data
transformation when loading the data but also reduces network calls since
tap migrations are now included in the same file.

The amount of additional data included in the file is very small so this
shouldn't drastically affect the file size either.
Removed aliases, oldname, oldnames, tap and tap_git_head because
that info will now be included at the top level of the formula
API JSON file.

Removed the revision and version_scheme hash keys when the value
is zero because that is the default internally already.
Copy link
Member

@Bo98 Bo98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good

Comment on lines 59 to 60
"tap_name" => tap.name,
"tap_git_head" => tap.git_head,
Copy link
Member

@Bo98 Bo98 Jan 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"tap_name" => tap.name,
"tap_git_head" => tap.git_head,
"tap" => {
"name" => tap.name,
"git_head" => tap.git_head,
},

maybe?

Sounds like it could be a nice metadata section should we do anything else in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth on that one a bit. Technically all of this information relates to a tap though so it seems kind of like unnecessary nesting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just name the file formula_tap.json and remove the tap prefix from all of the hash keys?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I can see that argument that's fair

Copy link
Member

@Bo98 Bo98 Jan 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just name the file formula_tap.json

If anything it could be in reflection of the tap name so we have in theory homebrew-core.json, homebrew-cask.json and (purely as an example) a homebrew-cask-fonts.json would theoretically slot in nicely. Then there's probably no need to store the name in the JSON itself.

The reason I'm talking about generalising it like that is when structured this way, we can maybe even integrate it better by having an AbstractAPITap or something that the core tap classes extend from and merge Homebrew::API::Formula and Homebrew::API::Cask into that (e.g. a generic JSON read would look for #{@name}.json). Reducing the formulae and cask API code differences to just what's already in Formulary and CaskLoader and generalising the rest would result in quite a nice cleanup, less code to maintain and less logical branches we have to test. This is maybe a longer term ambition given we will have to deal with backwards compatibility initially which will probably require to keep things separate for a bit but still something worth mentioning to think about when structuring the format and picking names now as we don't want to have to change them again.

Downloadable is an example of prior work to combine formula and cask code more and it'd be great to see that general idea continue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anything it could be in reflection of the tap name so we have in theory homebrew-core.json, homebrew-cask.json and (purely as an example) a homebrew-cask-fonts.json would theoretically slot in nicely. Then there's probably no need to store the name in the JSON itself.

I like this naming pattern. I don't think we even use the tap name when initializing formulas from the API so it might not be necessary to include anyway.

The reason I'm talking about generalising it like that is when structured this way, we can maybe even integrate it better by having an AbstractAPITap or something that the core tap classes extend from and merge Homebrew::API::Formula and Homebrew::API::Cask into that (e.g. a generic JSON read would look for #{@name}.json). Reducing the formulae and cask API code differences to just what's already in Formulary and CaskLoader and generalising the rest would result in quite a nice cleanup, less code to maintain and less logical branches we have to test. This is maybe a longer term ambition given we will have to deal with backwards compatibility initially which will probably require to keep things separate for a bit but still something worth mentioning to think about when structuring the format and picking names now as we don't want to have to change them again.

Downloadable is an example of prior work to combine formula and cask code more and it'd be great to see that general idea continue.

Long-term this sounds like a good idea.

Copy link
Member

@Bo98 Bo98 Jan 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we even use the tap name when initializing formulas from the API so it might not be necessary to include anyway.

Probably not but if we move it to live in the tap class in the future then it will probably be easier that way rather than saying "homebrew-core" = "formula" and "homebrew-cask" = "cask", which seems to be unnecessary extra logic when we can just name it in a way that avoids that.

The tap name being a part of the file path is also exactly how Ruby formulae/casks work too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in bc029cf

@Bo98
Copy link
Member

Bo98 commented Jan 20, 2024

Just to be double sure, we always store the Ruby source file in something like $(brew --cellar)/foo/1.0/.brew/foo.rb so we don't have to worry about backwards compatibility with loading old cached JSON representations of installed formulae.

Yes, that's correct.

This makes it clearer that the file represents the entire tap
and means that we can omit the tap name from the JSON itself.
@MikeMcQuaid
Copy link
Member

I don't think tap name gets used at all

Let's not have anything (or as little as possible) in the internal API that we don't use. YAGNI and all that.

Just to be double sure, we always store the Ruby source file in something like $(brew --cellar)/foo/1.0/.brew/foo.rb so we don't have to worry about backwards compatibility with loading old cached JSON representations of installed formulae.

Yes, at least for anything other than truly ancient kegs.

Copy link
Member

@MikeMcQuaid MikeMcQuaid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the filename change (unless I've missed something) is blocking. The rest are all optional and you can self-merge whenever you feel comfortable. Thanks!

rescue
onoe "Error while generating data for formula '#{name}'."
raise
end

homebrew_core_tap_json = JSON.generate(homebrew_core_tap_hash)
File.write("api/homebrew-core.json", homebrew_core_tap_json) unless args.dry_run?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
File.write("api/homebrew-core.json", homebrew_core_tap_json) unless args.dry_run?
File.write("api/internal_core_v3.json", homebrew_core_tap_json) unless args.dry_run?

or something, unless I've missed something here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was an idea from @Bo98 that I liked. See this thread starting here #16460 (comment)

Copy link
Member

@Bo98 Bo98 Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might make sense to lean on path components so it's /api/internal/v3/homebrew-core.json (which downloads to just homebrew-core.json in the cache). The version being a "directory" like that is very common practice in remote APIs (and we can't use headers like GitHub etc use instead).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second hardest problem is CS strikes again. I'm fine with that proposal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me:

Suggested change
File.write("api/homebrew-core.json", homebrew_core_tap_json) unless args.dry_run?
File.write("api/internal/v3/homebrew-core.json", homebrew_core_tap_json) unless args.dry_run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self that this requires creating some new directories too (think mkdir -p).


if (stable_bottle = api_hash.dig("bottle", "stable"))
stable_bottle["files"] = stable_bottle["files"].transform_values do |file|
# The "url" is not used here and takes up a lot of space.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also probably remove the root_url given it's the same for every bottle in the tap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that could probably just be at the top-level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it's not used at all.

if Homebrew::EnvConfig.bottle_domain == HOMEBREW_BOTTLE_DEFAULT_DOMAIN
root_url HOMEBREW_BOTTLE_DEFAULT_DOMAIN
else
root_url Homebrew::EnvConfig.bottle_domain
end

Copy link
Member

@Bo98 Bo98 Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it was removed for security reasons prior to the introduction of JWS. Those reasons aren't really relevant anymore since JWS fixed the root issue but equally the API should be able to behave like brew bottle and omit root_url if it is the default (which is always for homebrew-core).

Even that code that's there now on the Formulary feels unnecessary - surely we can skip calling root_url like how we don't call it for core tap Ruby formulae.

"tap_git_head",
)

# Remove keys that default to zero.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could maybe also do this for keys that are null (if you haven't already) e.g. urls.stable.using for most formulae

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were diminishing returns with trying to remove nested blank fields in the previous PR but I can just double-check that right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing more nested fields saves around another megabyte. We're down to 11MB.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Comment on lines -374 to +409
allow(described_class).to receive(:loader_for).and_return(described_class::FormulaAPILoader.new(formula_name))
allow(Homebrew::EnvConfig).to receive(:no_install_from_api?).and_return(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice 🎉

"oldname",
"oldnames",
"tap",
"tap_git_head",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Could remove options, nothing in homebrew/core has them.
  • Can remove linked_keg, installed, pinned, outdated as it's a reflection of the machine that generated this output

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe those will all be omitted automatically by the api_has.reject { |_, value| value.blank? } clause later on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linked_keg, installed, pinned, outdated will not be removed depending on the values on the generated machine (which seems undesirable). Furthermore, I think we should be explicit about removing all fields that aren't being used in the API loader.

I believe those will all be omitted automatically by the api_has.reject { |_, value| value.blank? } clause later on.

A mild concern (growing a little through this PR) is that it's very hard to read this method and figure out what the output will be, compared to Formula#to_hash. I wonder if it's worth having another to_hash method where fields are omitted there rather than removing them from to_hash_with_variations here to be a bit more explicit and, more importantly over time, that adding a new field to to_hash doesn't add a new field here unless added explicitly.

Alternatively, I could see having something like a shared constant used by both the loader and here to make sure we never keep or access a Hash key that we're not using in both places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it time for v3 of this PR? 😆

I prefer the idea of explicitly building the hash over the shared constant. That would require a new #to_api_hash_with_variations along with building the hash explicitly in #to_api_hash. As a consequence of that the runtime of the brew generate-*-api methods will probably be double what it is currently since we're doing twice the work. Building the variations is especially slow. If that sounds good, I'll close this PR and continue with that idea. Hopefully we won't need a v4.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linked_keg, installed, pinned, outdated will not be removed depending on the values on the generated machine (which seems undesirable).

But we know which machines we're running these commands on so that shouldn't be a problem, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it time for v3 of this PR? 😆

@apainintheneck I'm sorry, I do really appreciate the work here. I'm also happy for you to push to this PR any changes or merge this as-is and iterate on it? Whatever you feel is best. I don't want you to get demotivated here ❤️

As a consequence of that the runtime of the brew generate-*-api methods will probably be double what it is currently since we're doing twice the work. Building the variations is especially slow.

I wonder if we can move to having more caching/memoization in the methods that are called that are slow to better handle that?

Regardless: I personally think generation speed is something we can iterate on and improve later if we can.

But we know which machines we're running these commands on so that shouldn't be a problem, right?

Unfortunately not. They are run on GitHub Actions workers which GitHub messes around with the contents of on a regular basis. It'd also be nicer for e.g. testing if I can generate the same output on my local machine as is generated on the server.

Again, not blocking but this is another v2 API generation shortcoming that would be nice to address here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apainintheneck I'm sorry, I do really appreciate the work here. I'm also happy for you to push to this PR any changes or merge this as-is and iterate on it? Whatever you feel is best. I don't want you to get demotivated here ❤️

Nah, I just think it's funny.

I wonder if we can move to having more caching/memoization in the methods that are called that are slow to better handle that?

Regardless: I personally think generation speed is something we can iterate on and improve later if we can.

I don't think there are any obvious performance improvements to be made here. The slow thing is reloading a formula multiple times which will still be necessary. I've memoized the variations generation in this PR but we can't really do that if we have an entirely separate method for api generation.

Unfortunately not. They are run on GitHub Actions workers which GitHub messes around with the contents of on a regular basis. It'd also be nicer for e.g. testing if I can generate the same output on my local machine as is generated on the server.

Again, not blocking but this is another v2 API generation shortcoming that would be nice to address here.

I didn't know that. Interesting...

- remove unused url options
- remove more default values
- remove bottle root url which isn't needed
- remove full_name which is generated
Copy link
Member

@MikeMcQuaid MikeMcQuaid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments but I'd be fine with merging as-is and iterating once merged whenever you prefer that. Great work!

rescue
onoe "Error while generating data for formula '#{name}'."
raise
end

homebrew_core_tap_json = JSON.generate(homebrew_core_tap_hash)
File.write("api/homebrew-core.json", homebrew_core_tap_json) unless args.dry_run?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me:

Suggested change
File.write("api/homebrew-core.json", homebrew_core_tap_json) unless args.dry_run?
File.write("api/internal/v3/homebrew-core.json", homebrew_core_tap_json) unless args.dry_run?

"oldname",
"oldnames",
"tap",
"tap_git_head",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linked_keg, installed, pinned, outdated will not be removed depending on the values on the generated machine (which seems undesirable). Furthermore, I think we should be explicit about removing all fields that aren't being used in the API loader.

I believe those will all be omitted automatically by the api_has.reject { |_, value| value.blank? } clause later on.

A mild concern (growing a little through this PR) is that it's very hard to read this method and figure out what the output will be, compared to Formula#to_hash. I wonder if it's worth having another to_hash method where fields are omitted there rather than removing them from to_hash_with_variations here to be a bit more explicit and, more importantly over time, that adding a new field to to_hash doesn't add a new field here unless added explicitly.

Alternatively, I could see having something like a shared constant used by both the loader and here to make sure we never keep or access a Hash key that we're not using in both places.

"tap_git_head",
)

# Remove keys that default to zero.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

expect(formula).to be_a(Formula)

# Use a hash with missing fields to load the second formula.
api_json = formula.to_api_hash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually wonder if it's worth having a full comparison in the test rather than looking at individual keys like this. Will definitely make it easier to review PRs like this by being able to eyeball a full sample in the test and also to see when fields are added/removed unintentionally here and see what fields are nil, unneeded, unused, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, real tests will have to wait until we're actually building this on CI. I'm guessing that some other tweaks will be necessary to get things working afterwards anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apainintheneck Yeh, sorry, I guess what I meant here is that the best assertion here would be that expect(formula.to_api_hash).to eql { ... } with some actual hash where we can compare the output.

@apainintheneck
Copy link
Contributor Author

Closing this out in favor of the more explicit approach detailed in this thread: #16460 (comment)

@github-actions github-actions bot added the outdated PR was locked due to age label Feb 25, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
install from api Relates to API installs outdated PR was locked due to age
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants