Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add integer 32 & 64 types #800

Closed
wants to merge 3 commits into from
Closed

Conversation

stloyd
Copy link
Member

@stloyd stloyd commented Nov 19, 2023

Change Log

Added

  • Add integer 32 & 64 types

Fixed

Changed

Removed

Deprecated

Security


Description

Closes #771

Copy link
Contributor

github-actions bot commented Nov 19, 2023

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors
+-----------------------+-------------------+------+-----+------------------+------------------+-----------------+
| benchmark             | subject           | revs | its | mem_peak         | mode             | rstdev          |
+-----------------------+-------------------+------+-----+------------------+------------------+-----------------+
| AvroExtractorBench    | bench_extract_10k | 1    | 3   | 34.748mb +0.02%  | 1.154s +0.29%    | ±0.39% -25.13%  |
| CSVExtractorBench     | bench_extract_10k | 1    | 3   | 4.605mb +0.08%   | 300.840ms -0.91% | ±0.17% -92.61%  |
| JsonExtractorBench    | bench_extract_10k | 1    | 3   | 4.772mb +0.13%   | 1.388s -0.66%    | ±0.18% -81.60%  |
| ParquetExtractorBench | bench_extract_10k | 1    | 3   | 239.476mb +0.00% | 1.575s -0.61%    | ±0.50% +91.90%  |
| TextExtractorBench    | bench_extract_10k | 1    | 3   | 4.558mb +0.01%   | 24.261ms +0.16%  | ±1.62% -44.37%  |
| XmlExtractorBench     | bench_extract_10k | 1    | 3   | 4.558mb +0.01%   | 416.016ms +2.28% | ±1.43% +233.90% |
+-----------------------+-------------------+------+-----+------------------+------------------+-----------------+
Transformers
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| benchmark                   | subject                  | revs | its | mem_peak         | mode            | rstdev         |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| RenameEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 110.249mb +0.01% | 62.523ms -3.33% | ±1.27% -42.01% |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
Loaders
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| benchmark          | subject        | revs | its | mem_peak         | mode             | rstdev          |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| AvroLoaderBench    | bench_load_10k | 1    | 3   | 94.728mb +0.01%  | 485.995ms +9.52% | ±1.47% +212.00% |
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 54.711mb +0.01%  | 71.594ms +1.43%  | ±0.37% -71.09%  |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 105.311mb +0.01% | 53.975ms -2.95%  | ±0.80% +28.03%  |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 320.783mb +0.00% | 1.457s +1.31%    | ±0.52% +5.30%   |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 17.589mb +0.02%  | 40.720ms +0.24%  | ±0.21% -82.79%  |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
Building Blocks
+-------------------------+----------------------------+------+-----+------------------+------------------+------------------+
| benchmark               | subject                    | revs | its | mem_peak         | mode             | rstdev           |
+-------------------------+----------------------------+------+-----+------------------+------------------+------------------+
| RowsBench               | bench_chunk_10_on_10k      | 2    | 3   | 76.296mb +0.01%  | 2.242ms -6.30%   | ±3.13% -10.20%   |
| RowsBench               | bench_diff_left_1k_on_10k  | 2    | 3   | 96.086mb +0.01%  | 181.941ms +0.19% | ±2.67% +200.48%  |
| RowsBench               | bench_diff_right_1k_on_10k | 2    | 3   | 74.612mb +0.01%  | 18.019ms -0.85%  | ±0.22% -92.99%   |
| RowsBench               | bench_drop_1k_on_10k       | 2    | 3   | 75.434mb +0.01%  | 1.728ms +2.65%   | ±3.89% +95.73%   |
| RowsBench               | bench_drop_right_1k_on_10k | 2    | 3   | 75.434mb +0.01%  | 1.695ms -2.02%   | ±2.53% +13.07%   |
| RowsBench               | bench_entries_on_10k       | 2    | 3   | 74.648mb +0.01%  | 2.620ms +0.59%   | ±1.46% +10.56%   |
| RowsBench               | bench_filter_on_10k        | 2    | 3   | 75.177mb +0.01%  | 14.116ms +0.94%  | ±1.20% +41.02%   |
| RowsBench               | bench_find_on_10k          | 2    | 3   | 75.176mb +0.01%  | 14.504ms +4.31%  | ±1.15% -30.73%   |
| RowsBench               | bench_find_one_on_10k      | 10   | 3   | 73.079mb +0.01%  | 1.794μs -5.28%   | ±2.67% +5.66%    |
| RowsBench               | bench_first_on_10k         | 10   | 3   | 73.079mb +0.01%  | 0.400μs 0.00%    | ±0.00% 0.00%     |
| RowsBench               | bench_flat_map_on_1k       | 2    | 3   | 86.638mb +0.01%  | 12.345ms -6.78%  | ±2.30% -32.14%   |
| RowsBench               | bench_map_on_10k           | 2    | 3   | 115.995mb +0.01% | 62.161ms -6.90%  | ±2.13% +238.89%  |
| RowsBench               | bench_merge_1k_on_10k      | 2    | 3   | 75.697mb +0.01%  | 1.923ms +1.27%   | ±0.41% -81.11%   |
| RowsBench               | bench_partition_by_on_10k  | 2    | 3   | 77.965mb +0.01%  | 33.076ms -0.98%  | ±0.96% +2590.90% |
| RowsBench               | bench_remove_on_10k        | 2    | 3   | 77.798mb +0.01%  | 4.524ms +2.71%   | ±1.06% +4.02%    |
| RowsBench               | bench_sort_asc_on_1k       | 2    | 3   | 73.222mb +0.01%  | 38.607ms +1.37%  | ±0.23% -64.07%   |
| RowsBench               | bench_sort_by_on_1k        | 2    | 3   | 73.223mb +0.01%  | 38.801ms -1.83%  | ±0.36% -69.05%   |
| RowsBench               | bench_sort_desc_on_1k      | 2    | 3   | 73.222mb +0.01%  | 38.816ms +0.70%  | ±0.36% +4.71%    |
| RowsBench               | bench_sort_entries_on_1k   | 2    | 3   | 75.522mb +0.01%  | 7.420ms +2.40%   | ±0.25% -4.26%    |
| RowsBench               | bench_sort_on_1k           | 2    | 3   | 73.079mb +0.01%  | 28.819ms -0.04%  | ±0.67% -32.68%   |
| RowsBench               | bench_take_1k_on_10k       | 10   | 3   | 73.079mb +0.01%  | 13.342μs -0.58%  | ±2.23% +139.02%  |
| RowsBench               | bench_take_right_1k_on_10k | 10   | 3   | 73.079mb +0.01%  | 15.400μs -1.78%  | ±0.00% -100.00%  |
| RowsBench               | bench_unique_on_1k         | 2    | 3   | 96.087mb +0.01%  | 185.643ms +1.41% | ±1.20% +176.53%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 115.842mb +0.01% | 772.500ms +1.47% | ±0.87% -32.64%   |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 59.560mb +0.01%  | 391.210ms +0.26% | ±1.21% +153.42%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 14.682mb +0.05%  | 79.806ms +0.23%  | ±0.39% -64.43%   |
+-------------------------+----------------------------+------+-----+------------------+------------------+------------------+

@stloyd stloyd marked this pull request as ready for review November 19, 2023 13:09
@norberttech
Copy link
Member

int32/int64 should be also represented by schema definition

@github-actions github-actions bot added size: M and removed size: S labels Nov 19, 2023
Schema\Definition::integer('id', $nullable = false),
Schema\Definition::string('name', $nullable = true),
Schema\Definition::boolean('active', $nullable = false),
Schema\Definition::integer('id', ScalarType::integer32()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we need to decide here if we prefer to use \Definition::integer() or \Definition::integer32() - my problem with passing Type as a second parameter is that we need to validate it, which pretty much defeats the purpose of strongly typed arguments.
Alternatively, we can just drop all static factory methods (string/integer/boolean) and just use the constructor, however, that looks a bit less user-friendly.

Additionally, I would put type into all Definition types, String, Boolean, Integer, we can also unify the metadata name to be something like:

METADATA_PHP_TYPE which would always hold the instance of Type.

Copy link
Member Author

@stloyd stloyd Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure either how to decide about default that's why it's detecting inside what's max, but that's "magic", so I would go with BC break and force to choose either 32 or 64 on the user side but internally use 64 (similarly as it's now ie in parquet schema converter), with that I would also add check and throw exception when 64 is used when available is max 32.

So it would be: Entry::integer64()/Entry::integer32(), Definition::integer64() etc

Other than that I would save the helpers as DX is good there. Type for scalar is saved too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about it over night, and I can't figure out any good use case for the end users to decide if they want to use int64 vs int63, yeah, one requires less space but with current storage prices I'm not sure how relevant that is.
Originally I was planning to use that in order to know how to map Flow Int into Parquet Int and maybe we just keep it this way. Let me think more about it, I'm closing this PR for now, will get back to it

@github-actions github-actions bot added size: L and removed size: M labels Nov 20, 2023
@stloyd stloyd deleted the feature/int64 branch November 20, 2023 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce int32 and int64 types in order to easier transform flow entries into parquet/avro types
2 participants