feat: Read Parquet data file with projection #245

viirya · 2024-03-10T00:09:38Z

We can read Parquet file with TableScan as a stream of Arrow RecordBatches now. However, it reads all columns without projections of columns. This patch makes TableScanBuilder.select work to propagate selected columns to TableScan to apply the projection to scan operation.

close #244

viirya · 2024-03-10T00:11:35Z

crates/iceberg/src/scan.rs

- arrow_schema::Field::new("col", arrow_schema::DataType::Int64, true)
- .with_metadata(HashMap::from([(
- PARQUET_FIELD_ID_META_KEY.to_string(),
- "0".to_string(),
- )])),


I took a look at the table metadata. The written Parquet schema doesn't match with the table schema. As we need to project correct columns in the parquet file, I changed this.

viirya · 2024-03-20T01:55:53Z

@liurenjie1024 Thanks for providing some references to #251, #252.

I took at the Python reading projection in https://github.com/apache/iceberg-python/blob/6c8ea0effac0942ad4e880e5eef627473a354040/pyiceberg/io/pyarrow.py#L939. I'm wondering if we actually need #251 and #252 for pruning column here.

For the arrow Parquet reader, it only requires us to identify the columns to read through ProjectionMask. It can be obtained by using field ids from the selected columns from TableScan.

In the Python implementation, it requires #251 because it calls the scanner API that needs the pruned schema. For us, I don't see where we need the pruned schema.

I updated how to leverage ProjectionMask using field ids and fixed previous approach which doesn't look correct. Please take a look again. Thanks.

liurenjie1024 · 2024-03-20T08:06:07Z

@liurenjie1024 Thanks for providing some references to #251, #252.

I took at the Python reading projection in https://github.com/apache/iceberg-python/blob/6c8ea0effac0942ad4e880e5eef627473a354040/pyiceberg/io/pyarrow.py#L939. I'm wondering if we actually need #251 and #252 for pruning column here.

For the arrow Parquet reader, it only requires us to identify the columns to read through ProjectionMask. It can be obtained by using field ids from the selected columns from TableScan.

In the Python implementation, it requires #251 because it calls the scanner API that needs the pruned schema. For us, I don't see where we need the pruned schema.

I updated how to leverage ProjectionMask using field ids and fixed previous approach which doesn't look correct. Please take a look again. Thanks.

Cool, I'll take a look later. Maybe java's version is similar to this one.

liurenjie1024 · 2024-03-26T08:59:06Z

Hi, @viirya Sorry for late reply, it took me some time to totally understand the projection in iceberg, and I've written up a summary here.

Thanks for the idea of using projection mask, it helps a lot in pruning unnecessary columns. However I feel that we still miss sth, and I want to continue the discussion in #244 before merging this, what do you think?

viirya · 2024-03-26T14:01:02Z

@liurenjie1024 Thanks. Let me read through your summary first and explain what I've done in this PR in #244.

liurenjie1024

@viirya Thanks for this pr. Per discussion #244, we just need to deal with primitive columns only in this initial version, and we need to add some checks for this version to ensure that it's correct.

liurenjie1024 · 2024-03-25T12:03:10Z

crates/iceberg/src/arrow.rs

@@ -49,10 +54,17 @@ impl ArrowReaderBuilder {
 self
 }

+ /// Sets the desired column projection with a list of field ids.
+ pub fn with_field_ids(mut self, field_ids: Vec<usize>) -> Self {


Suggested change

pub fn with_field_ids(mut self, field_ids: Vec<usize>) -> Self {

pub fn with_field_ids(mut self, field_ids: impl IntoIterator<Item=usize>) -> Self {

liurenjie1024 · 2024-03-27T03:41:57Z

crates/iceberg/src/arrow.rs

+ ),
+ ));
+ }
+ column_map.insert(basic_info.id(), idx);


We need also a check that their types are matched. How about converting parquet schema to arrow schema, and uses filter_leaves to do this match check? This way we only need to deal with iceberg schema and arrow schema.

I changed to use filter_leaves. Compared to what I did with Parquet schema, However, it doesn't look quite good for the usage.

Because the filter of filter_leaves is not supported to propagate error inside the closure, we cannot make it well propagating error happened during matching the fields.

Although it can be improved as we can probably go to propose a change to the filter_leaves API . But in this version, we might tolerant it if we want to use filter_leaves.

Yes, I agree that maybe we should modify filter_leaves's api to return error.

liurenjie1024 · 2024-03-27T03:45:58Z

crates/iceberg/src/scan.rs

@@ -187,6 +190,22 @@ impl TableScan {
 let mut arrow_reader_builder =
 ArrowReaderBuilder::new(self.file_io.clone(), self.schema.clone());

+ let mut field_ids = vec![];
+ for column_name in &self.column_names {
+ let field_id = self.schema.field_id_by_name(column_name).ok_or_else(|| {


As discussed in #244, we need to do two checks here to ensure that it's valid:

The field is a direct child of schema, e.g. not a nested field. We can do this by calling Schema::as_struct::field_by_id::is_some

Ensure that this field is primitive type.

Added these checks.

viirya · 2024-03-28T00:41:33Z

Thank you @liurenjie1024 for review. I will address these comments soon.

liurenjie1024

Thanks @viirya for this pr, looks great! It would be better to report the error as FeatureNotSupport to make it more friendly to user, what do you think?

liurenjie1024 · 2024-03-31T03:45:58Z

crates/iceberg/src/scan.rs

+ ErrorKind::DataInvalid,
+ format!(
+ "Column {} is not a direct child of schema but a nested field. Schema: {}",
+ column_name, self.schema


Suggested change

ErrorKind::DataInvalid,

format!(

"Column {} is not a direct child of schema but a nested field. Schema: {}",

column_name, self.schema

ErrorKind::FeatureNotSupported,

format!(

"Column {} is not a direct child of schema but a nested field, which is not supported now. Schema: {}",

column_name, self.schema

Okay. I used FeatureUnsupported.

liurenjie1024 · 2024-03-31T03:46:40Z

crates/iceberg/src/scan.rs

+ ErrorKind::DataInvalid,
+ format!(
+ "Column {} is not a primitive type. Schema: {}",
+ column_name, self.schema


Ditto, returning a feature not supported error would be more user friendly.

viirya · 2024-03-31T18:18:09Z

Thanks @viirya for this pr, looks great! It would be better to report the error as FeatureNotSupport to make it more friendly to user, what do you think?

Thank you @liurenjie1024. Yea, I think it should be better to use FeatureUnsupported (I think you meant that). I changed the error type.

liurenjie1024 · 2024-04-01T02:09:39Z

Thanks @viirya for this pr, looks great! It would be better to report the error as FeatureNotSupport to make it more friendly to user, what do you think?

Thank you @liurenjie1024. Yea, I think it should be better to use FeatureUnsupported (I think you meant that). I changed the error type.

Cool, thanks!

viirya commented Mar 10, 2024

View reviewed changes

viirya force-pushed the read_with_projection branch 2 times, most recently from af7d6d1 to abc1c2a Compare March 10, 2024 01:01

viirya force-pushed the read_with_projection branch from 30010ca to cbf84d2 Compare March 19, 2024 17:29

viirya mentioned this pull request Mar 26, 2024

Read Parquet data file with projection #244

Open

liurenjie1024 reviewed Mar 27, 2024

View reviewed changes

viirya added 5 commits March 29, 2024 23:29

feat: Read Parquet data file with projection

13f9735

fix

822a89a

Update

519679f

More

794a46a

For review

1356baf

viirya force-pushed the read_with_projection branch from 93b4283 to 1356baf Compare March 30, 2024 07:02

liurenjie1024 approved these changes Mar 31, 2024

View reviewed changes

Use FeatureUnsupported error.

5234f97

viirya mentioned this pull request Mar 31, 2024

Make filter in filter_leaves API propagate error apache/arrow-rs#5574

Closed

liurenjie1024 merged commit 6e5a871 into apache:main Apr 1, 2024
7 checks passed

liurenjie1024 mentioned this pull request Apr 25, 2024

Tracking issues of iceberg-rust v0.3.0 #348

Open

72 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Read Parquet data file with projection #245

feat: Read Parquet data file with projection #245

viirya commented Mar 10, 2024 •

edited

viirya Mar 10, 2024 •

edited

viirya commented Mar 20, 2024 •

edited

liurenjie1024 commented Mar 20, 2024

liurenjie1024 commented Mar 26, 2024 •

edited

viirya commented Mar 26, 2024

liurenjie1024 left a comment

liurenjie1024 Mar 25, 2024

liurenjie1024 Mar 27, 2024

viirya Mar 30, 2024

liurenjie1024 Mar 31, 2024

liurenjie1024 Mar 27, 2024

viirya Mar 30, 2024

viirya commented Mar 28, 2024

liurenjie1024 left a comment

liurenjie1024 Mar 31, 2024

viirya Mar 31, 2024 •

edited

liurenjie1024 Mar 31, 2024

viirya commented Mar 31, 2024

liurenjie1024 commented Apr 1, 2024

	pub fn with_field_ids(mut self, field_ids: Vec<usize>) -> Self {
	pub fn with_field_ids(mut self, field_ids: impl IntoIterator<Item=usize>) -> Self {

feat: Read Parquet data file with projection #245

feat: Read Parquet data file with projection #245

Conversation

viirya commented Mar 10, 2024 • edited

viirya Mar 10, 2024 • edited

Choose a reason for hiding this comment

viirya commented Mar 20, 2024 • edited

liurenjie1024 commented Mar 20, 2024

liurenjie1024 commented Mar 26, 2024 • edited

viirya commented Mar 26, 2024

liurenjie1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Mar 28, 2024

liurenjie1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Mar 31, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Mar 31, 2024

liurenjie1024 commented Apr 1, 2024

viirya commented Mar 10, 2024 •

edited

viirya Mar 10, 2024 •

edited

viirya commented Mar 20, 2024 •

edited

liurenjie1024 commented Mar 26, 2024 •

edited

viirya Mar 31, 2024 •

edited