Concurrent table scans #373

sdd · 2024-05-13T06:52:39Z

This is a bit of an experiment to see how things could look if we tried to:

process Manifest Lists and Manifest Files concurrently rather than sequentially
process the stream of file scan tasks concurrently, streaming record batches from multiple files at the same time.

I'd like to add some unit tests to confirm that this behaves as expected beyond the existing tests that we have for TableScan, and add an integration / performance test that can quantify any performance improvements (or regressions 😅 ) that we get from these changes.

Let me know what you all think.

marvinlanhenke

Thanks @sdd for doing this draft.

Just in case you missed it we had a similiar discussion in #124 about a possible approach.
Left some comments (mostly questions).

In general I think we should wait for the runtime and the Evaluators (Manifest, Expression, etc.) to land. Add more tests in scan.rs - and then refactor into the async/ multi-threaded version. Curious for more comments on this, though.

marvinlanhenke · 2024-05-13T08:12:16Z

crates/iceberg/src/scan.rs

- partition_spec_id,
- &context,
- )?;
+ spawn(async move {


do we need to spawn here or is the try_for_each_concurrent in run(...) already enough?

I went with using an mpsc channel to avoid having 2 nested try_for_each_concurrent macros. On reflection the inner try_for_each_concurrent is probably unnecessary, since there's only one async operation needed for each manifest file, despite that manifest file producing n DataFileTasks. So I can probably ditch the channel too. The channel in reader.rs is also overkill as I've only got a single try_for_each_concurrent there.

I just wanted to get the ball rolling on this while we're waiting for the filtering code to get signed off, it's fun writing this kind of stuff 😁

I reverted reader.rs to be essentially the same as before, without any concurrent processing of batches from within the same file. I removed the nested try_for_each_concurrent from scan.rs but kept the mpsc channel.

marvinlanhenke · 2024-05-13T08:13:16Z

crates/iceberg/src/scan.rs

+}
+
+#[derive(Debug)]
+struct ConcurrentFileScanStreamContext {


I think once we have a runtime and the async approach is approved, we can get rid of the FileScanStreamContext and merge the struct into ConcurrentFileScanStreamContext

Yeah, completely agree. Kept them separate for now as it made it easier to have the old non-concurrent version in the code base at the same time.

marvinlanhenke · 2024-05-13T08:15:03Z

crates/iceberg/src/scan.rs

+ CONCURRENCY_LIMIT_MANIFEST_FILES,
+ Self::process_manifest_file,
+ )
+ .await


will this yield the first FileScanTask when its available - or do we have to iterate over the complete stream, before we can return any result?

Yes, everything proceeds fully concurrently - whatever manifest is ready first will start yielding tasks up to the stream first

thanks for explaining

marvinlanhenke · 2024-05-13T08:16:13Z

crates/iceberg/src/scan.rs

+ CONCURRENCY_LIMIT_MANIFEST_ENTRIES,
+ Self::process_manifest_entry,
+ )
+ .await


we had a discussion on this in #124 (in case you missed it) to not go overboard with the task spawning

sdd · 2024-05-14T19:39:43Z

I've updated this to ditch the concurrency when processing ManifestEntry items within a single Manifest, producing them asynchronously but sequentially instead. I've kept the limited concurrency when processing ManifestFiles within the scan's snapshot's ManifestList.

I've kept the approach of using an mpsc channel with a spawned task, with that task using try_for_each_concurrent to achieve the concurrency. This is because without the channel and spawned task, we'd need to use an async closure, which is unstable rust. With the spawned task we only need to use an async block, which is in stable rust.

…files and manifest data files concurrently rather than sequentially

…her than cloning

…es in a Manifest

marvinlanhenke reviewed May 13, 2024

View reviewed changes

sdd added 3 commits May 20, 2024 19:33

refactor: TableScan's plan_files method now operates on manifest …

fa7cf15

…files and manifest data files concurrently rather than sequentially

feat: ArrowReader processes FileScanTasks concurrently

3c643ec

refactor: pass an Arc of the field_ids to the FileScanTaskContext rat…

e135b3a

…her than cloning

sdd force-pushed the scan-concurrent branch 3 times, most recently from 6cb340c to 3947796 Compare May 22, 2024 06:25

refactor: remove concurrency when processing individual ManifestEntri…

e7d0c50

…es in a Manifest

sdd force-pushed the scan-concurrent branch from 3947796 to e7d0c50 Compare May 22, 2024 06:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent table scans #373

Concurrent table scans #373

sdd commented May 13, 2024

marvinlanhenke left a comment

marvinlanhenke May 13, 2024

sdd May 13, 2024

sdd May 22, 2024

marvinlanhenke May 13, 2024

sdd May 13, 2024

marvinlanhenke May 13, 2024

sdd May 13, 2024

marvinlanhenke May 14, 2024

marvinlanhenke May 13, 2024

sdd commented May 14, 2024

Concurrent table scans #373

Are you sure you want to change the base?

Concurrent table scans #373

Conversation

sdd commented May 13, 2024

marvinlanhenke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sdd commented May 14, 2024