feat: Add ability to bundle all records from one micro-batch into PutRecords #86

leslieyanyan · 2020-08-07T14:08:50Z

The current Kinesis sink only sends one record into Kinesis stream each time, the writing speed is very slow.
With the changes in this PR, we could bundle all records from one micro-batch into PutRecords. We've tested the changes in our production environment, the writing speed and efficiency improved a lot when enablingkinesis.executor.sink.bundle.records

…Records

itsvikramagr · 2020-08-10T06:03:25Z

@leslieyanyan - thanks for the PR.

Kinesis Sink was indeed very slow. @abhishekd0907 has taken a shot in reducing the latency in this PR #81. We are using KPL underneath which takes care of aggregation and sending multiple records in the same API call. Can you try the last master code and see if the new change in making a difference.

leslieyanyan · 2020-08-10T13:23:45Z

@itsvikramagr Thank you for the suggestion.

We tried to run jobs with latest master, however the speed was still very slow.
The current implementation initializes FutureCallback for every record, https://github.com/qubole/kinesis-sql/blob/master/src/main/scala/org/apache/spark/sql/kinesis/KinesisWriteTask.scala#L66 If my understanding is correct, it is only sending one record each time. The added method bundleExecute in this pr is more like the official example: https://docs.aws.amazon.com/streams/latest/dev/kinesis-kpl-writing.html, which is sending multiple records in the same API call.

…utRecords request

itsvikramagr · 2020-08-20T11:19:42Z

src/main/scala/org/apache/spark/sql/kinesis/KinesisWriteTask.scala

+
+ private def bundleExecute(iterator: Iterator[InternalRow]): Unit = {
+
+ val groupedIterator: iterator.GroupedIterator[InternalRow] = iterator.grouped(490)


what is 490 here? Should it be configurable?

itsvikramagr · 2020-09-02T09:17:09Z

README.md

@@ -149,7 +149,8 @@ Refering $SPARK_HOME to the Spark installation directory.
 | kinesis.executor.recordMaxBufferedTime | 1000 (millis) | Specify the maximum buffered time of a record |
 | kinesis.executor.maxConnections | 1 | Specify the maximum connections to Kinesis | 
 | kinesis.executor.aggregationEnabled | true | Specify if records should be aggregated before sending them to Kinesis |
-| kniesis.executor.flushwaittimemillis | 100 | Wait time while flushing records to Kinesis on Task End |
+| kinesis.executor.flushwaittimemillis | 100 | Wait time while flushing records to Kinesis on Task End |
+| kinesis.executor.sink.bundle.records | false | Bundle all records from one micro-batch into PutRecords | 


we have added "kinesis.executor.recordTtl" - can we add details about this config here

abhishekd0907 · 2020-09-02T10:43:05Z

src/main/scala/org/apache/spark/sql/kinesis/KinesisWriteTask.scala

 }
+
 Futures.addCallback(future, kinesisCallBack)

 producer.flushSync()


@leslieyanyan @itsvikramagr
The slowness is on account of this function call producer.flushSync(). Please refer my comment here: #81 (review)

The new code in this PR is showing improved performance because method sendBundledData() doesn't have this function call producer.flushSync()

We'll need to separately evaluate how much performance impact we're getting by using GroupedIterator instead of normal iterator.

leslieyanyan added 4 commits August 7, 2020 09:55

feat: Add ability to bundle all records from one micro-batch into Put…

6a859e8

…Records

chore: formatting

e346bc4

chore: fix typo in doc

fb4cf40

fix: fix var name

3a8a515

feat: enable RecordTtl setup and send less than 500 records in each p…

f75312d

…utRecords request

itsvikramagr reviewed Aug 20, 2020

View reviewed changes

itsvikramagr reviewed Sep 2, 2020

View reviewed changes

abhishekd0907 reviewed Sep 2, 2020

View reviewed changes

feat: make number of records in each PutRecords request configurable

321333f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add ability to bundle all records from one micro-batch into PutRecords #86

feat: Add ability to bundle all records from one micro-batch into PutRecords #86

leslieyanyan commented Aug 7, 2020 •

edited

itsvikramagr commented Aug 10, 2020

leslieyanyan commented Aug 10, 2020

itsvikramagr Aug 20, 2020

abhishekd0907 Sep 2, 2020

itsvikramagr Sep 2, 2020

abhishekd0907 Sep 2, 2020

abhishekd0907 Sep 2, 2020


		private def bundleExecute(iterator: Iterator[InternalRow]): Unit = {

		val groupedIterator: iterator.GroupedIterator[InternalRow] = iterator.grouped(490)

feat: Add ability to bundle all records from one micro-batch into PutRecords #86

Are you sure you want to change the base?

feat: Add ability to bundle all records from one micro-batch into PutRecords #86

Conversation

leslieyanyan commented Aug 7, 2020 • edited

itsvikramagr commented Aug 10, 2020

leslieyanyan commented Aug 10, 2020

itsvikramagr Aug 20, 2020

Choose a reason for hiding this comment

abhishekd0907 Sep 2, 2020

Choose a reason for hiding this comment

itsvikramagr Sep 2, 2020

Choose a reason for hiding this comment

abhishekd0907 Sep 2, 2020

Choose a reason for hiding this comment

abhishekd0907 Sep 2, 2020

Choose a reason for hiding this comment

leslieyanyan commented Aug 7, 2020 •

edited