Consumer requiring re-sharding of stream before starts consuming data #95

marciodebarros · 2020-11-18T00:09:59Z

Hello I am implementing a consumer application with the kinesis-sql library on spark 3.0.and running into an the following issue:

Start my consumer and there is no data available.
Start the producer to push data to kinesis and I get a response code of 200 from kinesis. The consumer application completes normally without any errors.
Wait for a few minutes and no data shows up
I log into the AWS console, change the number of shards in the stream from 1 to 2 or 2 to 1.
At this point the consumer is still running.
Start a new session of the producer and push more data with response code of 200.
At this point only after a few seconds I start receiving the data in my consumer.
My spark stream configured with TRIM_HORIZON.

I checked my configuration with the examples available and could not find anything else that may be different. Has anyone run into a similar issue or have any suggestions on what else I could try to troubleshoot this issue ?

Thank you in advance.

—MD.

roncemer · 2022-09-16T14:52:34Z

I sent an email to the maintainer of this repo, and didn't get a response. For now, I've forked the project here https://github.com/roncemer/kinesis-spark-connector and updated it to build for Spark 3.2.1. Under Spark 3.2.1, it appears to be working correctly. I allowed it to go overnight with no new records being posted to the Kinesis data stream. When I started posting records again, the records arrived in Spark, and were processed.

I issued pull request https://github.com/qubole/kinesis-sql/pull/113/files from my forked repo back to the original qubole/kinesis-sql repo. Be sure to read the full description under the Comments tab as well.

I could use the help of anyone who might be interested in this project. Apparently, qubole/kinesis-sql is abandoned for about two years, and the main guy who was maintaining it doesn't respond to emails. If anyone can get in touch with someone who has control of this repo, please ask them to make me a committer. Barring that, I will have to publish the jar file from my forked repo as a maven resource. In any case, if I end up maintaining this project, either the original or forked repo, I will need volunteers to help handle the builds each time a new version of Spark is released.

In the mean time, feel free to build your own jar from my forked repo and run it under Spark 3.2.1. Also, let me know if it works under 3.3.x, or if you run into any other problems.

Thanks!
Ron Cemer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consumer requiring re-sharding of stream before starts consuming data #95

Consumer requiring re-sharding of stream before starts consuming data #95

marciodebarros commented Nov 18, 2020

roncemer commented Sep 16, 2022