startingPosition AT_TIMESTAMP iteratorOption NumberFormatException #93

adamlbailey · 2020-10-01T15:25:25Z

Excellent addition for reading from stream at specific positions per #78

However, I'm having trouble using the "AT_TIMESTAMP" option to read from shards at specific timestamps.

The shardInfo object I'm using is below:

{
  "metadata": {
    "streamName": "QSR-data-stream-production",
    "batchId": "1"
  },
  "shardId-000000000002": {
    "iteratorType": "AT_TIMESTAMP",
    "iteratorPosition": "2020-09-30T19:58:46.480-00:00"
  },
  "shardId-000000000003": {
    "iteratorType": "AT_TIMESTAMP",
    "iteratorPosition": "2020-09-30T19:58:46.480-00:00"
  },
  "shardId-000000000004": {
    "iteratorType": "AT_TIMESTAMP",
    "iteratorPosition": "2020-09-30T19:58:46.480-00:00"
  },
  "shardId-000000000005": {
    "iteratorType": "AT_TIMESTAMP",
    "iteratorPosition": "2020-09-30T19:58:46.480-00:00"
  }
}

Here is the exception:

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 4 times, most recent failure: Lost task 3.3 in stage 0.0 (TID 9, ip-10-0-9-81.us-west-2.compute.internal, executor 5): java.lang.NumberFormatException: For input string: "2020-09-30T19:58:46.480-00:00"
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.lang.Long.parseLong(Long.java:589)
	at java.lang.Long.parseLong(Long.java:631)
	at scala.collection.immutable.StringLike$class.toLong(StringLike.scala:277)
	at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
	at org.apache.spark.sql.kinesis.KinesisReader.getShardIterator(KinesisReader.scala:120)
	at org.apache.spark.sql.kinesis.KinesisSourceRDD$$anon$1.getShardIterator(KinesisSourceRDD.scala:146)
	at org.apache.spark.sql.kinesis.KinesisSourceRDD$$anon$1.getNext(KinesisSourceRDD.scala:197)
	at org.apache.spark.sql.kinesis.KinesisSourceRDD$$anon$1.getNext(KinesisSourceRDD.scala:138)
	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

The text was updated successfully, but these errors were encountered:

adamlbailey · 2020-10-01T16:15:02Z

Update:

Using timestamp instead of iso formatted strings solves the problem. Would be nice to support iso strings as well for readability and compatibility with the options afforded by other kinesis libraries. Will consider adding a PR to address this soon.

{
  "metadata": {
    "streamName": "QSR-data-stream-production",
    "batchId": "1"
  },
  "shardId-000000000002": {
    "iteratorType": "AT_TIMESTAMP",
    "iteratorPosition": "1601495926480"
  },
  "shardId-000000000003": {
    "iteratorType": "AT_TIMESTAMP",
    "iteratorPosition": "1601495926480"
  },
  "shardId-000000000004": {
    "iteratorType": "AT_TIMESTAMP",
    "iteratorPosition": "1601495926480"
  },
  "shardId-000000000005": {
    "iteratorType": "AT_TIMESTAMP",
    "iteratorPosition": "1601495926480"
  }
}

itsvikramagr · 2020-10-01T16:19:45Z

@adamlbailey - Looking forward to the PR.

gopi-t2s · 2021-02-26T14:34:41Z

Hi @adamlbailey,

I am trying to read the data from Kinesis using at_timestamp as option in startingposition.
Here is my below piece of code to achieve this

pos = json.dumps({"at_timestamp": "02/26/2021 3:07:13 PDT"})
kinesisDF = spark.readStream.format("kinesis").option("streamName", name).option("endpointUrl", URL).option("awsAccessKeyId",key).option("awsSecretKey",sKey).option("startingposition",pos).load()

Here is the ERROR message I am receiving
pyspark.sql.utils.IllegalArgumentException: 'org.json4s.package$MappingException: Expected object but got JString(02/26/2021 3:07:13 PDT)'

I am new to use this kinesis connector and I know the way I am passing value for the starting position is wrong, could you help me how to pass the at_timestamp as the value for the startposition option.

Thanks in Advance!

adamlbailey · 2021-02-27T18:29:42Z

Hi @gopi-t2s, I'm somewhat removed from this work now but if memory serves:

You're going to want to construct an object exemplified in my previous comment.. Practically, I did this by writing a helper that described the stream so I could list each shard with the right timestamp Long value.

gopi-t2s · 2021-02-28T05:03:55Z

Thanks @adamlbailey for your inputs..

chadlagore · 2021-03-12T04:41:13Z

I ran into this as well @gopi-t2s - were you able to make it work? I was unsure if pyspark was going to be supported for this,

gopi-t2s · 2021-03-12T06:05:34Z

No @chadlagore, I am still looking for the ways to attain this..

nikitira · 2021-04-29T09:56:40Z

I got it working with the upper example from @adamlbailey

minimal example in pyspark:

now_ts = datetime.now().strftime("%s") + "000"  # timestamp in epoch time format, e.g. "1601495926000"
from_timestamp = {
  "metadata": {
    "streamName": "my-stream",
    "batchId": "1"
  },
  "shardId-000000000000": {
    "iteratorType": "AT_TIMESTAMP",
    "iteratorPosition":  now_ts
  }
}

starting_position = json.dumps(from_timestamp)

my_stream = (spark
                       .readStream
                       .format('kinesis')
                       .option('streamName', "my-stream")
                       .option('endpointUrl', KINESIS_ENDPOINT)
                       .option('region', KINESIS_REGION)
                       .option('startingposition', starting_position)

hope this helps @chadlagore @gopi-t2s

gopi-t2s · 2021-04-29T12:31:19Z

Thank you @nikitira , I will try this..

WTa-hash mentioned this issue Nov 7, 2020

java.lang.IllegalStateException: /checkpoint/round/sources/0/shard-commit/1 does not exist #57

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

startingPosition AT_TIMESTAMP iteratorOption NumberFormatException #93

startingPosition AT_TIMESTAMP iteratorOption NumberFormatException #93

adamlbailey commented Oct 1, 2020

adamlbailey commented Oct 1, 2020

itsvikramagr commented Oct 1, 2020

gopi-t2s commented Feb 26, 2021

adamlbailey commented Feb 27, 2021

gopi-t2s commented Feb 28, 2021

chadlagore commented Mar 12, 2021

gopi-t2s commented Mar 12, 2021

nikitira commented Apr 29, 2021

gopi-t2s commented Apr 29, 2021

startingPosition AT_TIMESTAMP iteratorOption NumberFormatException #93

startingPosition AT_TIMESTAMP iteratorOption NumberFormatException #93

Comments

adamlbailey commented Oct 1, 2020

adamlbailey commented Oct 1, 2020

itsvikramagr commented Oct 1, 2020

gopi-t2s commented Feb 26, 2021

adamlbailey commented Feb 27, 2021

gopi-t2s commented Feb 28, 2021

chadlagore commented Mar 12, 2021

gopi-t2s commented Mar 12, 2021

nikitira commented Apr 29, 2021

gopi-t2s commented Apr 29, 2021