Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SetCursor: context deadline exceeded #390

Open
abunjevac opened this issue Mar 7, 2022 · 13 comments
Open

SetCursor: context deadline exceeded #390

abunjevac opened this issue Mar 7, 2022 · 13 comments

Comments

@abunjevac
Copy link

Every now and then SetCursor API fails with "context deadline exceeded". Could it be something related to a paused cursors stream?
Liftbridge v1.7.1 on kubernetes.

liftbridge-0:

time="2022-03-07 11:00:29" level=debug msg="fsm: Paused stream __cursors"
time="2022-03-07 11:00:34" level=debug msg="Server becoming follower for partition [subject=liftbridge-default.cursors, stream=__cursors, partition=0], epoch: 28930"
time="2022-03-07 11:00:34" level=debug msg="Truncating log for partition [subject=liftbridge-default.cursors, stream=__cursors, partition=0] to 1557767"
time="2022-03-07 11:00:34" level=debug msg="Replicating partition [subject=liftbridge-default.cursors, stream=__cursors, partition=0] from leader cluster-liftbridge-1"
time="2022-03-07 11:00:34" level=debug msg="fsm: Resumed partition [subject=liftbridge-default.cursors, stream=__cursors, partition=0]"

liftbridge-1:

time="2022-03-07 11:00:29" level=debug msg="fsm: Paused stream __cursors"
time="2022-03-07 11:00:31" level=debug msg="api: PublishAsync [stream=ems.helb.sensors, partition=0]"
time="2022-03-07 11:00:34" level=debug msg="api: PublishAsync [stream=ems.helb.sensors, partition=0]"
time="2022-03-07 11:00:34" level=error msg="api: Failed to publish message: context deadline exceeded"
time="2022-03-07 11:00:34" level=debug msg="api: SetCursor [stream=ems.helb.sensors, partition=0, cursorId=sensors, offset=125103]"
time="2022-03-07 11:00:34" level=debug msg="api: Publish [stream=__cursors, partition=0]"
time="2022-03-07 11:00:34" level=debug msg="Server becoming leader for partition [subject=liftbridge-default.cursors, stream=__cursors, partition=0], epoch: 28930"
time="2022-03-07 11:00:34" level=debug msg="fsm: Resumed partition [subject=liftbridge-default.cursors, stream=__cursors, partition=0]"
time="2022-03-07 11:00:34" level=debug msg="api: SetCursor [stream=ems.helb.sensors, partition=0, cursorId=sensors, offset=125104]"
time="2022-03-07 11:00:34" level=debug msg="api: Publish [stream=__cursors, partition=0]"

liftbridge-2:

time="2022-03-07 11:00:34" level=error msg="Error sending replication request for partition [subject=liftbridge-default.cursors, stream=__cursors, partition=0]: nats: no responders available for request"
time="2022-03-07 11:00:34" level=error msg="Error sending replication request for partition [subject=liftbridge-default.cursors, stream=__cursors, partition=0]: nats: no responders available for request"
time="2022-03-07 11:00:34" level=error msg="Error sending replication request for partition [subject=liftbridge-default.cursors, stream=__cursors, partition=0]: nats: no responders available for request"
time="2022-03-07 11:00:34" level=error msg="Error sending replication request for partition [subject=liftbridge-default.cursors, stream=__cursors, partition=0]: nats: no responders available for request"
time="2022-03-07 11:00:34" level=error msg="Error sending replication request for partition [subject=liftbridge-default.cursors, stream=__cursors, partition=0]: nats: no responders available for request"
time="2022-03-07 11:00:34" level=error msg="Error sending replication request for partition [subject=liftbridge-default.cursors, stream=__cursors, partition=0]: nats: no responders available for request"
...
@tylertreat
Copy link
Member

It could be related to pausing. Any chance you could try disabling auto pausing for cursor partitions?

cursors.stream.auto.pause.time: 0

https://liftbridge.io/docs/configuration.html#cursors-configuration-settings

I'd be curious if the issue persists with that disabled.

@abunjevac
Copy link
Author

I disabled auto pausing, but I'm not sure if this has any effect. I still see these in the logs:

time="2022-03-08 18:33:17" level=info msg="Partition [subject=liftbridge-default.cursors, stream=__cursors, partition=0] has not received a message in over 1m0s, auto pausing partition"
time="2022-03-08 18:33:17" level=debug msg="fsm: Paused stream __cursors"
time="2022-03-08 18:33:17" level=error msg="Error sending replication request for partition [subject=liftbridge-default.cursors, stream=__cursors, partition=0]: nats: no responders available for request"

There are no SetCursor errors so far, but I suspect these will appear sometime. I will update the issue if this happens.

@tylertreat
Copy link
Member

The server requires a restart to take hold of the changed config, no hot reloading yet. That might be why auto pausing didn't disable?

@abunjevac
Copy link
Author

So scaling down to 0 and up to 3 won't work either?

@tylertreat
Copy link
Member

I would think that would work since the server would be restarting. But I'm not sure why it doesn't appear to be taking effect for you based on the logs you posted.

@abunjevac
Copy link
Author

abunjevac commented Mar 8, 2022

Here is the config straight from the pod (nats connection string omitted):

$ kubectl -n sys exec liftbridge-0 -- cat /etc/liftbridge.yaml | grep -v "nats"
listen: 0.0.0.0:9292
logging.level: debug


clustering.min.insync.replicas: 2
clustering.replication.max.bytes: 67108864
cursors.stream.partitions: 1
cursors.stream.auto.pause.time: 0

I did scale down/up thing and I still get this:

time="2022-03-08 22:21:16" level=info msg="Partition [subject=liftbridge-default.cursors, stream=__cursors, partition=0] has not received a message in over 1m0s, auto pausing partition"

Is there any chance this setting is behaving just like a cursors.stream.partitions? That one can not be changed once it is set.

@abunjevac
Copy link
Author

I did some local testing and debugging. The setting cursors.stream.auto.pause.time is indeed used only during the cursors stream creation in cursorManager.Initialize().

@tylertreat
Copy link
Member

You're right. For some reason I was thinking that config could be changed after the cursors stream was created. I can probably include that as part of the forthcoming consumer groups release.

@tylertreat
Copy link
Member

The cursors.stream.auto.pause.time setting issue is addressed here: #391

This will be included in the release forthcoming today.

@tylertreat
Copy link
Member

@abunjevac Are you able to test this with Liftbridge 1.8.0? This supports modifying cursors.stream.auto.pause.time after the cursors stream has already been created.

@abunjevac
Copy link
Author

Yes, I've been monitoring this for the last few days. Everything is fine now. The log is nice and clean. You can close this issue. Thanks.

@tylertreat
Copy link
Member

I'm inclined to keep the issue open since resuming streams should really not cause a deadline exceeded error. I'm thinking there may be a race condition between the cursor partition resume operation and the cursor publish that Liftbridge does internally.

Basically, disabling cursors.stream.auto.pause.time is a bandaid IMO.

I'd like to investigate this further.

@abunjevac
Copy link
Author

Sure. That makes sense. I'll be happy to test with pause enabled when you release a patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants