Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non stop ConditionalCheckFailedException error #30

Open
jney opened this issue Jun 27, 2019 · 17 comments
Open

non stop ConditionalCheckFailedException error #30

jney opened this issue Jun 27, 2019 · 17 comments

Comments

@jney
Copy link

jney commented Jun 27, 2019

Hello there, I'm now always getting the following error:

shard error (shardId-000000000005) in checkpointer.release: error releasing checkpoint: ConditionalCheckFailedException: The conditional request failed\n\tstatus code: 400, request id: RR9U1G5VFUCV6AHGT5DFJIPNJVVV4KQNSO5AEMVJF66Q9ASUAAJG

Any idea what's happened and how to fix that?

btw, I noticed #27 ticket

Thank you

@Sytten
Copy link
Contributor

Sytten commented Jun 27, 2019

I also get those errors when new consumers start

@cb-anthony-merrill
Copy link

Anyone had any luck on this? We are hitting this as well. Otherwise, we're starting our deep dive investigation.

@garethlewin
Copy link
Contributor

I have seen this but have never been able to successfully reproduce it. If you find any strong reproducible steps that would really be helpful.

Typically we see it so rarely that just restarting the process gets past it, but I am very aware that this is not idea.

@cb-anthony-merrill
Copy link

Hey @garethlewin, question for you about checkpoints.go, there is a comment in the checkpointRecord struct that OwnerID, LastUpdateRFC and FinishedRFC are not used for decision making yet, a few lines below it; we are using it

if record.OwnerID != nil && record.LastUpdate > cutoff {

Maybe this is related? Or should it just be replaced with OwnerName; or does the original comment no longer apply?

@garethlewin
Copy link
Contributor

I believe that is a mistake in checkpoints.go and OwnerID should be in the block of used variables and OwnerName should be in the "debug only" bit. I will verify this and update the code. From my quick check ownerName is never checked but ownerID is checked.

@ckatsaras-godaddy
Copy link

Just checking in if there was ever a solution found for this issue. I'm also experiencing this @jney, @Sytten , @garethlewin, @cb-anthony-merrill

@garethlewin
Copy link
Contributor

No.

Still have never managed to get a reproducable situation, just a "sometimes it happens."

I've read through the code and debugged it multiple times and I cannot find a reason for this. I do not want to point at AWS/DDB, so my assumption is there is a bug in kinsumer, I just can't find it.

If you can reproduce or add any kind of diagnostic information beyond what has been shared, I'd love to fix this.

@ckatsaras-godaddy
Copy link

Ah, fair enough @garethlewin ! Thanks for the quick response! In terms of reproduction, what I've noticed is that it happens whenever I boot up new containers in ECS and the older containers are attempting to release(). I wonder if maybe the times I had configured for WithLeaderActionFrequency and WithShardCheckFrequency are causing issues (both set to 1 second which may be far to fast and causing issues)

@ckatsaras-godaddy
Copy link

Also, I think this is an issue when you have more containers than shards for a stream (e.g 3 containers consuming from 2 shards). In the case where all 3 containers deregistered together, 1 of those containers will not be the owner of a shard (which is to be expected) and thus, will fail DynamoDB query in release() since it never was the owner of any shard. Is this correct @garethlewin?

Sadly, I tested the theory above just now and it still seemed to show up once when I brought up 2 new containers and deregistered 2 old ones. The odd thing is, Kinsumer seems to still properly consume from my stream so I'm not actually sure if this really is an issue I should be too worried about

@ckatsaras-godaddy
Copy link

So after quite a bit of testing with ECS, I think I can provide a bit more context into what is happening @garethlewin .

My test includes 2 containers that are initially running in ECS and are consuming from a stream with 2 shards
Given this configuration, the checkpoints table will have the following owner IDs

Shard OwnerID ...
Shard1 Container 1 ID ...
Shard2 Container 2 ID ...

Whenever I deploy new containers to ECS, what happens is 2 completely new containers are deployed and for a brief period of time, there are 4 containers running together. During this period of time, it appears that Container 3 and Container 4 are sometimes assigned ownership of the shards which makes the checkpoints table look like this:

Shard OwnerID ...
Shard1 Container 3 ID ...
Shard2 Container 4 ID ...

So, when ECS attempts to scale down to the desired 2 containers and deregister Container1 and Container2, it will fail to remove their ownership from the checkpoints table as they are no longer the owners

It's important to note that I can't seem to simulate this locally running multiple instances of my service but I can easily reproduce it in ECS. Given this information, do you have any idea why Kinsumer would be promoting the two new containers to be the owners of the shards prematurely?

Any help is really appreciated 😃

@ckatsaras-godaddy
Copy link

Sorry for all the messages @garethlewin , but I think I have some additional information that may be useful as well.

I'm wondering if sorting the clients by OwnerID in the getClients function (https://github.com/twitchscience/kinsumer/blob/master/clients.go#L123)
https://github.com/twitchscience/kinsumer/blob/master/clients.go#L36-L38
is potentially problematic and aggravating this issue as well.

It looks like getClients is used within refreshShards and it seems like the order in which these clients are returned has a direct impact on whether or not a client is "promoted" or not https://github.com/twitchscience/kinsumer/blob/master/kinsumer.go#L137-L157

So, with that being said, if the list is returned in a "new" order, it potentially leads to new clients being promoted and older clients no longer having their `ID associated with a checkpoint when the attempt to "release". It seems a bit odd to be sorting on a "random" UUID since it will lead to different outcomes each time (if I'm understanding correctly)

Let me know what you think of this and if there's any other information I can provide 😄

@garethlewin
Copy link
Contributor

No worries about the comments, but I no longer maintain this package so I might not be super fast and responsive but I'll do my best.

I haven't read (or written) this code in several years so I need to knock the rust off, but the sorting by OwnerID is primary idea that I built this entire library around.

By having all clients all sort by the same field, we get a consistent understanding between all the clients over which shard gets allocated to which client. We all agree that shard 0 goes to the first client etc. When new clients are added to the system it is correct, a reallocation of shards will happen. But everyone is supposed to take a break and then move to the newly allocated shards.

Your ECS vs Local example is worrying to me. I wonder if there is a timestamp issue here. The older workers are supposed to give up consuming their shards and it seems like you are finding they are not giving it up.

Here is where everyone is supposed to check what shards they should consume:

https://github.com/twitchscience/kinsumer/blob/master/kinsumer.go#L421

when we capture shard, we are supposed to not take one until the existing reader has said they are done with it

https://github.com/twitchscience/kinsumer/blob/master/shard_consumer.go#L75

I wonder if there is a race condition between releasing a shard and shutting down.

@ckatsaras-godaddy
Copy link

Thanks for all the details @garethlewin !
Okay, so it seems like sorting is to be expected here to "mix things up" and keep a healthy rotation. Yeah, it seems like there is some sort of issue where we basically are "prematurely" assigning new owners to shards before the older consumers can "release" their hold on it
To be honest, I've done quite a bit of testing and since the error happens on shutdown anyways, I simply call kinsumer.Stop() to ensure we clear the client table, etc. Ideally I'd love to have no errors on shutdown but it doesn't seem to actually effect the functionality as it's basically just saying

"I couldn't find that I was the Owner of a checkpoint, so I can't clear my ownership"

Given that you no longer maintain the project, I really appreciate you going out of your way to help me 😄 !

@garethlewin
Copy link
Contributor

No, the sorting isn't to mix things up. It's so that all the clients have consensus over who owns a shard, and who doesn't.

You should DEFINITELY call kinsumer.Stop() on shutdown or else your clients will hold onto the shards for longer than is needed.

@ckatsaras-godaddy
Copy link

Oh, I see! Sorry for the confusion!

Just to clarify, I wonder if it's possible that release() is written to "fail" on any failure to update, when in reality, there are scenarios where it isn't actually a failure

Like in the scenario above, if there are 4 consumers and only 2 shards, we are guaranteed to have at least 2 consumers that don't own a checkpoint. If I'm understanding correctly, there's nothing stopping the 2 consumers that don't have ownership from exiting, and if so, it isn't an error to clear their ownership since they don't have any.

I wonder if the improved logic for release() would be:

  1. Check if ownership exists
  2. Clear ownership IF applicable

What do you think of that @garethlewin ? 😄
https://github.com/twitchscience/kinsumer/blob/master/checkpoints.go#L203-L236

@garethlewin
Copy link
Contributor

I think you might be on to something, there is an error for "No shard assigned". But I am not sure what happens in that case if the consumer had a shard before.

@ckatsaras-godaddy
Copy link

That's good to hear!

For the "No shard assigned" case, I think this section here https://github.com/twitchscience/kinsumer/blob/master/kinsumer.go#L205-L207 covers the scenario where we have more consumers than shards which is why we wouldn't hit that particular error.

I think I'll look into creating a simple check for ownership and putting it in a PR for review. Once again, since you're not longer the maintainer of the project, don't feel you have to review, etc 😄

As I was saying earlier, now that we have the knowledge above about when the scenario occurs and how to safely clean up in this scenario, it's more about just removing the error so that others in the future aren't lead in the wrong direction

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants