Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECS Deployment Circuit Breaker Support #185

Open
bfox1793 opened this issue Oct 13, 2021 · 11 comments
Open

ECS Deployment Circuit Breaker Support #185

bfox1793 opened this issue Oct 13, 2021 · 11 comments

Comments

@bfox1793
Copy link

Is there current support for tracking when a deployment fails and the ECS deployment breaker rolls it back? I just tried forcefully failing a deployment, and while ECS correctly triggered the circuit breaker and rolled back, ecs deploy considered it as a successful deployment.

Not sure if this is a bug or working as intended.

@bfox1793
Copy link
Author

Digging around a bit more, it seems like it considers the deployment successful once the rollback removes the attempted deployment from ecs-deploy. Not sure where that lives in the code, but a possible fix might be logic that says "if AWS returns a deployment does not exist / is deleted response, consider the deployment failed"

@bfox1793
Copy link
Author

@fabfuel any thoughts on the above? I can provide more context if it's helpful.

@fabfuel
Copy link
Owner

fabfuel commented Oct 26, 2021

Hi Brett, sorry for the late response.

In March I added support for the ECS Circuit Breaker and added parsing of the rollout state. This means, ecs-deploy should recognize, if the deployment was rolled back.

Could you share some more details, especially about the timing. How long did you ecs-deploy instruct to wait for the deployment to be finished?

Best
Fabian

@bfox1793
Copy link
Author

bfox1793 commented Nov 1, 2021

Hi Fabian!

I have the timeout set to 600, but it's not hitting the timeout. Here's what it looks like I'm seeing:

  • Deployment starts
  • Tasks fail to start up
  • ECS marks the deployment as failed, and thus triggers a rollback
  • ecs deploy CLI notes Deployment successful and de-registers the active revision
  • ECS console reflects old task is still running, but now it's deregistered

Here's a SS of my console output:
image

Below is a tidbit from the ECS console, one showing the failed deployment, and the other showing the active running tasks
image

image

Happy to provide any additional information if it's helpful! The built-in rollback functionality seems to work as-intended though (if the circuit breaker is disabled), so that's good news!

Brett

@fabfuel
Copy link
Owner

fabfuel commented Nov 14, 2021

Hi Brett, thanks for all the details! I'm trying to reproduce it myself and keep you updated.
Do you remember what the issue was, which stopped the container to spin up properly? Was it a configuration error (on the ECS/Docker level) or an issue with the application (inside the container)?
ECS makes (or at least made, I need to check) a difference about how/why a container failed to start and how the circuit breaker kicks in.

Best
Fabian

@bfox1793
Copy link
Author

Hey Fabian!

Appreciate you digging into this! The issue was that the container was missing a required env var, which caused it to fail on start-up. However, the container isn't exiting properly, so eventually it just times-out and never passes ALB healthchecks. This causes the deployment to continue recycling containers until the circuit breaker is tripped.

For added color, when I tested this with circuit breaker turned on, eventually ecs deploy would timeout (since the deploy was never successful). If I turned on the --rollback flag, it would successfully roll back. If I didn't, the deploy would continue indefinitely, spinning up new containers and then recycling them after the ALB healthcheck fails.

Let me know if any other info would be helpful!

Brett

@bfox1793
Copy link
Author

bfox1793 commented Jan 3, 2022

Hi @fabfuel ! I hope you had a great new years!

I was wondering if there were any updates on the above? We'd like to enable circuit breaker support, but we're hitting a big of a snag because of the issue described above.

Thanks!

Brett

@fabfuel
Copy link
Owner

fabfuel commented Jan 5, 2022

Hi Brett,

thanks a lot, I also wish you a happy and healthy new year!

I still need to dig into this a bit more and double check, if it might be related to this ECS behavior:
aws/containers-roadmap#1206
There are cases, when the ECS Circuit Breaker does not correctly recognize a failing container.

Here is described & a screenshot, how ecs-deploy behaves in my test case in conjunction with the ECS circuit breaker. It should recognize it and through an error: #161 (comment)

In your case, what happened to the container, due to the missing env var? Did the app crash or exit with exit code 1 or something similar?

Thanks for the details and sorry for moving slow, busy times 😅

Best
Fabian

@mohamed-haidara-cko
Copy link

Hi there,

I'm facing a similar issue when ecs deploy ... marks the deployment as successful even though it wasn't the case. I also have the circuit breaker enabled with the rollback.

The container fails due to a permission issue just after entering the RUNNING state. The new deployment is rolled back after some tasks fail to start.

Is there anything I can do to move this forward? Happy to submit a PR


During what seems to be an edge case, I also got this error: same configuration as well

Screenshot 2023-12-27 at 11 09 31

@fabfuel
Copy link
Owner

fabfuel commented Dec 28, 2023

Hi @mohamed-haidara-cko,

thanks for the details. I will look into the Python error.

Could you share your task definition (JSON)?

Thanks
Fabian

@mohamed-haidara-cko
Copy link

Hi @fabfuel

Completely missed your message. Unfortunately, this specific task definition was deleted. Is there anything else I can do to help? I try to replicate the issue on our side as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants