Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods in error state #52

Open
yuriipolishchuk opened this issue May 10, 2022 · 34 comments
Open

Pods in error state #52

yuriipolishchuk opened this issue May 10, 2022 · 34 comments
Assignees
Labels
bug Something isn't working seleniferous sidecar container

Comments

@yuriipolishchuk
Copy link

After upgrading chrome image to version 101 (this can be a coincidence) we see that some chrome pods are in error state with
such entries in the log.

seleniferous {"level":"warning","msg":"session chrome-101-0-b1520d9a-82d5-403d-95ac-b66643885aae delete request failed: delete request failed: Delete \"http:///wd/hub/session/chrome-101-0-b1520d9a-82d5-403d-95ac-b66643
seleniferous {"level":"info","msg":"stopping seleniferous: failed to perform delete request","time":"2022-05-10T11:50:34Z"}
seleniferous {"level":"fatal","msg":"failed to stop%!(EXTRA context.deadlineExceededError=context deadline exceeded)","time":"2022-05-10T11:50:49Z"}

I've already tried to increase gracefulTerminationPeriod but that didn't help.

Here're the values for timeouts we use:

browser-wait-timeout: 20s
session-wait-timeout: 30s
session-idle-timeout: 1m0s
graceful-shutdown-timeout: 20s
@SebastianPereiro
Copy link

@yuriipolishchuk Hi
Have you found a way to fix this error?

@alcounit
Copy link
Owner

@yuriipolishchuk can you please share full log from seleniferous container

@SebastianPereiro
Copy link

@alcounit I can share mine
chrome94.log

@alcounit
Copy link
Owner

@SebastianPereiro thanks, will check

@alcounit alcounit added bug Something isn't working seleniferous sidecar container labels Feb 15, 2023
@alcounit alcounit self-assigned this Feb 15, 2023
alcounit added a commit to alcounit/seleniferous that referenced this issue Feb 15, 2023
@alcounit
Copy link
Owner

@SebastianPereiro, @yuriipolishchuk should be fixed now, update seleniferous image version to alcounit/seleniferous:v1.0.4

@SebastianPereiro
Copy link

@alcounit I'm doing the tests, so far so good - I haven't seen a single failed POD today.
I'll do some more stress tests and keep you posted.
Thank you very much for such a quick response!

@SebastianPereiro
Copy link

Hi @alcounit ! We ran a much bigger stress test today. And unfortunately got more of the same bugs.
I'm attaching two sample logs:
selenosis-chrome-91-0-f1057a7c-214d-41f9-881d-4c390ca87978-1676797786597419213.log
selenosis-chrome-105-0-6e5adda6-b83a-4079-839c-c4b60ae60209-1676798322725726600.log

@SebastianPereiro
Copy link

SebastianPereiro commented Feb 19, 2023

Maybe the problem is on the browser side? If so, we can simply change browser's POD restart policy from "restartPolicy: Never" to "restartPolicy: Always" and ignore it?

@alcounit
Copy link
Owner

alcounit commented Feb 19, 2023

@SebastianPereiro what are the timeouts for the seleniferous container?

@SebastianPereiro
Copy link

@alcounit I didn't change them, so they are default.

@alcounit
Copy link
Owner

I know how to fix this, will be a quick fix. But in your logs I see that session timed out, and delete request failed. smth wrong with browser?
seleniferous {"level":"info","msg":"proxy session","request":"DELETE /wd/hub/session/chrome-91-0-f1057a7c-214d-41f9-881d-4c390ca87978","request_by":"","request_id":"d3975f60-d8b5-441e-883f-eba0da2dcb23","time":"2023-02-19T08:50:21Z"} seleniferous {"level":"warning","msg":"session chrome-91-0-f1057a7c-214d-41f9-881d-4c390ca87978 delete request failed: delete request failed: Delete \"http://127.0.0.1:4445/wd/hub/session/chrome-91-0-f1057a7c-214d-41f9-881d-4c390ca87978\": context deadline exceeded","request":"POST /wd/hub/session/chrome-91-0-f1057a7c-214d-41f9-881d-4c390ca87978/execute/sync","request_by":"selenosis-6f7479ddb5-58kwr","request_id":"194cb85d-cd13-4599-8425-6bbcadc51f86","time":"2023-02-19T08:50:51Z"} seleniferous {"level":"info","msg":"stopping seleniferous: session deleted","time":"2023-02-19T08:50:51Z"}

@SebastianPereiro
Copy link

@alcounit You're right, I assume that something is wrong with our QA team tests.
Maybe some of the tests hang up random browser containers...

@alcounit
Copy link
Owner

@SebastianPereiro try new seleniferous image alcounit/seleniferous:v1.0.5

@SebastianPereiro
Copy link

@alcounit Thank you. I've updated the selenosis config and will run more tests soon.

@SebastianPereiro
Copy link

@alcounit I've just got a fresh error (looks the same). Attaching logs...
selenosis-chrome-93-0-ed7fdb76-bfbd-4686-a439-c01c94769315-1677149444657623189.log

@alcounit
Copy link
Owner

@SebastianPereiro I don't see any error in log you provided, can you give more details where you see an error?

@SebastianPereiro
Copy link

@alcounit Sorry, my bad (I attached the wrong log). Doing more tests...

@SebastianPereiro
Copy link

Actually it wasn't the wrong log. I got a new log from the similar POD:
selenosis-chrome-90-0-1eb493e5-3871-4f67-b512-f8a610712fe6-1677657742320591833.log
Here is how it looks in k9s:
image
Seleniferous exited normally, but the browser container is still alive for 19 hours.

@SebastianPereiro
Copy link

@alcounit Found another strange situation during a prolonged test:

  1. the browser limit was set to 5000
  2. At the moment I saw the problem, we had ~10-20 browsers open
  3. more than 60 selenosis PODs
  4. the logs of all selenosis containers were full of:
selenosis-956f48d4b-gz6qn time="2023-03-01T11:49:05Z" level=info msg=session request="POST /wd/hub/session" request_id=d071d5f1-1690-4f2d-8db6-a587fbe175f8 time_elapsed=0.00s
selenosis-956f48d4b-gz6qn time="2023-03-01T11:49:05Z" level=info msg="starting browser from image: us-east1-docker.pkg.dev/hidden/selenoid/firefox:97.0" request="POST /wd/hub/session" request_id=d071d5f1-1690-4f2d-8db6-a587fbe175f8 time_elapsed=0.00s
selenosis-956f48d4b-gz6qn time="2023-03-01T11:49:06Z" level=error msg="failed to start browser: failed to create pod pods \"firefox-97-0-31b0115b-ab02-4d20-916e-00686939e3be\" is forbidden: exceeded quota: selenosis-pod-limit, requested: pods=1, used: pods=5044, limited: pods=5044" request="POST /wd/hub/session" request_id=d071d5f1-1690-4f2d-8db6-a587fbe175f8 time_elapsed=0.26s

Could it be related to stuck browser containers?
What can you suggest?

@alcounit
Copy link
Owner

alcounit commented Mar 2, 2023

Actually it wasn't the wrong log. I got a new log from the similar POD: selenosis-chrome-90-0-1eb493e5-3871-4f67-b512-f8a610712fe6-1677657742320591833.log Here is how it looks in k9s: image Seleniferous exited normally, but the browser container is still alive for 19 hours.

This is odd.

@alcounit Found another strange situation during a prolonged test:

  1. the browser limit was set to 5000
  2. At the moment I saw the problem, we had ~10-20 browsers open
  3. more than 60 selenosis PODs
  4. the logs of all selenosis containers were full of:
selenosis-956f48d4b-gz6qn time="2023-03-01T11:49:05Z" level=info msg=session request="POST /wd/hub/session" request_id=d071d5f1-1690-4f2d-8db6-a587fbe175f8 time_elapsed=0.00s
selenosis-956f48d4b-gz6qn time="2023-03-01T11:49:05Z" level=info msg="starting browser from image: us-east1-docker.pkg.dev/hidden/selenoid/firefox:97.0" request="POST /wd/hub/session" request_id=d071d5f1-1690-4f2d-8db6-a587fbe175f8 time_elapsed=0.00s
selenosis-956f48d4b-gz6qn time="2023-03-01T11:49:06Z" level=error msg="failed to start browser: failed to create pod pods \"firefox-97-0-31b0115b-ab02-4d20-916e-00686939e3be\" is forbidden: exceeded quota: selenosis-pod-limit, requested: pods=1, used: pods=5044, limited: pods=5044" request="POST /wd/hub/session" request_id=d071d5f1-1690-4f2d-8db6-a587fbe175f8 time_elapsed=0.26s

Could it be related to stuck browser containers? What can you suggest?

If pod is not deleted then you will quickly reach quota limit. Need to understand why pods not deleted after calling kubernetes api.
Thanks for updates.

@SebastianPereiro
Copy link

@alcounit I also see these errors from time to time:
time="2023-03-02T08:22:22Z" level=error msg="Failed to start browser: failed to create pod Operation cannot be fulfilled on resourcequotas \"selenosis-pod-limit\": the object has been modified; please apply your changes to the latest version and try again" request="POST /wd/hub/session" request_id=6424f050-ae4c-4957-a3ef-bd1fe5c7872b time_elapsed=0.64s
What does this mean?

@SebastianPereiro
Copy link

@alcounit Looks like there is some sort of session stacking bug. We can see it by Selenosis RAM usage:
image
image
This bug occurs when we have 500+ browser sessions. Selenosis PODs start to consume lots of RAM/CPU (our limit set to 1GB) .
Logs of POD full of "failed to start browser: failed to create pod pods "firefox-90-0-5ebe9298-b4fc-410f-bc11-9bb16f8f8b60" is forbidden: exceeded quota: selenosis-pod-limit, requested: pods=1, used: pods=2065".
At the same time real number of PODs can be less than 100.

@SebastianPereiro
Copy link

Temporary fix - resize the Selenosis deployment to 0 pods and then upscale it back to HPA recommended value.
This will help, but only for a short time. If there are a lot of browsers, the pod limit errors will come back quickly.

@alcounit
Copy link
Owner

alcounit commented Mar 7, 2023

@SebastianPereiro sorry for the delays. I'll try to fix the browser pod deletion issue. Will post develop image shorty so you can test it.
Selenosis has a memory issue in some cases, I have plans to rewrite the whole project

@SebastianPereiro
Copy link

@alcounit No problem. Thank you!
p.s. as a workaround for stale browser PODs, I recompiled the Selenosis with "RestartPolicyAlways".

@SebastianPereiro
Copy link

Let me know if I can provide more diagnostic information.

@alcounit
Copy link
Owner

@SebastianPereiro can you please try this image alcounit/seleniferous:develop_v1.0.6
Changes: alcounit/seleniferous@a71a453

@SebastianPereiro
Copy link

@alcounit Thank you. I tried but browsers failed to start. I got the following errors:
From selniferous logs:

repository: "alcounit/seleniferous"
tag: "develop_v1.0.6"
browser Starting ChromeDriver 105.0.5195.19 (b9c217c128c16f53d12f9a02933fcfdec1bf49af-refs/branch-heads/5195@{#176}) on port 4444
browser All remote connections are allowed. Use an allowlist instead!
browser Please see https://chromedriver.chromium.org/security-considerations for suggestions on keeping ChromeDriver safe.
browser ChromeDriver was started successfully.
seleniferous exec /seleniferous: exec format error
Stream closed EOF for devops/chrome-105-0-760376d1-4d43-4e4a-a0d2-f3c11c757818 (seleniferous)
Stream closed EOF for devops/chrome-105-0-760376d1-4d43-4e4a-a0d2-f3c11c757818 (browser)

Selenosis logs:

selenosis:v1.0.5
time="2023-03-23T08:05:34Z" level=info msg=session request="POST /wd/hub/session" request_id=0f09ae27-18a6-41d6-82e2-90112970f462 time_elapsed=0.00s
time="2023-03-23T08:05:34Z" level=info msg="starting browser from image: us-east1-docker.pkg.dev/non-production-80fc/selenoid/chrome:107.0" request="POST /wd/hub/session" request_id=0f09ae27-18a6-41d6-82e2-90112970f462 time_elapsed=0.00s
time="2023-03-23T08:05:34Z" level=error msg="session failed: Post "http://chrome-98-0-69ecc584-3bc6-4626-a348-a95e3de6b2d1.seleniferous:4445/wd/hub/session\": dial tcp 10.53.1.17:4445: connect: connection refused" request="POST /wd/hub/session" request_id=2a8c9fe4-e4f7-4769-a437-5d6427f472e7 time_elapsed=9.50s
time="2023-03-23T08:05:34Z" level=info msg=session request="POST /wd/hub/session" request_id=1f9f9e3c-4569-44e0-85a5-83b2e94dd264 time_elapsed=0.00s
time="2023-03-23T08:05:34Z" level=info msg="starting browser from image: us-east1-docker.pkg.dev/ /chrome:101.0" request="POST /wd/hub/session" request_id=1f9f9e3c-4569-44e0-85a5-83b2e94dd264 time_elapsed=0.00s
time="2023-03-23T08:05:35Z" level=info msg=session request="POST /wd/hub/session" request_id=f88c1959-cfeb-43a4-a8ff-78d6a390a0cb time_elapsed=0.00s
time="2023-03-23T08:05:35Z" level=info msg="starting browser from image: us-east1-docker.pkg.dev/ /chrome:103.0" request="POST /wd/hub/session" request_id=f88c1959-cfeb-43a4-a8ff-78d6a390a0cb time_elapsed=0.00s
time="2023-03-23T08:05:43Z" level=error msg="session failed: Post "http://chrome-107-0-ac970928-0456-483f-b8eb-c195bdc6f6a7.seleniferous:4445/wd/hub/session\": dial tcp 10.53.1.18:4445: connect: connection refused" request="POST /wd/hub/session" request_id=0f09ae27-18a6-41d6-82e2-90112970f462 time_elapsed=9.19s
time="2023-03-23T08:05:43Z" level=error msg="session failed: Post "http://chrome-101-0-2c04a3e5-1bb9-4c75-a59e-bab61b2c9228.seleniferous:4445/wd/hub/session\": dial tcp 10.53.1.19:4445: connect: connection refused" request="POST /wd/hub/session" request_id=1f9f9e3c-4569-44e0-85a5-83b2e94dd264 time_elapsed=8.73s
time="2023-03-23T08:05:44Z" level=error msg="session failed: Post "http://chrome-103-0-8e797f76-1683-4701-a13d-86945c9fbd7e.seleniferous:4445/wd/hub/session\": dial tcp 10.53.1.20:4445: connect: connection refused" request="POST /wd/hub/session" request_id=f88c1959-cfeb-43a4-a8ff-78d6a390a0cb time_elapsed=8.75s

@alcounit
Copy link
Owner

my bad, image was built with arch arm64, try alcounit/seleniferous:develop_v1.0.6.1

@SebastianPereiro
Copy link

@alcounit Many thanks! Testing...

@SebastianPereiro
Copy link

Ok, I ran my tests and still some pods end up in 'unfinished' state (seleniferous is in completed state & browser is still running).
image
Attaching one of the logs.
devops-chrome-105-0-2a722778-7506-4b86-9ebf-cc14e633bd1e-1679926652109504881.log

@alcounit
Copy link
Owner

alcounit commented Mar 30, 2023

This looks odd, pod delete call failed

seleniferous {"error":"Delete "https://10.53.144.1:443/api/v1/namespaces/devops/pods/chrome-105-0-2a722778-7506-4b86-9ebf-cc14e633bd1e\": dial tcp 10.53.144.1:443: i/o timeout","level":"info","msg":"deleting pod chrome-105-0-2a722778-7506-4b86-9ebf-cc14e633bd1e","time":"2023-03-27T14:14:45Z"}

@SebastianPereiro
Copy link

Could you please increase the timeout value and add more retries for pod deletion?

@alcounit
Copy link
Owner

Could you please increase the timeout value and add more retries for pod deletion?

@SebastianPereiro will do, sorry for delays

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working seleniferous sidecar container
Projects
None yet
Development

No branches or pull requests

3 participants