Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker container fails to restart #2381

Open
Infinoid opened this issue Oct 6, 2021 · 1 comment
Open

Docker container fails to restart #2381

Infinoid opened this issue Oct 6, 2021 · 1 comment

Comments

@Infinoid
Copy link

Infinoid commented Oct 6, 2021

Summary

The cyberark/conjur docker container does not restart gracefully. It leaves a stale pidfile behind, and then refuses to start.

Steps to Reproduce

  1. Follow the quickstart setup instructions.
  2. Restart the server container: docker restart conjur_server

(Restarting the docker host machine is also sufficient to reproduce the problem.)

Expected Results

Conjur server serves conjurs after the restart.

Actual Results (including error logs, if applicable)

Server did not restart properly. Clients get "connection refused" attempting to contact the server.

docker logs conjur_server contains an error message about how the PID file already exists.

This log contains output from both the first run and the second (failed) one:

authn-local is listening at /run/authn-local/.socket
=> Booting Puma
=> Rails 5.2.6 application starting in production 
=> Run `rails server -h` for more startup options
[19] Puma starting in cluster mode...
[19] * Puma version: 5.3.2 (ruby 2.5.8-p224) ("Sweetnighter")
[19] *  Min threads: 5
[19] *  Max threads: 5
[19] *  Environment: development
[19] *   Master PID: 19
[19] *      Workers: 2
[19] *     Restarts: (✔) hot (✖) phased
[19] * Preloading application
[19] * Listening on http://0.0.0.0:80
[19] Use Ctrl-C to stop
CONJ00038I OpenSSL FIPS mode set to true
Loaded configuration:
- trusted_proxies from defaults
- authenticators from defaults
[19] - Worker 0 (PID: 24) booted in 0.0s, phase: 0
Loaded configuration:
- trusted_proxies from defaults
- authenticators from defaults
[19] - Worker 1 (PID: 28) booted in 0.0s, phase: 0
error: SIGTERM
A server is already running. Check /opt/conjur-server/tmp/pids/server.pid.
=> Booting Puma
=> Rails 5.2.6 application starting in production 
=> Run `rails server -h` for more startup options
Exiting
authn-local is listening at /run/authn-local/.socket

Search the above for server.pid.

Reproducible

I don't know if it's 100%, but it occurs at least 50% of the time for me. Happens often for me, for the past year or more, whenever system updates on the docker host machine require a reboot.

Version/Tag number

Latest. Currently failing on docker image sha256:3f552a4b683b064e45265ba875f6fcc797170a8a3f93ff90e81e5f9df337682e, tagged as 1.13.1.

Environment setup

This happens in the environment set up by following the quickstart instructions without any modifications.

Docker version 20.10.7, build 20.10.7-0ubuntu1~20.04.2

With minor changes to the docker-compose.yml file (just adding "docker://" prefixes), I also see the same problem with podman-compose.

podman version 3.0.1

Additional Information

Once the stale pidfile is present, the server will NEVER restart until it is removed. It can be removed as follows:

docker exec conjur_server rm /opt/conjur-server/tmp/pids/server.pid; docker restart conjur_server.

When the server is in the bad state, docker top conjur_server shows fewer processes running.

Good:

USER   PID   PPID   %CPU    ELAPSED          TTY   TIME   COMMAND
root   1     0      0.000   1m4.601003137s   ?     0s     ruby /usr/local/bin/conjurctl server 
root   10    1      0.000   1m1.601150286s   ?     0s     sh -c 
          rails server -p '80' -b '0.0.0.0'
         
root   13   1    1.623   1m1.601969553s   ?    1s   ruby /var/lib/ruby/bin/rake authn_local:run 
root   16   1    3.247   1m1.602483189s   ?    2s   ruby /var/lib/ruby/bin/rake expiration:watch 
root   19   10   3.247   1m1.603541028s   ?    2s   puma 5.3.2 (tcp://0.0.0.0:80) [Conjur API Server]        
root   24   19   0.000   59.603698347s    ?    0s   puma: cluster worker 0: 19 [Conjur API Server]           
root   28   19   0.000   59.603843814s    ?    0s   puma: cluster worker 1: 19 [Conjur API Server]           

Bad:

USER   PID   PPID   %CPU    ELAPSED           TTY   TIME   COMMAND
root   1     0      0.000   6m45.406072271s   ?     0s     ruby /usr/local/bin/conjurctl server 
root   13    1      0.248   6m43.406187601s   ?     1s     ruby /var/lib/ruby/bin/rake authn_local:run 
root   16    1      0.496   6m43.406293512s   ?     2s     ruby /var/lib/ruby/bin/rake expiration:watch 

I think that the docker init script should clean up stale PID files. Alternately, the server process could check whether a process with that pid is running, and is not the current process id.

@doodlesbykumbi
Copy link
Contributor

Thanks for posting this issue @Infinoid. I was able to reproduce. I happened to have the v1.11.6 image and noticed that it gracefully deals with those stale PID files, compared to .v1.13.1 which does not. I looked at the diff between v1.11.6 and .v1.13.1, nothing seems out of the ordinary.

Right now it's not clear what's causing this behavior, and so this will likely require further investigation.

A quick fix is to comment out the command and specify an entrypoint that cleans up the PID file on the conjur service in your docker-compose.yml

#    command: server
    entrypoint: ["sh", "-c" , "rm -f /opt/conjur-server/tmp/pids/server.pid; conjurctl server"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants