Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full Docker support #195

Open
wants to merge 30 commits into
base: master
Choose a base branch
from
Open

Conversation

acrois
Copy link

@acrois acrois commented Jul 30, 2021

Based on the initial work done by Fusl (credits preserved), the Dockerfile has been updated and improved as well as all its usage documented in the README. It strives to address all of the usability concerns mentioned amongst several other issues on this topic.

Please review the usage to begin running Docker grab-site on your machine

This new image is based on Alpine 3.13 and Python 3.7. The advantage with this new addition is that no previous support exists, and this breaks no existing installations only broadens them. Maintaining this will be easy and can be integrated into Travis CI. If you wish to run multiple grab-site instances, you may want to mount a different volume to your host machine for each container.

One limitation, which is unavoidable unless we make some slight changes to the existing installation process, is that external dependencies (Python libs defined in setup.py) are not able to be cached so any changes to the code will cause the dependencies to be downloaded fresh on a subsequent build.

Ideally, we would download these dependencies as a separate build step and then copy the code over and run the setup. It may be as simple as building the image and then running pip freeze > requirements.txt and then pip install but further testing would be required to determine what paths setup.py standalone vs pip install would use and didn't seem important enough to warrant the complexity of figuring it out.

There is no Docker Registry usage. If this is desired by the community in the future, it can be added simply by additional steps using Travis CI if we can integrate it and have it automatically deploy to GitHub Packages and/or Docker Hub from successful Travis CI build.

Finally, I will provide support to maintain and improve this image and its usage/deployment in the future.

@brandongalbraith
Copy link

brandongalbraith commented Jul 30, 2021

@acrois THANK YOU for your work on this 👏👏👏

@ivan
Copy link
Contributor

ivan commented Aug 1, 2021

Since I am somewhat occupied at the moment: would a grab-site user kindly test this and all of the new README instructions? Feel free to post feedback or code review.

Copy link
Contributor

@ivan ivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Could you please clarify the build step so that I can continue testing?

Dockerfile Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@acrois
Copy link
Author

acrois commented Aug 6, 2021

In order to simplify the data storage and improve portability, I have made several improvements to the design of the Docker environment and taking advantage of some default behaviors, can completely remove the need to specify the "--dir" flag at runtime by default.

I was working on a side project, related to grab-site and I discovered a pain when I was trying to automate running it in a cluster.

I ended up isolating all grab-site related processes to their own network, for increased security.

grab-site stores data in whatever current working directory you are in, so I simply added grab-site's install dir to the PATH so that references to scripts need "./" to refer to them. This allows us to simplify specifying one argument for persistence -v ./data:/data:rw.

Quick start usage as well as changes regarding this change can be found in the updated README.

Let me know if this is reproducible for you. I've run it on both Windows and Debian Docker hosts.

I still plan on updating more information on what each parameter in the docker run does but it won't be functionally changing at all.

@ivan
Copy link
Contributor

ivan commented Aug 28, 2021

Sorry for the delay, I had a lot of things going on.

I followed the updated instructions (though I substituted grab-network with gs-network) on NixOS and got:

at@dev:/home/at/code/archiving/grab-site:docker-pr-2# docker run --net=gs-network --rm -d -e GRAB_SITE_HOST=gs-server -v ./data:/data:rw grab-site:latest grab-site https://example.com/
docker: Error response from daemon: create ./data: "./data" includes invalid characters for a local volume name, only "[a-zA-Z0-9][a-zA-Z0-9_.-]" are allowed. If you intended to pass a host directory, use absolute path.
See 'docker run --help'.

at@dev:/home/at/code/archiving/grab-site:docker-pr-2# docker --version
Docker version 20.10.7, build v20.10.7

Did I do something wrong? Do I need to make some directory exist first or create some volume?

In general, if we are to support Docker, I would really like to have instructions that would work for a first-time Docker user, with everything made explicit. Otherwise people will generally just give up.

@ivan
Copy link
Contributor

ivan commented Aug 28, 2021

Ah, I see # Ensure that ./data exists and container has write privileges on volume, so perhaps it should say

mkdir -p data

and then -v $PWD/data:/data:rw in the docker run args.

@ivan
Copy link
Contributor

ivan commented Aug 28, 2021

When I start a grab-site container as per the Docker instructions, my containers appear to die before they do anything. journalctl -b -u docker shows

Aug 28 01:42:45 dev f4baf8e0cffb[188335]: + exec grab-site https://example.com/
Aug 28 01:42:46 dev f4baf8e0cffb[188335]: Traceback (most recent call last):
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:   File "/app/grab-site", line 4, in <module>
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:     main.main()
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:   File "/home/grab-site/.local/lib/python3.7/site-packages/click/core.py", line 1137, in __call__
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:     return self.main(*args, **kwargs)
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:   File "/home/grab-site/.local/lib/python3.7/site-packages/click/core.py", line 1062, in main
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:     rv = self.invoke(ctx)
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:   File "/home/grab-site/.local/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:     return ctx.invoke(self.callback, **ctx.params)
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:   File "/home/grab-site/.local/lib/python3.7/site-packages/click/core.py", line 763, in invoke
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:     return __callback(*args, **kwargs)
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:   File "/app/libgrabsite/main.py", line 293, in main
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:     os.makedirs(working_dir)
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:   File "/usr/local/lib/python3.7/os.py", line 223, in makedirs
Aug 28 01:42:46 dev f4baf8e0cffb[188335]:     mkdir(name, mode)
Aug 28 01:42:46 dev f4baf8e0cffb[188335]: PermissionError: [Errno 13] Permission denied: '/data/example.com-2021-08-28-d9c5308a'

We probably need instructions for "Ensure that [...] container has write privileges on volume" as well if we are to support this.

chmod 0777 data appears to work but I would like to have the proper solution.

@ivan
Copy link
Contributor

ivan commented Aug 28, 2021

regarding the permissions, I am told

"Fusl's Dockerfile simply ran grab-site as root inside the container, by the way. Not ideal, but since it's isolated to that container, probably fine? I'm not a Docker expert though.

The Dockerfile in the PR creates a user and group inside the container, which is cleaner, but then that uid/gid needs to have write access in the mountpoint as well."

@Nothing4You
Copy link

to my knowledge there's basically 2 approaches for this.

  1. launch as root, have an init script that performs initial setup stuff like chowning work directrories and afterwards dropping to a non-root user
  2. set a static uid the container runs as in the Dockerfile

you can also combine these as for example https://www.linuxserver.io/ does, the init script checks if it's running as root and it has env vars for dropping privilege, then it chowns the working directory and drops to the final user.

effectively there's also a third option that can be used regardless of the previous configuration, however, depending on which of the previous configurations are chosen this can be tricky to implement.
with the first option the user just passes uids of the user to run as env var, but you can also launch containers directly with specified uid/gid.

if you however specify VOLUMEs in the Dockerfile they are chowned (except for bind mounts) on creation to the user that owned the base folder at that time.
while this works just fine for bind mounts like -v ./data:/data:rw as you can just chown them manually if desired this will not work for volume mounts.
with volumes you typically don't have an easy way to change permissions/ownership in them.

with bind mounts you also won't get the files and folders initialized inside the volume if there are any, this only works with volumes.

also to note is that some applications may run into issues if they run as a uid that does not map to a user (i.e. no matching entry in /etc/passwd).

@acrois
Copy link
Author

acrois commented Sep 11, 2021

@ivan @Nothing4You

Thank you for the review, I have been a little busy so I just wanted to let you know I'm working on addressing the comments soon. I'll tag again when I've got some answers.

Dockerfile Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
@JustAnotherArchivist
Copy link
Contributor

One more comment: the paths in --input-file, --import-ignores, --dir, --finished-warc-dir, and --wpull-args obviously refer to paths in the container. I don't think there is a way to avoid that, but it should be documented how to work around it by using -v to map these to files in the container and then specifying the container paths in the command.

acrois and others added 4 commits August 14, 2022 16:00
Save space on build by specifying `--no-cache-dir` to `pip install` in `Dockerfile`

Co-authored-by: JustAnotherArchivist <[email protected]>
…k, update documentation for single container usage vs Docker GRAB_SITE_HOST usage, set default GRAB_SITE_HOST build arg to gs-server instead of application default 127.0.0.1 with documented reasoning and how to change it
…ntainer usage vs Docker GRAB_SITE_HOST usage, set default GRAB_SITE_HOST build arg to gs-server instead of application default 127.0.0.1 with documented reasoning and how to change it
…Alpine version, remove need to chmod entrypoint, remove cargo dependency, move user creation above dependency installation, set uid and gid to 10000 for consistency
…quick start for universal docker run volume mount syntax
…d how to pass configuration for grab-site crawls
@raspher
Copy link

raspher commented Nov 18, 2022

Please provide information about selinux bind mounting to avoid spam of unneeded issues (just use Z instead of rw please)

https://stackoverflow.com/questions/24288616/permission-denied-on-accessing-host-directory-in-docker

@ImportTaste
Copy link

Is this still being worked on?

@acrois
Copy link
Author

acrois commented May 21, 2023

@ImportTaste It should be complete, I just pulled the latest changes from here into my fork/branch so your build will be against the latest version of the code.

Please let me know if you follow the usage instructions and it does not work so I can improve it.


TL;DR: I'd call this the best starting point for anyone trying to run grab-site and gs-server on Docker. It might not be 100% complete but I've been running it ever since I submitted the PR on several VMs (AWS, Azure, GCP, VirtualBox, Proxmox), so it works on my machines! 😄


It would be great if this could be merged and then a separate issue can be created with an accurate report of how to reproduce these different environments. That way the foundational support isn't just hanging out forever on my fork.

That being said, I have recently (last 9mo) documented setting up user/permissions between the host and container on a bone standard Debian install with an additional user. I believe that could address the SELinux user host<->container mismatch and allow it to run on those systems with whatever flag set (rw vs Z) but I don't have the time to try and figure out how to reproduce everyone else's configuration between selinux, nix/os, and other setups from scratch.

I'd love to do it, I just need some help. So, instead of making assumptions, I'd appreciate a script and target OS that I can use to help me set up everyone's environment(s) locally in order to reproduce this last remaining issue to prepare a proper solution that is compatible with all configurations.

@xaiki
Copy link

xaiki commented Dec 6, 2023

this seems to require pip install --no-binary lxml to work.

@acrois
Copy link
Author

acrois commented Jan 9, 2024

pip install --no-binary lxml

https://github.com/acrois/grab-site/blob/feature/dockerfile/Dockerfile#L47-L48

I know that pip is not ideal but this is supposed to cover the app dependency scenario already. If it is necessary it should be included as part of the list of dependencies, perhaps here in setup.py:

https://github.com/acrois/grab-site/blob/feature/dockerfile/setup.py#L12-L20


image

I added a build script that will take care of release needs.

Please send me a reproduction using a command that leverages the new images.

docker pull ghcr.io/acrois/grab-site:sha-3ee0d2f

Perhaps we need a test case added that complements the build.yaml workflow?

Yes or yes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants