Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dockerfile to simplify installation #93

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

notslang
Copy link

@notslang notslang commented Sep 5, 2016

I'm deploying a couple instances of grab-site to a CoreOS cluster, so I made a Dockerfile... Hopefully this is a bit easier to use than pip/virtualenv. The reason why this uses the larger python:3.4-slim image (rather than python:3.4-alpine) is because Alpine had some issues compiling https://github.com/dw/py-lmdb with its version of gcc.

This PR still needs docs, so it's a work-in-progress right now.

After starting the container you can use the regular grab-site command via docker exec <container-name> grab-site <args and site url>

@ivan
Copy link
Contributor

ivan commented Sep 5, 2016

I haven't used Docker, so bear with me...

  1. Why COPY to /app/ if you still subsequently do a pip3 install .? If you pip3 install ., then grab-site, gs-server, etc should be installed somewhere, right?

  2. Can you make the script in .travis.yml test that this Dockerfile works? (Probably after all the existing stuff.)

Thanks for working on this!

@notslang
Copy link
Author

notslang commented Sep 6, 2016

No prob - you can think of a Docker container as a lightweight VM... Like VirtualBox, but with better tooling and less overhead). The Dockerfile automates building/configuring the container and the COPY directive handles copying the code from your working directory into the container's file-system. Once the code is in the container (at /app) then we do pip3 install to get all the deps and set everything up.

This creates a fully isolated, reproducible installation of grab-site in a 200-300MB image. This image can be run on any host OS, including CoreOS where Python isn't even installed. Using Alpine as a base we could get this image down to 20-50MB, but that requires some modifications to py-lmdb.

As for testing, we can have https://hub.docker.com automatically rebuild the image whenever new code is pushed (see: https://docs.docker.com/docker-hub/builds/) and run Docker-based tests in Travis if you want: https://docs.travis-ci.com/user/docker/

@ivan
Copy link
Contributor

ivan commented Sep 6, 2016

pip3 install . should install grab-site in addition to the dependencies, though. pip3 install puts things in /usr/local/bin while pip3 install --user puts things in ~/.local/bin, unless there's some extra configuration doing something else. Would it make sense to use the installed grab-site scripts in one of those paths rather than duplicate some pip functionality with the COPY lines?

@ivan
Copy link
Contributor

ivan commented Sep 6, 2016

Is there an issue filed somewhere for py-lmdb's failure to compile on Alpine Linux's gcc?

@notslang
Copy link
Author

notslang commented Sep 6, 2016

pip3 install . is being run within the context of the Docker container (not the host OS) so you need to COPY the files into the container for pip to work.

@ivan
Copy link
Contributor

ivan commented Sep 6, 2016

Oh, that explains it :-)

@notslang
Copy link
Author

notslang commented Sep 6, 2016

There isn't an issue filed on https://github.com/dw/py-lmdb/issues yet.

@igorbrigadir
Copy link

Alpine had some issues compiling https://github.com/dw/py-lmdb with its version of gcc.

I haven't tried running grab-site, but it seems like installing py-lmdb works on python:3.4-alpine with this:

FROM python:3.4-alpine
RUN apk add --update build-base libffi-dev
RUN pip install lmdb

Dockerfile Outdated
RUN pip3 install ./
RUN apt-get purge -y build-essential
RUN apt-get autoremove -y
RUN apt-get clean
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone tells me that each RUN creates a new layer, so the purge/autoremove/clean would not reduce the size of the final Docker image. What do you think about combining the RUNs on lines 6-12 into one RUN command?

https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/#/minimize-the-number-of-layers

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Someone" is me, in case additional clarification of this comment is needed :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility is to use FROM python:3.4 instead of FROM python:3.4-slim; the non-slim variant is based off of buildpack-deps which has a lot of compilers / tools / libraries installed. The resulting total image size would be bigger, but the advantage is that the buildpack-deps portion would be shared with every other image based off of that, so in the usual case where you have several images, the total space usage would be lower.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I combined the commands & got the image size down to 235.4MB. I'm kinda surprised that images don't get flattened, but moby/moby#332 offers a lengthy discussion on it.

As for basing it on python:3.4, that would reduce the build time & total size of images on the system, but only if a significant number of the other images on the system are based on it too, which I don't think we can assume. It's probably better to just optimize for the smallest resulting image size.

@notslang notslang force-pushed the patch-1 branch 2 times, most recently from fb56712 to ce4d178 Compare September 13, 2016 05:41
@@ -34,6 +34,7 @@ Note: grab-site currently **does not work with Python 3.5**; please use Python 3
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a lie, sorry. I've been updating this TOC manually and probably don't want the Tips for specific websites expanded.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, removed

@notslang
Copy link
Author

You're right about it working on Alpine - I was just missing libffi-dev. Now it's down to 112.4 MB (37 MB when compressed). Also, I added instructions to the README, so I'm going to remove the "[wip]" from this.

@notslang notslang changed the title [wip] add Dockerfile to simplify installation Add Dockerfile to simplify installation Sep 13, 2016
@notslang notslang force-pushed the patch-1 branch 2 times, most recently from ee5a78e to 3393d82 Compare September 13, 2016 06:02
README.md Outdated
Start the grab-site server. You can set the port, volume, and name to whatever you want:

```bash
docker run --detach -p 29000:29000 -v /home/ludios/download/grab-site-data:/data --name warcfactory slang800/grab-site
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just ~/grabs instead of /home/ludios/download/grab-site-data?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docker requires an absolute path for mounts... I suppose I could do $(pwd)/grabs, if that's obvious to most users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~ will be made absolute by the shell, no?

$ echo ~/
/home/at/

@ivan
Copy link
Contributor

ivan commented Sep 13, 2016

Thanks for the fixes.

I am currently somewhat busy and under-dockered, can a grab-site user please give the Docker instructions a try and see if they work? (And let me know if you had to perform any other steps to make this a useful setup?)

@igorbrigadir
Copy link

I had docker already, but worth linking to https://docs.docker.com/engine/installation/ instructions

I tried it with:
Ubuntu: 12.04.5 LTS, x86_64, 3.8.0-44-generic
Docker: Docker version 1.7.1, build 786b29d

Ran (sudo for docker commands because i skipped this step https://docs.docker.com/engine/installation/linux/ubuntulinux/#/create-a-docker-group):

sudo docker pull slang800/grab-site
sudo docker run --detach -p 29000:29000 -v ~/grab-site-data:/data --name warcfactory slang800/grab-site
Web UI worked on http://localhost:29000/
sudo docker exec warcfactory grab-site --no-offsite-links http://xkcd.com/

Crawl finished successfully!

@ivan
Copy link
Contributor

ivan commented Sep 19, 2016

I tried this out, but couldn't find a way to attach a terminal to a docker exec -d process (or a docker exec process that has been ctrl-c'ed - note the ctrl-c is not passed to the child). The reason that you sometimes need a terminal attached to a grab-site process is to 1) see which URL is currently being grabbed (this information is not reported to the dashboard, only finished responses) and 2) look at segfaults and websocket connection problems that don't get reported to the dashboard either.

Would adding tmux to the container and using tmux work? (Note, tmux 2.1 is broken; 1.8 is a known-good version.) I just hope that docker exec tmux attach works. If this does work, the documentation should also be updated.

@ivan
Copy link
Contributor

ivan commented Sep 19, 2016

Also, running gs-server as PID 1 seems undesirable because if it were killed, it would kill all the grab-site processes as well. grab-site processes are designed to stay running even if gs-server crashes or is taken down for an upgrade. Maybe gs-server (and each grab-site) should run in its own container instead.

@notslang
Copy link
Author

Maybe gs-server (and each grab-site) should run in its own container instead.

Splitting up the server and client would make sense, especially since you could then run them on different machines, but I should probably do that as a separate PR, since I'll need to look into how they communicate.

Also, running gs-server as PID 1 seems undesirable because if it were killed, it would kill all the grab-site processes as well.

Would using dumb-init as PID 1 allow the orphaned grab-site processes to keep running in the case where gs-server dies? If so, that would be a decent temporary fix.

I tried this out, but couldn't find a way to attach a terminal to a docker exec -d process (or a docker exec process that has been ctrl-c'ed - note the ctrl-c is not passed to the child).

You could run docker exec without detatching, but this whole setup could be simplified by splitting up the processes into their own containers... Then you'd be able to use docker logs and pass signals in a sane manner.

@semente
Copy link

semente commented Nov 7, 2018

hey people! what is the status of this PR? I could give a hand.

@ivan
Copy link
Contributor

ivan commented Nov 7, 2018

For now, I would like someone else to be the Dockerized grab-site upstream. I don't use Docker and I don't have the resources to 1) figure out if a PR is taking the right approach with Dockerization (which base? which init? one container per grab-site? how to integrate tmux, if needed?) 2) double my manual testing matrix.

So, please, have at it and promote your fork/Dockerfile here. If you (or someone else) stays interested in maintaining and testing it, I might take a PR in the future.

@notslang
Copy link
Author

what is the status of this PR?

@semente I've been using it pretty often for my own projects, and it works fine, but I haven't rebased it since 2016. I'll try rebasing and pushing a new image to the Docker hub.

For now, I would like someone else to be the Dockerized grab-site upstream

Ok, I'll keep an image updated over here: https://cloud.docker.com/u/slang800/repository/docker/slang800/grab-site

@gabefair
Copy link

@notslang Thank you for all this work. Can you confirm that your fork still works fine? I am curious if you ran into any issues or discovered anything of note.

@818S
Copy link

818S commented May 20, 2021

https://cloud.docker.com/u/slang800/repository/docker/slang800/grab-site

It says updated 3 years ago, any plans to update it?

Or any plans to officially ship a Dockerfile for this?

@818S 818S mentioned this pull request May 20, 2021
@brandongalbraith
Copy link

FYI this third party grab-site Dockerfile currently works as of this comment being posted: https://github.com/Nold360/docker-grab-site.

https://registry.hub.docker.com/r/nold360/grab-site/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants