Add xgboost and stan install and test scripts #331

RobJY · 2022-01-11T20:35:22Z

Here's code to install and test gpu supported xgboost and stan. Please let me know if I should move or add anything.

I'm not sure if I've put install_stan.R in the right place since it needs to be run by users to install stan. Is there a better place for it.

I got the xgboost test script from https://rdrr.io/cran/xgboost/src/demo/gpu_accelerated.R. Please let me know if you think there's something better we could use.

cboettig · 2022-01-31T15:11:23Z

These look good to me! @noamross any thoughts here?

noamross · 2022-01-31T15:19:04Z

Only that I'm not sure these are updated since @RobJY found the issue in NVIDIA/nvidia-docker#342, so these should likely change to install something other than nvidia-cuda-toolkit, right @RobJY?

noamross · 2022-01-31T15:22:00Z

Also, these are interactive test scripts, not scripts that return a value or a specific error for failure. Wouldn't it be better to have something that can integrate into a smoke test?

RobJY · 2022-01-31T15:46:18Z

That's right @noamross, I'll have to update the xgboost install script when I've got a fix working.

scripts/install_xgboost.sh

scripts/install_stan.R

scripts/install_xgboost.sh

@eitsupi

Thanks @eitsupi !

RobJY · 2022-02-03T16:03:05Z

I've pushed an update based on a fix that I'll be deploying to our systems tomorrow. Let me explain what I think the issue is and a possible solution.

The issue we've been having seems to be that because the cuda libraries are being so actively developed it's very easy for a library to be updated after the cuda driver has been installed and then the library versions on the host and in the container no longer match. Keeping the nvidia-cuda-toolkit would probably be fine, but it contains so many packages that it's more likely to run into this problem. For this reason, the updated code uses a minimal set of libraries to help mitigate the issue.
Updating the driver would require scheduling a downtime for machine reboot, so it's better for us to lock the library versions and use apt-mark hold to try to keep them from being updated. This has worked in my test container.

It seems like using this script for Rocker builds has the opposite problem in that we will be building images with the current versions of the libraries, but will have no way of knowing what version of the driver users will have on their host machines. It seems very likely that this will often result in this library mismatch error. For this reason I removed the versions from the apt call and now that I think of it we probably don't want to hold the installed version either. I'll update that too. Maybe the best way to deal with this is to have a document that helps users performs a set of steps to try to resolve the problem if they receive the library mismatch error message. That may be as simple as updating their driver and then updating the libraries in the container.

Is there a better way to handle this?

RobJY · 2022-02-07T18:41:47Z

Also, these are interactive test scripts, not scripts that return a value or a specific error for failure. Wouldn't it be better to have something that can integrate into a smoke test?

I was thinking we'd collect standard out from running this and get the return code. If the code isn't 0 we would just echo the standard out. I see now I should do that inside the script though and not create more work outside it.
Thanks for pointing this out @noamross!

I've added the bash scripts that do what I described above. Let me know if you'd like any other changes.

Shouldn't we not need to install here?

RobJY · 2022-03-11T20:31:29Z

I just took another look at this to see if I could move it along. I could add some text about possible fixes a user could try if they encounter the Failed to initialize NVML: Driver/library version mismatch error message, but I think we first need to learn a little more about the issue on a long running server like we have on premises with the fixes we've proposed there. Those being that we will not peg versions of GPU packages and will install a new version of the Nvidia driver with each update. This should cut down on the number of this kind of mismatch errors, but I'm not sure it will completely eliminate them. I'll update this PR when we've gone through another local update and deploy.

RobJY · 2022-07-14T15:22:44Z

We've gone through another deploy and had the same 'version mismatch' issue. To correct it I modified the script (attached as text file) we run on our host machine to install the CUDA toolkit at version 11.1. Curiously, nvidia-smi shows CUDA version = 11.7 on both the host and in the container after running the script. It failed previously in the container with the 'version mismatch' message. I've read that this version report can be incorrect, but I wanted to check. We don't have nvcc installed on the host so I can't check the version there. I'd rather not install it now and risk causing versions of other packages to change. We also don't have a /usr/local/cuda/version.txt file on either the host or in the container. I can see that in /usr/local we only have the directories cuda-11.1 and cuda and cuda is just a symlink to cuda-11.1. For this reason I think we're actually running toolkit version 11.1 as intended and this technique can be used by Rocker users to update their host machines to match the toolkit version in the container.

The block of code at the bottom of the script to install verison 11.1 of the toolkit comes from this Nvidia page with the only modification being the change from cuda to cuda-11.1 in the last line. We could pass instructions for Rocker users encountering this 'version mismatch' problem to go to this link and make the required modification to have them install the 11.1 version of the toolkit on their host machine and hopefully correct the error as it did for us.

How does this sound?

install_container_toolkit.txt

cboettig · 2022-07-14T21:06:20Z

@RobJY thanks for the follow-up. the driver mismatch error is definitely an annoying one and not one that I've completely wrapped my head around either. A few comments/observations so far:

Curiously, nvidia-smi shows CUDA version = 11.7 on both the host and in the container after running the script.

This is as expected. nvidia-smi shows the CUDA driver version, i.e. the host version, not the version of the toolkit software. The driver version must be >= the toolkit version accoding to NVIDIA.

Updating the driver on the host while the container is still running is one way to get the Failed to initialize NVML: Driver/library version mismatch, but I also see this error when I don't think I've done any updates on the host? Have you observed that? Either way, simply doing docker restart <container_id> resolves the error, though obviously not ideal.

RobJY · 2022-07-15T15:46:05Z

Thank you @cboettig for the explanation of the CUDA version from nvidia-smi! That makes sense.

You're right about suggesting a machine reboot. That should be the first thing we ask users to try. I did see that that was successful for many users on the forums. Unfortunately, it didn't work for me.

Let me explain what I think the issue was for our servers and please let me know if you think I've got any of it wrong.
At the top of my attached script we install the updated driver and container toolkit. I think this installs package version greater than those used by CUDA 11.1. To counteract this I install the 11.1 toolkit on the host machine at the bottom of the script. This downgrades the versions previously installed.
I have two pieces of evidence that this is what's happening:

I no longer get the package mismatch error
It fails if I don't use the flag --allow-downgrades in my call to apt at the top of the script with a message about not being able to satisfy dependencies when installing CUDA toolkit 11.1.

Is there a better way to achieve this?

cboettig · 2022-07-15T18:07:42Z

Thanks @RobJY , apologies I missed that the script was not part of the commit but just attached to the comment. Very interesting! if it works it works, but honestly this just deepens the puzzle for me -- this script is all being run outside the container, on the host machine, yeah?

I don't think you should need the toolkit of any kind installed on the host machine though? Maybe I'm just confused. All software above the kernel level (i.e. not including the drivers) should be handled in the containerized layer? Or does that assumption go out the window when we're using nvidia-docker?

I've also had to mess with apt-pinning to get nvidia container runtime to play nicely with upgraded drivers, as described here: pop-os/pop#1708 (comment) . Maybe this is related to / an alternative version of your script above?

But I still see containers that have been up a while start to throw the error Failed to initialize NVML: Unknown Error whenever i try GPU operations (e.g. nvidia-smi) until I restart them, even when (I think) I haven't updated the drivers, so I don't really know why that is... in these cases, the host can still run nvidia-smi no problem, it's just the container that is stuck. Is that the situation you were seeing as well?

RobJY · 2022-07-15T18:42:25Z

Apologies about the attached script. I did have it hidden at the bottom of the post. Yes, we're running the attached script on the host machine.

Yeah, I think you're right that the nvidia-docker install loads some packages on the host and those are at a higher version than those in the container. This makes sense to me because the without any pinning it's going to install the newest versions available. I also tried pinning versions earlier, but Nvidia doesn't seem to distribute older versions of packages for very long and when I went back to use the pinned script I had the older versions of the packages were unavailable.

I haven't seen the Failed to initialize NVML: Unknown Error message before. I wonder if something running on the GPU got the card(s) into a bad state. I was just experimenting with some ML code that can make use of multiple cards. We have 2 cards of one type and 2 cards of another type on the machine. When I tried running the code with one or two cards it was fine, but when I tried 3 or 4 it crashed and in such a way that we weren't able to use those cards until the machine was rebooted. I saw behavior like this at a previous job where we were doing GPU development. When a card crashes it seems to take a reboot to make it usable again. I guess it makes sense that it can't report the nature of the error in that case, just that the card is not responsive.

cboettig · 2022-07-15T18:52:48Z

Thanks! I now recall a bit more, I think the error I get is related to cgroup2 versions in docker, discussed here: NVIDIA/nvidia-container-toolkit#251

It seems if I run with explicit --device binding, my long-running containers don't lose the GPU state and stay up, e.g. like so:

docker run --name rstudio \
    --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 \
    --env-file private_env.sh \
    --memory 50g \
    --network cirrus-net \
    --gpus 0 \
    rocker/ml-verse:4.2.1-cuda11.1

feels like that shouldn't be necessary but needs some software to be fixed upstream first...

RobJY · 2022-07-15T20:32:31Z

That's interesting. It looks like a nice clean solution to me. It probably protects you from other potential issues as well.

RobJY and others added 2 commits January 10, 2022 17:01

added install and test scripts for xgboost and stan

fcbceb4

Merge branch 'rocker-org:master' into add-xgboost-stan

0c7985f

RobJY changed the title ~~Add xgboost stan install and test scripts~~ Add xgboost and stan install and test scripts Jan 13, 2022

eitsupi reviewed Feb 1, 2022

View reviewed changes

scripts/install_xgboost.sh Show resolved Hide resolved

scripts/install_stan.R Outdated Show resolved Hide resolved

scripts/install_xgboost.sh Show resolved Hide resolved

RobJY added 3 commits February 2, 2022 16:02

added shebang

02f76df

Thanks @eitsupi !

changed stan install script to bash

964faf9

changed cuda libraries and added package hold and symlink

e6c1948

RobJY added 2 commits February 3, 2022 11:03

removed apt-mark hold

5fb647f

added bash wrappers for stan and xgboost test scripts

25c0d64

removed repo as suggested from keras install

9dd1fe0

Shouldn't we not need to install here?

eitsupi marked this pull request as draft May 4, 2022 10:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add xgboost and stan install and test scripts #331

Add xgboost and stan install and test scripts #331

RobJY commented Jan 11, 2022

cboettig commented Jan 31, 2022

noamross commented Jan 31, 2022

noamross commented Jan 31, 2022

RobJY commented Jan 31, 2022

RobJY commented Feb 3, 2022

RobJY commented Feb 7, 2022

RobJY commented Mar 11, 2022

RobJY commented Jul 14, 2022

cboettig commented Jul 14, 2022

RobJY commented Jul 15, 2022 •

edited

cboettig commented Jul 15, 2022

RobJY commented Jul 15, 2022

cboettig commented Jul 15, 2022

RobJY commented Jul 15, 2022

Add xgboost and stan install and test scripts #331

Are you sure you want to change the base?

Add xgboost and stan install and test scripts #331

Conversation

RobJY commented Jan 11, 2022

cboettig commented Jan 31, 2022

noamross commented Jan 31, 2022

noamross commented Jan 31, 2022

RobJY commented Jan 31, 2022

RobJY commented Feb 3, 2022

RobJY commented Feb 7, 2022

RobJY commented Mar 11, 2022

RobJY commented Jul 14, 2022

cboettig commented Jul 14, 2022

RobJY commented Jul 15, 2022 • edited

cboettig commented Jul 15, 2022

RobJY commented Jul 15, 2022

cboettig commented Jul 15, 2022

RobJY commented Jul 15, 2022

RobJY commented Jul 15, 2022 •

edited