Steps:
- Get an H100 on brev. has to be the fluidstack one that you can reboot. yes, it only has 100 GB of disk space.
- get your keys
wget <https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.1-1_all.deb>
sudo dpkg -i cuda-keyring_1.1-1_all.deb
- just blow up their drivers and install better ones
sudo apt-get purge 'nvidia-.*'
sudo apt-get install cuda-drivers-550 nvidia-container-toolkit -y
sudo reboot
- configure docker
sudo nvidia-ctk runtime configure --runtime=docker
sudo rm /etc/cdi/nvidia.yaml
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
sudo systemctl restart docker
- You may be good to go!
- But, if you get errors like
RuntimeError: cuDNN Frontend error: [cudnn_frontend] Error: No execution plans support the graph.
or Could not load library [libcuda.so](<http://libcuda.so/>). Error: [libcuda.so](<http://libcuda.so/>): cannot open shared object file: No such file or directory
, then you need to link libcuda.so within your docker container. (I’ve seen this sometimes but not on every machine) If so, you’ll have to cog -p 5000 run bash
or docker run ...
your container and then
ln -s /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so
python -m cog.server.http (or whatever command you actually want to run)
liveblog, preserved for posterity: