A Swarm of GPU Sparks

About GPU-enabled Docker Swarms

recraft@pm.me https://recraft.me (Recraft)https://github.com/recraft
2019-02-09

Can we set up a Docker Swarm cluster with nodes that can run Spark?

Yes, here is a nice write-up stepping through how to run a Docker Swarm-based cluster of Spark nodes.

Note however that this guide does not provide a GPU-enabled Docker Swarm-based cluster running Spark.

GPU-enabled Swarm

So how can we add GPU capabilities to the nodes?

Some IaaS-providers provide GPU-enabled environments. For example, using the Amazon Cloud, provision a deep learning base AMI for Ubuntu, like so:


docker-machine create --driver amazonec2 --amazonec2-ami=ami-086062166ec8340ac \
  --amazonec2-open-port 8000 --amazonec2-region eu-west-1 aws-sandbox

This host needs to be updated with nvidia-docker2 - this guide mentions the procedure here, summarized like so:


docker-machine ssh aws-sandbox
sudo apt-get -y install nvidia-docker2
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

We then need to configure the nodes to utilize the GPU like so:


export GPU_ID=`nvidia-smi -a | grep UUID | awk '{print substr($4,0,12)}'`
sudo mkdir -p /etc/systemd/system/docker.service.d
cat <<EOF | sudo tee --append /etc/systemd/system/docker.service.d/override.conf
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -H fdd:// --default-runtime=nvidia --node-generic-resource gpu=${GPU_ID}
EOF
sudo sed -i '1iswarm-resource = "DOCKER_RESOURCE_GPU"' /etc/nvidia-container-runtime/config.toml
sudo systemctl daemon-reload
sudo systemctl start docker

More nodes can be bootstrapped and added in the same manner to the swarm.

Deploying services and stacks

On the swarm, GPU-aware tensorflow services can now be launched. A ten replica strong set of Spark worker nodes could be launched like so:


docker service create --generic-resource "gpu=1" --replicas 10 \
  --name sparkWorker <image_name> \
  "service ssh start && \
  /opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://<spark_master_ip>:7077\"

It should now also be possible to define a stack to run on the swarm, using Docker Compose, by providing relevant constraints in a .yml-file similar to what is done here:


https://github.com/infsaulo/docker-spark/blob/master/docker-compose.yml
https://github.com/aocenas/spark-docker-swarm/blob/master/provision.sh

Prefer Open Science Grid over Amazon?

It may be that your local computing grid infrastructure supports running GPU-enabled Docker or Singularity containers.

Test for example to use the image provided from Docker Hub named opensciencegrid/tensorflow-gpu with your local grid computing IaaS-provider and see how it goes.

The Open Science Grid has some information on GPU-enabled options with information on how to configure GPU images including an example of how to create custom GPU-enabled Dockerfiles, available on GitHub.