Saving money on GPUs in Azure

Azure provides low priority VMs that we can use to save some money when utilising GPUs for machine learning. All the major cloud providers offer spare compute for a discount: Google cloud has Preemptible VMs, AWS has spot instances and Azure has low priority VMs. On Azure these low priority VMs can only be provisioned in Azure Batch or virtual machine scale sets and we can use the latter for hosting a jupyter environment. The low priority pricing is 23c/hour vs the normal $1.147/hour…a nice saving

Prerequisites

I’m assuming you have:

  • an Azure subscription
  • the Azure CLI
  • A local ssh key

Some variables I’ll use in scripts

LOCATION="australiaeast"
RG_NAME="mlvm-rg"
STORAGE_NAME="pontiml"

Storage

Firstly, any of these cheaper VMs can be reclaimed by the provider so it’s worth always persisting any work to a persistent storage. In this case, we’ll use Azure Files

az group create -n "${RG_NAME}" -l "${LOCATION}" --tags environment=ml

az storage account create -n "${STORAGE_NAME}" -g "${RG_NAME}" -l "${LOCATION}"

# Grab the storage key - also needed for cloud-config later
STORAGE_KEY=$(az storage account keys list --account-name ${STORAGE_NAME} --resource-group "${RG_NAME}" | head -n1 | awk '{print $3}')

az storage share create -n "machinelearning" --account-name "${STORAGE_NAME} --account-key ""

Virtual Machine

We’re actually creating a scale set, which can be any number of VMs behind a load balancer. For our purpose, we only need one and we’re picking the Standard_NC6 (a K80 GPU instance)

az vmss create \
    --name "${VM_NAME}" \
    -g "${RG_NAME}" \
    -l "${LOCATION}" \
    --image microsoft-ads:linux-data-science-vm-ubuntu:linuxdsvmubuntu:latest \
    --vm-sku Standard_NC6 \
    --instance-count 1 \
    --admin-username $USER \
    --ssh-key-value ~/.ssh/id_rsa.pub \
    --storage-sku Standard_LRS \
    --priority Low \
    --public-ip-address-dns-name "${VM_NAME}" \
    --custom-data cloud-init.yml

I’m using cloud init to attach my file share and start the docker container with jupyter automatically. This means no setup should be required on your part once the machine has done its configuration. This is important for repeatability and ease of setting up and tearing down these machines. Make sure to replace the storage account name, key and file share name but that file is:

#cloud-config
mounts:
    - [ //<storage-account>.file.core.windows.net/<file-share-name>, /afs, cifs, "vers=3.0,username=<storage-account>,password=<storage-key>,dir_mode=0777,file_mode=0777,serverino" ]
runcmd:
    - docker pull tensorflow/tensorflow:latest-gpu-py3
    - docker run --runtime=nvidia -dit -p 8888:8888 -v /afs:/afs -v $HOME/developer:/notebooks -v /tmp:/tmp -e "PASSWORD=<jupyter-password>" --restart unless-stopped --name tf tensorflow/tensorflow:latest-gpu-py3

This will create a single instance VM behind a scale set with a vnet, public ip and load balancer. In a scale set, the load balancer assigns a port for ssh’ing into a machine, but it’s normally 50000 for the first machine. If this port isn’t the one, you can run

az vmss list-instance-connection-info -g ${RG_NAME} -n ${VM_NAME}

to get the connection info for your instances. With this, we can connect to the machine and tunnel the jupyter runtime into our localhost by running:

ssh -p 50000 -L 8888:localhost:8888 -L 6006:localhost:6006 ${USER}@${VM_NAME}.${LOCATION}.cloudapp.azure.com

Accessing localhost:8888 will give us access to the jupyter notebook. When persisting files inside, remember the volumes you mapped with docker run as any other points will be deleted with the container

To save more money!

Once done using the machine, you can deallocate it and won’t be charged for compute resources while in this stage. Important: this is different than just stopping the machine. To deallocate your scale set (and then subsequently start it):

az vmss deallocate -n "${VM_NAME}" -g "${RG_NAME}"

az vmss start -n "${VM_NAME}" -g "${RG_NAME}"