Skip to main content
Version: v0.9

Accelerate HostPath with Fluid

Scenario: training ResNet50​

  • Device: V100 x8
  • NFS Server:38037492dc-pol25.cn-shanghai.nas.aliyuncs.com

Configuration​

Hardware Configuration​

ClusterAlibaba Cloud Kubernetes. v1.16.9-aliyun.1
ECS InstanceECS specifications:ecs.gn6v-c10g1.20xlarge
CPU:82 cores
Distributed StorageNAS

Software Configuration​

Software version: 0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6

Prerequisites​

Known Constraints​

  • Implementing mounts via hostPath is not the recommended way, as it relies on mount point maintenance methods outside of Kubernetes, which are actually unreliable and can trigger data inconsistency issues.

Data Preparation​

  1. Download the dataset
$ wget http://imagenet-tar.oss-cn-shanghai.aliyuncs.com/imagenet.tar.gz
  1. Extract the dataset
$ tar -I pigz -xvf imagenet.tar.gz

NFS dawnbench​

Deploy Dataset​

  1. Mount the dataset on the NFS Server
  2. Mount NFS to a host path
$ sudo mount -t nfs -o vers=3,nolock,proto=tcp,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport <YOUR_NFS_SERVER>:<YOUR_PATH_TO_DATASET> /mnt/nfs-imagenet
  1. Check if NFS has been successfully mounted
$ mount | grep nfs
<YOUR_NFS_SERVER>:<YOUR_PATH_TO_DATASET> on /mnt/nfs-imagenet type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noresvport,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.1.28,mountvers=3,mountport=2049,mountproto=tcp,local_lock=all,addr=192.168.1.28)

NOTE:
Modify <YOUR_NFS_SERVER> and <YOUR_PATH_TO_DATASET> in the above commands to your nfs server address and mount path.

Dawnbench​

Single machine with eight GPUs​

arena submit mpijob \
--name horovod-v2-nfs-hostpath-1x8-093000 \
--gpus=8 \
--workers=1 \
--working-dir=/horovod-demo/tensorflow-demo/ \
--data-dir /mnt/nfs-imagenet:/data \
-e DATA_DIR=/data/imagenet \
-e num_batch=1000 \
-e datasets_num_private_threads=8 \
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod-benchmark-dawnbench-v2:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6 \
./launch-example.sh 1 8

Four machines with eight GPUs​

arena submit mpi \
--name horovod-v2-nfs-hostpath-4x8-092921 \
--gpus=8 \
--workers=4 \
--working-dir=/horovod-demo/tensorflow-demo/ \
--data-dir /mnt/nfs-imagenet:/data \
-e DATA_DIR=/data/imagenet \
-e num_batch=1000 \
-e datasets_num_private_threads=8 \
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod-benchmark-dawnbench-v2:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6 \
./launch-example.sh 4 8

Accelerate hostPath with Fluid​

Deploy Dataset​

  1. Follow the previous steps to complete NFS mounting
  2. Deploy Fluid to accelerate hostPath mounted NFS
$ cat <<EOF > dataset.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: imagenet
spec:
mounts:
- mountPoint: local:///mnt/nfs-imagenet
name: imagenet
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: aliyun.accelerator/nvidia_name
operator: In
values:
- Tesla-V100-SXM2-16GB
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: imagenet
spec:
replicas: 4
data:
replicas: 1
tieredstore:
levels:
- mediumtype: MEM
path: /alluxio/ram
quota: 50Gi
high: "0.99"
low: "0.8"
EOF

NOTE:\

  • mounts.mountPoint specifies the host path to be mounted by the local:// prefix (e.g. /mnt/nfs-imagenet)
  • Keep the number of workers for spec.replicas and dawnbench consistent. For example: 1 for a single machine with eight GPUs, 4 for four machines with eight GPUs
  • nodeSelectorTerms is to restrict the deployment of the dataset to machines with V100, which should be adjusted according to the experimental environment.
$ kubectl create -f dataset.yaml
  1. Check the deployment
$ kubectl get pv,pvc
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
persistentvolume/imagenet 100Gi RWX Retain Bound default/imagenet 3h28m

NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/imagenet Bound imagenet 100Gi RWX 3h28m

Dawnbench​

Single machine with eight GPUs​

arena submit mpi \
--name horovod-v2-nfs-fluid-1x8-093009 \
--gpus=8 \
--workers=1 \
--working-dir=/horovod-demo/tensorflow-demo/ \
--data imagenet:/data \
-e DATA_DIR=/data/imagenet/imagenet \
-e num_batch=1000 \
-e datasets_num_private_threads=8 \
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod-benchmark-dawnbench-v2:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6 \
./launch-example.sh 1 8

Four machines with eight GPUs​

arena submit mpi \
--name horovod-v2-nfs-fluid-4x8-092910 \
--gpus=8 \
--workers=4 \
--working-dir=/horovod-demo/tensorflow-demo/ \
--data imagenet:/data \
-e DATA_DIR=/data/imagenet/imagenet \
-e num_batch=1000 \
-e datasets_num_private_threads=8 \
--image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod-benchmark-dawnbench-v2:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6 \
./launch-example.sh 4 8

Experiment Results​

horovod-1x8​

nfs-hostpathfluid (cold)fluid (warm)
Training time4h20m36s4h21m56s4h2m16s
Speed at the 1000 step(images/second)2426.42467.28959.7
Speed at the last step(images/second)8218.18219.88275.8
steps563005630056300
Accuracy @ 50.92800.92880.9291

horovod-4x8​

nfs-hostpathfluid (cold)fluid (warm)
Training time2h9m21s1h40m15s1h29m55s
Speed at the 1000 step(images/second)3219.211067.221951.3
Speed at the last step(images/second)15855.720964.421869.8
steps140701407014070
Accuracy @ 50.92270.92320.9228

Analysis​

From the test results, the Fluid acceleration on 1x8 has no obvious enhancement, while in the improvements of 4x8, the effect is very obvious. In warm data scenario, the training time can be shortened (129-89)/129 = 31 %; In cold data scenario, training time can be shortened (129-100)/129 = 22 %. This is because NFS bandwidth became a bottleneck under 4x8; Fluid based on Alluxio provides distributed cache data reading capability for P2P data.