版本：v1.0

Accelerate PVC with Fluid

Scenario: training ResNet50

Device： V100 x8
NFS Server：38037492dc-pol25.cn-shanghai.nas.aliyuncs.com

Configuration

Hardware Configuration

Cluster	Alibaba Cloud Kubernetes. v1.16.9-aliyun.1
ECS Instance	ECS specifications：ecs.gn6v-c10g1.20xlarge CPU：82 cores
Distributed Storage	NAS

Software Configuration

Software version: 0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6

Prerequisites

Fluid (version >= 0.3.0)
Arena（version >= 0.4.0）
Horovod (version=0.18.1)
Benchmark

Data Preparation

Download

$ wget http://imagenet-tar.oss-cn-shanghai.aliyuncs.com/imagenet.tar.gz

Unpack

$ tar -I pigz -xvf imagenet.tar.gz

NFS dawnbench

Deploy Dataset

Export Dataset on Your NFS Server
Create Volume using Kubernetes

$ cat <<EOF > nfs.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-imagenet
spec:
  capacity:
    storage: 150Gi
  volumeMode: Filesystem
  accessModes:
  - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: nfs
  mountOptions:
  - vers=3
  - nolock
  - proto=tcp
  - rsize=1048576
  - wsize=1048576
  - hard
  - timeo=600
  - retrans=2
  - noresvport
  - nfsvers=4.1
  nfs:
    path: <YOUR_PATH_TO_DATASET>
    server: <YOUR_NFS_SERVER>
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs-imagenet
spec:
  accessModes:
  - ReadOnlyMany
  resources:
    requests:
      storage: 150Gi
  storageClassName: nfs
EOF

NOTE:

Please replace YOUR_PATH_TO_DATASET and YOUR_NFS_SERVER with your own nfs server address and path to dataset.

$ kubectl create -f nfs.yaml

Check Volume

$ kubectl get pv,pvc
NAME                            CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                  STORAGECLASS   REASON   AGE
persistentvolume/nfs-imagenet   150Gi      ROX            Retain           Bound    default/nfs-imagenet   nfs                     45s

NAME                                 STATUS   VOLUME         CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/nfs-imagenet   Bound    nfs-imagenet   150Gi      ROX            nfs            45s

Dawnbench

Single machine with eight GPUs

arena submit mpi \
    --name horovod-resnet50-v2-1x8-nfs \
    --gpus=8 \
    --workers=1 \
    --working-dir=/horovod-demo/tensorflow-demo/ \
    --data nfs-imagenet:/data \
    -e DATA_DIR=/data/imagenet \
    -e num_batch=1000 \
    -e datasets_num_private_threads=8 \
    --image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod-benchmark-dawnbench-v2:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6 \
    ./launch-example.sh 1 8

Four machines with eight GPUs

arena submit mpi \
    --name horovod-resnet50-v2-4x8-nfs \
    --gpus=8 \
    --workers=4 \
    --working-dir=/horovod-demo/tensorflow-demo/ \
    --data nfs-imagenet:/data \
    -e DATA_DIR=/data/imagenet \
    -e num_batch=1000 \
    -e datasets_num_private_threads=8 \
    --image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod-benchmark-dawnbench-v2:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6 \
    ./launch-example.sh 4 8

NOTE:
If you find that nfs volume cannot be deleted, this is because Arena will leave a launcher pod after training finished, and Kubernetes still thinks that volume is in using.

Just execute following command to force deleting volume:
$ kubectl patch pvc nfs-imagenet  -p '{"metadata":{"finalizers": []}}' --type=merge

Accelerate PVC with Fluid

Deploy Dataset

Follow Previous Steps to Create NFS Volume
Deploy Fluid to Accelerate NFS Volume

$ cat <<EOF > dataset.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: fluid-imagenet
spec:
  mounts:
  - mountPoint: pvc://nfs-imagenet
    name: nfs-imagenet
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: aliyun.accelerator/nvidia_name
              operator: In
              values:
                - Tesla-V100-SXM2-16GB
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
  name: fluid-imagenet
spec:
  replicas: 4
  data:
    replicas: 1
  tieredstore:
    levels:
      - mediumtype: SSD
        path: /var/lib/docker/alluxio
        quota: 150Gi
        high: "0.99"
        low: "0.8"
EOF

NOTE:

Please keep spec.replicas consistent with the number of machines you are going to use for machine learning。

nodeSelectorTerms is used to restrict scheduling on machines with V100 GPU only.

$ kubectl create -f dataset.yaml

Check Volume

$ kubectl get pv,pvc
NAME                              CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                    STORAGECLASS   REASON   AGE
persistentvolume/fluid-imagenet   100Gi      RWX            Retain           Bound    default/fluid-imagenet                           1s
persistentvolume/nfs-imagenet     150Gi      ROX            Retain           Bound    default/nfs-imagenet     nfs                     16m

NAME                                   STATUS   VOLUME           CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/fluid-imagenet   Bound    fluid-imagenet   100Gi      RWX                           0s
persistentvolumeclaim/nfs-imagenet     Bound    nfs-imagenet     150Gi      ROX            nfs            16m

Dawnbench

Single machine with eight GPUs

arena submit mpi \
    --name horovod-resnet50-v2-1x8-fluid \
    --gpus=8 \
    --workers=1 \
    --working-dir=/horovod-demo/tensorflow-demo/ \
    --data fluid-imagenet:/data \
    -e DATA_DIR=/data/nfs-imagenet/imagenet \
    -e num_batch=1000 \
    -e datasets_num_private_threads=8 \
    --image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod-benchmark-dawnbench-v2:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6 \
    ./launch-example.sh 1 8

Four machines with eight GPUs

arena submit mpi \
    --name horovod-resnet50-v2-4x8-fluid \
    --gpus=8 \
    --workers=4 \
    --working-dir=/horovod-demo/tensorflow-demo/ \
    --data fluid-imagenet:/data \
    -e DATA_DIR=/data/nfs-imagenet/imagenet \
    -e num_batch=1000 \
    -e datasets_num_private_threads=8 \
    --image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/horovod-benchmark-dawnbench-v2:0.18.1-tf1.14.0-torch1.2.0-mxnet1.5.0-py3.6 \
    ./launch-example.sh 4 8

Experiment Results

horovod-1x8

	nfs	fluid (cold)	fluid (warm)
Training time	3h49m10s	3h50m40s	3h34m15s
Speed at the 1000 step(images/second)	2400.8	2378.4	9327.6
Speed at the last step(images/second)	8696.8	8692.8	9301.6
steps	56300	56300	56300
Accuracy @ 5	0.9282	0.9286	0.9285

horovod-4x8

	nfs	fluid (cold)	fluid (warm)
Training time	2h15m59s	1h43m43s	1h32m22s
Speed at the 1000 step(images/second)	3136	8889.6	20859.5
Speed at the last step(images/second)	15024	20506.3	21329
steps	14070	14070	14070
Accuracy @ 5	0.9228	0.9204	0.9243

Analysis

From the test results, the Fluid acceleration effect on 1x8 has no obvious effect, but in the scenario of 4x8, the effect is very obvious. In warm data scenario, the training time can be shortened (135-92)/135 = 31%; In cold data scenario, training time can be shortened (135-103) /135 = 23%. This is because NFS bandwidth became a bottleneck under 4x8; Fluid based on Alluxio provides distributed cache data reading capability for P2P data.

Accelerate PVC with Fluid

Scenario: training ResNet50​

Configuration​

Hardware Configuration​

Software Configuration​

Prerequisites​

Data Preparation​

NFS dawnbench​

Deploy Dataset​

Dawnbench​

Single machine with eight GPUs​

Four machines with eight GPUs​

Accelerate PVC with Fluid​

Deploy Dataset​

Dawnbench​

Single machine with eight GPUs​

Four machines with eight GPUs​

Experiment Results​

horovod-1x8​

horovod-4x8​

Analysis​

Scenario: training ResNet50

Configuration

Hardware Configuration

Software Configuration

Prerequisites

Data Preparation

NFS dawnbench

Deploy Dataset

Dawnbench

Single machine with eight GPUs

Four machines with eight GPUs

Accelerate PVC with Fluid

Deploy Dataset

Dawnbench

Single machine with eight GPUs

Four machines with eight GPUs

Experiment Results

horovod-1x8

horovod-4x8

Analysis