Version: v1.0

Automatic Cleanup Data Operation

Background

Fluid's universal data operations describe operations such as data prefetch, data migration, elastic scaling, cache cleaning, metadata backup, and recovery. Similar to the Kubernetes Job's automatic cleaning mechanism, Fluid also provides automatic cleanup data operation, leveraging the Time-to-Live (TTL) mechanism to limit the lifecycle of data operations that have finished execution. This document will briefly demonstrate the utilization of these features.

Prerequisites

Before we start, please refer to Installation Guide to install Fluid on your Kubernetes Cluster. Make sure that all the components required by Fluid are set up correctly like this:

$ kubectl get pod -n fluid-system
alluxioruntime-controller-5b64fdbbb-84pc6   1/1     Running   0          8h
csi-nodeplugin-fluid-fwgjh                  2/2     Running   0          8h
csi-nodeplugin-fluid-ll8bq                  2/2     Running   0          8h
dataset-controller-5b7848dbbb-n44dj         1/1     Running   0          8h

Automatic cleanup data operation

1.Set up demo dataset

Check the Dataset and AlluxioRuntime objects to be created

$ cat<<EOF >dataset.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: hbase
spec:
  mounts:
    - mountPoint: https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/stable/
      name: hbase
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
  name: hbase
spec:
  replicas: 1
  tieredstore:
    levels:
      - mediumtype: MEM
        path: /dev/shm
        quota: 2Gi
        high: "0.95"
        low: "0.7"
EOF

Create the Dataset and AlluxioRuntime

$ kubectl create -f dataset.yaml

Wait for the Dataset and AlluxioRuntime to be ready You can check their status by running:

$ kubectl get datasets hbase

Dataset and Runtime are all ready if you see something like this:

NAME    UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
hbase   1.21GiB          0.00B    2.00GiB          0.0%                Bound      75s

2.Set up data operation

We use Dataload to show the automatic cleanup of data operations here.

Check the Dataload objects to be created

$ cat <<EOF > dataload.yaml
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: hbase-dataload
spec:
  dataset:
    name: hbase
    namespace: default
  ttlSecondsAfterFinished: 300
EOF

Here, we use the spec.ttlSecondsAfterFinished field to indicate how many seconds the data operation will be cleaned up after the job is Complete or Failed, in seconds.

Create the Dataload

$ kubectl apply -f dataload.yaml

Watch the Dataload status

$ kubectl get dataload -w 
NAME             DATASET   PHASE       AGE   DURATION
hbase-dataload   hbase     Executing   7s    Unfinished
hbase-dataload   hbase     Complete    29s   7s
hbase-dataload   hbase     Complete    5m29s   7s

$ kubectl get dataload hbase-dataload
Error from server (NotFound): dataloads.data.fluid.io "hbase-dataload" not found

It can be seen that 300s after the execution of hbase-dataload is completed, the dataload will be automatically cleaned.

Caution: Time skew

Because the TTL-after-finished controller (Fluid dataset-controller) uses timestamps stored in the Data Operation to determine whether the TTL has expired or not, this feature is sensitive to time skew in your cluster, which might lead to premature or delayed cleanup of Job objects. Be cautious when setting a non-zero TTL.

Automatic Cleanup Data Operation

Background​

Prerequisites​

Automatic cleanup data operation​

1.Set up demo dataset​

2.Set up data operation​

Caution: Time skew​