Automatic Cleanup Data Operation
Backgroundâ
Fluid's universal data operations describe operations such as data prefetch, data migration, elastic scaling, cache cleaning, metadata backup, and recovery. Similar to the Kubernetes Job's automatic cleaning mechanism, Fluid also provides automatic cleanup data operation, leveraging the Time-to-Live (TTL) mechanism to limit the lifecycle of data operations that have finished execution. This document will briefly demonstrate the utilization of these features.
Prerequisitesâ
Before we start, please refer to Installation Guide to install Fluid on your Kubernetes Cluster. Make sure that all the components required by Fluid are set up correctly like this:
$ kubectl get pod -n fluid-system
alluxioruntime-controller-5b64fdbbb-84pc6 1/1 Running 0 8h
csi-nodeplugin-fluid-fwgjh 2/2 Running 0 8h
csi-nodeplugin-fluid-ll8bq 2/2 Running 0 8h
dataset-controller-5b7848dbbb-n44dj 1/1 Running 0 8h
Automatic cleanup data operationâ
1.Set up demo datasetâ
Check the Dataset and AlluxioRuntime objects to be created
$ cat<<EOF >dataset.yaml
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: hbase
spec:
mounts:
- mountPoint: https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/stable/
name: hbase
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: hbase
spec:
replicas: 1
tieredstore:
levels:
- mediumtype: MEM
path: /dev/shm
quota: 2Gi
high: "0.95"
low: "0.7"
EOF
Create the Dataset and AlluxioRuntime
$ kubectl create -f dataset.yaml
Wait for the Dataset and AlluxioRuntime to be ready You can check their status by running:
$ kubectl get datasets hbase
Dataset and Runtime are all ready if you see something like this:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE
hbase 1.21GiB 0.00B 2.00GiB 0.0% Bound 75s
2.Set up data operationâ
We use Dataload to show the automatic cleanup of data operations here.
Check the Dataload objects to be created
$ cat <<EOF > dataload.yaml
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
name: hbase-dataload
spec:
dataset:
name: hbase
namespace: default
ttlSecondsAfterFinished: 300
EOF
Here, we use the spec.ttlSecondsAfterFinished
field to indicate how many seconds the data operation will be cleaned up after the job is Complete or Failed, in seconds.
Create the Dataload
$ kubectl apply -f dataload.yaml
Watch the Dataload status
$ kubectl get dataload -w
NAME DATASET PHASE AGE DURATION
hbase-dataload hbase Executing 7s Unfinished
hbase-dataload hbase Complete 29s 7s
hbase-dataload hbase Complete 5m29s 7s
$ kubectl get dataload hbase-dataload
Error from server (NotFound): dataloads.data.fluid.io "hbase-dataload" not found
It can be seen that 300s after the execution of hbase-dataload
is completed, the dataload will be automatically cleaned.
Caution: Time skewâ
Because the TTL-after-finished controller (Fluid dataset-controller) uses timestamps stored in the Data Operation to determine whether the TTL has expired or not, this feature is sensitive to time skew in your cluster, which might lead to premature or delayed cleanup of Job objects. Be cautious when setting a non-zero TTL.