跳到主要内容

· 阅读需 1 分钟
Rong Gu

Our paper "High-level Data Abstraction and Elastic Data Caching for Data-intensive AI Applications on Cloud-native Platforms" is Accepted by IEEE TPDS 2023!

For more information, please refer to:

Rong Gu, Zhihao Xu, Yang Che, et al. High-level Data Abstraction and Elastic Data Caching for Data-intensive AI Applications on Cloud-native Platforms. IEEE TPDS, pp. 2946-2964, Vol 34(11), 2023.

· 阅读需 1 分钟
Rong Gu

Fluid: Dataset Abstraction and Elastic Acceleration for Cloud-native Deep Learning Training Jobs” is Accepted by IEEE ICDE 2022!

For more information, please refer to:

Rong Gu, Kai Zhang, Zhihao Xu, et al. Fluid: Dataset Abstraction and Elastic Acceleration for Cloud-native Deep Learning Training Jobs. IEEE ICDE, pp. 2183-2196, May, 2022. (Conference Version)

· 阅读需 10 分钟
Rong Gu

Guide: in order to solve the problems of heterogeneous data source access, slow I/O speed of storage and calculation separation, low efficiency of scenario perception and low scheduling, the open source project fluid was jointly launched by pasalab, Alibaba and alluxio of Nanjing University in june2020 in order to solve the problems of heterogeneous data source access, slow storage / computing separation I/O speed, low efficiency of scenario perception and scheduling.

Fluid is an efficient support platform for data intensive applications in cloud native environment. Since the open source release, the project has attracted the attention of experts and engineers in many related fields. The community has been evolving with the positive feedback of all. Recently, fluid 0.5 version has been officially released. In this version, fluid mainly adds three aspects to improve:

  • It is rich in the operation functions of data sets, and supports online elastic expansion, metadata backup and recovery.

  • Support the deployment of various environments and meet the user's personalized deployment and configuration requirements.

  • Add data cache engine implementation, and increase the engine selection of users on the public cloud.

Fluid open source project address : https://github.com/fluid-cloudnative/fluid

The development requirements of these three main functions come from the actual production feedback of many community users. In addition, fluid v0.5 has also carried out some bug fixes and document updates. Welcome to experience fluid v0.5!

Fluidv0.5 download link : https://github.com/fluid-cloudnative/fluid/releases

The following is a further introduction to the release features of this new version.

Enrich the operation function of data set

In this version, fluid focuses on enriching the relevant operation functions of the core abstract object, dataset, so that data intensive applications can better utilize the basic functions of elasticity and observability provided by cloud natively, and enhance the user's flexibility in data set management.

1. data set online elastic cache expansion

This is the function that community users have been looking forward to! Before fluid v0.5, if the user wants to adjust the cache capability of the dataset, it needs to be done by uninstalling the cache engine and redeploying it all. This approach is time-consuming and must also consider the high cost of all data cache loss. Therefore, in the new version, we provide the support for the data set to expand the cache elasticity. Users can increase the cache capacity of a dataset on-the-fly in a non-stop manner according to their own scenario requirements to accelerate data access (expansion) or reduce the cache capacity (shrink) of a dataset that is not frequently used, Thus, more precise elastic resource allocation can be realized and resource utilization rate can be improved. The built-in controller of fluid selects the appropriate expansion node according to the policy, for example, when scaling, it will take the task situation on the node and the node cache ratio as the filter condition.

To perform the elastic data set cache capacity elastic expansion, the user only needs to run the following command:

kubectl scale alluxioruntimes.data.fluid.io {datasetName} --replicas={num}

Where dataset name corresponds to the name of the dataset, replica specifies the number of cache nodes.

The video of manual expansion and its effect of data set is as follows:

fly-demo

For more details on manual scaling of datasets, refer to documentation

2. backup and recovery of metadata

This feature enhances the flexibility of fluid dataset metadata management. Previous fluid v0.4 has supported loading metadata for datasets (for example, file system inode tree) to the local and records some key statistics (for example, the size of the data volume and the number of files) of the dataset. However, once the user destroys the local dataset, all the metadata information will be lost, and the data set needs to be retrieved from the underlying storage system again when rebuilding the dataset.

Therefore, in fluid v0.5, we add a k8s custom resource object, DataBackup, which provides the user with a declarative API interface to control the related behavior of data backup. A simple example of building a DataBackup custom resource object is as follows:

apiVersion: data.fluid.io/v1alpha1
kind: DataBackup
metadata:
name: hbase-backup
spec:
dataset: hbase
backupPath: pvc://<pvcName>/subpath1/subpath2/

When you create the dataset again, you need to add a field that specifies the location of the backup file:


apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: hbase
spec:
dataRestoreLocation:
path: pvc://pvc-local/subpath1/
mounts:
- mountPoint: https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/2.2.6/

At this point, fluid will first load metadata and dataset statistics from the backup file, thus greatly improving the loading speed of metadata.

3. observability optimization of data set

Fluid v0.5 also further enhances the observability of the dataset, which includes two parts:

1) combination with Prometheus

This feature supports the collection of data set availability and performance indicators and is visualized through grafana. At present, the implementation of alloxioruntime is supported. Users can easily understand the performance indicators such as current cache node, cache space, existing cache ratio, remote reading, short-circuit reading and so on. The whole configuration process is very simple, and it achieves the effect of "out of the box" for the data set monitoring system.

prometheus

2) hit rate index of new dataset cache

This feature can identify how many of the access to the dataset in the last 1 minute has hit the distributed cache. On the one hand, the indicator can help users analyze the performance bottleneck in their data intensive applications, and quantitatively check the effect of fluid in the workflow of the whole application; On the other hand, it can help users to balance the application performance and cache resource occupation, and make reasonable expansion decision.

This indicator is added to the dataset CRD resource status of 'dataset.status.cachestates' in the future v0.5, specifically including:

  • Cache hit ratio: the percentage of access to distributed cache hits in the past minute.

  • Local hit ratio: the percentage of access hit by the local cache in the past minute.

  • Remote hit ratio: the percentage of access to remote cache hits in the past minute.

Note: for distributed cache, there are two different cache hits for data hits Local cache hit refers to the access initiator can access the cache data directly at the same node Remote cache hit refers to the access to cache data on other nodes through the network by the initiator.

In fluid v0.5, users can easily view cache hit indicators using the following command:


kubectl get dataset <dataset-name> -o wide

NAME ... CACHE HIT RATIO AGE

<dataset-name> ... 86.2% 16m

Support deployment of diverse environment configuration

Since the release of fluid 0.4, we have increased support for fluid deployment configuration in a variety of environments according to the problems and requirements of community users' actual deployment feedback.

1. support fuse global mode

In fluid, the remote files defined in the dataset resource object are schedulable, which means that you can manage the remote file cache to the location on the kubernetes cluster as you do managing pod. The calculated pod can access the data file through the fuse client. In previous versions of fluid, fuse clients always schedule to the nodes where the cache is located, but users are not free to control the dispatch of fuse.

In fluid v0.5, we added a global deployment pattern to fuse. In this mode, fuse is deployed globally to all nodes by default. Users can also influence the scheduling results of fuse by specifying the nodeselector of fuse. At the same time, cache will be deployed on nodes with a large number of calculated pods.

2. support HDFS user level configuration

Many community users use the distributed cache system, alloxo, as the cache engine for fluid data sets. In the case of data set persistence stored in HDFS file system, to make aluxo access to the underlying HDFS, the aluxo cluster needs to obtain all kinds of configuration information of the HDFS in advance.

In fluid v0.5, we use kubernetes' native resources to support the above scenarios. Users need to create the relevant configuration files (e.g. ` HDFS site.xml 'and' core site.xml ') in the kubernetes environment in the form of' configmap ', and then reference the' configmap 'created above in the created' alloxioruntime 'resource object to achieve the above functions.

An example of the alloxioruntime resource object is as follows:

apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: my-hdfs
spec:
...
hadoopConfig: <configmap-name>
...

At this point, the created cluster of aluxo will be able to access the data in the HDFS cluster normally. For more information, refer to the sample documentation

Implementation of new data cache engine

The default distributed cache runtime used by fluid is alluxioruntime. In order to support the needs of users in different environments for the cache system, fluid has made the distributed cache runtime access framework into a pluggable architecture in the previous version. In fluid v0.5, community contributors from Alibaba cloud developed jindoruntime based on the framework and added an execution engine implementation to support fluid dataset data management and caching. Users can use the cache mode of jindofs to access and cache remote files in fluid through jindoruntime. Using and deploying jindoruntime on fluid is simple, compatible with the native k8s environment and out of the box.

Summary

In fluid v0.5, we have enriched and enhanced the functional features and user experience of fluid. First of all , fluid v0.5 further adds the function operation of dataset:

  • Provide online elastic capacity expansion and contraction of data sets, and realize more flexible and fine cluster resource allocation control.

  • The new DataBackup CRD realizes the backup and recovery of dataset file metadata and other information, and helps complete the rapid restart of dataset caching system.

  • A cache hit rate indicator is added to help users better quantify and analyze the acceleration effect provided by fluid.

Secondly , fluid supports more environment modes and configurations to meet the deployment requirements of more real scenarios.

Finally , fluid adds a distributed cache runtime based on jindofs - jindoruntime, which provides users with different cache engine choices in diversified deployment environments.

We will continue to pay extensive attention to and adopt community suggestions to promote the long-term development of the fluid project, and look forward to hearing more feedback from you. If you have any questions or suggestions, welcome to join the fluid user group to participate in communication or discuss with us on GitHub:

Acknowledge

Thanks to the community partners who contributed to this version, including Wang Tao from Alibaba cloud, Xie Yuandong from Tencent cloud, Qiu Lingwei from China Telecom, Xu Zhihao, Hou Haojun, Chen Guowang, Chen Yuquan and other students from pasalab of Nanjing University.

Introduction to the author

Dr. Gu Rong, associate researcher of Computer Department of Nanjing University, member of PMC of fluid open source project co founder and alluxio open source project, research direction of big data processing system, has published more than 30 papers in Frontier Journal conferences in TPDS, ICDE, jpdc, IPDPS, ICPP and other fields, and presided over general projects / youth projects of National Natural Science Foundation of China There are a number of projects specially funded by China Postdoctoral Science Foundation. The research results have been applied to Alibaba, Baidu, byte beat, Sinopec, Huatai Securities and other companies and open source projects Apache spark and alluxio, and won the first prize of Jiangsu Science and technology in 2018 and the youth science and technology award of Jiangsu computer society in 2019, Served as a member of the system software special committee of China Computer Society / communication member of the big data special committee and Secretary General of the big data special committee of Jiangsu computer society.

· 阅读需 7 分钟
Rong Gu

Guide reading: in order to solve the problems of high delay of data access, difficult joint analysis and multi-dimensional management in the separation scenario of data intensive applications such as big data and AI in cloud primary computing storage separation scenario, pasalab, Alibaba and alluxio of Nanjing University jointly launched open source project fluid in September 2020.

Recently, fluid 0.4 version was officially released, with four important functions added, namely:

  • Data load customization provides easy to use and customizable data preheating capability

  • Enhance the support capability of large amount of small file data sets, and expand the support scenarios of fluid for AI applications

  • Open HDFS file system compatible interface, support data access of spark and other frameworks

  • Support mixed deployment of multiple datasets and single nodes, and adapt to the shared cluster environment in the production environment

Fluid project address : https://github.com/fluid-cloudnative/fluid

And fluid 0.3 Similar to the above functions, the development requirements of the above functions are also from the production actual feedback of many community users. In addition, fluid v0.4 has also carried out some bug fixes and document updates. Welcome to experience fluid v0.4! Thank you for the community partners who have contributed to this version. In the next version function iteration, we will continue to pay close attention to and adopt community suggestions, promote the development of fluid project, and look forward to hearing more feedback from you!

The following is a further introduction to the release features of this new version.

Support active data preheating

Data preheating is a common optimization method in AI application model training. Data preheating refers to pulling the data needed by the application from the remote storage system to the local computing cluster before the application runs for later application operation Data preheating is a sequential and regular parallel data reading mode, which avoids the unnecessary communication overhead caused by random data reading when data intensive applications consume data of remote storage system directly.

Therefore, in fluid 0.4, we implemented a new kubernetes custom resource dataload, which provides the user with a declarative API interface in the way of kubernetes resources to control the data preheating related behaviors . A simple example of dataload custom resources is as follows:

apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
name: imagenet-dataload
spec:
dataset:
name: imagenet
namespace: default

In addition, with a small amount of additional configuration, dataload can also realize many customizable functions such as subdirectory loading, cache replica quantity control, metadata synchronization, etc. for more details related to the use of dataload, please refer to sample document on GitHub.

The demo video on the use and optimization of dataload is as follows:

04-demo

Enhance the support ability of large amount of small file data sets

Fluid is an efficient support platform for data intensive applications in cloud native environment. Therefore, we have been closely following the applicability of the data set support capability provided by fluid in different scenarios. Before fluid 0.4, fluid has provided a series of data set support capabilities such as abstraction, management, acceleration, observability, etc., however, the above capabilities are still very basic in the context of large amount of small files based on the feedback of community members.

Considering the universality of large-scale small file data sets in real production environment, especially AI application scenarios, we have made in-depth research on the problems brought by large-scale small files, and put forward solutions such as asynchronous metadata loading query, streaming data processing and so on , which are all integrated into fluid 0.4 version at present, To enhance fluid's support for large small file data sets

The following is the performance comparison assessment results of fluid after optimizing in the 4million small file scenario using the alluxito runtime :

Fluid 0.3Fluid 0.4
dataset initialization60 min22 min
8 thread parallel data reading407 min29 min
deep learning model training6.5 hours45 min

Storage management of large amount of small files is a difficult problem that many storage systems will encounter. In the subsequent versions, we will continue to pay attention to this scenario and the problems it brings.

Convenient big data computing framework such as spark to provide data access support

Besides AI applications, fluid 0.4 also supports big data applications such as spark to run on it. By exposing the Hadoop file system compatible interface (HCFs) of the allouxio distributed cache engine to users, the data analysis application written by Hadoop MapReduce, Apache spark and other big data computing frameworks can be directly run on fluid without modifying the application code, and enjoy the ability of distributed cache acceleration provided by fluid .

For more details on accessing data through the HCFs interface, refer to sample documentation on GitHub.

Mixed deployment of multiple data sets and single node

In the real production environment, users will train multiple tasks on GPU nodes in kubernetes cluster to use multiple datasets. Before fluid 0.4, single node cannot deploy multiple data sets at the same time. Therefore, if multiple users expect to access the required data sets at the same node at the same time, A user's dataset cannot be created.

In fluid 0.4, we added the ability of multi dataset and single node mixed deployment for fluid, which means that as long as the resources on the node are sufficient, the conflict of deployment of multiple datasets from different users will no longer occur, which will make fluid more suitable for the needs of the actual production environment. On the other hand, hybrid deployment can effectively utilize idle resources, increase the utilization rate of cluster resources of each node in the cluster, and further improve the cost and benefit brought by fluid.

For a brief introduction to mixed deployment of multiple datasets and single nodes, refer to sample document on GitHub.

Thank

  • Xu Zhihao (pasalab, Nanjing University) contribution to supporting small file scenarios and data preheating functions

  • Xiefeng (cloud Zhisheng) for the development of mixed deployment function and scenario verification of multiple data sets and single node

  • Qiu Lingwei (Chinatelecom) contributed to fluid architecture split, he split the runtime and dataset controller, and supported the parallel evolution of the two components in the future

Summary

Fluid 0.4 will continue to address the problems and requirements feedback from community users in the actual production environment, expand the applicability of fluid in various scenarios and improve the user experience:

  • Firstly, the optimization of the support capability of large and small file data sets enables fluid to better deal with different use scenarios;

  • Secondly, the new data load customization resources provide a simple data preheating solution for users;

  • Furthermore, the support for data access of big data applications such as spark enables fluid to provide support for different types of data intensive applications;

  • Finally, the mixed deployment of multiple datasets makes fluid more suitable for the actual production environment.

If you have any questions or suggestions, please join the nail exchange group to participate in and discuss:

Introduction to the author

Dr. Gu Rong is an associate researcher of Computer Department of Nanjing University, and has published more than 20 papers in the frontier periodical meetings in TPDS, ICDE, jpdc, IPDPS, ICPP and other fields. He presided over several projects on the National Natural Science Foundation (NSFC) and youth projects, and a number of special projects funded by China Postdoctoral Science Fund. The research results have been applied to Alibaba, Baidu, and Baidu Byteco, Sinopec, Huatai Securities and open source projects Apache spark and alluxio won the first prize of Jiangsu Science and technology in 2018, the youth science and technology award of Jiangsu computer society in 2019, and served as member of the system software special committee of China Computer Society / communication member of the special committee of big data, Secretary General of the big data special committee of Jiangsu computer society Fluid Open Source Project Co foundation, PMC member of the alluxio open source project.