Deep learning(DL) models have different access requirements from the data store when compared to typical workloads for which existing storage systems are built. Also the data scale required for training these models range in petabytes. Hence having a storage system customized for this use case is essential & Nvidia has developed AIStore, an open sourced solution to tackle this problem. As part of this post, we will go through the paper describing the architecture for AIStore along with custom storage format known as WebDataset.
Challenges with existing storage systems
Typical workload of a DL application consists of following steps:
- Shuffle the dataset in a random order
- Iterate through shuffled dataset in a sequential order
- Read batch of date
- Deliver the decoded data to DL job running on a GPU
- Check for accuracy & repeat if required
One single traversal of the dataset is also referred to as one epoch. In order to facilitate such a workload, a datastore needs to provide a storage format, a client library to read the data & server for serving shards of data. Frameworks available to fulfill such a workload fall into these 4 categories:
- Parallel POSIX based filesystems
- Object stores
- GFS or S3
- HDFS
File system interface provided by TensorFlow provides support for all 4 storages.
A typical DL storage has certain of requirements:
- Support for standard protocols & file formats
- Support for migrating existing deep-learning models & datasets
- Fast access to stored data
- Easy to setup & configure
- Compatible with Python ETL ecosystyem
- Integration with Kubernetes for easy deployment
- Performance overhead due to JVM runtime
Considering these requirements is where the existing solutions fall short & this led to Nvidia building the open-sourced solution as AIStore.
Introducing AIStore
In this section we will take a high-level look at both AIStore(AIS) & WebDataset storage format.
AIS provides a scalable namespace over large number of storage disks where data is streamed between storage & compute nodes. AIS provides a S3 like interface supporting GET
& PUT
operations. It achieves high I/O by leveraging HTTP redirect where client accesses data objects by directly connecting to storage server holding that object. AIS provide data durability guarantees using typical storage techniques such as erasure-coding, N-way mirroring etc. It also integrates seamlessly with object stores such as S3. Written from ground up in Golang, it provides multiple deployment options such as Kubernetes.
WebDataset are POSIX based tar files containing sample data which is stored next to each other in tar format. AIStore provides a Python library to support existing infrastructure in order to access sequential storage. The WebDataset library can read from multiple sources such as files, web servers etc.
Diving deeper
In this section we will look into challenges that a storage system suffers from & how AIStore tackles these challenges. One of these well know challenges is the small-file problem. A DL workload comprises of millions of small files making it even more challenging for the storage engine supporting the workload. One common solution to this problem is to combine multiple such small files into a larger file. AIS uses dSort(An extension to AIS) to create merged shards in parallel on storage nodes for optimal data transfer. AIS relies on GNU tar for aggregating smaller DL samples to bigger shards.
Another challenge is scalable storage access. Typical storage systems expose an endpoint to extract metadata for the stored files & then data endpoints to read the data. AIStore improves upon this by using HTTP redirect. So when a client requests for a data, the request is redirected to the storage node through AIS gateways. This way we avoid sending multiple calls to read the data. Also post redirection, clients directly interact with the data nodes & no data transfer takes place through AIS gateways.
To summarize, key ideas provided by AIS are:
- Supporting reads for larger data
- High-scale storage access
- Ease of deployment
- Direct access to storage nodes through HTTP redirect
With AIS, a typical data processing pipeline looks as below:
Conclusion
Benchmarks show that AIS proves to be similar to HDFS in terms of performance & this is a great result considering how mature HDFS ecosystem is. AIS tries to solve specific set of problems which are encountered while dealing with deep learning workloads. Such specialized storage systems will become more evident as we are seeing influx of AI applications. I plan on covering more papers about such storage systems in future. Happy learning!