The Problem: Transferring data into cloud storage fast, at scale
In this day of pervasive data generation, collection, and analysis, everyone recognizes the value and scale that Cloud storage provides. Users are looking to Cloud storage to cost-effectively, securely, and reliably:
- Stage data for processing in the Cloud
- Provide secure data redundancy
- Archive infrequently accessed data
But how do we get data in and out of the Cloud fast, particularly for modern data sets, ones with hundreds of millions of files or petabytes of data?
The Solution: File and File System Parallel Uploading
Back in 2011, we benchmarked parallel uploading to speed up transfers to Amazon S3. We found parallelizing the transfers of parts of individual files, as well as transferring entire files concurrently, maximized bandwidth usage into Amazon S3.
Figure 1. Simple DataMan UI for moving data between file systems and the cloud
Cycle’s DataMan™ data workflow software has done this out of the box since 2013. The intelligent parallelism built into DataMan™ enables it to handle data workflows with truly massive scale into and out of the Cloud: a billion data blobs, Petabytes (PB) of data, and distributed file systems, among other production use cases.
Figure 2. Fast, Scheduled Transfers in a Data Workflow in DataMan
Using the simple DataMan GUI, you can easily configure and schedule the transfers using a data workflow: a one-time copy or regular, scheduled operations. You do not need to manage the complexity of large or complex directory structures, varying file sizes, file compression, or meta-data generation.
The Details: Here’s how we do it
DataMan handles all of that, efficiently and scalably dividing up large files and file systems for you. And to maximize performance, DataMan™ uses parallel uploads – via multiple threads or multiple instances of DataMan™ running at the same time – to maximize bandwidth utilization, as shown below:
Figure 3. The Fast, Parallel Transfer Pipeline into Cloud within DataMan
The figure above illustrates the parallel pipeline that moves data from filesystems, for example, to cloud blob storage. Fast parallel uploading occurs at both the file, blob, and filesystem levels, to maximize concurrent transfer and bandwidth utilization.
These parallel upload features power real, enterprise-level environments. Some users have transferred petabytes of data to the Cloud and are actively managing it, and others have used DataMan to transfer and synchronize upwards of hundreds of millions of files in a single directory structure. This diagram illustrates a real client use case for high-speed archiving of genomic data into glacier, that uploads 30TB of data per week:
Figure 4. Production Client Use Case for 1PB of data into Glacier
You can also sign-up for an online demo of DataMan here.