DataMan Feature: Fast Parallel Uploads for Big Data into the Cloud

Rob Futrick, Cycle Computing CTOBy Rob Futrick, Cycle Computing CTO

The Problem: Transferring data into cloud storage fast, at scale

In this day of pervasive data generation, collection, and analysis, everyone recognizes the value and scale that Cloud storage provides. Users are looking to Cloud storage to cost-effectively, securely, and reliably:

  • Stage data for processing in the Cloud
  • Provide secure data redundancy
  • Archive infrequently accessed data

But how do we get data in and out of the Cloud fast, particularly for modern data sets, ones with hundreds of millions of files or petabytes of data?

The Solution: File and File System Parallel Uploading

Back in 2011, we benchmarked parallel uploading to speed up transfers to Amazon S3. We found parallelizing the transfers of parts of individual files, as well as transferring entire files concurrently, maximized bandwidth usage into Amazon S3.

 

Simple DataMan UI for moving data between file systems and the cloud

Figure 1. Simple DataMan UI for moving data between file systems and the cloud

Cycle’s DataMan™ data workflow software has done this out of the box since 2013. The intelligent parallelism built into DataMan™ enables it to handle data workflows with truly massive scale into and out of the Cloud: a billion data blobs, Petabytes (PB) of data, and distributed file systems, among other production use cases.

Fast, Scheduled Transfers in a Data Workflow in DataMan

Figure 2. Fast, Scheduled Transfers in a Data Workflow in DataMan

Using the simple DataMan GUI, you can easily configure and schedule the transfers using a data workflow: a one-time copy or regular, scheduled operations. You do not need to manage the complexity of large or complex directory structures, varying file sizes, file compression, or meta-data generation.

The Details: Here’s how we do it

DataMan handles all of that, efficiently and scalably dividing up large files and file systems for you. And to maximize performance, DataMan™ uses parallel uploads – via multiple threads or multiple instances of DataMan™ running at the same time – to maximize bandwidth utilization, as shown below:

Fast, Parallel Transfer Pipeline into Cloud within DataMan

Figure 3. The Fast, Parallel Transfer Pipeline into Cloud within DataMan

The figure above illustrates the parallel pipeline that moves data from filesystems, for example, to cloud blob storage. Fast parallel uploading occurs at both the file, blob, and filesystem levels, to maximize concurrent transfer and bandwidth utilization.

These parallel upload features power real, enterprise-level environments. Some users have transferred petabytes of data to the Cloud and are actively managing it, and others have used DataMan to transfer and synchronize upwards of hundreds of millions of files in a single directory structure. This diagram illustrates a real client use case for high-speed archiving of genomic data into glacier, that uploads 30TB of data per week:

Production Client Use Case for 1PB of data into Glacier

Figure 4. Production Client Use Case for 1PB of data into Glacier

If you’re interested in trying DataMan, please connect with Cycle here, or to learn more, please visit the DataMan home page.

You can also sign-up for an online demo of DataMan here.

Share this: