Gentleman start your uploads! They're free now but how fast can we do them? Lately we’ve been working with clients solving big scientific problems with Big Data (Next Generation Sequencing analysis is one example) so we’ve been working hard to transfer large files into and out of the cloud as efficiently as possible. We’re optimizing two costs here: money and time.
Lucky for us, Amazon Web Services continues to drive down the costs of data transfer. We were excited to see that all data transfer into AWS will be free as of July 1st! They’re also reducing the cost to transfer data out of AWS. Less money, more science, yes!
We still need to optimize for time, however. The scalability of the Elastic Compute Cloud (EC2) means we can throw as many cores at a scientific problem as we can afford in a very short time. But what if our input or result data is so large that the time to transfer it far outweighs the time to analyze it?
Our previous work has shown that file transfers often do not fill the pipe to capacity, and are often limited by disk I/O and other factors. Therefore, we can speed transfers by using multiple threads to fill the pipe.
As shown above, this work involved moving data directly to a file system using rsync. But since that time, we’ve begun to rely upon the Simple Storage Service (S3) as both a staging area and long-term storage solution for input and result data. S3’s availability and scalability are far superior than even striped Elastic Block Store volumes running on an instance. Besides, using S3 over an instance serving an EBS file system means we have a smaller support footprint. So how can we move data quickly using multiple streams into and out of S3?
S3 supports multiple stream data transfers differently depending on the direction of the transfer. If you move data into S3, you can make use of the multi-part upload feature. This enables you to break up a large file into chunks and upload them separately. When you’ve uploaded each chunk, you issue another call to S3 to stitch the chunks together. So we can take a large file, chunk it up, and transfer the chunks simultaneously over many threads to fill our upload pipe.
When downloading S3 objects, you can make use of the ranged GET feature of S3. The Hyper Text Transfer Protocol (HTTP) defines a Range header you can add to GET requests. This feature allows you to specify a byte range and only transfer that part of the resource. Because S3 supports the Range header, we can create a list of chunks – specified by byte ranges – and download them simultaneously. After all chunks are downloaded, we can concatenate them and checksum the resulting file to verify integrity.
Test Environment and Results
We tested a new transfer tool by repeatedly uploading and downloading a 2.2 GB NGS file using different numbers of threads. Tests were run on an m1.large EC2 instance. The time to upload the entire file and the time to download it from S3 (including time to reassemble and verify the checksum) were recorded and averaged. As you can see in this first graph, the total bandwidth used in both directions increased with an increasing number of threads.
By using many threads, the time to transfer decreased dramatically. This graph shows the percent increase by number of threads. In the best case, we were able to decrease the time to upload the file to S3 by nearly 2.5 times!
Even with a multi-part transfer tool, you still run into the following issues:
- You need to track the transfers and retry any that fail
- You can be too efficient with your transfers and overload your data pipe
- Some Internet proxies drop the Range header from requests silently, resulting in many copies of the entire S3 object to be downloaded
- The time to concatenate downloaded chunks can outweigh some of the transfer gains
- Many S3 tools assume the etag header of the object contains the MD5 checksum of that object (this is not the case with mutipart uploaded objects!) causing integrity checks to fail
Coming soon: A JetS3t app update
We’ll be addressing these challenges in the coming weeks. Unfortunately, the most commonly used S3 transfer tools have limited support for parallelization. We are big fans of s3cmd, but it doesn’t support multiple threads, and it lacks support for multipart transfers. JetS3t is another option. It’s Synchronize application has support for multiple threads, but it’s focused on moving lots of smaller files in parallel rather than one big file in chunks. Because the JetS3t library has support for multi-threaded transfers, and because it’s open source, we used it as the basis for our own multipart data transfer tool. It’s this tool that generated the results you see above.
We’ll be releasing an application to do this automatically in the next week or so, to enable anyone to easily do parallel uploads.