Time has always been money, but this is especially true when you rent your compute. Shortening the setup time of a cluster not only shortens the time-to-results, but it saves money. We recently worked with a customer to cut about 20 minutes from the initial boot time of their compute instances by changing how they staged in their data.
This customer does cancer research, so the jobs have a lot of reference data. The reference data was provided by a collaborator and stored in Amazon S3, but in a different region from the compute instances used by the researchers. Initially, each instance pulled the data from S3 using Cycle Computing’s pogo tool for cloud data transfer. pogo provides great performance but since the traffic went across the public internet, it was slow and sometimes overwhelmed the AWS NAT Gateway.
We looked at two options: mirroring the data to the desired region and using EBS Snapshots. We chose snapshots because the reference data was static, so the effort of creating a snapshot was amortized over many computation runs. Using a snapshot takes a little bit of time to create and copy, but the startup time for each compute instance is smaller than downloading the data from S3 for each time an instance starts. When execute nodes start, CycleCloud creates a volume from that Snapshot and attaches it to the instance. This gives each execute node a local copy of the reference data in a fraction of the time.
Note that disk snapshots are not the only way to avoid downloading reference data each time. Snapshots work well when the reference data is large (tens of gigabytes or larger) and static. A shared filesystem can also be used to share data between instances. This can be considerably less work when the reference data changes frequently since you don’t need to create a new Snapshot each time the data changes. One option for small, less-intensive file access in a single region is to use CycleCloud to launch a cloud instance that acts as a file server. At larger scales, our partners at Avere offer a scalable cloud NAS product that can deliver extra performance as requirements grow. For multiple-region computation, mirroring data to all regions is an alternative to creating snapshots.
The right answer depends on your application and architecture needs. Cycle Computing’s Solutions Architects have the experience to help you find the right approach for optimizing your cloud computing.
Sidebar: creating EBS Snapshots from CycleCloud volumes
When creating a snapshot for reference data, we suggest making the source volume the same size as the destination volume. In other words, if you plan to have your reference data on a 200 gigabyte volume when attached to execute nodes, create the snapshot from a 200 gigabyte volume. CycleCloud uses Linux’s Logical Volume Manager (LVM) technology to create and manage disks, and the LVM metadata includes the volume size. Creating the snapshot from a smaller volume (e.g. 100 gigabytes) will result in your 200 gigabyte volume appearing to the operating system as a 100 gigabyte volume.