While public clouds have gained a reputation as strong performers and a good fit for batch and throughput-based workloads, we often still hear that clouds don’t work for “real” or “at scale” high performance computing applications. That’s not necessarily true, however, as Microsoft Azure has continued its rollout of Infiniband-enabled virtual machines. InfiniBand is the most common interconnect among TOP500 supercomputers, and Microsoft has deployed the powerful and stable iteration known as “FDR” Infiniband. Best of all, these exceptionally high levels of interconnect performance are now available to everyone on Azure’s new H-series and N-series virtual machines.
To see how well Azure’s Infiniband works, we benchmarked LAMMPS, an open source molecular dynamics simulation package developed by Sandia National Laboratories. LAMMPS is used widely-used across government, academia, and industry, and is frequently a computational tool of choice for some of the most advanced science and engineering teams. LAMMPS relies heavily on MPI to achieve sustained high performance on real-world workloads, and can scale to many hundreds of thousands of CPU cores.
Armed with H16r virtual machines, we used the Lennard-Jones liquid benchmark. We selected the “LJ” benchmark and tested two scenarios: “weak scaling”, in which every core simulated 32,000 atoms no matter how many cores were utilized, and “strong scaling” which used a fixed problem size of 512,000 atoms with an increasing number of cores. Both scenarios simulated 1,000 time steps. We performed no “data dumps” (i.e. intermediate output to disk) in order to isolate solver performance, and ran 30 test jobs per data point in order to obtain statistical significance and associated averages.
In summary, the results were impressive on both scaling approaches. For the weak scaling, we were able to increase the problem size by 64x while only having a 1.7x increase in wallclock time. For the strong scaling, we saw a 17x decrease in wallclock time as we scaled the problem from 16 to 1,024 CPU cores. If you want to see all of our testing parameters and more about Azure H-series, please see details at the bottom of this post.
As described in the wikipedia article on Scalability, in an ideal weak scaling scenario, wallclock time would remain constant as processing capability and problem size increase commensurately. In practice, however, we may see an increase in runtime as the problem domain increases due to overhead from network and inter-process communication within a given application. The chart below shows how the elapsed wallclock time increased with the number of cores. The results show LAMMPS to indeed perform quite well in a weak scaling scenario on Azure’s H-series virtual machines. We observe a 64x increase in the number of total atoms simulated, but only a 1.7x increase in wallclock time.
In an ideal strong scaling scenario, wallclock time would decrease as processing capability increases as the problem size remains constant (e.g. 10 CPUs needing only 10% of the time to complete a simulation as compared to 1 CPU). In practice, however, perfectly linear strong scaling is difficult and uncommon due to high demands on MPI communication, high-speed networks, cache coherence, and Amdahl’s Law. Still, Azure’s H-series VM’s produced encouraging results for our strong scaling benchmark, yielding a 17x decrease in wallclock time as we scaled the problem from 16 to 1,024 CPU cores.
Using InfiniBand with CycleCloud
CycleCloud requires no special configuration to use Azure’s InfiniBand instances. For these tests, we used our standard Open Grid Scheduler cluster type with autoscaling enabled. The InfiniBand-enabled CentOS image provided by Azure includes the Intel MPI libraries, however you will want to install the Intel MKL packages in order to optimize performance. You can use CycleCloud’s Cluster-Init technology to install the software and the license file on your instances.
Your Open Grid Scheduler submit file should set the Intel MPI environment variables:
source /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh intel64 export I_MPI_DYNAMIC_CONNECTION=0 export I_MPI_DEBUG=3 # THIS IS A MANDATORY ENVIRONMENT VARIABLE AND MUST BE SET BEFORE RUNNING ANY JOB # Setting the variable to shm:dapl gives best performance for some applications # If your application doesn't take advantage of shared memory and MPI together, then set only dapl export I_MPI_FABRICS=shm:dapl export I_MPI_DAPL_PROVIDER=ofa-v2-ib0
Each Azure H16r virtual machine uses two (2) Intel Xeon E5 2667 v3 processors based on the “Haswell” architecture, and operate at 3.2 GHz for non-AVX instructions with all cores under load. These processors feature 8 physical CPU cores, and by default have their Hyperthreading feature disabled on Azure. Each H16r VM also features 112 GB of DDR4 RAM, and 2 terabytes (TB) of local SSD storage.
Azure’s FDR Infiniband network is based on technology from Mellanox and operates at 56 gigabits per second (approximately 54 gigabits per second under real-world conditions due to 64:66 encoding schemes), with port to port latency of approximately ~170 nanoseconds.
Jobs were submitted with Open Grid Scheduler version 2011.11. all jobs were launched from a common NFS filer hosted in the same region as the virtual cluster.
All data points found in this article represent averages from 30 test jobs per data point. This was done to obtain statistical significance in our findings.
For this study, we used a “pure MPI” stable build of LAMMPS from the November 2016 release. We used one MPI process per core for a total of 16 MPI processes per VM
Interested in running your MPI workloads on Microsoft Azure? Contact us for more information.