We’ve see how disruptive the cloud has been for compute workflows because of it’s elasticity and vast scale, but IEEE Big Data 2016 revealed that the cloud has been disruptive for Big Data workloads for other reasons.
I attended the conference to present work done by NASA in collaboration with Cycle Computing entitled “Using Cloud Bursting to Count Trees and Shrubs in Sub-Saharan Africa”. The work describes large scale analysis of satellite imagery using our event-based workflows. The session agenda favored downstream analysis, but the audience showed interest in data provenance and metadata awareness in the primary batch workflow.
George Percivall from Open Geospatial Consortium detailed how we’re entering the era of data that is “born connected”. Users, services, and applications will derive value from linking various PB-scale data repositories. Whether they be climate data, geological data, satellite imagery, or surface sensor readings these data repos will be created with rich metadata and served in the cloud. Mr Percivall and his organization advocate for owners of these large dataset to share them in the cloud with consistent standards so that they’re accessible and consumable.
A complimentary and very thought-provoking presentation by Brian Wilson from JPL about SciSpark, JPL’s Spark variant, showed us what a companion analysis architecture for these massive linked datasets is likely to look like. There has been consolidation around Spark-based analysis architectures in the last several years. Mr. Wilson described how spark can be used optimally for just about any kind of analysis of big data with a careful design of the backend filesystem. He points out that for dense, well structured data that HDFS is extremely efficient and scalable. However for sparse data, using some variable decomposition techniques, backing spark by a Casandra database has clear benefits. And further for very large scale datasets, Casandra can be replaced with cloud-native blob storage with some trade-off between speed and scale.
These presentations — even in our own session — present a remarkable future of applications powered by incredibly rich planetary data. We’re inspired to push forward and integrate our event-based and batch tools with flexible and reliable spark architectures. Each session reinforced the theme of how accessibility of large data sets makes the cloud disruptive, not only because of scale, but as a data collaboration platform.