Elastically scalable big data clusters that can respond to varying workload demands, while efficiently utilizing and sharing cloud resources, is a reality that is attainable with Hadoop on OpenStack. To achieve that reality requires seperating cluster compute from cluster storage in order to enable scaling compute independently of data. In this session we discuss how OpenStack Swift can serve as the basis for an elastically scalable Hadoop cluster on OpenStack and detail the challenges faced when using Swift as the primary data store for big data. We describe the cluster storage design and enhancements to the Hadoop Swift file system implementation that are necessary to achieve performance at big data scale.
Successful approaches to a number of the challenges are presented:
- Storage architecture design addressing object, block, and transient storage
- Hadoop SwiftFS enhancements to handle tens of thousands to millions of objects
- Vendor specific support for Swift API implementations (CEPH)
- Tool ecosystem interoperability