During the day when you are reading this, more data will be produced than the amount of information contained in all printed material in the world
1. The Internet Data Center estimated the growth of data to be of a factor of 300 between 2005 and 2020, expecting to raise from 130 Exabytes to 20,000 Exabytes. This Data Deluge revolutionizes both business, which now capitalizes the value searched in large data collections, and the process of scientific discovery, which moves towards a new paradigm: Data Science. Consequently, the applications need to scale and distribute their processing in order to handle overwhelming volumes, high acquisition velocities or great varieties of data. These challenges are associated to what is called “the Big Data phenomenon”.
One factor which accelerated the revolution of Big Data and which emerged alongside with it, is cloud computing.
The large, multi-site oriented infrastructure of clouds, which enables collocating computation and data, and the on-demand scaling provides an interesting option for supporting Big Data scenarios. Clouds bring to life the illusion of a (more-or-less) infinitely scalable infrastructure managed through a fully outsourced service that allows the users to avoid the overhead of buying and managing complex distributed hardware. Thus, users focus directly on extracting value, renting and scaling their services for a better resource utilization, according to the application’s processing needs and geographical distribution layout.
The typical cloud Big Data scenarios (e.g., MapReduce, workflows) require to partition and distribute processing across as many resources as possible, and potentially across multiple data centers. The need to distribute the processing geographically comes from multiple reasons, ranging from the size of the data (exceeding the capacities of a single site), to the distant locations of the data sources or to the nature of the analysis itself (crossing multiple service instances). Therefore, the major feature of such data-intensive computation on clouds is scalability, which translates to managing data in a highly distributed fashion. Whether the processing is performed in-site or across multiple data centers, the input needs to be shared across the parallel compute instances, which in turn need to share their (partial) results.
To a great extent, the most difficult and compelling challenge is to achieve high-performance for managing the data at a large-scale, and thereby enable acceptable execution times for the overall Big Data processing.
The cloud technologies, now in operation, are relatively new and have not reached yet their full potential: many capabilities are still far from being exploited to a full degree. This particularly impacts data management which is rather far from meeting the more and more demanding performance requirements of the applications. High cost, low I/O throughput and high latency are some of their major issues. Clouds primarily provide data storage services, which are optimized for high availability and durability, while performance is not the primary goal. Some data functionalities, such as data sharing or geographical replication are supported only as a “side effect”, while many others are missing: geographically distributed transfers, cost optimizations, differentiated quality of service, customizable trade-offs between cost and performance. All these suggest that data-intensive applications are often costly (time- and money-wise) or hard to structure because of difficulties and inefficiencies in data management in the cloud. In this landscape, providing diversified and efficient cloud data management services are key milestones for Big Data applications.’