DataCloud is a Horizon 2020 project that develops methods for the entire lifecycle of Big Data pipelines on diverse infrastructures for efficient processing and monitoring.
About
The EU-funded H2020 project DataCloud introduces a groundbreaking paradigm with a complete life cycle managing Big Data pipelines through discovery, design, simulation, provisioning, deployment and adaptation across the computing continuum. It allows Big Data pipelines to interconnect the end-to-end industrial operations from the preprocessing and collecting of data to the realisation of a business target, using heterogeneous infrastructures as part of a common compute continuum.
DataCloud develops novel methods to support the complete lifecycle of Big Data pipelines processing, enabling their discovery, definition, model-based analysis and optimization, simulation, deployment, adaptive run-time and monitoring on top of decentralized heterogeneous infrastructures on the Computing Continuum.
The Challenge
For DataCloud, we focus on the cloud-edge continuum, and try to provide benefits by limiting data transfer. Therefore we found it important to be able to host the deployed applications close to the source data, therefore to EGI cloud providers in Italy, Spain and Portugal. In addition, of course, the benefit of free cloud resources that can be used for our tests was an important aspect, as it saved us valuable resources in comparison to using AWS or Azure cloud resources, as we performed the evaluation testing over.
The Solution
In DataCloud, we used the EGI Cloud Compute and the cloud-based EGI Online Storage to distribute the computational task to a scalable compute platform and to store intermediate results from the user jobs.
In addition, we used both the VMOps Dashboard and the Infrastructure Manager Dashboard to configure our resources and perform our tests. The EGI Check-In was used for authorized access to both portals and to the underlying distributed compute infrastructure. DataCloud used the EGI Applications Database to configure and deploy underlying services that we needed as prerequisites.
Then, we proceeded with a federated Kubernetes cluster setup using EGI resources from multiple cloud providers across Europe. For the setup of the federated cluster, we used Submariner to enable direct networking between Pods and Services in different Kubernetes clusters that facilitate a compute continuum that is used by DataCloud Toolbox for the deployment and management of the Big Data pipelines.
For our testing we used the vo.access.egi.eu virtual organization. We tested our setup with cloud providers mainly in France, Italy, Spain but also tested with resources in other countries across Europe, such as Slovakia, Poland and Portugal.
Through EGI-ACE, the DataCloud team got access to European Cloud resources that were used to create distributed cloud-edge continuum testbeds for the execution of realistic scenarios from five use cases across different domains.