Pangeo is a collaborative, open-source project revolutionising big data analysis in geoscience




What is Pangeo?
The exponential growth of data across scientific fields presents substantial opportunities and challenges. In geoscience, vast volumes of Earth observation data require sophisticated, scalable analysis techniques to extract valuable insights. When these techniques are open, reproducible, and adaptable, they accelerate reusability and foster cross-disciplinary innovation. The era of big data has revolutionised geoscience, with Earth observation satellites and other sources generating vast amounts of data daily. This influx offers unprecedented opportunities to gain insights into Earth’s systems, from climate change and ecosystem dynamics to natural disasters, resource management, and climate adaptation strategies. However, the volume and complexity of these data streams present significant challenges for researchers. Managing, processing, and analysing such vast datasets requires sophisticated computational tools and scalable methodologies to meet modern geoscience demands. These challenges are not unique to geoscience but resonate across multiple scientific domains as the demand for data-driven methodologies grows.
Recognising these challenges, EGI has played a crucial role in supporting the deployment of Pangeo@EOSC, providing the underlying computing and storage infrastructure that powers this innovative platform. Through projects like EGI-ACE and C-SCALE, EGI has facilitated seamless integration of the Pangeo ecosystem within the European Open Science Cloud (EOSC), ensuring that European researchers gain equitable access to scalable computational resources for big data geoscience. By leveraging EGI Cloud Compute, EGI Online Storage, and EGI Check-in for secure access, Pangeo@EOSC has been able to provide researchers with a powerful, federated environment for analysing and visualising large-scale datasets. This partnership is a key step in enabling open, reproducible, and collaborative research, fostering scientific advancements across multiple disciplines.
Pangeo is a collaborative, open-source project revolutionising big data analysis in geoscience. It provides tools and infrastructure for scientists to access, analyse, and visualise massive datasets. Initially US-centric, the Pangeo@EOSC project is bringing this powerful platform to Europe via the European Open Science Cloud (EOSC). This initiative not only improves data accessibility and streamlines research workflows for European geoscientists but also fosters open science practices. Furthermore, Pangeo@EOSC’s flexible design has potential applications beyond geoscience, offering a valuable resource for other data-intensive fields like bioimaging, astrophysics, and AI/machine learning, promoting interdisciplinary collaboration and innovation.
Pangeo Deployment on EOSC
Pangeo@EOSC significantly advances open science by addressing key challenges researchers face in implementing FAIR principles. Integrating core Pangeo technologies like Jupyter notebooks, Dask, and Xarray within the European Open Science Cloud (EOSC) provides a much-needed, accessible platform for data-driven geoscience research. This addresses the previous lack of such resources in Europe, where researchers were hampered by geographically limited access to existing Pangeo deployments. Pangeo@EOSC fosters collaboration and empowers researchers to more easily share, access, and reuse data, directly supporting the core tenets of open science and accelerating scientific discovery.
In collaboration with EGI, the Pangeo Europe Community deployed a DaskHub comprising a Dask Gateway and JupyterHub, supported by a Kubernetes cluster within EOSC via the EGI Federation infrastructure. This deployment leverages EGI Check-in for user registration, enabling authenticated access to the Pangeo JupyterHub portal and underlying compute infrastructure. Additionally, it utilises EGI Cloud Compute and EGI Online Storage to efficiently distribute computational tasks across a scalable platform, with intermediate results stored for further analysis.

Deployment of Pangeo platform with Dask Gateway on EOSC
These resources enable users to tackle realistic, large-scale data analysis problems, enhancing their ability to work with tools like Xarray and Dask in a real-world context. The collaborative environment also fosters interaction among researchers, Research Software Engineers, and Open Science practitioners, eliminating the need for individual infrastructure setup and package installation, ultimately advancing a culture of knowledge sharing and Open Science.
Pangeo Community of Practice
The Pangeo@EOSC initiative is not only a technical deployment but also a vibrant community of practice fostering collaboration and knowledge sharing among researchers, educators, engineers, and developers. This collaborative environment drives innovation, leading to the development of new tools like the xDGGS package for geospatial data processing, crucial for initiatives like Destination Earth. Beyond tool development, Pangeo@EOSC emphasises best practices for open science, training researchers and partnering with resources like the Environmental Data Science Book (EDS Book) to promote the creation and publication of FAIR, reusable Jupyter Notebooks. By championing these practices and providing a space for collaboration, Pangeo@EOSC strengthens the European research ecosystem and empowers researchers to build upon each other's work, accelerating scientific progress.
Pangeo Training Infrastructure as a Service (PTIaaS) revolutionises Pangeo training by providing a dedicated, instantly usable, and scalable environment. Instead of participants struggling with setup, PTIaaS offers a pre-configured Pangeo JupyterHub instance tailored for training, complete with necessary resources, software, and sample data. This allows instructors to focus on teaching and participants to immediately engage with hands-on exercises and collaborative learning. This on-demand and customisable infrastructure is freely available via a simple request form, removing the technical barriers to Pangeo education and ensuring a consistent, reproducible learning experience for all.
Next Steps
Pangeo Europe and Pangeo@EOSC have effectively created a collaborative ecosystem that Pangeo Europe and Pangeo@EOSC have successfully established a collaborative ecosystem for large-scale data analysis, advancing open science. Building on this success and its expanding cross-disciplinary reach, particularly in bioimaging and cosmology, the next steps involve several key areas. Pangeo will prioritise leadership in federated computing and green computing practices, including optimising workflows and supporting GPU acceleration for computationally intensive research. Further integration with EOSC resources and a focus on interoperability are crucial, along with tailored training programmes to broaden user adoption. Showcasing domain-specific use cases in diverse fields like climate resilience and bioimaging will demonstrate practical value. Finally, building strategic partnerships across sectors and investing in FAIR data practices will ensure long-term sustainability and maximise the impact of the Pangeo ecosystem.
Related magazine news
Neurodesk is a flexible, cloud-based solution designed to create a teaching environment for neuroimaging data