We interviewed Maarten Litmaath, WLCG Operations Coordination co-chair, who gave us insights into the development of the Worldwide LHC Computing Grid.
Starting around the beginning of this century, many projects were trying to create technologies that would help make “the grid” happen. It quickly became clear there would never be just a single grid, but rather a collection of partially overlapping grids that sites could be members of at the same time, and that virtual organisations might be able to make use of at the same time, if such grids were sufficiently compatible. For the LHC Computing Grid (LCG), it was clear we needed to have middleware and services that could work globally, and we thus needed to ensure there was sufficient compatibility between European and US contributions in particular. As CERN had masterminded the European DataGrid (EDG) project for the needs of the LCG in particular, the first versions of the latter would largely be based on EDG components that were deemed necessary and ready at the time. Underlying those was the Globus Toolkit, packaged and enhanced through the Virtual Data Toolkit (VDT) developed as part of the NSF Middleware Initiative project. The (HT)Condor middleware has also played a big part since the very early days. Furthermore, some components were taken from the Data TransAtlantic Grid (DataTAG) project that had been specifically created to help ensure transatlantic compatibility.
The very first LCG release was called LCG-0 and became available on 28 February 2003, more than 20 years ago now! It was mainly supposed to serve in testing the deployment procedures themselves but was actually used for production work as well. Sites could only set up Computing Elements, Worker Nodes, simple GridFTP server Storage Elements, and User Interface hosts. More complex services for the LHC experiments were only running at CERN. Half a year later, the first LCG-1 release became available and contained more EDG components, in particular for workload management, that could be made to have instances at other participating sites. Another half year later, early 2004, the first LCG-2 release became available, with many improvements, some of which were backward-incompatible. The final LCG release, 2.7.0, became available early 2006.
Meanwhile, to distinguish the naming of the middleware releases from that of the infrastructure, the latter had been rebranded WLCG, the Worldwide LHC Computing Grid, on which also other middleware stacks were deployed, particularly in the US. During those early years, not only did the middleware steadily improve, but also a number of crucial auxiliary services were put into production, and those evolved as well. Examples include APEL, GGUS (with first- and second-level support to follow), GOCDB, security coordination, operations support and what is now the Operations Portal. Furthermore, the number of sites participating in the WLCG steadily rose, as did the resources at each site: the grid thus saw a steady, exponential increase of activity as of the early days, and that trend has continued to this day. To help make and keep the grid sufficiently stable, regular (typically hourly) testing and monitoring of the services provided by the sites have played crucial roles since the very early days.
To help prepare the WLCG for dealing with the LHC data deluge, a number of Data / Service Challenges were orchestrated from 2004 through 2009 at ever-increasing levels that turned out to be more than sufficient in the end. However, we could not be entirely sure of that until the data-taking had started in the autumn of 2009!
As of April 2004, the EDG project was succeeded by three 2-year phases of the EGEE project, whose acronym came to mean Enabling Grids for E-sciencE, initially just in Europe, but ultimately around the world, except for the US, where our partner institutes operated under the umbrella of the Open Science Grid project, as they still do today. It was quite remarkable to see the EGEE project collaborate with many grid projects around the world, a number of which were also funded by the EU, not only to foster collaboration, but also as a means of knowledge transfer to developing nations and regions. CERN also led the EGEE project, which focused in particular on creating the operational infrastructure that ultimately became EGI as of May 2010.
However, it was clear from the start that further improvements of the grid middleware and adaptations, e.g. to new operating systems and other changes in the landscape, were vital as well. The new middleware generation was named gLite, conveying the intention for the grid (‘g’) software to be made more “lite-weight” than it had been so far.
As of spring 2005, the LCG releases steadily switched from EDG components to gLite counterparts as the latter became mature in gLite releases 1.0 through 1.5. Next, the LCG and EGEE middleware certification, testing and release teams were merged to allow a unified release to be made comprising the combined functionalities of gLite 1 and LCG-2: gLite 3.0, which became available in May 2006 for Scientific Linux (SL) 3. Two further major releases would follow, each with many tens of subsequent updates: gLite 3.1, whose components became available for SL 4 as of June 2007, and gLite 3.2, whose components became available for SL 5 as of March 2009. The history of the LCG and gLite releases has been captured in the LCG Deployment Releases Museum.
Though the needs of the LHC experiments were still driving the developments, in EGEE, it was vital that any multi-institutional, potentially international e-science collaboration should be able to make use of grid middleware components to share data and/or computing resources. The idea was that if the middleware could work at the scale of the LHC experiments, then indeed it should work at the much smaller scales of other e-sciences at the time. While we did not manage to introduce high-level data or workflow management frameworks for other communities as part of gLite, we did make the building blocks of grid sites and central grid services mature enough for the LHC data storage and processing to rely on them since the very start of data taking in 2009.
The LCG-0 deployment exercise concerned CERN and 10 other sites, mostly future Tier-1 centres. The same set also were the first to deploy LCG-1, steadily followed by other early adopter sites.
By early 2004 there were 25 sites. By May 2005, 72 sites were already reporting to APEL; by November 2005, the number of sites running LCG-2 or gLite 1.x had grown to 144, and by the end of 2009, the number of EGEE sites running gLite 3.x had grown to 148.
For the network connections between all the sites participating in the WLCG, a conservative model was originally implemented: CERN was connected to each of the 11 initial Tier-1 centres through dedicated 10 Gbps links. In contrast, each Tier-2 site ideally was connected to a nearby Tier-1 site through arrangements negotiated between those sites and their network providers.
The concern was that networks might not evolve as quickly as computing and storage resources and should, therefore, be used as little and as efficiently as possible.
Fortunately, as WLCG requirements were seen to be overtaken by those from industry and society, network evolution accelerated to the extent that by early 2008, several Tier-1 centres already had direct connections between them in addition, for redundancy and extra capacity. Also, the Tier-2 sites profited from steadily improving conditions. Those trends have continued to this day.
At the end of the EGEE project, it was clear that grid middleware would need to continue evolving with the IT landscape as well as community requirements, while the infrastructure organisation and operations set up by EGEE had to evolve to their next levels, becoming based on National Grid Initiatives that would be coordinated through a new, long-lived European organisation: EGI.
By that time, most of the sites that make up the WLCG today were already established. The ones in EGI ran gLite 3.1 services on SL4 and 3.2 services on SL5. The European Middleware Initiative (EMI) project was created to take the middleware to the next level and, in particular, make it available on SL6. The project ran from May 2010 to May 2013. It produced three releases, comprising not only newer versions of gLite products but also NorduGrid ARC components that we still very much use today, UNICORE products for HPC communities, and finally dCache as an independent product, which previously had been made available also through gLite releases.
After EMI had finished, the middleware products that remained relevant for EGI sites continued to be maintained for many more years by the institutes at which those products had been developed. In recent years, some of those products began to get phased out in favour of replacements that can be considered better suited for the years to come. Examples included the CREAM CE and the DPM SE, both having been very successful in their times.
The next big middleware project was INDIGO-DataCloud, which ran from April 2015 through September 2017 and was so forward-looking that its outcomes are still finding new and growing application areas today! Two examples are: 1) the steady move from X509 and VOMS to federated identities and JSON Web Tokens for AAA, and 2) the steady increase in the use of containers not only for service deployment but also in the encapsulation of grid job payloads. Supporting the advances in AAA is the EGI Check-in service that also WLCG users and site admins will come to depend on much more in the coming years. In the meantime, some of the high-level data and job workflow management products developed initially for LHC experiments had become available for other communities, also thanks to EGI funding to help make that happen: CVMFS, DIRAC and Rucio. We see nice examples of collaboration activities between EGI and WLCG going in both directions!
A distributed computing environment allows a research project to profit from multiple existing computing infrastructures and funding streams, ranging from regional to international. Each participating institute will see itself on the map for the projects in which it participates and will be able to show directly how its own funding is helping those projects, which may well increase buy-in from relevant parties. Obviously, there also is a price to pay: a distributed infrastructure is harder to set up and keep running than a single HPC facility, in which a project might get a large allocation of computing resources, presuming that the project requirements are a good match for costly HPC resources rather than HTC commodity resources. The WLCG project probably will need to make more use of HPC resources in the coming years because some countries may prefer investing in supercomputers for various reasons, and there may then be less money for the much less fancy HTC resources that are a natural fit for WLCG. A site participating in a distributed computing environment does not actually need to have its computing capacity on its premises. These days, a site may be better off renting such capacity in the cloud instead. At CERN and other WLCG sites, it was found that deploying resources on-site is still less expensive, but the expectation is for that to change in the coming years. Mind that one needs to take into account not just the hardware management and the IaaS on top but also all the higher-level software layers (PaaS, SaaS, etc.) that have to be designed, integrated with auxiliary services, configured, operated, maintained and evolved: all that still has to be done by IT staff at each institute, though we can collaborate on some of those matters through shared projects and forums. Also, mind that we have to take a different attitude to the crucial parts of the data produced by a research project: for such data, we had better not rely on commercial cloud providers but rather on partnerships with well-established, publicly funded, research-driven institutes acting as the long-term custodians of such data. Such institutes can be seen as a new form of public libraries that host publicly funded data instead of books. In WLCG, that role is attributed to CERN and the O(10) Tier-1 centres spread over the globe.
To allow the LHC to be exploited to its full potential, it has been and will continue to be upgraded every few years until it has operated at a maximally feasible performance level during a last run that should see us into the early 2040s. To keep up with the LHC and thus be able to take data at ever-increasing rates, the experiments must keep improving their detectors, data acquisition systems, data processing software and ultimately, the capacity of their WLCG resources.
The most recent LHC upgrade took place in Long Shutdown 2 from 2019 through early 2022, during which the ALICE and LHCb experiments implemented radical changes from their detectors all the way up to their data analysis software. The ATLAS and CMS experiments are preparing for their very extensive upgrades to happen in Long Shutdown 3 from 2026 through 2028. As the resources of these experiments amount to about 70% of the WLCG, experts in these experiments, as well as WLCG, have been looking into what can be done to try and ensure the capacity of the WLCG will still be sufficient by the time LHC Run 4 is expected to start in 2029.
The experiments will need to evolve their data handling to make use of more ingenious data formats that take less space and can be processed faster than what is feasible today, for example, by taking advantage of object stores like Ceph and access protocols like S3 that are becoming ever more popular in industry, and therefore, at sites. They will also need to make increased use of alternative computing architectures like ARM and GPUs for faster and/or less expensive computing resources, also taking into account their energy consumption and carbon footprint. For each such architecture, not only must the software be made to run, but also extensive physics validation campaigns are required before the architecture is trusted.
Moreover, sites must be able to assess the performance of such architectures, pledge them to their experiments and see them correctly accounted for in APEL. Here, we will yet more profit from the new, versatile HEPScore benchmark framework recently adopted both by WLCG and EGI, but there remains a lot to be done in the coming years. More use will be made of HPC centres and cloud providers, with provisos mentioned earlier in this article. Such resources must also be connected to LHCONE, the global overlay network for sites serving WLCG and related projects. The use of advanced IPv6 features is expected to become important, and IPv4 support will start getting phased out when it is no longer needed on a global scale for legacy workflows. The middleware will need to continue evolving and the use of containers is expected to keep increasing. The transition from X509 certificates and VOMS proxies toward federated identities and WLCG tokens is foreseen to be completed well before the start of LHC Run 4, by which time users shall not need certificates anymore, while tokens shall be transparently handled by the middleware that the users interact with.
As has been the case throughout the lifetime of the WLCG project, these evolutions are made possible also thanks to many partners and projects and will keep benefiting not just the LHC experiments, but also many other communities!