TPC Workshop at ISC

Workshop: The Trillion Parameter Consortium (TPC) – Accelerating the use of Generative AI for Science and Engineering

16-May 2024 (9am-1pm)
Venue: ISC in Hamburg, Germany

Abstract

Realizing the promise of large-scale, ‘foundation’ models for scientific discovery, supporting activities ranging from self-driving laboratories to hypothesis generation, will require unprecedented scale of computation to train large-scale AI models, coupled with an enormous, inherently multidisciplinary task of preparing diverse scientific data for use in model training. Simply put, only a relatively small number of organizations have the resources to build models at state-of-the-art scales (e.g., trillions of parameters, trained using tens of trillions of tokens).

These trends are motivating the formation of multi-institutional teams whose efforts can be accelerated through sharing strategies such as for model architecture, evaluation, and training as well as through collaboratively building and sharing high-quality training data sets. This workshop will highlight such collaborations, which are being catalyzed by the international Trillion Parameter Consortium (TPC), along with progress in various aspects of generative AI for science and engineering with presentations from academics, national laboratories, HPC centers, industry, institutes, and leaders from funding agencies.

This workshop will also provide a forum for continuing to develop a shared vision and goals for accelerating the use of generative AI for science and engineering, particularly providing the opportunity for participation by the European AI, HPC, and disciplinary science research communities. The workshop will introduce the structure and strategies of the TPC and an overview of high-priority areas in which new collaborators can contribute and benefit from joining the consortium.

Program

08:00-09:00 Welcome Coffee

09:00-11:00 Extreme-Scale AI+HPC for Science and the Trillion Parameter Consortium (TPC)

Welcome: The Trillion Parameter Consortium (Charlie Catlett, Argonne National Laboratory / University of Chicago) [slides]
Keynote: Building Frontier AI Systems for Science and the Path to Zettascale (Rick Stevens, Argonne National Laboratory / University of Chicago) [slides]
Large-Scale AI Model Projects in Japan (Jens Domke, RIKEN) [slides]
EuroHPC: AI in Europe (Alexandra Kourfali, EuroHPC JU)
Panel: Government Initiatives at the Intersection of HPC and AI (Moderator: Fabrizio Gagliardi, BSC)
- Alexandra Kourfali (EU), Satoshi Matsuoka (Japan), Rick Stevens (US)

11:00-11:30 Coffee Break

11:30-13:00 Exemplars of Open Collaborations in Large AI Models for Science and Growing the Community

It’s Time to Open Up Language Models (Noah Smith, Allen Institute for Artificial Intelligence / University of Washington) [slides]
AI in Weather and Climate Prediction (Torsten Hoefler, ETHZ) [slides]
Leveraging the Trillion Parameter Consortium to Create the Next Generation of AI Scientists (Valerie Taylor, Argonne National Laboratory / University of Chicago) [slides]

13:00-14:00 Lunch for All Workshop Participants

Left: Panel discussion on AI/HPC funding initiatives featuring Alexandra Kourfali (EU), Rick Stevens (US), Fabrizio Gagliardi (moderator), Satoshi Matsuoka (JP)
Center: Group picture
Right: Valerie Taylor (Argonne/UChicago).

Workshop Aims and Expected Outcomes

Today the science community has a growing number of projects aiming to harness generative AI, ranging from models trained for individual disciplines (e.g., UniverseTBD for astronomy or MAIRA-1 for radiological image analysis) to models aspiring to multi-disciplinary use (e.g., Olmo, AuroraGPT). There are also new data sources being released, such as DOLMA, and many efforts to identify scientific data (literature, databases, time series data, etc.) and transform those data for use training AI models. There has been an explosion of new activities, each developing methods and tools to grapple with common challenges, from evaluation (for trustworthiness, safety, bias, performance, etc.) to training and data preparation workflows. Concurrently, the enormous cost of computation for model training limits the number of groups that can realistically build and train large-scale models. These trends all offer incredible opportunities to collaborate, both to accelerate progress toward optimal tools and methods and to enable groups to strategically collaborate to reduce duplication of effort, such as in data preparation of new scientific data sources. Additionally, many new challenges are coming to the fore with generative AI, including new ways of thinking about data sharing and attribution, about licensing artifacts, and about the importance of responsible development safe, ethical AI models. An overarching need today in AI is openness – which includes open data, open source code for tools and workflows, and careful thought as to how, when, and whether to open the models themselves. Such openness is critical to progress in every area related to generative AI.

The enormity of these challenges, and of the resources needed for data preparation, pre-training new models, and responsibly preparing them for downstream applications has meant that progress is largely concentrated in industry, where there is limited, or in some cases, no visibility into the artifacts (models, data sets) or the processes used to create them. This underscores the need for collaboration in the open science community—central to the motivation behind creating the international Trillion Parameter Consortium (TPC). The workshop aspires to stimulate new thinking, attract scientists to some of the emerging challenges associated with generative AI, and potentially catalyze the formation of new topical collaborative working groups that can be supported by TPC.

Impact

The trends discussed above put HPC at the center of the community’s quest to use generative AI for science, bringing together HPC and AI experts to discuss current trends and breakthroughs and explore potential collaborations. The international Trillion Parameter Consortium (TPC) was formed in 2023 for this very purpose—to provide the community with a venue for identifying and pursuing collaborations that will accelerate progress and that will enable the broadest possible community to benefit from the limited HPC resources available to train large AI models.

A fundamental goal of this workshop is to attract early- and mid-career scientists, inclusive of the breadth of the ISC community from HPC providers to computer and computational scientists to disciplinary scientists and encompassing those involved in career development and training programs.

For more information about the Trillion Parameter Consortium (TPC) please visit TPC .dev. If you are interested in participating in the TPC community, please contact us here.

TPC will convene scientists from around the world for tutorials, plenary sessions, and working group breakouts at the European TPC Kick-Off workshop hosted by the Barcelona Supercomputing Center from 19-21 June, 2024.

Co-Organizers

Fabrizio Gagliardi, Barcelona Supercomputer Center
Kimmo Koski, CSC – IT Center for Science
Per Öster, CSC – IT Center for Science
Alfonso Valencia, Barcelona Supercomputing Center
Mateo Valero, Barcelona Supercomputing Center
Charlie Catlett, Argonne National Laboratory and University of Chicago
Ian Foster, Argonne National Laboratory and the University of Chicago
Satoshi Matsuoka, RIKEN
Irina Rish, Université de Montréal and MILA
Noah Smith, Allen Institute for Artificial Intelligence and the University of Washington
Rick Stevens, Argonne National Laboratory and the University of Chicago
Valerie Taylor, Argonne National Laboratory and the University of Chicago

Abstracts and Speaker Biographies

Building Frontier AI Systems for Science and the Path to Zettascale (Rick Stevens, Argonne National Laboratory / University of Chicago)

Abstract: The successful development of transformative applications of AI for science, medicine and energy research will have a profound impact on the world. The rate of development of AI capabilities continues to accelerate, and the scientific community is becoming increasingly agile in using AI, leading to us to anticipate significant changes in how science and engineering goals will be pursued in the future. Frontier AI (the leading edge of AI systems) enables small teams to conduct increasingly complex investigations, accelerating some tasks such as generating hypotheses, writing code, or automating entire scientific campaigns. However, certain challenges remain resistant to AI acceleration such as human-to-human communication, large-scale systems integration, and assessing creative contributions. Taken together these developments signify a shift toward more capital-intensive science, as productivity gains from AI will drive resource allocations to groups that can effectively leverage AI into scientific outputs, while other will lag. In addition, with AI becoming the major driver of innovation in high-performance computing, we also expect major shifts in the computing marketplace over the next decade, we see a growing performance gap between systems designed for traditional scientific computing vs those optimized for large-scale AI such as Large Language Models. In part, as a response to these trends, but also in recognition of the role of government supported research to shape the future research landscape the U. S. Department of Energy has created the FASST (Frontier AI for Science, Security and Technology) initiative. FASST is a decadal research and infrastructure development initiative aimed at accelerating the creation and deployment of frontier AI systems for science, energy research, national security. I will review the goals of FASST and how we imagine it transforming the research at the national laboratories. Along with FASST, I’ll discuss the goals of the recently established Trillion Parameter Consortium (TPC), whose aim is to foster a community wide effort to accelerate the creation of large-scale generative AI for science. Additionally, I’ll introduce the AuroraGPT project an international collaboration to build a series of multilingual multimodal foundation models for science, that are pretrained on deep domain knowledge to enable them to play key roles in future scientific enterprises.

Speaker: Rick Stevens is a Professor of Computer Science at the University of Chicago and the Associate Laboratory Director of the Computing, Environment and Life Sciences (CELS) Directorate and Argonne Distinguished Fellow at Argonne National Laboratory. His research spans the computational and computer sciences from high-performance computing architecture to the development of advanced tools and methods. Recently, he has focused on developing AI methods for a variety of scientific and biomedical problems, and also has significant responsibility in delivering on the U.S. national initiative for Exascale computing and developing the DOE’s Frontiers in Artificial Intelligence for Science, Security, and Technology (FASST) initiative.

EuroHPC: AI in Europe (Alexandra Kourfali, EuroHPC JU)

Abstract: Large AI models are central to Europe’s digital sovereignty. It is imperative to not only embrace, but also lead in the research, development, and utilization of trillion-parameter AI models. Central to this strategy is the symbiotic relationship between these massive AI models and the hardware infrastructure they depend on. Supercomputers with integrated AI chips form the foundational basis upon which these models operate. In this talk, we will delve into EU efforts to boost innovation across the AI computing stack, we will discuss the supercomputing requirements to support massive AI models and finally, we will discuss the convergence of AI and hardware in our processor initiatives and how this will lead to a solid roadmap towards unlocking the full potential of AI and accomplish digital sovereignty in Europe.

Speaker: Alexandra Kourfali is a Programme Manager of Research and Innovation at the EuroHPC Joint Undertaking, focused on the Technology pilar. She received her MSc degree in Computer Engineering from the University of Thessaly, Greece, and her Ph.D. in Computer Engineering from Ghent University, Belgium, in 2019. Previously she held academic appointments at Ghent University, Stuttgart University, the Barcelona Supercomputing Center, and the European Space Agency, and non-academic appointments at Thales, Belgium. She has been an expert reviewer for the European Commission, IEEE, and ACM journals, and co-authored patents, journals, and conference papers. She has been part of the program and organizing committee of several IEEE conferences. Her interests include High-Performance Computing, reconfigurable computing, hardware reliability, and computer architectures with an emphasis on RISC-V. She is a member of HiPEAC and IEEE.

Updates on Efforts to Pre-train LLMs in Japan (Jens Domke, RIKEN)

Abstract: There are multiple efforts to pre-train LLMs in Japan, both from scratch on multilingual data and also continually from open models like Llama and Mistral. Training from scratch has the benefit of being able to control the entire training pipeline, which is crucial when studying the effects of the training data on downstream performance and safety. On the other hand, continual pre-training allows one to leverage all the data that was used to train the best open models, which is more effective if the objective is to train the best model under a limited budget.

Speaker: Jens Domke is the Team Leader of the Supercomputing Performance Research Team at the RIKEN Center for Computational Science (R-CCS), Japan. He received his doctoral degree from the Technische Universität Dresden, Germany, in 2017 for his work on HPC routing algorithms and interconnects. Jens started his career in HPC in 2008, after he and a team of five students of the TU Dresden and Indiana University, won the Student Cluster Competition at SC08. Since then, he published dozens of peer-reviewed journal and conference articles. Jens contributed the DFSSSP and Nue routing algorithms to the subnet manager of InfiniBand, and built the first large-scale HyperX prototype at the Tokyo Institute of Technology. His research interests include system co-design, performance evaluation, extrapolation, and modelling, interconnect networks, and optimization of parallel applications and architectures.

Panel: Government Initiatives at the Intersection of HPC and AI (Fabrizio Gagliardi, BSC, Moderator)

Panelists include Alexandra Kourfali (EU), Satoshi Matsuoka (Japan), and Rick Stevens (US).

Panelist: Professor Satoshi Matsuoka from April 2018 has been the director of Riken Center for Computational Science (R-CCS), the Tier-1 national HPC center for Japan, developing and hosting Japan’s flagship ‘Fugaku’ supercomputer which has become the fastest supercomputer in the world in 2020 and 2021, supporting cutting edge HPC research, including investigating Post-Moore era computing, especially the future FugakuNEXT supercomputer. He led the TSUBAME series of supercomputers that received many international acclaims, at the Tokyo Institute of Technology, where he holds a professor position pursuing research in HPC, scalable Big Data, and AI. His longtime contribution was commended with the Medal of Honor with Purple ribbon by his Majesty Emperor Naruhito of Japan in 2022. He is a Fellow in ACM, ISC, IPSJ and the JSSST and has won numerous awards including ACM Gordon Bell Prizes, the IEEE-CS Sidney Fernbach Award, and the IEEE-CS Computer Society Seymour Cray Computer Engineering Award.

Moderator: Fabrizio Gagliardi is the Senior Strategy Advisor to the Barcelona Supercomputing Center Director’s Office. He was a senior scientist at CERN, European Centre for Particle Physics, from 1975-2005. From 2000-2003 he was PI of the EU DataGrid and initiator and PI of EGEE 1 and 2 from 2003-2005. During this time he co-founded Global Grid Forum (later Open Grid Forum). Gagliardi was the founder and director of the first Grid School in 2003. From 2005-2013 he was the director for external research in LATAM and EMEA at Microsoft, joining the Barcelona Supercomputing Centre in 2013 as Senior Advisor. He established the annual ACM HPC and AI summer school in Barcelona in 2019 and has initiated schools including the EU-ASEAN HPC school in Bangkok (2021and 2022). He was a visiting professor at the Gran Sasso Science Institute from 2013-2015 and from 2009-2015 founded and chaired the ACM European council. He was ACM President Award recipient in 2013 and 2018.

It’s Time to Open Up Language Models (Noah Smith, Allen Institute for Artificial Intelligence / University of Washington)

Abstract: Neural language models with billions of parameters and trained on trillions of words are powering the fastest-growing computing applications in history and generating discussion and debate around the world. Yet most scientists cannot study or improve those state-of-the-art models because the organizations deploying them keep their data and machine learning processes secret. I believe that the path to models that are usable by all, at low cost, customizable for areas of critical need like the sciences, and whose capabilities and limitations are made transparent and understandable, is radically open development, with academic and not-for-profit researchers empowered to do reproducible science. Projects like Falcon, Llama, MPT, and Pythia provide glimmers of hope. In this talk, I’ll share the story of the work our team is doing to radically open up the science of language modeling. So far, we’ve released Dolma, a three-trillion-token open dataset curated for training language models, and used it to pretrain OLMo v1, also publicly released. We’ve also built and released Tülu, a series of open instruction-tuned models. All of these come with open-source code and extensive documentation, including new tools for evaluation. Together these artifacts make it possible to explore new scientific questions and democratize control of the future of this fascinating and important technology. The work I’ll present was carried out primarily by a large team at the Allen Institute for Artificial Intelligence in Seattle, with collaboration from the Paul G. Allen School at the University of Washington and various kinds of support and coordination from many organizations, including the Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University, AMD, CSC – IT Center for Science (Finland), Databricks, and Together.ai.

Speaker: Noah Smith is a computer scientist working at the junction of natural language processing (NLP), machine learning (ML), and computational social science. He recently wrote Language Models: A Guide for the Perplexed, a general-audience tutorial, and he co-directs the OLMo open language modeling effort with Hanna Hajishirzi. His research spans core problems in NLP, general-purpose ML methods for NLP, methodology in NLP, and a wide range of applications. You can watch videos of some of his talks, read his papers, and learn about his research groups, Noah’s ARK and AllenNLP. Smith is most proud of his mentoring accomplishments: as of 2024, he has graduated 28 Ph.D. students and mentored 15 postdocs, with 25 alumni now in faculty positions around the world. 20 of his undergraduate/masters mentees have gone on to Ph.D. programs. His group’s alumni have started companies and are technological leaders both inside and outside the tech industry.

AI in Weather and Climate Prediction (Torsten Hoefler, ETHZ)

Abstract: Machine learning presents a great opportunity for Climate simulation and research. We will discuss some ideas from the Earth Virtualization Engines summit in Berlin and several research results ranging from ensemble prediction and bias correction of simulation output, extreme compression of high-resolution data, and a vision towards affordable km-scale ensemble simulations. We will also discuss programming framework research to improve simulation performance. Specifically, our ensemble spread prediction and bias correction network applied to global data, achieves a relative improvement in ensemble forecast skill (CRPS) of over 14%. Furthermore, we demonstrate that the improvement is larger for extreme weather events on select case studies. We also show that our post-processing can use fewer trajectories to achieve comparable results to the full ensemble. Our ML-based compression method achieves data reduction from 300x to more than 3,000x and outperforms the state-of-the-art compressor SZ3 in terms of weighted RMSE and MAE. It can faithfully preserve important large scale atmosphere structures and does not introduce artifacts. When using the resulting neural network as a 790x compressed data loader to train the WeatherBench forecasting model, its RMSE increases by less than 2%. The three orders of magnitude compression democratizes access to high-resolution climate data and enables numerous new research directions. We will close by discussing ongoing research directions and opportunities for using machine learning for ensemble simulations and combine several machine learning techniques. All those methods will enable km-scale global climate simulations.

Speaker: Torsten Hoefler is a Professor of Computer Science at ETH Zurich, a member of Academia Europaea, and a Fellow of the ACM and IEEE. Following a “Performance as a Science” vision, he combines mathematical models of architectures and applications to design optimized computing systems. Before joining ETH Zurich, he led the performance modeling and simulation efforts for the first sustained Petascale supercomputer, Blue Waters, at the University of Illinois at Urbana-Champaign. He is also a key contributor to the Message Passing Interface (MPI) standard where he chaired the “Collective Operations and Topologies” working group. Torsten won best paper awards at ACM/IEEE Supercomputing in 2010, 2013, 2014, 2019, 2022, 2023, and at other international conferences. He has published numerous peer-reviewed scientific articles and authored chapters of the MPI-2.2 and MPI-3.0 standards. For his work, Torsten received the IEEE CS Sidney Fernbach Memorial Award in 2022, the ACM Gordon Bell Prize in 2019, the ISC Jack Dongarra award, the IEEE TCSC Award of Excellence (MCR), ETH Zurich’s Latsis Prize, the SIAM SIAG/Supercomputing Junior Scientist Prize, the IEEE TCSC Young Achievers in Scalable Computing Award, and the BenchCouncil Rising Star Award. Following his Ph.D., he received the 2014 Young Alumni Award and the 2022 Distinguished Alumni Award of his alma mater, Indiana University. Torsten was elected to the first steering committee of ACM’s SIGHPC in 2013 and he was re-elected for every term since then. He was the first European to receive many of those honors; he also received both an ERC Starting and Consolidator grant. His research interests revolve around the central topic of performance-centric system design and include scalable networks, parallel programming techniques, and performance modeling for large-scale simulations and artificial intelligence systems. Additional information about Torsten can be found on his homepage at htor.inf.ethz.ch.

Leveraging the Trillion Parameter Consortium to Create the Next Generation of AI Scientists (Valerie Taylor, Argonne National Laboratory / University of Chicago)

Abstract: In the last two years, AI models at unprecedented scales have rapidly emerged, transforming science and engineering while catalyzing new interdisciplinary collaborations. This rapid advancement underscores the urgent need to equip the global workforce with AI expertise, highlighting the dual challenge and opportunity of training, upskilling, and diversifying an AI-ready workforce. International collaboration is crucial, offering a pathway to significantly enhance the disciplinary and cultural diversity in AI data and models. The diversification of perspectives — across culture, gender, beliefs, race, and other facets — is essential for evolving AI’s role in human-to-technology interactions. I will discuss exciting opportunities to advance large-scale AI systems with strategies to engage and deeply involve students, preparing them to form the next generation of AI and HPC leaders.

Speaker: Valerie Taylor is the Director of the Mathematics and Computer Science Division and a Distinguished Fellow at Argonne National Laboratory. Her research is in high-performance computing, with a focus on performance analysis, modeling and tuning of parallel, scientific applications, and energy efficient computing. Prior to joining Argonne, she was the Senior Associate Dean of Academic Affairs in the College of Engineering and a Regents Professor and the Royce E. Wisenbaker Professor in the Department of Computer Science and Engineering at Texas A&M University. She is also the President and CEO of the Center for Minorities and People with Disabilities in IT (CMD-IT). Valerie is an IEEE Fellow, ACM Fellow, and AAAS Fellow.

Workshop Moderator (Charlie Catlett, Argonne National Laboratory and The University of Chicago)

Charlie Catlett is a Senior Computer Scientist at the U.S. Department of Energy’s Argonne National Laboratory, and a Visiting Scientist at the University of Chicago’s Mansueto Institute for Urban Innovation. His research focuses on building cyberinfrastructure to embed edge-AI in urban, environmental, and emergency sensing and response settings. He was founding chair of Grid Forum / Global Grid Forum from 1999-2005 and director of NSF’s TeraGrid initiative from 2004-2007. Charlie was part of the team that established the National Center for Supercomputing Applications (NCSA) in 1985, leading efforts there including the deployment and operation of the NSFNET backbone network, an early component of the Internet, and serving as Chief Technology Officer prior to joining Argonne and UChicago in 2000. He was one of GovTech magazine’s “25 Doers, Dreamers & Drivers” of 2016 and in 2019 received the Argonne Board of Governors Distinguished Performer award. Charlie is a Computer Engineering graduate of the University of Illinois at Urbana-Champaign.