The workshop agenda comprises thematic plenary sessions and parallel workshop sessions organized around multiple themes, each featuring lightning talks, panels, and group discussions. Breakout sessions aim to facilitate the formation and deepening of collaborations, explore opportunities to coordinate efforts, and strengthen collaborations within the European AI, HPC, and Science communities. Each theme is organized by individuals active in corresponding TPC working groups, with participants from around the globe who have been collaborating virtually since 2023. Breakout sessions will include lightning talks on cooperative efforts, lessons learned to date, and discussions.
Each breakout topic is organized in 1-2 sessions, each 90-minutes, comprising lightning talks and discussions. These breakouts are designed to (a) exchange insights on progress in areas related to responsibly building large-scale AI models for science, for the purpose of (b) exploring, initiating, and coordinating specific, outcome-oriented collaborations (with clear targets and timelines) in these areas.
MAPE: Model Architecture and Performance Evaluation
Leaders: Rio Yokota (TiTech); Irina Rish (UdeM/MILA), Aitor González-Agirre (BSC)
Architectures for LLMs are continuously evolving. Variants of transformers, their mixture-of-experts-based extensions, and state-space models are being proposed on a weekly basis. Frameworks such as Megatron-LM and DeepSpeed, and their various forks each cover a different subset of architectures, distributed parallelism, and compute/memory/IO optimizations. Determining the optimal architecture for training a trillion parameter model on scientific data, and the best framework to accomplish this on today’s Exascale platforms is critical for the creation of a new class of AI models for a broad range of scientific applications.
| Emani, Murali (ANL) | Toward a Holistic Performance Evaluation of Large Language Models Across Diverse AI Accelerators |
| Ganti, Raghu (IBM Research) | Maximizing hardware utilization for training and inference: Lessons learned from PyTorch |
| González-Agirre, Aitor (BSC) | Challenges of large-scale pretraining: The tradeoffs of 3D parallelism |
| Goon, Garrett (HPE) | Experiences with Fully Sharded Data Parallelism |
| Van Essen, Brian (LLNL) | FLASK: Developing foundation models for molecular design and synthesis prediction |
| Vassilieva, Natalia (Cerebras) | Training Recipes and Scaling Strategies for LLMs on Cerebras Wafer-Scale Clusters |
| Wells, Azton (ANL) | Keeping up with the Joneses: Instruct and alignment tuning with open source datasets in preparation for AuroraGPT |
| Zhang, Minjia (UIllinois) | Towards Efficient System and Algorithm Design for Large-Scale Scientific Discovery |
DTW: Data, Training Workflows, LLM strategies (chaining, composition of experts, the “LLM OS”…)
Leaders: Ian Foster (Argonne/UChicago); Neeraj Kumar (PNNL); Ravi Madduri (Argonne)
In the era of exponential data growth, this session addresses critical challenges and innovative strategies in harnessing vast datasets needed for training large language models (LLMs) in domains such as chemistry/materials, biology, climate science, and high energy physics. This session will discuss the complexities of developing a data-focused infrastructure, including streamlined data curation pipelines, the refinement of data curation practices, and the application of pre-training methodologies. It will explore how the incorporation of domain-specific knowledge into these processes can significantly enhance the performance and applicability of LLMs in scientific research, emphasizing the critical role of targeted data selection and preparation in advancing AI capabilities. Through discussions, lightning talks, and collaborative dialogues, participants will explore cutting-edge approaches to optimize data utility and model effectiveness, setting a foundation for breakthroughs in scientific discovery.
| Ansari, Mehrad (U Toronto/Acceleration Consortium) | Inverse Design of Materials with Chemist AI Agents |
| Bhattacharya, Suparna (HPE) | Foundation Model as an OS |
| Da Dalt, Severino (BSC) | Multilinguality experiments |
| Foster, Ian (ANL/UChicago) | Science agents |
| Grobelnik, Marko (Jozef Stefan Institute) | Extracting WorldModels from LLMs |
| Kulebi, Baybars (BSC) | LLMOps for Evaluation workflows |
| Kumar, Neeraj (PNNL) | CACTUS Agent or LLMs benchmark/evaluation |
| Ma, Po-Lun (PNNL) | Data challenges from climate science |
| Madduri, Ravi (ANL) | Fine Tuning LLMs using Federated Learning – Early results |
| Palomar, Jorge (BSC) | Pre-training data processing: strategies and challenges |
| von Werra, Leandro (Hugging Face) | Scaling LLM beyond Chinchilla optimal |
| Wahib, Mohamed (RIKEN) | Adaptive Patching for Vision Transformers |
| Yokota, Rio (RIKEN) | Continual pre-training |
SST: Skills, Safety, and Trust Evaluation
Leaders: Bo Li (UChicago); Franck Cappello (Argonne); Javier Aula-Blasco (BSC)
One of the main thrusts behind the rapid evolution of LLMs is the availability of benchmarks that assess the skills and trustworthiness of LLMs. Not only do they enable a rigorous evaluation of LLMs skills and trustworthiness from accepted metrics, but they also generate competition between LLM developers. While several frameworks/benchmarks have emerged as de facto standards for the evaluation of general-purpose LLMs (Eleuther AI Harness and HELM for skills, DecodingTrust for trustworthiness), only very few of them specifically are related to science. In this segment, we will discuss the challenges of developing benchmarks to evaluate the skills, trustworthiness, and safety of large Foundation Models for science, and the related efforts in the multilingual (Spanish, European, in addition to English) and AuroraGPT project contexts. The segment will feature lightning talks covering the following topics: multilingual capabilities, skill evaluation, trust/safety evaluation, uncertainty quantification, robustness/consistency evaluation, contamination avoidance and detection, MCQ and open questions, domain-specific tests, and question format.
| Aula-Blasco, Javier (BSC) | An Evaluation of LLM Evaluation Epistemology |
| Bhattacharya, Suparna (HPE) | Less is More: Data-Efficient Methods for LLM Evaluation |
| Cappello, Franck (ANL) | Evaluating Foundation Model as Scientific Assistant |
| Hernandez-Orallo, Jose (UPV) | Characterising General-Purpose Artificial Intelligence: Evaluating Capabilities and Generality |
| Wahib, Mohamed (RIKEN) | Attribution in LLMs |
BCD: Bioinformatics / Treatment Designs
Leaders: Arvind Ramanathan (Argonne); Mohamed Wahib (RIKEN); Miguel Vazquez (BSC)
This track will focus on the development of foundation models / large language models for biology – focused on multi-omics datasets and their implications for how LLMs would be trained/ evaluated. Given the shared interests and the broader implications for how LLMs can potentially alter the scope of biological research, the goal of this session is to catalyze discussions and build collaborations along the directions of: (1) how to build shared datasets for creating a rich repertoire of downstream evaluation tasks for foundation models; (2) discuss and develop shared strategies for model sharing and scoping across diverse biological applications; and (3) evaluate approaches towards incorporating robust strategies to reflect implicitly on the bias and trust/safety into the context of biological data. We will be extensively working on developing benchmark data and evaluation suites along with ideas on how models are being developed across the community today.
| Cuidad, Alvaro (Nostrum Biodiscovery) | Scaling protein language models |
| Dallago, Chrsitian (NVIDIA) | Protein language models and their effectiveness |
| Ferruz, Noelia (IBMB/CSIC) | Large language models for protein design |
| Hie, Brian (Stanford) | Evo: Generative models for biology |
| Sugita, Yuji (RIKEN) | Machine learning and data assimilation in biomolecular simulations |
| Ugarte La Torre, Diego (RIKEN) | Equivariant diffusion models for backmapping coarse-grained macromolecules |
| Vazquez, Miguel (BSC) | ExTRI2: Finding transcription regulation interactions in literature and their possible uses |
GTAC: Growing and Training a diverse, global AI For Science Community
Leaders: Valerie Taylor (Argonne/UChicago); Fabrizio Gagliardi (BSC), Claudio Domenico Arlandini (CINECA)
In the last two years, AI models at unprecedented scales have rapidly emerged, transforming science and engineering while catalyzing new interdisciplinary collaborations. This rapid advancement underscores the urgent need to equip the global workforce with AI expertise, highlighting the dual challenge and opportunity of training, upskilling, and diversifying an AI-ready workforce. International collaboration is crucial, offering a pathway to significantly enhance the disciplinary and cultural diversity in AI data and models. The diversification of perspectives — across culture, gender, beliefs, race, and other facets — is essential for evolving AI’s role in human-to-technology interactions. This session will explore how the Trillion Parameter Consortium, an emerging international community, can design its governance and operational strategies to foster global collaboration and diversity in AI research and development. Participants will discuss specific strategies for guiding the consortium’s leadership and collaboration opportunities, aiming to shape a future where AI technology—and TPC leadership and participants—reflect the diversity of the global community they serve.
| Del Castillo Negrete, Carlos (TACC) |
| Haga, Jason (AIST) |
| Morselli, Laura (CINECA) |
| Rui Oliveira (INESC TEC) |
| Taylor, Valerie (ANL) |
SOFT: Early Experiences in Using LLMs for Scientific Software Use Cases
Leaders: Anshu Dubey (Argonne), Pete Beckman (Northwestern), Valerie Taylor (Argonne/UChicago)
It is well known that generative AI performs very poorly for scientific code generation because of sparsity of training data and lack of understanding about how to pose questions. We need to do systematic study of what it takes to generate reliable code from generative AI. Additionally, we need to do two kinds of knowledge synthesis – one for the purpose of improving the training of the models, and one for the training of the software developers and engineers for harnessing the power of generative AI. Several groups have experimented with various aspects of code generation and translation. In this session we bring together these early experimenters to present their experiences and insights as a way of seeding further discussions about using LLMs for code generation and translation. This breakout will discuss these experiences as well as providing updates regarding discussions of the 2024 International Post-Exascale (InPEx) workshop.
| Durillo Barrionuevo, Juan (LRZ) | LLMs at the Leibniz Supercomputing Centre: Initial steps in automatic software generation for power users |
| Jitsev, Jenia (JSC) | Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models |
| Marrón Vida, Diego (BSC) | LLMs for Automatic Code Parallelization |
| Taylor, Valerie (Argonne) | Using LLMs for Software Generation: Two Case Studies |
| Van Essen, Brian (LLNL) | Using LLMs for Code Translation |
LLM-HC: Large Language Models for Healthcare
Leaders: Silvia Crivelli (LBL); Dario Garcia-Gasulla (BSC); Antoine Bosselut (EPFL)
This track will focus on the critical challenges faced while developing foundation models for healthcare and medicine, and potential solutions for them. This type of models entails heavy requirements with regards to privacy preservation, bias quantification, factuality, and trustworthiness, all of them critical if such systems are to be released openly and used with safety guarantees. To start with, the complex nature of healthcare data drives foundation models targeting this field towards multi-modality (which brings along the added computational complexity of processing high resolution images), to ensure a holistic view of patient health. To become truly open models while training on potentially private and sensitive data, foundation models on healthcare need to provide safety mechanisms to guarantee the lack of data leaks (typically through the integration of anonymization processes and the generation and use of synthetic data). One more particularity of healthcare foundation models is their special need for factuality and consistency, considering the potential risk of hallucinations for human health (exploiting the vast amount of available scientific literature on the medical field).
| Bosselut, Antoine (EPFL) | MEDITRON: Open Medical Foundation Models Adapted for Clinical Practice |
| Garcia Gasulla, Dario (BSC) | Selecting and evaluating LLMs: Lessons from Aloe |
| Lu, Zhiyong (NIH) | Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in Medicine |
| Oermann, Eric (NYU) | Engineering Large Language Models within Health Systems |
| Zamora-Resendiz, Rafael (LBNL) | Scientific Assessment of Foundational Representations in Clinical Large Language Models |
HARD: AI Hardware Acceleration Strategies at Scale
Leaders: Tobias Becker (Groq); Dawei Huang (Sambanova); Natalia Vassilieva (Cerebras), Christian Dallago (NVIDIA)
The explosion in the size of training and inference workloads caused by the race to larger language models has created unprecedented computational demands. GPUs have been rapidly increasing in size and complexity in an effort to service the largest of these workloads. Despite this heroic effort, the cost, availability, and system architecture of GPU accelerators are slowing down the race to scaling AI workloads faster and further. This session will explore alternative hardware/software approaches designed specifically to tackle the challenge of scaling of these workloads. This session convenes experts from Groq, Sambanova, Cerebras, NVIDIA, and others to paint fresh views of how systems can be designed for the training and inference of the largest next generation models. The exercise of mapping workloads onto such innovative systems is just as much of an important topic as their architecture, so we’ll pay particular attention to discussing the programming models for these systems.
| Becker, Tobias (Groq) | Fast LLM Inference at Scale with Groq LPUs |
| Ganti, Raghu (IBM Research) | The IBM Artificial Intelligence Unit (AIU) |
| Thakker, Urmish (Sambanova) | Enabling extremely fast inference and training performance using dataflow and SN40L |
| Sarasua, Ignacio (NVIDIA) | Tackling AI at scale: a major stake for NVIDIA |
| Vassilieva, Natalia (Cerebras) | The Benefits of Cerebras Wafer-Scale Clusters and Its Co-Designed Software Stack for AI at Scale |

