The workshop agenda comprises thematic plenary sessions and parallel workshop sessions organized around multiple themes, each featuring lightning talks, panels, and group discussions. Breakout sessions aim to facilitate the formation and deepening of collaborations, explore opportunities to coordinate efforts, and strengthen collaborations within the European AI, HPC, and Science communities. Each theme is organized by individuals active in corresponding TPC working groups, with participants from around the globe who have been collaborating virtually since 2023. Breakout sessions will include lightning talks on cooperative efforts, lessons learned to date, and discussions.

Each breakout topic is organized in 1-2 sessions, each 90-minutes, comprising lightning talks and discussions. These breakouts are designed to (a) exchange insights on progress in areas related to responsibly building large-scale AI models for science, for the purpose of (b) exploring, initiating, and coordinating specific, outcome-oriented collaborations (with clear targets and timelines) in these areas.


MAPE: Model Architecture and Performance Evaluation

Leaders: Rio Yokota (TiTech); Irina Rish (UdeM/MILA), Aitor González-Agirre (BSC)

Architectures for LLMs are continuously evolving. Variants of transformers, their mixture-of-experts-based extensions, and state-space models are being proposed on a weekly basis. Frameworks such as Megatron-LM and DeepSpeed, and their various forks each cover a different subset of architectures, distributed parallelism, and compute/memory/IO optimizations. Determining the optimal architecture for training a trillion parameter model on scientific data, and the best framework to accomplish this on today’s Exascale platforms is critical for the creation of a new class of AI models for a broad range of scientific applications.

Emani, Murali (ANL)Toward a Holistic Performance Evaluation of Large Language Models Across Diverse AI Accelerators
Ganti, Raghu (IBM Research)Maximizing hardware utilization for training and inference: Lessons learned from PyTorch
González-Agirre, Aitor (BSC)Challenges of large-scale pretraining: The tradeoffs of 3D parallelism
Goon, Garrett (HPE)Experiences with Fully Sharded Data Parallelism
Van Essen, Brian (LLNL)FLASK: Developing foundation models for molecular design and synthesis prediction
Vassilieva, Natalia (Cerebras)Training Recipes and Scaling Strategies for LLMs on Cerebras Wafer-Scale Clusters
Wells, Azton (ANL)Keeping up with the Joneses: Instruct and alignment tuning with open source datasets in preparation for AuroraGPT
Zhang, Minjia (UIllinois)Towards Efficient System and Algorithm Design for Large-Scale Scientific Discovery
Model Architecture and Performance Evaluation Lightning Talks

DTW: Data, Training Workflows, LLM strategies (chaining, composition of experts, the “LLM OS”…)

Leaders: Ian Foster (Argonne/UChicago); Neeraj Kumar (PNNL); Ravi Madduri (Argonne)

In the era of exponential data growth, this session addresses critical challenges and innovative strategies in harnessing vast datasets needed for training large language models (LLMs) in domains such as chemistry/materials, biology, climate science, and high energy physics. This session will discuss the complexities of developing a data-focused infrastructure, including streamlined data curation pipelines, the refinement of data curation practices, and the application of pre-training methodologies.  It will explore how the incorporation of domain-specific knowledge into these processes can significantly enhance the performance and applicability of LLMs in scientific research, emphasizing the critical role of targeted data selection and preparation in advancing AI capabilities. Through discussions, lightning talks, and collaborative dialogues, participants will explore cutting-edge approaches to optimize data utility and model effectiveness, setting a foundation for breakthroughs in scientific discovery.

Ansari, Mehrad (U Toronto/Acceleration Consortium)Inverse Design of Materials with Chemist AI Agents
Bhattacharya, Suparna (HPE)Foundation Model as an OS
Da Dalt, Severino (BSC)Multilinguality experiments
Foster, Ian (ANL/UChicago)Science agents
Grobelnik, Marko (Jozef Stefan Institute)Extracting WorldModels from LLMs
Kulebi, Baybars (BSC)LLMOps for Evaluation workflows
Kumar, Neeraj (PNNL)CACTUS Agent or LLMs benchmark/evaluation
Ma, Po-Lun (PNNL)Data challenges from climate science
Madduri, Ravi (ANL)Fine Tuning LLMs using Federated Learning – Early results
Palomar, Jorge (BSC)Pre-training data processing: strategies and challenges
von Werra, Leandro (Hugging Face)Scaling LLM beyond Chinchilla optimal
Wahib, Mohamed (RIKEN)Adaptive Patching for Vision Transformers
Yokota, Rio (RIKEN)Continual pre-training
Data, Training Workflows, LLM strategies Lightning Talks

SST: Skills, Safety, and Trust Evaluation

Leaders: Bo Li (UChicago); Franck Cappello (Argonne); Javier Aula-Blasco (BSC)

One of the main thrusts behind the rapid evolution of LLMs is the availability of benchmarks that assess the skills and trustworthiness of LLMs. Not only do they enable a rigorous evaluation of LLMs skills and trustworthiness from accepted metrics, but they also generate competition between LLM developers. While several frameworks/benchmarks have emerged as de facto standards for the evaluation of general-purpose LLMs (Eleuther AI Harness and HELM for skills, DecodingTrust for trustworthiness), only very few of them specifically are related to science. In this segment, we will discuss the challenges of developing benchmarks to evaluate the skills, trustworthiness, and safety of large Foundation Models for science, and the related efforts in the multilingual (Spanish, European, in addition to English) and AuroraGPT project contexts. The segment will feature lightning talks covering the following topics: multilingual capabilities, skill evaluation, trust/safety evaluation, uncertainty quantification, robustness/consistency evaluation, contamination avoidance and detection, MCQ and open questions, domain-specific tests, and question format.

Aula-Blasco, Javier (BSC)An Evaluation of LLM Evaluation Epistemology
Bhattacharya, Suparna (HPE)Less is More: Data-Efficient Methods for LLM Evaluation
Cappello, Franck (ANL)Evaluating Foundation Model as Scientific Assistant
Hernandez-Orallo, Jose (UPV)Characterising General-Purpose Artificial Intelligence: Evaluating Capabilities and Generality
Wahib, Mohamed (RIKEN)Attribution in LLMs
Skills, Safety, and Trust Evaluation Lightning Talks

BCD: Bioinformatics / Treatment Designs

Leaders: Arvind Ramanathan (Argonne); Mohamed Wahib (RIKEN); Miguel Vazquez (BSC)

This track will focus on the development of foundation models / large language models for biology – focused on multi-omics datasets and their implications for how LLMs would be trained/ evaluated. Given the shared interests and the broader implications for how LLMs can potentially alter the scope of biological research, the goal of this session is to catalyze discussions and build collaborations along the directions of: (1) how to build shared datasets for creating a rich repertoire of downstream evaluation tasks for foundation models; (2) discuss and develop shared strategies for model sharing and scoping across diverse biological applications; and (3) evaluate approaches towards incorporating robust strategies to reflect implicitly on the bias and trust/safety into the context of biological data. We will be extensively working on developing benchmark data and evaluation suites along with ideas on how models are being developed across the community today.

Cuidad, Alvaro (Nostrum Biodiscovery)Scaling protein language models
Dallago, Chrsitian (NVIDIA)Protein language models and their effectiveness
Ferruz, Noelia (IBMB/CSIC)Large language models for protein design
Hie, Brian (Stanford)Evo: Generative models for biology
Sugita, Yuji (RIKEN)Machine learning and data assimilation in biomolecular simulations
Ugarte La Torre, Diego (RIKEN)Equivariant diffusion models for backmapping coarse-grained macromolecules
Vazquez, Miguel (BSC)ExTRI2: Finding transcription regulation interactions in literature and their possible uses
Bioinformatics / Treatment Designs Lightning Talks

GTAC: Growing and Training a diverse, global AI For Science Community

Leaders: Valerie Taylor (Argonne/UChicago); Fabrizio Gagliardi (BSC), Claudio Domenico Arlandini (CINECA)

In the last two years, AI models at unprecedented scales have rapidly emerged, transforming science and engineering while catalyzing new interdisciplinary collaborations. This rapid advancement underscores the urgent need to equip the global workforce with AI expertise, highlighting the dual challenge and opportunity of training, upskilling, and diversifying an AI-ready workforce. International collaboration is crucial, offering a pathway to significantly enhance the disciplinary and cultural diversity in AI data and models. The diversification of perspectives — across culture, gender, beliefs, race, and other facets — is essential for evolving AI’s role in human-to-technology interactions. This session will explore how the Trillion Parameter Consortium, an emerging international community, can design its governance and operational strategies to foster global collaboration and diversity in AI research and development. Participants will discuss specific strategies for guiding the consortium’s leadership and collaboration opportunities, aiming to shape a future where AI technology—and TPC leadership and participants—reflect the diversity of the global community they serve.

Del Castillo Negrete, Carlos (TACC)
Haga, Jason (AIST)
Morselli, Laura (CINECA)
Rui Oliveira (INESC TEC)
Taylor, Valerie (ANL)
Growing and Training a diverse, global AI For Science Community Lightning Talks

SOFT: Early Experiences in Using LLMs for Scientific Software Use Cases

Leaders: Anshu Dubey (Argonne), Pete Beckman (Northwestern), Valerie Taylor (Argonne/UChicago)

It is well known that generative AI performs very poorly for scientific code generation because of sparsity of training data and lack of understanding about how to pose questions. We need to do systematic study of what it takes to generate reliable code from generative AI. Additionally, we need to do two kinds of knowledge synthesis – one for the purpose of improving the training of the models, and one for the training of the software developers and engineers for harnessing the power of generative AI. Several groups have experimented with various aspects of code generation and translation. In this session we bring together these early experimenters to present their experiences and insights as a way of seeding further discussions about using LLMs for code generation and translation. This breakout will discuss these experiences as well as providing updates regarding discussions of the 2024 International Post-Exascale (InPEx) workshop.

Durillo Barrionuevo, Juan (LRZ)LLMs at the Leibniz Supercomputing Centre: Initial steps in automatic software generation for power users
Jitsev, Jenia (JSC)Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
Marrón Vida, Diego (BSC)LLMs for Automatic Code Parallelization
Taylor, Valerie (Argonne)Using LLMs for Software Generation: Two Case Studies
Van Essen, Brian (LLNL)  Using LLMs for Code Translation
Early Experiences in Using LLMs for Scientific Software Use Cases Lightning Talks

LLM-HC: Large Language Models for Healthcare

Leaders: Silvia Crivelli (LBL); Dario Garcia-Gasulla (BSC); Antoine Bosselut (EPFL)

This track will focus on the critical challenges faced while developing foundation models for healthcare and medicine, and potential solutions for them. This type of models entails heavy requirements with regards to privacy preservation, bias quantification, factuality, and trustworthiness, all of them critical if such systems are to be released openly and used with safety guarantees. To start with, the complex nature of healthcare data drives foundation models targeting this field towards multi-modality (which brings along the added computational complexity of processing high resolution images), to ensure a holistic view of patient health. To become truly open models while training on potentially private and sensitive data, foundation models on healthcare need to provide safety mechanisms to guarantee the lack of data leaks (typically through the integration of anonymization processes and the generation and use of synthetic data). One more particularity of healthcare foundation models is their special need for factuality and consistency, considering the potential risk of hallucinations for human health (exploiting the vast amount of available scientific literature on the medical field).

Bosselut, Antoine (EPFL)MEDITRON: Open Medical Foundation Models Adapted for Clinical Practice
Garcia Gasulla, Dario (BSC)Selecting and evaluating LLMs: Lessons from Aloe
Lu, Zhiyong (NIH)Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in Medicine
Oermann, Eric (NYU)Engineering Large Language Models within Health Systems
Zamora-Resendiz, Rafael (LBNL)Scientific Assessment of Foundational Representations in Clinical Large Language Models
Large Language Models for Healthcare Lightning Talks

HARD: AI Hardware Acceleration Strategies at Scale

Leaders: Tobias Becker (Groq); Dawei Huang (Sambanova); Natalia Vassilieva (Cerebras), Christian Dallago (NVIDIA)

The explosion in the size of training and inference workloads caused by the race to larger language models has created unprecedented computational demands. GPUs have been rapidly increasing in size and complexity in an effort to service the largest of these workloads. Despite this heroic effort, the cost, availability, and system architecture of GPU accelerators are slowing down the race to scaling AI workloads faster and further. This session will explore alternative hardware/software approaches designed specifically to tackle the challenge of scaling of these workloads. This session convenes experts from Groq, Sambanova, Cerebras, NVIDIA, and others to paint fresh views of how systems can be designed for the training and inference of the largest next generation models. The exercise of mapping workloads onto such innovative systems is just as much of an important topic as their architecture, so we’ll pay particular attention to discussing the programming models for these systems.

Becker, Tobias (Groq)Fast LLM Inference at Scale with Groq LPUs
Ganti, Raghu (IBM Research)The IBM Artificial Intelligence Unit (AIU)
Thakker, Urmish (Sambanova)Enabling extremely fast inference and training performance using dataflow and SN40L
Sarasua, Ignacio (NVIDIA)Tackling AI at scale: a major stake for NVIDIA
Vassilieva, Natalia (Cerebras)The Benefits of Cerebras Wafer-Scale Clusters and Its Co-Designed Software Stack for AI at Scale
AI Hardware Acceleration Strategies at Scale Lightning Talks