TPC Workshop:
Synthetic Data, Trust, and Scale to Advance
Reliable AI Models for Science with HPC and
Multilingual Insights

  • Friday, 13 June 2025 from 2pm to 6pm
  • Venue: ISC in Hamburg, Germany (June 10-13, 2025)

Abstract

This workshop will explore the technical, ethical, and accessibility challenges of using synthetic data for training and evaluating AI models in scientific and engineering applications. It emphasizes building trustworthy and reliable systems while addressing barriers to equitable access, ensuring synthetic data workflows can benefit a global, multilingual community of researchers across diverse domains. A special focus will be on the role of HPC in generating, processing, and validating synthetic data at scale. This workshop also discusses the use of synthetic data in the European context and the role of the Trillion Parameter Consortium (TPC) in Europe.

Images from the TPC workshop at ISC’24: Left: Panel discussion on AI/HPC funding initiatives featuring Alexandra Kourfali (EU), Rick Stevens (US), Fabrizio Gagliardi (moderator), Satoshi Matsuoka (JP); Center: Group picture; Right: Valerie Taylor (Argonne/UChicago).


Co-Organizers

  • Javier Aula-Blasco (Barcelona Supercomputing Center)
  • Charlie Catlett (Argonne National Laboratory and University of Chicago)
  • Fabrizio Gagliardi (Barcelona Supercomputer Center)
  • Laura Morselli (CINECA)
  • Per Öster (CSC – IT Center for Science)
  • Miguel Vazquez (Barcelona Supercomputing Center)

Agenda

14:00 – 14:10: Welcome and introduction. Javier Aula-Blasco (Barcelona Supercomputing Center).
14:10 – 15:00: Invited keynote (30-35 mins) + Moderated audience discussion (15-20 mins).
Mahmoud Ibrahim (Maastricht University). Moderated by Javier Aula-Blasco.
15:00 – 16:00: Themed roundtable: Using synthetic data and the role of TPC in Europe.
Moderated by Charlie Catlett (Argonne National Laboratory and University of Chicago), Fabrizio Gagliardi (Barcelona Supercomputer Center) and Per Öster (CSC – IT Center for Science).
16:00 – 16:30: Break.
16:30 – 17:50: Themed session: Synthetic data methodologies + Open Discussion: Next steps.
Lucía Tormo-Bañuelos (Barcelona Supercomputing Center), Jason Haga (National Institute of Advanced Industrial Science and Technology) and Robert Underwood (Argonne National Laboratory). Moderated by Laura Morselli (CINECA) and Miguel Vazquez (Barcelona Supercomputing Center).
17:50 – 18:00: Closing remarks and adjourn. Javier Aula-Blasco.


Workshop Topics

1. Synthetic Data for Trustworthy AI

  • Ensuring synthetic data aligns with real-world distributions to improve the reliability and robustness of scientific AI models.
  • Techniques to mitigate biases and representational gaps in synthetic datasets, particularly for underrepresented languages and domains.
  • Leveraging HPC capabilities to efficiently generate large-scale synthetic datasets that capture domain-specific complexity and multilingual diversity.
  • Evaluating safety, robustness, and interpretability of models trained or evaluated with synthetic data, utilizing HPC-powered validation pipelines.

2. Accessibility and Democratization

  • Open-source tools and frameworks for synthetic data generation to foster global collaboration, including HPC-enabled platforms for generating and sharing multilingual datasets.
  • Sharing synthetic datasets that include a diverse set of languages to support research communities worldwide.
  • Case studies of democratization efforts leveraging multilingual synthetic data and HPC resources to advance AI research in under-resourced regions.
  • Leveraging HPC systems to address data scarcity in disciplines like bioinformatics, climate modeling, and materials science.

3. Ethics and Transparency in Synthetic Data

  • Standards for ethical synthetic data generation, including transparency, accountability, and consideration of linguistic diversity.
  • Addressing concerns about over-reliance on synthetic data and potential risks of bias amplification, particularly across languages.
  • The role of HPC environments in ensuring reproducibility and traceability of synthetic data workflows.

4. Integrating Synthetic and Real-World Data:

  • Best practices for combining synthetic and real-world datasets to enhance model performance and applicability in scientific domains.
  • Frameworks for benchmarking and validating models trained on synthetic data using HPC-driven simulations.

5. Collaboration and Community Building:

  • Creating open, collaborative platforms to share synthetic data workflows and insights across languages and disciplines.
  • Strategies for building a global network of resources and expertise to support researchers in synthetic data and AI development, with emphasis on multilingual inclusivity and access to HPC resources.

Workshop Aims and Expected Outcomes

This workshop will explore the technical, ethical, and accessibility challenges of using synthetic data for training and evaluating AI models in scientific and engineering applications. It emphasizes building trustworthy and reliable systems while addressing barriers to equitable access, ensuring synthetic data workflows can benefit a global, multilingual community of researchers across diverse domains. A special focus will be on the role of HPC in generating, processing, and validating synthetic data at scale. This workshop builds upon the ISC-HPC 2024 workshop, The Trillion Parameter Consortium (TPC) – Accelerating the use of Generative AI for Science and Engineering, attended by some 80 people (capped by room size). Members of the program committee also led a similarly structured TPC-organized workshop at SC’24 in Atlanta, USA which was attended by nearly 200 people.

The intended outcomes to this workshop include:

  • Establish a shared understanding of methodologies for generating and using synthetic data in HPC contexts, focusing on reliability, trust, and multilingual representation.
  • Identify or propose open methodologies, datasets, tools and good practices tailored to (multilingual and multimodal) synthetic data generation and evaluation in HPC environments.
  • Foster a diverse, inclusive community of researchers and practitioners committed to improving synthetic data practices for large-scale AI models in science and engineering.
  • Facilitate cross-disciplinary partnerships among researchers, institutions, and HPC centers to tackle challenges in multilingual AI and synthetic data at scale.
  • Highlight gaps in current approaches and propose areas for further research, particularly in leveraging HPC for synthetic data and multilingual AI applications.