Groq’s approach to HW/SW systems for LLM inference

Part of the TPC Seminar Series

Speaker: Valentin Reis, Software Engineer at Groq Inc.
Date: Wednesday, July 10, 2024
Time: 10:00 A.M. to 11:15 A.M. (Central Time)
Location: Virtual

Join via Zoom

Abstract:

The race to model size continues. At the time of writing, some organizations are training 400B+ parameter Large Language Models (LLMs). The larger these training and inference workloads become, the more demand they create for GPU memory capacity. This has resulted in accelerators that leverage complex technology as part of their memory hierarchy. Groq addresses this scaling challenge in opposite ways.

We’ll explain how we co-designed a compilation-based software stack and a class of accelerators called Language Processing Units (LPUs). The core architectural decision underlying LPUs is determinism – systems of clock-synchronized accelerators are programmed via detailed instruction scheduling. This includes networking, which is also scheduled statically. As a result, LPU-based systems have fewer networking constraints, which make their SRAM-based design practicable without the use of a memory hierarchy. This redesign results in high utilization and low end-to-end system latency. Moreover, determinism enables static reasoning about program performance – the key to kernel-free compilation.

We’ll talk about the challenges of breaking models apart over networks of LPUs, and outline how this HW/SW system architecture keeps enabling breakthrough LLM inference latency at all model sizes.

Biography:

Valentin Reis is a software engineer at Groq, Inc, where he works on the software stack that powers LPU accelerators. He was previously a postdoctoral appointee at the Argonne National Laboratory in the Mathematics and Computer Science division.

More TPC Seminar Series Events