Curating Dolma, an Open Corpus for Language Model Pretraining Research

Part of the TPC Seminar Series

Speaker: Kyle Lo, Research Scientist at the Allen Institute for AI in Seattle
Date: Wednesday, May 22, 2024
Time: 10:00 A.M. to 11:15 A.M. (Central Time)
Location: Virtual

Join via Zoom

Abstract:

Language models have become critical for tackling a wide range of natural language processing tasks, yet many details about how the best-performing language models were developed are not reported. In particular, information about their pretraining corpora is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it is challenging to conduct and advance scientific research on language modeling, such as understanding how training data impacts model capabilities and limitations. To facilitate scientific research on language model pretraining, we curate and release Dolma, a three-trillion-token English corpus, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. In this talk, I’ll present Dolma, including its design principles, details about its construction, and a summary of its contents. I’ll provide analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices. Finally, I’ll discuss our ongoing work on updates to Dolma and open investigations pertaining to the science of pretraining data.

Biography:

Kyle Lo is a research scientist at the Allen Institute for AI in Seattle working on topics in natural language processing, machine learning and human-AI interaction, with emphasis on language model development and adapting language models to specialized (scientific) texts. He is currently a tech lead on the OLMo and Semantic Scholar projects, focusing particularly on dataset curation. Kyle’s works on language model adaptation, scientific text processing, long document summarization, and AI-powered reading assistance have won paper awards at top conferences including ACL, CHI, EMNLP and EACL. His works on open language models and datasets have been featured in articles published in Nature, Science, TechCrunch, MIT Tech Review, GeekWire, and others. In 2020, Kyle was a co-lead for a White House OSTP initiative to curate and release the largest collection of COVID-19 research to-date in support of computational use-cases like automated literature review. Kyle graduated with a degree in Statistics from the University of Washington. He enjoys board games, boba tea, D&D, and relaxing with his cat Belphegor.

More TPC Seminar Series Events