Synthetic Data Powering Pretraining
Abstract
Invited lecture for UC Berkeley EE 194/290 on how synthetic data is reshaping modern LLM pretraining. The talk covers the data wall, scaling pressures, practical synthetic-data pipelines, and current evidence on where synthetic pretraining helps most (and where it still fails).
Summary
This lecture argues that web-only pretraining is approaching limits, but pretraining itself is not over. The key lever is synthetic data designed as a training curriculum.
Topics covered
- Why inference economics and overtraining behavior create pressure for larger and more novel token pools.
- How synthetic data is being used in large-scale pretraining today, including rephrasing, capability seeding, and verification loops.
- Evidence from recent open and industry studies on code and reasoning data mixed directly into pretraining.
- Open problems: quality control, evaluation infrastructure, and agent-trajectory data at scale.