# Synthetic Pretraining Tracker
# Tracking synthetic data usage in open-weight LLM pretraining (Jan 2024+)

> https://eric-tramel.github.io/synthetic-pretraining-tracker/
> https://eric-tramel.github.io/synthetic-pretraining-tracker/data/models.json
> https://eric-tramel.github.io/synthetic-pretraining-tracker/data/models.yaml
> https://eric-tramel.github.io/synthetic-pretraining-tracker/data/authors.yaml
> https://github.com/eric-tramel/synthetic-pretraining-tracker

## Fields

Each model entry contains:
- name: Model name
- url: HuggingFace model card URL
- report: Technical report URL (PDF, blog, or paper)
- org: Organization
- date: Release date (YYYY-MM)
- arch: Architecture (dense, moe, hybrid)
- params: Total parameters in billions
- active: Active parameters in billions (same as params for dense)
- tokens: Total pretraining tokens in billions (null if undisclosed)
- tokens_cite: Direct quote and source for token count
- synth_tokens: Synthetic pretraining tokens in billions (null if unknown)
- synth_cite: Direct quote and source for synthetic token count
- synth_note: Brief context on synthetic data status