# Synthetic Pretraining Tracker # Tracking synthetic data usage in open-weight LLM pretraining (Jan 2024+) > https://eric-tramel.github.io/synthetic-pretraining-tracker/ > https://eric-tramel.github.io/synthetic-pretraining-tracker/data/models.json > https://eric-tramel.github.io/synthetic-pretraining-tracker/data/models.yaml > https://eric-tramel.github.io/synthetic-pretraining-tracker/data/authors.yaml > https://github.com/eric-tramel/synthetic-pretraining-tracker ## Fields Each model entry contains: - name: Model name - url: HuggingFace model card URL - report: Technical report URL (PDF, blog, or paper) - org: Organization - date: Release date (YYYY-MM) - arch: Architecture (dense, moe, hybrid) - params: Total parameters in billions - active: Active parameters in billions (same as params for dense) - tokens: Total pretraining tokens in billions (null if undisclosed) - tokens_cite: Direct quote and source for token count - synth_tokens: Synthetic pretraining tokens in billions (null if unknown) - synth_cite: Direct quote and source for synthetic token count - synth_note: Brief context on synthetic data status