Name: Synthetic Pretraining Tracker
Creator: Eric W. Tramel
License: https://opensource.org/licenses/MIT

Model	Report	Org	Date	Arch	Total Params	Active Params	Train Tokens	TPP Total Training Tokens / Total Parameters	Synth Tokens	STPP Synthetic Training Tokens / Total Parameters	Synth %
DBRX	blog	Databricks	2024-03	MoE	132B	36B	12T It was pre-trained on 12T tokens of text and code data. Databricks Blog: Introducing DBRX	90.9	TBD	TBD	TBD
Grok-1	github	xAI	2024-03	MoE	314B	79B	undisclosed	undisclosed	undisclosed	undisclosed	undisclosed
Phi-3 medium 14B	PDF	Microsoft	2024-04	Dense	14B	14B	4.8T We also provide parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called phi-3-small, phi-3-medium... arXiv 2404.14219, Abstract	342.9	TBD	TBD	TBD
Mixtral 8x22B	none	Mistral	2024-04	MoE	141B	39B	undisclosed	undisclosed	undisclosed	undisclosed	undisclosed
Snowflake Arctic	blog	Snowflake	2024-04	MoE	480B	17B	3.5T Snowflake Arctic was pretrained on 3.5 trillion tokens of data from publicly available sources. Snowflake Blog	7.3	TBD	TBD	TBD
Command-R+	none	Cohere	2024-04	Dense	104B	104B	undisclosed	undisclosed	undisclosed	undisclosed	undisclosed
DeepSeek-V2	PDF	DeepSeek	2024-05	MoE	236B	21B	8.1T We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens. arXiv 2405.04434, Abstract	34.3	TBD	TBD	TBD
Yi-1.5 34B	PDF	01.AI	2024-05	Dense	34B	34B	3.1T We construct 3.1 trillion tokens of English and Chinese corpora using a cascaded data deduplication and quality filtering pipeline. arXiv 2403.04652	91.2	TBD	TBD	TBD
Gemma 2 27B	PDF	Google	2024-06	Dense	27B	27B	13T We train Gemma 2 27B on 13 trillion tokens of primarily-English data, the 9B model on 8 trillion tokens, and the 2B on 2 trillion tokens. arXiv 2408.00118, §3.1 Training Data	481.5	TBD	TBD	TBD
Llama 3.1 405B	PDF	Meta	2024-07	Dense	405B	405B	15.6T We pre-trained a flagship model with 405B trainable parameters on 15.6T text tokens. arXiv 2407.21783, §1 Introduction	38.5	0 We found that annealing on small amounts of high-quality code and mathematical data...can boost the performance. We do not use any synthetic data produced by other LLMs for pretraining. arXiv 2407.21783, §3.1.3 Annealing Data	0	0%
Mistral Large 2	none	Mistral	2024-07	Dense	123B	123B	undisclosed	undisclosed	undisclosed	undisclosed	undisclosed
Jamba 1.5 Large	paper	AI21	2024-08	Hybrid	398B	94B	undisclosed	undisclosed	undisclosed	undisclosed	undisclosed
Qwen 2.5 72B	PDF	Alibaba	2024-09	Dense	72B	72B	18T In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. arXiv 2412.15115, Section 6 Conclusion	250.0	TBD	TBD	TBD
Granite 3.0 8B	paper	IBM	2024-10	Dense	8B	8B	12T Trained from scratch following a two-stage training strategy. In the first stage, it is trained on 10 trillion tokens sourced from diverse domains. During the second stage, it is further trained on 2 trillion tokens. Granite 3.0 8B HF Model Card	1500.0	TBD	TBD	TBD
DeepSeek-V3	PDF	DeepSeek	2024-12	MoE	671B	37B	14.8T We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens. arXiv 2412.19437, Abstract	22.1	TBD	TBD	TBD
Phi-4 14B	PDF	Microsoft	2024-12	Dense	14B	14B	10T The model was pretrained for approximately 10T tokens using linear warm-up and decay schedules. arXiv 2412.08905, §3 Pretraining	714.3	5.5T Web 15% 1.3T 1.2 Web rewrites 15% 290B 5.2 Synthetic 40% 290B 13.8 Code data 20% 820B 2.4 Acquired sources 10% 580B 1.7 arXiv 2412.08905, §3 Table 5	392.9	55.0%
Falcon 3 10B	blog	TII	2024-12	Dense	10B	10B	16T We conducted a single large-scale pretraining run on the 7B model, using 1024 H100 GPU chips, leveraging 14 trillion tokens featuring web, code, STEM, and curated high-quality and multilingual data. Depth up-scaling for improved reasoning: Building on recent studies on the effects of model depth, we upscaled the 7B model to a 10B parameters model by duplicating the redundant layers and continuing pre-training with 2 trillion tokens of high-quality data. HF Blog: Falcon 3	1600.0	TBD	TBD	TBD
MiniMax-01	PDF	MiniMax	2025-01	MoE	456B	45.9B	11.4T Our training approach involved large-scale pre-training on 11.4 trillion tokens, followed by a three-stage process to extend the context window up to 1 million tokens. MiniMax official presentation	25.0	TBD	TBD	TBD
OLMo 2 32B	blog	Allen AI	2025-02	Dense	32B	32B	6T OLMo 2 32B is trained for 1.5 epochs, up to 6T tokens. Ai2 Blog: OLMo 2 32B	187.5	TBD	TBD	TBD
Gemma 3 27B	PDF	Google	2025-03	Dense	27B	27B	14T We pre-train our models on a slightly larger token budget than Gemma 2, i.e., we train on 14T tokens for Gemma 3 27B, 12T for the 12B version, 4T for the 4B, and 2T tokens for the 1B. arXiv 2503.19786, §2.2 Pre-training	518.5	TBD	TBD	TBD
Llama 4 Scout	blog	Meta	2025-04	MoE	109B	17B	40T Pre-training data: ~40T tokens Llama 4 Scout HF Model Card	367.0	TBD	TBD	TBD
Llama 4 Maverick	blog	Meta	2025-04	MoE	400B	17B	22T Pre-training data: ~22T tokens Llama 4 Maverick HF Model Card	55.0	TBD	TBD	TBD
Qwen 3 0.6B	PDF	Alibaba	2025-05	Dense	0.6B	0.6B	36T All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens. arXiv 2505.09388, §3.1	60000.0	TBD	TBD	TBD
Qwen 3 7B	PDF	Alibaba	2025-05	Dense	7B	7B	36T All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens. arXiv 2505.09388, §3.1	5142.9	TBD	TBD	TBD
Qwen 3 14B	PDF	Alibaba	2025-05	Dense	14B	14B	36T All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens. arXiv 2505.09388, §3.1	2571.4	TBD	TBD	TBD
Qwen 3 32B	PDF	Alibaba	2025-05	Dense	32B	32B	36T All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens. arXiv 2505.09388, §3.1	1125.0	TBD	TBD	TBD
Qwen 3 30B-A3B	PDF	Alibaba	2025-05	MoE	30B	3B	36T All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens. arXiv 2505.09388, §3.1	1200.0	TBD	TBD	TBD
Qwen 3 235B-A22B	PDF	Alibaba	2025-05	MoE	235B	22B	36T All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens. arXiv 2505.09388, §3.1	153.2	TBD	TBD	TBD
Kimi K2	PDF	Moonshot	2025-07	MoE	1T	32B	15.5T The Kimi K2 pre-training corpus comprises 15.5 trillion tokens of curated, high-quality data spanning four primary domains: Web Text, Code, Mathematics, and Knowledge. arXiv 2507.20534, §2.2.3	14.9	TBD	TBD	TBD
GLM-4.5	PDF	Zhipu AI	2025-07	MoE	355B	32B	23T Multi-stage training on 23T tokens. arXiv 2508.06471	64.8	TBD	TBD	TBD
OLMo 3 32B	blog	Allen AI	2025-11	Dense	32B	32B	6T Dolma 3 Mix, a 5.9-trillion-token (~6T) pretraining mix. Ai2 Blog: OLMo 3	187.5	TBD	TBD	TBD
Trinity Large	paper	Arcee	2025-11	MoE	400B	13B	17T Trinity Large was trained on 17T tokens of data curated by DatologyAI, split across three phases of 10T, 4T, and 3T tokens. Arcee Blog: Trinity Large	42.5	8T Notably, over 8 trillion tokens of synthetic data were generated for this dataset across web, code, math, reasoning, and multilingual domains, using a breadth of state-of-the-art rephrasing approaches. Arcee Blog: Trinity Large	20.0	47.1%
Monad	blog	Pleias	2025-11	Dense	0.1B	0.1B	200B Monad is a 56 million parameters generalist Small Reasoning Model, trained on 200 billions tokens from SYNTH, a fully open generalist dataset. Monad HF Model Card	3571.4	200B While the final training data is fully synthetic, it relied on seeds collected from three data sources: SYNTH dataset card	3571.4	100.0%
Baguettotron	blog	Pleias	2025-11	Dense	0.3B	0.3B	200B Baguettotron is a 321 million parameters generalist Small Reasoning Model, trained on 200 billions tokens from SYNTH, a fully open generalist dataset. Baguettotron HF Model Card	623.1	200B While the final training data is fully synthetic, it relied on seeds collected from three data sources: SYNTH dataset card	623.1	100.0%
Trinity Nano	blog	Arcee	2025-12	MoE	6B	1B	10T Trinity Nano and Mini train on 10T tokens, organized into three phases with progressively higher quality and STEM concentration: 7T tokens in phase 1, 1.8T tokens in phase 2, and 1.2T tokens in phase 3. Arcee Blog: The Trinity Manifesto	1666.7	TBD	TBD	TBD
Trinity Mini	blog	Arcee	2025-12	MoE	26B	3B	10T Trinity Nano and Mini train on 10T tokens, organized into three phases with progressively higher quality and STEM concentration: 7T tokens in phase 1, 1.8T tokens in phase 2, and 1.2T tokens in phase 3. Arcee Blog: The Trinity Manifesto	384.6	TBD	TBD	TBD
Nemotron-3 Nano	PDF	NVIDIA	2025-12	Hybrid	30B	3.5B	25T Nemotron 3 Nano was pretrained on 25 trillion text tokens, including more than 3 trillion new unique tokens over Nemotron 2. arXiv 2512.20848	833.3	2.5T They curated or generated over 2.5T new tokens from Common Crawl data. Applied five prompts to Medium-High-Quality data from 110 Common Crawl snapshots, resulting in 2.1T new tokens. For all synthetic rephrasing, they used Qwen3-30B-A3B. arXiv 2512.20848, §Data Curation	83.3	10.0%
Kimi K2.5	blog	Moonshot	2026-01	MoE	1T	32B	30.5T Kimi K2.5 builds on Kimi K2 with continued pretraining over approximately 15T mixed visual and text tokens. Kimi K2.5 Tech Blog + K2 report (15.5T base + 15T CPT)	29.3	TBD	TBD	TBD
GLM-5	HF	Zhipu AI	2026-02	MoE	744B	40B	28.5T Compared to GLM-4.5, GLM-5 scales from 355B parameters (32B active) to 744B parameters (40B active), and increases pre-training data from 23T to 28.5T tokens. GLM-5 HF Model Card	38.3	TBD	TBD	TBD

Tracker: Synthetic Data in Pretraining