Tracker: Synthetic Data in Pretraining

See an issue or want to add a model? Contribute on GitHub
Data exports: JSON · YAML · llms.txt

Flagship open-weight models released since Jan 2024.

Last updated: February 13, 2026

Dense
MoE
Hybrid (MoE+Mamba)
Model Report Org Date Arch Total Params Active Params Train Tokens TPP
Total Training Tokens / Total Parameters
Synth Tokens STPP
Synthetic Training Tokens / Total Parameters
Synth %
DBRX blog Databricks 2024-03 MoE 132B 36B 12T
It was pre-trained on 12T tokens of text and code data.
90.9 TBD TBD TBD
Grok-1 github xAI 2024-03 MoE 314B 79B undisclosed undisclosed undisclosed undisclosed undisclosed
Phi-3 medium 14B PDF Microsoft 2024-04 Dense 14B 14B 4.8T
We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi-3-medium.
342.9 TBD TBD TBD
Mixtral 8x22B none Mistral 2024-04 MoE 141B 39B undisclosed undisclosed undisclosed undisclosed undisclosed
Snowflake Arctic blog Snowflake 2024-04 MoE 480B 17B 3.5T
Snowflake Arctic was pretrained on 3.5 trillion tokens of data from publicly available sources.
7.3 TBD TBD TBD
Command-R+ none Cohere 2024-04 Dense 104B 104B undisclosed undisclosed undisclosed undisclosed undisclosed
DeepSeek-V2 PDF DeepSeek 2024-05 MoE 236B 21B 8.1T
We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens.
34.3 TBD TBD TBD
Yi-1.5 34B PDF 01.AI 2024-05 Dense 34B 34B 3.1T
We construct 3.1 trillion tokens of English and Chinese corpora using a cascaded data deduplication and quality filtering pipeline.
91.2 TBD TBD TBD
Gemma 2 27B PDF Google 2024-06 Dense 27B 27B 13T
We train Gemma 2 27B on 13 trillion tokens of primarily-English data, the 9B model on 8 trillion tokens, and the 2B on 2 trillion tokens.
481.5 TBD TBD TBD
Llama 3.1 405B PDF Meta 2024-07 Dense 405B 405B 15.6T
We pre-trained a flagship model with 405B trainable parameters on 15.6T text tokens.
38.5 0
We found that annealing on small amounts of high-quality code and mathematical data...can boost the performance. We do not use any synthetic data produced by other LLMs for pretraining.
0 0%
Mistral Large 2 none Mistral 2024-07 Dense 123B 123B undisclosed undisclosed undisclosed undisclosed undisclosed
Jamba 1.5 Large paper AI21 2024-08 Hybrid 398B 94B undisclosed undisclosed undisclosed undisclosed undisclosed
Qwen 2.5 72B PDF Alibaba 2024-09 Dense 72B 72B 18T
All models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens.
250.0 TBD TBD TBD
Granite 3.0 8B paper IBM 2024-10 Dense 8B 8B 12T
Trained from scratch following a two-stage training strategy. In the first stage, it is trained on 10 trillion tokens sourced from diverse domains. During the second stage, it is further trained on 2 trillion tokens.
1500.0 TBD TBD TBD
DeepSeek-V3 PDF DeepSeek 2024-12 MoE 671B 37B 14.8T
We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens.
22.1 TBD TBD TBD
Phi-4 14B PDF Microsoft 2024-12 Dense 14B 14B 10T
The model was pretrained for approximately 10T tokens using linear warm-up and decay schedules.
714.3 5.5T
Synthetic data constitutes the bulk of the training data for phi-4. Data mixture: Synthetic 40%, Web rewrites 15%, Filtered web data 15%, Code data 20%, Acquired sources 10%.
392.9 55.0%
Falcon 3 10B blog TII 2024-12 Dense 10B 10B 16T
We conducted a single large-scale pretraining run on the 7B model...leveraging 14 trillion tokens featuring web, code, STEM, and curated high-quality and multilingual data. [Then] upscaled the 7B model to a 10B parameters model...continuing pre-training with 2 trillion tokens.
1600.0 TBD TBD TBD
MiniMax-01 PDF MiniMax 2025-01 MoE 456B 45.9B 12T
Trained on ~12 trillion tokens.
26.3 TBD TBD TBD
OLMo 2 32B blog Allen AI 2025-02 Dense 32B 32B 6T
OLMo 2 32B is trained for 1.5 epochs, up to 6T tokens.
187.5 TBD TBD TBD
Gemma 3 27B PDF Google 2025-03 Dense 27B 27B 14T
We pre-train our models on a slightly larger token budget than Gemma 2, i.e., we train on 14T tokens for Gemma 3 27B, 12T for the 12B version, 4T for the 4B, and 2T tokens for the 1B.
518.5 TBD TBD TBD
Llama 4 Scout blog Meta 2025-04 MoE 109B 17B 40T
Pre-training data: ~40T tokens
367.0 TBD TBD TBD
Llama 4 Maverick blog Meta 2025-04 MoE 400B 17B 22T
Pre-training data: ~22T tokens
55.0 TBD TBD TBD
Qwen 3 0.6B PDF Alibaba 2025-05 Dense 0.6B 0.6B 36T
All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens.
60000.0 TBD TBD TBD
Qwen 3 7B PDF Alibaba 2025-05 Dense 7B 7B 36T
All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens.
5142.9 TBD TBD TBD
Qwen 3 14B PDF Alibaba 2025-05 Dense 14B 14B 36T
All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens.
2571.4 TBD TBD TBD
Qwen 3 32B PDF Alibaba 2025-05 Dense 32B 32B 36T
All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens.
1125.0 TBD TBD TBD
Qwen 3 30B-A3B PDF Alibaba 2025-05 MoE 30B 3B 36T
All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens.
1200.0 TBD TBD TBD
Qwen 3 235B-A22B PDF Alibaba 2025-05 MoE 235B 22B 36T
All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens.
153.2 TBD TBD TBD
Kimi K2 PDF Moonshot 2025-07 MoE 1T 32B 15.5T
The Kimi K2 pre-training corpus comprises 15.5 trillion tokens of curated, high-quality data spanning four primary domains: Web Text, Code, Mathematics, and Knowledge.
14.9 TBD TBD TBD
GLM-4.5 PDF Zhipu AI 2025-07 MoE 355B 32B 23T
Multi-stage training on 23T tokens.
64.8 TBD TBD TBD
OLMo 3 32B blog Allen AI 2025-11 Dense 32B 32B 6T
Dolma 3 Mix, a 5.9-trillion-token (~6T) pretraining mix.
187.5 TBD TBD TBD
Trinity Large paper Arcee 2025-11 MoE 400B 13B 17T
Trinity Large was pre-trained on 17 trillion tokens of data curated by DatologyAI, split across three phases of 10T, 4T, and 3T tokens.
42.5 8T
Over 8 trillion tokens of synthetic data were generated for this dataset across web, code, math, reasoning, and multilingual domains. The 8T synthetic tokens include 6.5T synthetic web tokens, 1T multilingual tokens, and 800B synthetic code tokens.
20.0 47.1%
Monad blog Pleias 2025-11 Dense 0.1B 0.1B 200B
Monad is a 56 million parameters generalist Small Reasoning Model, trained on 200 billions tokens from SYNTH, a fully open generalist dataset.
3571.4 200B
Full synthetic training makes relatively straightforward to expand language support. Trained exclusively on SYNTH, a fully open generalist synthetic dataset built from 50,000 Wikipedia vital articles as seeds.
3571.4 100.0%
Baguettotron blog Pleias 2025-11 Dense 0.3B 0.3B 200B
Baguettotron is a 321 million parameters generalist Small Reasoning Model, trained on 200 billions tokens from SYNTH, a fully open generalist dataset.
623.1 200B
Full synthetic training makes relatively straightforward to expand language support. Trained exclusively on SYNTH, a fully open generalist synthetic dataset built from 50,000 Wikipedia vital articles as seeds.
623.1 100.0%
Trinity Nano blog Arcee 2025-12 MoE 6B 1B 10T
Trained on 10T tokens, organized into three phases with progressively higher quality and STEM concentration: 7T in phase 1, 1.8T in phase 2, 1.2T in phase 3.
1666.7 TBD TBD TBD
Trinity Mini blog Arcee 2025-12 MoE 26B 3B 10T
Trained on 10T tokens, organized into three phases with progressively higher quality and STEM concentration: 7T in phase 1, 1.8T in phase 2, 1.2T in phase 3.
384.6 TBD TBD TBD
Nemotron-3 Nano PDF NVIDIA 2025-12 Hybrid 30B 3.5B 25T
Nemotron 3 Nano was pretrained on 25 trillion text tokens, including more than 3 trillion new unique tokens over Nemotron 2.
833.3 2.5T
They curated or generated over 2.5T new tokens from Common Crawl data. Applied five prompts to Medium-High-Quality data from 110 Common Crawl snapshots, resulting in 2.1T new tokens. For all synthetic rephrasing, they used Qwen3-30B-A3B.
83.3 10.0%
Kimi K2.5 blog Moonshot 2026-01 MoE 1T 32B 30.5T
Built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base.
29.3 TBD TBD TBD
GLM-5 HF Zhipu AI 2026-02 MoE 744B 40B 28.5T
Pre-training data growing from 23T to 28.5T tokens. GLM-5 employs a MoE architecture, scaling from GLM-4.5's 355B params to 744B, with 256 experts, 8 activated per token.
38.3 TBD TBD TBD