Tracker: Synthetic Data in Pretraining

See an issue or want to add a model? Contribute on GitHub
Data exports: JSON · YAML · llms.txt

Flagship open-weight models released since Jan 2024.

Last updated: March 15, 2026

Dense
MoE
Hybrid (MoE+Mamba)
Model Report Org Date Arch Total Params Active Params Train Tokens TPP
Total Training Tokens / Total Parameters
Synth Tokens STPP
Synthetic Training Tokens / Total Parameters
Synth %
DBRX blog Databricks 2024-03 MoE 132B 36B 12T
It was pre-trained on 12T tokens of text and code data.
90.9 TBD TBD TBD
Grok-1 github xAI 2024-03 MoE 314B 79B undisclosed undisclosed undisclosed undisclosed undisclosed
Phi-3 medium 14B PDF Microsoft 2024-04 Dense 14B 14B 4.8T
We also provide parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called phi-3-small, phi-3-medium...
342.9 TBD TBD TBD
Mixtral 8x22B none Mistral 2024-04 MoE 141B 39B undisclosed undisclosed undisclosed undisclosed undisclosed
Snowflake Arctic blog Snowflake 2024-04 MoE 480B 17B 3.5T
Snowflake Arctic was pretrained on 3.5 trillion tokens of data from publicly available sources.
7.3 TBD TBD TBD
Command-R+ none Cohere 2024-04 Dense 104B 104B undisclosed undisclosed undisclosed undisclosed undisclosed
DeepSeek-V2 PDF DeepSeek 2024-05 MoE 236B 21B 8.1T
We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens.
34.3 TBD TBD TBD
Yi-1.5 34B PDF 01.AI 2024-05 Dense 34B 34B 3.1T
We construct 3.1 trillion tokens of English and Chinese corpora using a cascaded data deduplication and quality filtering pipeline.
91.2 TBD TBD TBD
Gemma 2 27B PDF Google 2024-06 Dense 27B 27B 13T
We train Gemma 2 27B on 13 trillion tokens of primarily-English data, the 9B model on 8 trillion tokens, and the 2B on 2 trillion tokens.
481.5 TBD TBD TBD
Llama 3.1 405B PDF Meta 2024-07 Dense 405B 405B 15.6T
We pre-trained a flagship model with 405B trainable parameters on 15.6T text tokens.
38.5 0
We found that annealing on small amounts of high-quality code and mathematical data...can boost the performance. We do not use any synthetic data produced by other LLMs for pretraining.
0 0%
Mistral Large 2 none Mistral 2024-07 Dense 123B 123B undisclosed undisclosed undisclosed undisclosed undisclosed
Jamba 1.5 Large paper AI21 2024-08 Hybrid 398B 94B undisclosed undisclosed undisclosed undisclosed undisclosed
Qwen 2.5 72B PDF Alibaba 2024-09 Dense 72B 72B 18T
In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens.
250.0 TBD TBD TBD
Granite 3.0 8B paper IBM 2024-10 Dense 8B 8B 12T
Trained from scratch following a two-stage training strategy. In the first stage, it is trained on 10 trillion tokens sourced from diverse domains. During the second stage, it is further trained on 2 trillion tokens.
1500.0 TBD TBD TBD
DeepSeek-V3 PDF DeepSeek 2024-12 MoE 671B 37B 14.8T
We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens.
22.1 TBD TBD TBD
Phi-4 14B PDF Microsoft 2024-12 Dense 14B 14B 10T
The model was pretrained for approximately 10T tokens using linear warm-up and decay schedules.
714.3 5.5T
Web 15% 1.3T 1.2 Web rewrites 15% 290B 5.2 Synthetic 40% 290B 13.8 Code data 20% 820B 2.4 Acquired sources 10% 580B 1.7
392.9 55.0%
Falcon 3 10B blog TII 2024-12 Dense 10B 10B 16T
We conducted a single large-scale pretraining run on the 7B model, using 1024 H100 GPU chips, leveraging 14 trillion tokens featuring web, code, STEM, and curated high-quality and multilingual data. Depth up-scaling for improved reasoning: Building on recent studies on the effects of model depth, we upscaled the 7B model to a 10B parameters model by duplicating the redundant layers and continuing pre-training with 2 trillion tokens of high-quality data.
1600.0 TBD TBD TBD
MiniMax-01 PDF MiniMax 2025-01 MoE 456B 45.9B 11.4T
Our training approach involved large-scale pre-training on 11.4 trillion tokens, followed by a three-stage process to extend the context window up to 1 million tokens.
25.0 TBD TBD TBD
OLMo 2 32B blog Allen AI 2025-02 Dense 32B 32B 6T
OLMo 2 32B is trained for 1.5 epochs, up to 6T tokens.
187.5 TBD TBD TBD
Gemma 3 27B PDF Google 2025-03 Dense 27B 27B 14T
We pre-train our models on a slightly larger token budget than Gemma 2, i.e., we train on 14T tokens for Gemma 3 27B, 12T for the 12B version, 4T for the 4B, and 2T tokens for the 1B.
518.5 TBD TBD TBD
Llama 4 Scout blog Meta 2025-04 MoE 109B 17B 40T
Pre-training data: ~40T tokens
367.0 TBD TBD TBD
Llama 4 Maverick blog Meta 2025-04 MoE 400B 17B 22T
Pre-training data: ~22T tokens
55.0 TBD TBD TBD
Qwen 3 0.6B PDF Alibaba 2025-05 Dense 0.6B 0.6B 36T
All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens.
60000.0 TBD TBD TBD
Qwen 3 7B PDF Alibaba 2025-05 Dense 7B 7B 36T
All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens.
5142.9 TBD TBD TBD
Qwen 3 14B PDF Alibaba 2025-05 Dense 14B 14B 36T
All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens.
2571.4 TBD TBD TBD
Qwen 3 32B PDF Alibaba 2025-05 Dense 32B 32B 36T
All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens.
1125.0 TBD TBD TBD
Qwen 3 30B-A3B PDF Alibaba 2025-05 MoE 30B 3B 36T
All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens.
1200.0 TBD TBD TBD
Qwen 3 235B-A22B PDF Alibaba 2025-05 MoE 235B 22B 36T
All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens.
153.2 TBD TBD TBD
Kimi K2 PDF Moonshot 2025-07 MoE 1T 32B 15.5T
The Kimi K2 pre-training corpus comprises 15.5 trillion tokens of curated, high-quality data spanning four primary domains: Web Text, Code, Mathematics, and Knowledge.
14.9 TBD TBD TBD
GLM-4.5 PDF Zhipu AI 2025-07 MoE 355B 32B 23T
Multi-stage training on 23T tokens.
64.8 TBD TBD TBD
OLMo 3 32B blog Allen AI 2025-11 Dense 32B 32B 6T
Dolma 3 Mix, a 5.9-trillion-token (~6T) pretraining mix.
187.5 TBD TBD TBD
Trinity Large paper Arcee 2025-11 MoE 400B 13B 17T
Trinity Large was trained on 17T tokens of data curated by DatologyAI, split across three phases of 10T, 4T, and 3T tokens.
42.5 8T
Notably, over 8 trillion tokens of synthetic data were generated for this dataset across web, code, math, reasoning, and multilingual domains, using a breadth of state-of-the-art rephrasing approaches.
20.0 47.1%
Monad blog Pleias 2025-11 Dense 0.1B 0.1B 200B
Monad is a 56 million parameters generalist Small Reasoning Model, trained on 200 billions tokens from SYNTH, a fully open generalist dataset.
3571.4 200B
While the final training data is fully synthetic, it relied on seeds collected from three data sources:
3571.4 100.0%
Baguettotron blog Pleias 2025-11 Dense 0.3B 0.3B 200B
Baguettotron is a 321 million parameters generalist Small Reasoning Model, trained on 200 billions tokens from SYNTH, a fully open generalist dataset.
623.1 200B
While the final training data is fully synthetic, it relied on seeds collected from three data sources:
623.1 100.0%
Trinity Nano blog Arcee 2025-12 MoE 6B 1B 10T
Trinity Nano and Mini train on 10T tokens, organized into three phases with progressively higher quality and STEM concentration: 7T tokens in phase 1, 1.8T tokens in phase 2, and 1.2T tokens in phase 3.
1666.7 TBD TBD TBD
Trinity Mini blog Arcee 2025-12 MoE 26B 3B 10T
Trinity Nano and Mini train on 10T tokens, organized into three phases with progressively higher quality and STEM concentration: 7T tokens in phase 1, 1.8T tokens in phase 2, and 1.2T tokens in phase 3.
384.6 TBD TBD TBD
Nemotron-3 Nano PDF NVIDIA 2025-12 Hybrid 30B 3.5B 25T
Nemotron 3 Nano was pretrained on 25 trillion text tokens, including more than 3 trillion new unique tokens over Nemotron 2.
833.3 2.5T
They curated or generated over 2.5T new tokens from Common Crawl data. Applied five prompts to Medium-High-Quality data from 110 Common Crawl snapshots, resulting in 2.1T new tokens. For all synthetic rephrasing, they used Qwen3-30B-A3B.
83.3 10.0%
Kimi K2.5 blog Moonshot 2026-01 MoE 1T 32B 30.5T
Kimi K2.5 builds on Kimi K2 with continued pretraining over approximately 15T mixed visual and text tokens.
29.3 TBD TBD TBD
GLM-5 HF Zhipu AI 2026-02 MoE 744B 40B 28.5T
Compared to GLM-4.5, GLM-5 scales from 355B parameters (32B active) to 744B parameters (40B active), and increases pre-training data from 23T to 28.5T tokens.
38.3 TBD TBD TBD