Flagship open-weight models released since Jan 2024.
Last updated: March 15, 2026
| Model | Report | Org | Date | Arch | Total Params | Active Params | Train Tokens | TPP Total Training Tokens / Total Parameters |
Synth Tokens | STPP Synthetic Training Tokens / Total Parameters |
Synth % |
|---|---|---|---|---|---|---|---|---|---|---|---|
| DBRX | blog | Databricks | 2024-03 | MoE | 132B | 36B | 12T It was pre-trained on 12T tokens of text and code data. |
90.9 | TBD | TBD | TBD |
| Grok-1 | github | xAI | 2024-03 | MoE | 314B | 79B | undisclosed | undisclosed | undisclosed | undisclosed | undisclosed |
| Phi-3 medium 14B | Microsoft | 2024-04 | Dense | 14B | 14B | 4.8T We also provide parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called phi-3-small, phi-3-medium... |
342.9 | TBD | TBD | TBD | |
| Mixtral 8x22B | none | Mistral | 2024-04 | MoE | 141B | 39B | undisclosed | undisclosed | undisclosed | undisclosed | undisclosed |
| Snowflake Arctic | blog | Snowflake | 2024-04 | MoE | 480B | 17B | 3.5T Snowflake Arctic was pretrained on 3.5 trillion tokens of data from publicly available sources. |
7.3 | TBD | TBD | TBD |
| Command-R+ | none | Cohere | 2024-04 | Dense | 104B | 104B | undisclosed | undisclosed | undisclosed | undisclosed | undisclosed |
| DeepSeek-V2 | DeepSeek | 2024-05 | MoE | 236B | 21B | 8.1T We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens. |
34.3 | TBD | TBD | TBD | |
| Yi-1.5 34B | 01.AI | 2024-05 | Dense | 34B | 34B | 3.1T We construct 3.1 trillion tokens of English and Chinese corpora using a cascaded data deduplication and quality filtering pipeline. |
91.2 | TBD | TBD | TBD | |
| Gemma 2 27B | 2024-06 | Dense | 27B | 27B | 13T We train Gemma 2 27B on 13 trillion tokens of primarily-English data, the 9B model on 8 trillion tokens, and the 2B on 2 trillion tokens. |
481.5 | TBD | TBD | TBD | ||
| Llama 3.1 405B | Meta | 2024-07 | Dense | 405B | 405B | 15.6T We pre-trained a flagship model with 405B trainable parameters on 15.6T text tokens. |
38.5 | 0 We found that annealing on small amounts of high-quality code and mathematical data...can boost the performance. We do not use any synthetic data produced by other LLMs for pretraining. |
0 | 0% | |
| Mistral Large 2 | none | Mistral | 2024-07 | Dense | 123B | 123B | undisclosed | undisclosed | undisclosed | undisclosed | undisclosed |
| Jamba 1.5 Large | paper | AI21 | 2024-08 | Hybrid | 398B | 94B | undisclosed | undisclosed | undisclosed | undisclosed | undisclosed |
| Qwen 2.5 72B | Alibaba | 2024-09 | Dense | 72B | 72B | 18T In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. |
250.0 | TBD | TBD | TBD | |
| Granite 3.0 8B | paper | IBM | 2024-10 | Dense | 8B | 8B | 12T Trained from scratch following a two-stage training strategy. In the first stage, it is trained on 10 trillion tokens sourced from diverse domains. During the second stage, it is further trained on 2 trillion tokens. |
1500.0 | TBD | TBD | TBD |
| DeepSeek-V3 | DeepSeek | 2024-12 | MoE | 671B | 37B | 14.8T We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens. |
22.1 | TBD | TBD | TBD | |
| Phi-4 14B | Microsoft | 2024-12 | Dense | 14B | 14B | 10T The model was pretrained for approximately 10T tokens using linear warm-up and decay schedules. |
714.3 | 5.5T Web 15% 1.3T 1.2 Web rewrites 15% 290B 5.2 Synthetic 40% 290B 13.8 Code data 20% 820B 2.4 Acquired sources 10% 580B 1.7 |
392.9 | ||
| Falcon 3 10B | blog | TII | 2024-12 | Dense | 10B | 10B | 16T We conducted a single large-scale pretraining run on the 7B model, using 1024 H100 GPU chips, leveraging 14 trillion tokens featuring web, code, STEM, and curated high-quality and multilingual data. Depth up-scaling for improved reasoning: Building on recent studies on the effects of model depth, we upscaled the 7B model to a 10B parameters model by duplicating the redundant layers and continuing pre-training with 2 trillion tokens of high-quality data. |
1600.0 | TBD | TBD | TBD |
| MiniMax-01 | MiniMax | 2025-01 | MoE | 456B | 45.9B | 11.4T Our training approach involved large-scale pre-training on 11.4 trillion tokens, followed by a three-stage process to extend the context window up to 1 million tokens. |
25.0 | TBD | TBD | TBD | |
| OLMo 2 32B | blog | Allen AI | 2025-02 | Dense | 32B | 32B | 6T OLMo 2 32B is trained for 1.5 epochs, up to 6T tokens. |
187.5 | TBD | TBD | TBD |
| Gemma 3 27B | 2025-03 | Dense | 27B | 27B | 14T We pre-train our models on a slightly larger token budget than Gemma 2, i.e., we train on 14T tokens for Gemma 3 27B, 12T for the 12B version, 4T for the 4B, and 2T tokens for the 1B. |
518.5 | TBD | TBD | TBD | ||
| Llama 4 Scout | blog | Meta | 2025-04 | MoE | 109B | 17B | 40T Pre-training data: ~40T tokens |
367.0 | TBD | TBD | TBD |
| Llama 4 Maverick | blog | Meta | 2025-04 | MoE | 400B | 17B | 22T Pre-training data: ~22T tokens |
55.0 | TBD | TBD | TBD |
| Qwen 3 0.6B | Alibaba | 2025-05 | Dense | 0.6B | 0.6B | 36T All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens. |
60000.0 | TBD | TBD | TBD | |
| Qwen 3 7B | Alibaba | 2025-05 | Dense | 7B | 7B | 36T All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens. |
5142.9 | TBD | TBD | TBD | |
| Qwen 3 14B | Alibaba | 2025-05 | Dense | 14B | 14B | 36T All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens. |
2571.4 | TBD | TBD | TBD | |
| Qwen 3 32B | Alibaba | 2025-05 | Dense | 32B | 32B | 36T All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens. |
1125.0 | TBD | TBD | TBD | |
| Qwen 3 30B-A3B | Alibaba | 2025-05 | MoE | 30B | 3B | 36T All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens. |
1200.0 | TBD | TBD | TBD | |
| Qwen 3 235B-A22B | Alibaba | 2025-05 | MoE | 235B | 22B | 36T All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens. |
153.2 | TBD | TBD | TBD | |
| Kimi K2 | Moonshot | 2025-07 | MoE | 1T | 32B | 15.5T The Kimi K2 pre-training corpus comprises 15.5 trillion tokens of curated, high-quality data spanning four primary domains: Web Text, Code, Mathematics, and Knowledge. |
14.9 | TBD | TBD | TBD | |
| GLM-4.5 | Zhipu AI | 2025-07 | MoE | 355B | 32B | 23T Multi-stage training on 23T tokens. |
64.8 | TBD | TBD | TBD | |
| OLMo 3 32B | blog | Allen AI | 2025-11 | Dense | 32B | 32B | 6T Dolma 3 Mix, a 5.9-trillion-token (~6T) pretraining mix. |
187.5 | TBD | TBD | TBD |
| Trinity Large | paper | Arcee | 2025-11 | MoE | 400B | 13B | 17T Trinity Large was trained on 17T tokens of data curated by DatologyAI, split across three phases of 10T, 4T, and 3T tokens. |
42.5 | 8T Notably, over 8 trillion tokens of synthetic data were generated for this dataset across web, code, math, reasoning, and multilingual domains, using a breadth of state-of-the-art rephrasing approaches. |
20.0 | |
| Monad | blog | Pleias | 2025-11 | Dense | 0.1B | 0.1B | 200B Monad is a 56 million parameters generalist Small Reasoning Model, trained on 200 billions tokens from SYNTH, a fully open generalist dataset. |
3571.4 | 200B While the final training data is fully synthetic, it relied on seeds collected from three data sources: |
3571.4 | |
| Baguettotron | blog | Pleias | 2025-11 | Dense | 0.3B | 0.3B | 200B Baguettotron is a 321 million parameters generalist Small Reasoning Model, trained on 200 billions tokens from SYNTH, a fully open generalist dataset. |
623.1 | 200B While the final training data is fully synthetic, it relied on seeds collected from three data sources: |
623.1 | |
| Trinity Nano | blog | Arcee | 2025-12 | MoE | 6B | 1B | 10T Trinity Nano and Mini train on 10T tokens, organized into three phases with progressively higher quality and STEM concentration: 7T tokens in phase 1, 1.8T tokens in phase 2, and 1.2T tokens in phase 3. |
1666.7 | TBD | TBD | TBD |
| Trinity Mini | blog | Arcee | 2025-12 | MoE | 26B | 3B | 10T Trinity Nano and Mini train on 10T tokens, organized into three phases with progressively higher quality and STEM concentration: 7T tokens in phase 1, 1.8T tokens in phase 2, and 1.2T tokens in phase 3. |
384.6 | TBD | TBD | TBD |
| Nemotron-3 Nano | NVIDIA | 2025-12 | Hybrid | 30B | 3.5B | 25T Nemotron 3 Nano was pretrained on 25 trillion text tokens, including more than 3 trillion new unique tokens over Nemotron 2. |
833.3 | 2.5T They curated or generated over 2.5T new tokens from Common Crawl data. Applied five prompts to Medium-High-Quality data from 110 Common Crawl snapshots, resulting in 2.1T new tokens. For all synthetic rephrasing, they used Qwen3-30B-A3B. |
83.3 | ||
| Kimi K2.5 | blog | Moonshot | 2026-01 | MoE | 1T | 32B | 30.5T Kimi K2.5 builds on Kimi K2 with continued pretraining over approximately 15T mixed visual and text tokens. |
29.3 | TBD | TBD | TBD |
| GLM-5 | HF | Zhipu AI | 2026-02 | MoE | 744B | 40B | 28.5T Compared to GLM-4.5, GLM-5 scales from 355B parameters (32B active) to 744B parameters (40B active), and increases pre-training data from 23T to 28.5T tokens. |
38.3 | TBD | TBD | TBD |