Flagship open-weight models released since Jan 2024.
Last updated: February 13, 2026
| Model | Report | Org | Date | Arch | Total Params | Active Params | Train Tokens | TPP Total Training Tokens / Total Parameters |
Synth Tokens | STPP Synthetic Training Tokens / Total Parameters |
Synth % |
|---|---|---|---|---|---|---|---|---|---|---|---|
| DBRX | blog | Databricks | 2024-03 | MoE | 132B | 36B | 12T It was pre-trained on 12T tokens of text and code data. |
90.9 | TBD | TBD | TBD |
| Grok-1 | github | xAI | 2024-03 | MoE | 314B | 79B | undisclosed | undisclosed | undisclosed | undisclosed | undisclosed |
| Phi-3 medium 14B | Microsoft | 2024-04 | Dense | 14B | 14B | 4.8T We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi-3-medium. |
342.9 | TBD | TBD | TBD | |
| Mixtral 8x22B | none | Mistral | 2024-04 | MoE | 141B | 39B | undisclosed | undisclosed | undisclosed | undisclosed | undisclosed |
| Snowflake Arctic | blog | Snowflake | 2024-04 | MoE | 480B | 17B | 3.5T Snowflake Arctic was pretrained on 3.5 trillion tokens of data from publicly available sources. |
7.3 | TBD | TBD | TBD |
| Command-R+ | none | Cohere | 2024-04 | Dense | 104B | 104B | undisclosed | undisclosed | undisclosed | undisclosed | undisclosed |
| DeepSeek-V2 | DeepSeek | 2024-05 | MoE | 236B | 21B | 8.1T We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens. |
34.3 | TBD | TBD | TBD | |
| Yi-1.5 34B | 01.AI | 2024-05 | Dense | 34B | 34B | 3.1T We construct 3.1 trillion tokens of English and Chinese corpora using a cascaded data deduplication and quality filtering pipeline. |
91.2 | TBD | TBD | TBD | |
| Gemma 2 27B | 2024-06 | Dense | 27B | 27B | 13T We train Gemma 2 27B on 13 trillion tokens of primarily-English data, the 9B model on 8 trillion tokens, and the 2B on 2 trillion tokens. |
481.5 | TBD | TBD | TBD | ||
| Llama 3.1 405B | Meta | 2024-07 | Dense | 405B | 405B | 15.6T We pre-trained a flagship model with 405B trainable parameters on 15.6T text tokens. |
38.5 | 0 We found that annealing on small amounts of high-quality code and mathematical data...can boost the performance. We do not use any synthetic data produced by other LLMs for pretraining. |
0 | 0% | |
| Mistral Large 2 | none | Mistral | 2024-07 | Dense | 123B | 123B | undisclosed | undisclosed | undisclosed | undisclosed | undisclosed |
| Jamba 1.5 Large | paper | AI21 | 2024-08 | Hybrid | 398B | 94B | undisclosed | undisclosed | undisclosed | undisclosed | undisclosed |
| Qwen 2.5 72B | Alibaba | 2024-09 | Dense | 72B | 72B | 18T All models are pretrained on our latest large-scale dataset, encompassing up to 18 trillion tokens. |
250.0 | TBD | TBD | TBD | |
| Granite 3.0 8B | paper | IBM | 2024-10 | Dense | 8B | 8B | 12T Trained from scratch following a two-stage training strategy. In the first stage, it is trained on 10 trillion tokens sourced from diverse domains. During the second stage, it is further trained on 2 trillion tokens. |
1500.0 | TBD | TBD | TBD |
| DeepSeek-V3 | DeepSeek | 2024-12 | MoE | 671B | 37B | 14.8T We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens. |
22.1 | TBD | TBD | TBD | |
| Phi-4 14B | Microsoft | 2024-12 | Dense | 14B | 14B | 10T The model was pretrained for approximately 10T tokens using linear warm-up and decay schedules. |
714.3 | 5.5T Synthetic data constitutes the bulk of the training data for phi-4. Data mixture: Synthetic 40%, Web rewrites 15%, Filtered web data 15%, Code data 20%, Acquired sources 10%. |
392.9 | ||
| Falcon 3 10B | blog | TII | 2024-12 | Dense | 10B | 10B | 16T We conducted a single large-scale pretraining run on the 7B model...leveraging 14 trillion tokens featuring web, code, STEM, and curated high-quality and multilingual data. [Then] upscaled the 7B model to a 10B parameters model...continuing pre-training with 2 trillion tokens. |
1600.0 | TBD | TBD | TBD |
| MiniMax-01 | MiniMax | 2025-01 | MoE | 456B | 45.9B | 12T Trained on ~12 trillion tokens. |
26.3 | TBD | TBD | TBD | |
| OLMo 2 32B | blog | Allen AI | 2025-02 | Dense | 32B | 32B | 6T OLMo 2 32B is trained for 1.5 epochs, up to 6T tokens. |
187.5 | TBD | TBD | TBD |
| Gemma 3 27B | 2025-03 | Dense | 27B | 27B | 14T We pre-train our models on a slightly larger token budget than Gemma 2, i.e., we train on 14T tokens for Gemma 3 27B, 12T for the 12B version, 4T for the 4B, and 2T tokens for the 1B. |
518.5 | TBD | TBD | TBD | ||
| Llama 4 Scout | blog | Meta | 2025-04 | MoE | 109B | 17B | 40T Pre-training data: ~40T tokens |
367.0 | TBD | TBD | TBD |
| Llama 4 Maverick | blog | Meta | 2025-04 | MoE | 400B | 17B | 22T Pre-training data: ~22T tokens |
55.0 | TBD | TBD | TBD |
| Qwen 3 0.6B | Alibaba | 2025-05 | Dense | 0.6B | 0.6B | 36T All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens. |
60000.0 | TBD | TBD | TBD | |
| Qwen 3 7B | Alibaba | 2025-05 | Dense | 7B | 7B | 36T All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens. |
5142.9 | TBD | TBD | TBD | |
| Qwen 3 14B | Alibaba | 2025-05 | Dense | 14B | 14B | 36T All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens. |
2571.4 | TBD | TBD | TBD | |
| Qwen 3 32B | Alibaba | 2025-05 | Dense | 32B | 32B | 36T All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens. |
1125.0 | TBD | TBD | TBD | |
| Qwen 3 30B-A3B | Alibaba | 2025-05 | MoE | 30B | 3B | 36T All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens. |
1200.0 | TBD | TBD | TBD | |
| Qwen 3 235B-A22B | Alibaba | 2025-05 | MoE | 235B | 22B | 36T All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens. |
153.2 | TBD | TBD | TBD | |
| Kimi K2 | Moonshot | 2025-07 | MoE | 1T | 32B | 15.5T The Kimi K2 pre-training corpus comprises 15.5 trillion tokens of curated, high-quality data spanning four primary domains: Web Text, Code, Mathematics, and Knowledge. |
14.9 | TBD | TBD | TBD | |
| GLM-4.5 | Zhipu AI | 2025-07 | MoE | 355B | 32B | 23T Multi-stage training on 23T tokens. |
64.8 | TBD | TBD | TBD | |
| OLMo 3 32B | blog | Allen AI | 2025-11 | Dense | 32B | 32B | 6T Dolma 3 Mix, a 5.9-trillion-token (~6T) pretraining mix. |
187.5 | TBD | TBD | TBD |
| Trinity Large | paper | Arcee | 2025-11 | MoE | 400B | 13B | 17T Trinity Large was pre-trained on 17 trillion tokens of data curated by DatologyAI, split across three phases of 10T, 4T, and 3T tokens. |
42.5 | 8T Over 8 trillion tokens of synthetic data were generated for this dataset across web, code, math, reasoning, and multilingual domains. The 8T synthetic tokens include 6.5T synthetic web tokens, 1T multilingual tokens, and 800B synthetic code tokens. |
20.0 | |
| Monad | blog | Pleias | 2025-11 | Dense | 0.1B | 0.1B | 200B Monad is a 56 million parameters generalist Small Reasoning Model, trained on 200 billions tokens from SYNTH, a fully open generalist dataset. |
3571.4 | 200B Full synthetic training makes relatively straightforward to expand language support. Trained exclusively on SYNTH, a fully open generalist synthetic dataset built from 50,000 Wikipedia vital articles as seeds. |
3571.4 | |
| Baguettotron | blog | Pleias | 2025-11 | Dense | 0.3B | 0.3B | 200B Baguettotron is a 321 million parameters generalist Small Reasoning Model, trained on 200 billions tokens from SYNTH, a fully open generalist dataset. |
623.1 | 200B Full synthetic training makes relatively straightforward to expand language support. Trained exclusively on SYNTH, a fully open generalist synthetic dataset built from 50,000 Wikipedia vital articles as seeds. |
623.1 | |
| Trinity Nano | blog | Arcee | 2025-12 | MoE | 6B | 1B | 10T Trained on 10T tokens, organized into three phases with progressively higher quality and STEM concentration: 7T in phase 1, 1.8T in phase 2, 1.2T in phase 3. |
1666.7 | TBD | TBD | TBD |
| Trinity Mini | blog | Arcee | 2025-12 | MoE | 26B | 3B | 10T Trained on 10T tokens, organized into three phases with progressively higher quality and STEM concentration: 7T in phase 1, 1.8T in phase 2, 1.2T in phase 3. |
384.6 | TBD | TBD | TBD |
| Nemotron-3 Nano | NVIDIA | 2025-12 | Hybrid | 30B | 3.5B | 25T Nemotron 3 Nano was pretrained on 25 trillion text tokens, including more than 3 trillion new unique tokens over Nemotron 2. |
833.3 | 2.5T They curated or generated over 2.5T new tokens from Common Crawl data. Applied five prompts to Medium-High-Quality data from 110 Common Crawl snapshots, resulting in 2.1T new tokens. For all synthetic rephrasing, they used Qwen3-30B-A3B. |
83.3 | ||
| Kimi K2.5 | blog | Moonshot | 2026-01 | MoE | 1T | 32B | 30.5T Built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. |
29.3 | TBD | TBD | TBD |
| GLM-5 | HF | Zhipu AI | 2026-02 | MoE | 744B | 40B | 28.5T Pre-training data growing from 23T to 28.5T tokens. GLM-5 employs a MoE architecture, scaling from GLM-4.5's 355B params to 744B, with 256 experts, 8 activated per token. |
38.3 | TBD | TBD | TBD |