List of notable large language models since 2017 From Wikipedia, the free encyclopedia
A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.
This page lists notable large language models.
For the training cost column, 1 petaFLOP-day = 1 petaFLOP/sec × 1 day = 8.64E19 FLOP. Also, only the largest model's cost is written.
Name | Release date[a] | Developer | Number of parameters (billion) [b] | Corpus size | Training cost (petaFLOP-day) | License[c] | Notes |
---|---|---|---|---|---|---|---|
GPT-1 | June 2018 | OpenAI | 0.117 | 1[1] | MIT[2] | First GPT model, decoder-only transformer. Trained for 30 days on 8 P600 GPUs. | |
BERT | October 2018 | 0.340[3] | 3.3 billion words[3] | 9[4] | Apache 2.0[5] | An early and influential language model.[6]Encoder-only and thus not built to be prompted or generative.[7] Training took 4 days on 64 TPUv2 chips.[8] | |
T5 | October 2019 | 11[9] | 34 billion tokens[9] | Apache 2.0[10] | Base model for many Google projects, such as Imagen.[11] | ||
XLNet | June 2019 | 0.340[12] | 33 billion words | 330 | Apache 2.0[13] | An alternative to BERT; designed as encoder-only. Trained on 512 TPU v3 chips for 5.5 days.[14] | |
GPT-2 | February 2019 | OpenAI | 1.5[15] | 40GB[16] (~10 billion tokens)[17] | 28[18] | MIT[19] | Trained on 32 TPUv3 chips for 1 week.[18] |
GPT-3 | May 2020 | OpenAI | 175[20] | 300 billion tokens[17] | 3640[21] | Proprietary | A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022.[22] |
GPT-Neo | March 2021 | EleutherAI | 2.7[23] | 825 GiB[24] | MIT[25] | The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3.[25] | |
GPT-J | June 2021 | EleutherAI | 6[26] | 825 GiB[24] | 200[27] | Apache 2.0 | GPT-3-style language model |
Megatron-Turing NLG | October 2021 [28] | Microsoft and Nvidia | 530[29] | 338.6 billion tokens[29] | 38000[30] | Restricted web access | Trained for 3 months on over 2000 A100 GPUs on the NVIDIA Selene Supercomputer, for over 3 million GPU-hours.[30] |
Ernie 3.0 Titan | December 2021 | Baidu | 260[31] | 4 Tb | Proprietary | Chinese-language LLM. Ernie Bot is based on this model. | |
Claude[32] | December 2021 | Anthropic | 52[33] | 400 billion tokens[33] | beta | Fine-tuned for desirable behavior in conversations.[34] | |
GLaM (Generalist Language Model) | December 2021 | 1200[35] | 1.6 trillion tokens[35] | 5600[35] | Proprietary | Sparse mixture of experts model, making it more expensive to train but cheaper to run inference compared to GPT-3. | |
Gopher | December 2021 | DeepMind | 280[36] | 300 billion tokens[37] | 5833[38] | Proprietary | Later developed into the Chinchilla model. |
LaMDA (Language Models for Dialog Applications) | January 2022 | 137[39] | 1.56T words,[39] 168 billion tokens[37] | 4110[40] | Proprietary | Specialized for response generation in conversations. | |
GPT-NeoX | February 2022 | EleutherAI | 20[41] | 825 GiB[24] | 740[27] | Apache 2.0 | based on the Megatron architecture |
Chinchilla | March 2022 | DeepMind | 70[42] | 1.4 trillion tokens[42][37] | 6805[38] | Proprietary | Reduced-parameter model trained on more data. Used in the Sparrow bot. Often cited for its neural scaling law. |
PaLM (Pathways Language Model) | April 2022 | 540[43] | 768 billion tokens[42] | 29,250[38] | Proprietary | Trained for ~60 days on ~6000 TPU v4 chips.[38] As of October 2024[update], it is the largest dense Transformer published. | |
OPT (Open Pretrained Transformer) | May 2022 | Meta | 175[44] | 180 billion tokens[45] | 310[27] | Non-commercial research[d] | GPT-3 architecture with some adaptations from Megatron. Uniquely, the training logbook written by the team was published.[46] |
YaLM 100B | June 2022 | Yandex | 100[47] | 1.7TB[47] | Apache 2.0 | English-Russian model based on Microsoft's Megatron-LM. | |
Minerva | June 2022 | 540[48] | 38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint server[48] | Proprietary | For solving "mathematical and scientific questions using step-by-step reasoning".[49] Initialized from PaLM models, then finetuned on mathematical and scientific data. | ||
BLOOM | July 2022 | Large collaboration led by Hugging Face | 175[50] | 350 billion tokens (1.6TB)[51] | Responsible AI | Essentially GPT-3 but trained on a multi-lingual corpus (30% English excluding programming languages) | |
Galactica | November 2022 | Meta | 120 | 106 billion tokens[52] | unknown | CC-BY-NC-4.0 | Trained on scientific text and modalities. |
AlexaTM (Teacher Models) | November 2022 | Amazon | 20[53] | 1.3 trillion[54] | proprietary[55] | bidirectional sequence-to-sequence architecture | |
LLaMA (Large Language Model Meta AI) | February 2023 | Meta AI | 65[56] | 1.4 trillion[56] | 6300[57] | Non-commercial research[e] | Corpus has 20 languages. "Overtrained" (compared to Chinchilla scaling law) for better performance with fewer parameters.[56] |
GPT-4 | March 2023 | OpenAI | Unknown[f] (According to rumors: 1760)[59] | Unknown | Unknown | proprietary | Available for ChatGPT Plus users and used in several products. |
Chameleon | June 2024 | Meta AI | 34[60] | 4.4 trillion | |||
Cerebras-GPT | March 2023 | Cerebras | 13[61] | 270[27] | Apache 2.0 | Trained with Chinchilla formula. | |
Falcon | March 2023 | Technology Innovation Institute | 40[62] | 1 trillion tokens, from RefinedWeb (filtered web text corpus)[63] plus some "curated corpora".[64] | 2800[57] | Apache 2.0[65] | |
BloombergGPT | March 2023 | Bloomberg L.P. | 50 | 363 billion token dataset based on Bloomberg's data sources, plus 345 billion tokens from general purpose datasets[66] | Proprietary | Trained on financial data from proprietary sources, for financial tasks. | |
PanGu-Σ | March 2023 | Huawei | 1085 | 329 billion tokens[67] | Proprietary | ||
OpenAssistant[68] | March 2023 | LAION | 17 | 1.5 trillion tokens | Apache 2.0 | Trained on crowdsourced open data | |
Jurassic-2[69] | March 2023 | AI21 Labs | Unknown | Unknown | Proprietary | Multilingual[70] | |
PaLM 2 (Pathways Language Model 2) | May 2023 | 340[71] | 3.6 trillion tokens[71] | 85,000[57] | Proprietary | Was used in Bard chatbot.[72] | |
Llama 2 | July 2023 | Meta AI | 70[73] | 2 trillion tokens[73] | 21,000 | Llama 2 license | 1.7 million A100-hours.[74] |
Claude 2 | July 2023 | Anthropic | Unknown | Unknown | Unknown | Proprietary | Used in Claude chatbot.[75] |
Granite 13b | July 2023 | IBM | Unknown | Unknown | Unknown | Proprietary | Used in IBM Watsonx.[76] |
Mistral 7B | September 2023 | Mistral AI | 7.3[77] | Unknown | Apache 2.0 | ||
Claude 2.1 | November 2023 | Anthropic | Unknown | Unknown | Unknown | Proprietary | Used in Claude chatbot. Has a context window of 200,000 tokens, or ~500 pages.[78] |
Grok-1[79] | November 2023 | xAI | 314 | Unknown | Unknown | Apache 2.0 | Used in Grok chatbot. Grok-1 has a context length of 8,192 tokens and has access to X (Twitter).[80] |
Gemini 1.0 | December 2023 | Google DeepMind | Unknown | Unknown | Unknown | Proprietary | Multimodal model, comes in three sizes. Used in the chatbot of the same name.[81] |
Mixtral 8x7B | December 2023 | Mistral AI | 46.7 | Unknown | Unknown | Apache 2.0 | Outperforms GPT-3.5 and Llama 2 70B on many benchmarks.[82] Mixture of experts model, with 12.9 billion parameters activated per token.[83] |
Mixtral 8x22B | April 2024 | Mistral AI | 141 | Unknown | Unknown | Apache 2.0 | [84] |
DeepSeek LLM | November 29, 2023 | DeepSeek | 67 | 2T tokens[85]: table 2 | 12,000}} | DeepSeek License | Trained on English and Chinese text. 1e24 FLOPs for 67B. 1e23 FLOPs for 7B[85]: figure 5 |
Phi-2 | December 2023 | Microsoft | 2.7 | 1.4T tokens | 419[86] | MIT | Trained on real and synthetic "textbook-quality" data, for 14 days on 96 A100 GPUs.[86] |
Gemini 1.5 | February 2024 | Google DeepMind | Unknown | Unknown | Unknown | Proprietary | Multimodal model, based on a Mixture-of-Experts (MoE) architecture. Context window above 1 million tokens.[87] |
Gemini Ultra | February 2024 | Google DeepMind | Unknown | Unknown | Unknown | ||
Gemma | February 2024 | Google DeepMind | 7 | 6T tokens | Unknown | Gemma Terms of Use[88] | |
Claude 3 | March 2024 | Anthropic | Unknown | Unknown | Unknown | Proprietary | Includes three models, Haiku, Sonnet, and Opus.[89] |
Nova | October 2024 | Rubik's AI | Unknown | Unknown | Unknown | Proprietary | Includes three models, Nova-Instant, Nova-Air, and Nova-Pro. |
DBRX | March 2024 | Databricks and Mosaic ML | 136 | 12T Tokens | Databricks Open Model License | Training cost 10 million USD. | |
Fugaku-LLM | May 2024 | Fujitsu, Tokyo Institute of Technology, etc. | 13 | 380B Tokens | The largest model ever trained on CPU-only, on the Fugaku.[90] | ||
Phi-3 | April 2024 | Microsoft | 14[91] | 4.8T Tokens | MIT | Microsoft markets them as "small language model".[92] | |
Granite Code Models | May 2024 | IBM | Unknown | Unknown | Unknown | Apache 2.0 | |
Qwen2 | June 2024 | Alibaba Cloud | 72[93] | 3T Tokens | Unknown | Qwen License | Multiple sizes, the smallest being 0.5B. |
DeepSeek V2 | June 2024 | DeepSeek | 236 | 8.1T tokens | 28,000 | DeepSeek License | 1.4M hours on H800.[94] |
Nemotron-4 | June 2024 | Nvidia | 340 | 9T Tokens | 200,000 | NVIDIA Open Model License | Trained for 1 epoch. Trained on 6144 H100 GPUs between December 2023 and May 2024.[95][96] |
Llama 3.1 | July 2024 | Meta AI | 405 | 15.6T tokens | 440,000 | Llama 3 license | 405B version took 31 million hours on H100-80GB, at 3.8E25 FLOPs.[97][98] |
DeepSeek V3 | December 2024 | DeepSeek | 671 | 14.8T tokens | 56,000 | DeepSeek License | 2.788M hours on H800 GPUs.[99] |
Amazon Nova | December 2024 | Amazon | Unknown | Unknown | Unknown | Proprietary | Includes three models, Nova Micro, Nova Lite, and Nova Pro[100] |
DeepSeek R1 | January 2025 | DeepSeek | 671 | Unknown | Unknown | MIT | No pretraining. Reinforcement-learned upon V3-Base.[101][102] |
Qwen2.5 | January 2025 | Alibaba | 72 | 18T tokens | Unknown | Qwen License | [103] |
MiniMax-Text-01 | January 2025 | Minimax | 456 | 4.7T tokens[104] | Unknown | Minimax Model license | [105][104] |
Seamless Wikipedia browsing. On steroids.