Remove ads
List of notable large language models since 2017 From Wikipedia, the free encyclopedia
A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. As language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a self-supervised and semi-supervised training process.
This page lists notable large language models.
For the training cost column, 1 petaFLOP-day = 1 petaFLOP/sec × 1 day = 8.64E19 FLOP. Also, only the largest model's cost is written.
Name | Release date[a] | Developer | Number of parameters (billion) [b] | Corpus size | Training cost (petaFLOP-day) | License[c] | Notes |
---|---|---|---|---|---|---|---|
GPT-1 | June 2018 | OpenAI | 0.117 | 1[1] | MIT[2] | First GPT model, decoder-only transformer. Trained for 30 days on 8 P600 GPUs. | |
BERT | October 2018 | 0.340[3] | 3.3 billion words[3] | 9[4] | Apache 2.0[5] | An early and influential language model.[6]Encoder-only and thus not built to be prompted or generative.[7] Training took 4 days on 64 TPUv2 chips.[8] | |
T5 | October 2019 | 11[9] | 34 billion tokens[9] | Apache 2.0[10] | Base model for many Google projects, such as Imagen.[11] | ||
XLNet | June 2019 | 0.340[12] | 33 billion words | 330 | Apache 2.0[13] | An alternative to BERT; designed as encoder-only. Trained on 512 TPU v3 chips for 5.5 days.[14] | |
GPT-2 | February 2019 | OpenAI | 1.5[15] | 40GB[16] (~10 billion tokens)[17] | 28[18] | MIT[19] | Trained on 32 TPUv3 chips for 1 week.[18] |
GPT-3 | May 2020 | OpenAI | 175[20] | 300 billion tokens[17] | 3640[21] | proprietary | A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022.[22] |
GPT-Neo | March 2021 | EleutherAI | 2.7[23] | 825 GiB[24] | MIT[25] | The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3.[25] | |
GPT-J | June 2021 | EleutherAI | 6[26] | 825 GiB[24] | 200[27] | Apache 2.0 | GPT-3-style language model |
Megatron-Turing NLG | October 2021[28] | Microsoft and Nvidia | 530[29] | 338.6 billion tokens[29] | 38000[30] | Restricted web access | Trained for 3 months on over 2000 A100 GPUs on the NVIDIA Selene Supercomputer, for over 3 million GPU-hours.[30] |
Ernie 3.0 Titan | December 2021 | Baidu | 260[31] | 4 Tb | Proprietary | Chinese-language LLM. Ernie Bot is based on this model. | |
Claude[32] | December 2021 | Anthropic | 52[33] | 400 billion tokens[33] | beta | Fine-tuned for desirable behavior in conversations.[34] | |
GLaM (Generalist Language Model) | December 2021 | 1200[35] | 1.6 trillion tokens[35] | 5600[35] | Proprietary | Sparse mixture of experts model, making it more expensive to train but cheaper to run inference compared to GPT-3. | |
Gopher | December 2021 | DeepMind | 280[36] | 300 billion tokens[37] | 5833[38] | Proprietary | Later developed into the Chinchilla model. |
LaMDA (Language Models for Dialog Applications) | January 2022 | 137[39] | 1.56T words,[39] 168 billion tokens[37] | 4110[40] | Proprietary | Specialized for response generation in conversations. | |
GPT-NeoX | February 2022 | EleutherAI | 20[41] | 825 GiB[24] | 740[27] | Apache 2.0 | based on the Megatron architecture |
Chinchilla | March 2022 | DeepMind | 70[42] | 1.4 trillion tokens[42][37] | 6805[38] | Proprietary | Reduced-parameter model trained on more data. Used in the Sparrow bot. Often cited for its neural scaling law. |
PaLM (Pathways Language Model) | April 2022 | 540[43] | 768 billion tokens[42] | 29,250[38] | Proprietary | Trained for ~60 days on ~6000 TPU v4 chips.[38] As of October 2024[update], it is the largest dense Transformer published. | |
OPT (Open Pretrained Transformer) | May 2022 | Meta | 175[44] | 180 billion tokens[45] | 310[27] | Non-commercial research[d] | GPT-3 architecture with some adaptations from Megatron. Uniquely, the training logbook written by the team was published.[46] |
YaLM 100B | June 2022 | Yandex | 100[47] | 1.7TB[47] | Apache 2.0 | English-Russian model based on Microsoft's Megatron-LM. | |
Minerva | June 2022 | 540[48] | 38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint server[48] | Proprietary | For solving "mathematical and scientific questions using step-by-step reasoning".[49] Initialized from PaLM models, then finetuned on mathematical and scientific data. | ||
BLOOM | July 2022 | Large collaboration led by Hugging Face | 175[50] | 350 billion tokens (1.6TB)[51] | Responsible AI | Essentially GPT-3 but trained on a multi-lingual corpus (30% English excluding programming languages) | |
Galactica | November 2022 | Meta | 120 | 106 billion tokens[52] | unknown | CC-BY-NC-4.0 | Trained on scientific text and modalities. |
AlexaTM (Teacher Models) | November 2022 | Amazon | 20[53] | 1.3 trillion[54] | proprietary[55] | bidirectional sequence-to-sequence architecture | |
Neuro-sama | December 2022 | Independent | Unknown | Unknown | privately-owned | A language model designed for live-streaming on Twitch. | |
LLaMA (Large Language Model Meta AI) | February 2023 | Meta AI | 65[56] | 1.4 trillion[56] | 6300[57] | Non-commercial research[e] | Corpus has 20 languages. "Overtrained" (compared to Chinchilla scaling law) for better performance with fewer parameters.[56] |
GPT-4 | March 2023 | OpenAI | Unknown[f] (According to rumors: 1760)[59] | Unknown | Unknown | proprietary | Available for ChatGPT Plus users and used in several products. |
Chameleon | June 2024 | Meta AI | 34[60] | 4.4 trillion | |||
Cerebras-GPT | March 2023 | Cerebras | 13[61] | 270[27] | Apache 2.0 | Trained with Chinchilla formula. | |
Falcon | March 2023 | Technology Innovation Institute | 40[62] | 1 trillion tokens, from RefinedWeb (filtered web text corpus)[63] plus some "curated corpora".[64] | 2800[57] | Apache 2.0[65] | |
BloombergGPT | March 2023 | Bloomberg L.P. | 50 | 363 billion token dataset based on Bloomberg's data sources, plus 345 billion tokens from general purpose datasets[66] | Proprietary | Trained on financial data from proprietary sources, for financial tasks. | |
PanGu-Σ | March 2023 | Huawei | 1085 | 329 billion tokens[67] | Proprietary | ||
OpenAssistant[68] | March 2023 | LAION | 17 | 1.5 trillion tokens | Apache 2.0 | Trained on crowdsourced open data | |
Jurassic-2[69] | March 2023 | AI21 Labs | Unknown | Unknown | Proprietary | Multilingual[70] | |
PaLM 2 (Pathways Language Model 2) | May 2023 | 340[71] | 3.6 trillion tokens[71] | 85,000[57] | Proprietary | Was used in Bard chatbot.[72] | |
Llama 2 | July 2023 | Meta AI | 70[73] | 2 trillion tokens[73] | 21,000 | Llama 2 license | 1.7 million A100-hours.[74] |
Claude 2 | July 2023 | Anthropic | Unknown | Unknown | Unknown | Proprietary | Used in Claude chatbot.[75] |
Granite 13b | July 2023 | IBM | Unknown | Unknown | Unknown | Proprietary | Used in IBM Watsonx.[76] |
Mistral 7B | September 2023 | Mistral AI | 7.3[77] | Unknown | Apache 2.0 | ||
Claude 2.1 | November 2023 | Anthropic | Unknown | Unknown | Unknown | Proprietary | Used in Claude chatbot. Has a context window of 200,000 tokens, or ~500 pages.[78] |
Grok-1[79] | November 2023 | xAI | 314 | Unknown | Unknown | Apache 2.0 | Used in Grok chatbot. Grok-1 has a context length of 8,192 tokens and has access to X (Twitter).[80] |
Gemini 1.0 | December 2023 | Google DeepMind | Unknown | Unknown | Unknown | Proprietary | Multimodal model, comes in three sizes. Used in the chatbot of the same name.[81] |
Mixtral 8x7B | December 2023 | Mistral AI | 46.7 | Unknown | Unknown | Apache 2.0 | Outperforms GPT-3.5 and Llama 2 70B on many benchmarks.[82] Mixture of experts model, with 12.9 billion parameters activated per token.[83] |
Mixtral 8x22B | April 2024 | Mistral AI | 141 | Unknown | Unknown | Apache 2.0 | [84] |
Phi-2 | December 2023 | Microsoft | 2.7 | 1.4T tokens | 419[85] | MIT | Trained on real and synthetic "textbook-quality" data, for 14 days on 96 A100 GPUs.[85] |
Gemini 1.5 | February 2024 | Google DeepMind | Unknown | Unknown | Unknown | Proprietary | Multimodal model, based on a Mixture-of-Experts (MoE) architecture. Context window above 1 million tokens.[86] |
Gemini Ultra | February 2024 | Google DeepMind | Unknown | Unknown | Unknown | ||
Gemma | February 2024 | Google DeepMind | 7 | 6T tokens | Unknown | Gemma Terms of Use[87] | |
Claude 3 | March 2024 | Anthropic | Unknown | Unknown | Unknown | Proprietary | Includes three models, Haiku, Sonnet, and Opus.[88] |
Nova | October 2024 | Rubik's AI | Unknown | Unknown | Unknown | Proprietary | Includes three models, Nova-Instant, Nova-Air, and Nova-Pro. |
DBRX | March 2024 | Databricks and Mosaic ML | 136 | 12T Tokens | Databricks Open Model License | Training cost 10 million USD. | |
Fugaku-LLM | May 2024 | Fujitsu, Tokyo Institute of Technology, etc. | 13 | 380B Tokens | The largest model ever trained on CPU-only, on the Fugaku.[89] | ||
Phi-3 | April 2024 | Microsoft | 14[90] | 4.8T Tokens | MIT | Microsoft markets them as "small language model".[91] | |
Granite Code Models | May 2024 | IBM | Unknown | Unknown | Unknown | Apache 2.0 | |
Qwen2 | June 2024 | Alibaba Cloud | 72[92] | 3T Tokens | Multiple sizes, the smallest being 0.5B. | ||
Nemotron-4 | June 2024 | Nvidia | 340 | 9T Tokens | 200,000 | NVIDIA Open Model License | Trained for 1 epoch. Trained on 6144 H100 GPUs between December 2023 and May 2024.[93][94] |
Llama 3.1 | July 2024 | Meta AI | 405 | 15.6T tokens | 440,000 | Llama 3 license | 405B version took 31 million hours on H100-80GB, at 3.8E25 FLOPs.[95][96] |
DeepSeek V3 | December 2024 | DeepSeek | 671 | 14.8T tokens | 440,00 | DeepSeek License | 2.788M hours on H800 GPUs.[97] |
Seamless Wikipedia browsing. On steroids.
Every time you click a link to Wikipedia, Wiktionary or Wikiquote in your browser's search results, it will show the modern Wikiwand interface.
Wikiwand extension is a five stars, simple, with minimum permission required to keep your browsing private, safe and transparent.