List of large language models

From Wikipedia, the free encyclopedia

A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.

This page lists notable large language models.

List

Summarize
Perspective

For the training cost column, 1 petaFLOP-day = 1 petaFLOP/sec × 1 day = 8.64E19 FLOP. Also, only the largest model's cost is written.

More information Name, Release date ...
NameRelease date[a]DeveloperNumber of parameters (billion) [b]Corpus size Training cost (petaFLOP-day)License[c]Notes
GPT-1June 2018OpenAI0.117 1[1]MIT[2] First GPT model, decoder-only transformer. Trained for 30 days on 8 P600 GPUs.
BERTOctober 2018Google0.340[3]3.3 billion words[3] 9[4]Apache 2.0[5] An early and influential language model.[6]Encoder-only and thus not built to be prompted or generative.[7] Training took 4 days on 64 TPUv2 chips.[8]
T5 October 2019 Google 11[9] 34 billion tokens[9] Apache 2.0[10] Base model for many Google projects, such as Imagen.[11]
XLNetJune 2019Google0.340[12]33 billion words 330Apache 2.0[13] An alternative to BERT; designed as encoder-only. Trained on 512 TPU v3 chips for 5.5 days.[14]
GPT-2February 2019OpenAI1.5[15]40GB[16] (~10 billion tokens)[17] 28[18]MIT[19] Trained on 32 TPUv3 chips for 1 week.[18]
GPT-3May 2020OpenAI175[20]300 billion tokens[17] 3640[21]Proprietary A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022.[22]
GPT-NeoMarch 2021EleutherAI2.7[23]825 GiB[24] MIT[25] The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3.[25]
GPT-JJune 2021EleutherAI6[26]825 GiB[24] 200[27]Apache 2.0 GPT-3-style language model
Megatron-Turing NLGOctober 2021 [28]Microsoft and Nvidia530[29]338.6 billion tokens[29] 38000[30]Restricted web access Trained for 3 months on over 2000 A100 GPUs on the NVIDIA Selene Supercomputer, for over 3 million GPU-hours.[30]
Ernie 3.0 TitanDecember 2021Baidu260[31]4 Tb Proprietary Chinese-language LLM. Ernie Bot is based on this model.
Claude[32]December 2021Anthropic52[33]400 billion tokens[33] beta Fine-tuned for desirable behavior in conversations.[34]
GLaM (Generalist Language Model)December 2021Google1200[35]1.6 trillion tokens[35] 5600[35]Proprietary Sparse mixture of experts model, making it more expensive to train but cheaper to run inference compared to GPT-3.
GopherDecember 2021DeepMind280[36]300 billion tokens[37] 5833[38]Proprietary Later developed into the Chinchilla model.
LaMDA (Language Models for Dialog Applications)January 2022Google137[39]1.56T words,[39] 168 billion tokens[37] 4110[40]Proprietary Specialized for response generation in conversations.
GPT-NeoXFebruary 2022EleutherAI20[41]825 GiB[24] 740[27]Apache 2.0 based on the Megatron architecture
ChinchillaMarch 2022DeepMind70[42]1.4 trillion tokens[42][37] 6805[38]Proprietary Reduced-parameter model trained on more data. Used in the Sparrow bot. Often cited for its neural scaling law.
PaLM (Pathways Language Model)April 2022Google540[43]768 billion tokens[42] 29,250[38]Proprietary Trained for ~60 days on ~6000 TPU v4 chips.[38] As of October 2024, it is the largest dense Transformer published.
OPT (Open Pretrained Transformer)May 2022Meta175[44]180 billion tokens[45] 310[27]Non-commercial research[d] GPT-3 architecture with some adaptations from Megatron. Uniquely, the training logbook written by the team was published.[46]
YaLM 100BJune 2022Yandex100[47] 1.7TB[47]Apache 2.0English-Russian model based on Microsoft's Megatron-LM.
MinervaJune 2022Google540[48]38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint server[48] Proprietary For solving "mathematical and scientific questions using step-by-step reasoning".[49] Initialized from PaLM models, then finetuned on mathematical and scientific data.
BLOOMJuly 2022Large collaboration led by Hugging Face175[50]350 billion tokens (1.6TB)[51] Responsible AI Essentially GPT-3 but trained on a multi-lingual corpus (30% English excluding programming languages)
GalacticaNovember 2022Meta120106 billion tokens[52] Un­knownCC-BY-NC-4.0 Trained on scientific text and modalities.
AlexaTM (Teacher Models)November 2022Amazon20[53]1.3 trillion[54] Proprietary[55] bidirectional sequence-to-sequence architecture
LLaMA (Large Language Model Meta AI)February 2023Meta AI65[56]1.4 trillion[56] 6300[57]Non-commercial research[e] Corpus has 20 languages. "Overtrained" (compared to Chinchilla scaling law) for better performance with fewer parameters.[56]
GPT-4March 2023OpenAIUn­known[f]
(According to rumors: 1760)[59]
Un­known Un­known,
estimated 230,000.
Proprietary Available for ChatGPT Plus users and used in several products.
ChameleonJune 2024Meta AI34[60]4.4 trillion
Cerebras-GPT March 2023 Cerebras 13[61] 270[27]Apache 2.0 Trained with Chinchilla formula.
FalconMarch 2023Technology Innovation Institute40[62]1 trillion tokens, from RefinedWeb (filtered web text corpus)[63] plus some "curated corpora".[64] 2800[57]Apache 2.0[65]
BloombergGPTMarch 2023Bloomberg L.P.50363 billion token dataset based on Bloomberg's data sources, plus 345 billion tokens from general purpose datasets[66] Proprietary Trained on financial data from proprietary sources, for financial tasks.
PanGu-ΣMarch 2023Huawei1085329 billion tokens[67] Proprietary
OpenAssistant[68]March 2023LAION171.5 trillion tokens Apache 2.0 Trained on crowdsourced open data
Jurassic-2[69] March 2023 AI21 Labs Un­known Un­known Proprietary Multilingual[70]
PaLM 2 (Pathways Language Model 2)May 2023Google340[71]3.6 trillion tokens[71] 85,000[57]Proprietary Was used in Bard chatbot.[72]
Llama 2July 2023Meta AI70[73]2 trillion tokens[73] 21,000Llama 2 license 1.7 million A100-hours.[74]
Claude 2 July 2023 Anthropic Un­known Un­known Un­knownProprietary Used in Claude chatbot.[75]
Granite 13b July 2023 IBM Un­known Un­known Un­knownProprietary Used in IBM Watsonx.[76]
Mistral 7BSeptember 2023Mistral AI7.3[77]Un­known Apache 2.0
Claude 2.1 November 2023 Anthropic Un­known Un­known Un­knownProprietary Used in Claude chatbot. Has a context window of 200,000 tokens, or ~500 pages.[78]
Grok 1[79] November 2023 xAI 314 Un­known Un­knownApache 2.0 Used in Grok chatbot. Grok 1 has a context length of 8,192 tokens and has access to X (Twitter).[80]
Gemini 1.0 December 2023 Google DeepMind Un­known Un­known Un­knownProprietary Multimodal model, comes in three sizes. Used in the chatbot of the same name.[81]
Mixtral 8x7B December 2023 Mistral AI 46.7 Un­known Un­knownApache 2.0 Outperforms GPT-3.5 and Llama 2 70B on many benchmarks.[82] Mixture of experts model, with 12.9 billion parameters activated per token.[83]
Mixtral 8x22B April 2024 Mistral AI 141 Un­known Un­knownApache 2.0 [84]
DeepSeek-LLM November 29, 2023 DeepSeek 67 2T tokens[85]:table 2 12,000 DeepSeek License Trained on English and Chinese text. 1e24 FLOPs for 67B. 1e23 FLOPs for 7B[85]:figure 5
Phi-2 December 2023 Microsoft 2.7 1.4T tokens 419[86]MIT Trained on real and synthetic "textbook-quality" data, for 14 days on 96 A100 GPUs.[86]
Gemini 1.5 February 2024 Google DeepMind Unknown Un­known Un­knownProprietary Multimodal model, based on a Mixture-of-Experts (MoE) architecture. Context window above 1 million tokens.[87]
Gemini Ultra February 2024 Google DeepMind Unknown Un­known Un­known
GemmaFebruary 2024Google DeepMind76T tokensUnknownGemma Terms of Use[88]
Claude 3 March 2024 Anthropic Un­known Un­known Un­known Proprietary Includes three models, Haiku, Sonnet, and Opus.[89]
Nova October 2024 Rubik's AI Un­known Un­known Un­known Proprietary Previous three models, Nova-Instant, Nova-Air, and Nova-Pro. Company shifted to Sonus AI.
Sonus[90] January 2025 Rubik's AI Un­known Un­known Un­known Proprietary
DBRX March 2024 Databricks and Mosaic ML 136 12T Tokens Databricks Open Model License Training cost 10 million USD.
Fugaku-LLM May 2024 Fujitsu, Tokyo Institute of Technology, etc. 13 380B Tokens The largest model ever trained on CPU-only, on the Fugaku.[91]
Phi-3 April 2024 Microsoft 14[92] 4.8T Tokens MIT Microsoft markets them as "small language model".[93]
Granite Code Models May 2024 IBM Unknown Un­known Un­knownApache 2.0
Qwen2 June 2024 Alibaba Cloud 72[94] 3T Tokens Un­known Qwen License Multiple sizes, the smallest being 0.5B.
DeepSeek-V2 June 2024 DeepSeek 236 8.1T tokens 28,000 DeepSeek License 1.4M hours on H800.[95]
Nemotron-4 June 2024 Nvidia 340 9T Tokens 200,000 NVIDIA Open Model License Trained for 1 epoch. Trained on 6144 H100 GPUs between December 2023 and May 2024.[96][97]
Llama 3.1 July 2024 Meta AI 405 15.6T tokens 440,000 Llama 3 license 405B version took 31 million hours on H100-80GB, at 3.8E25 FLOPs.[98][99]
DeepSeek-V3 December 2024 DeepSeek 671 14.8T tokens 56,000 MIT 2.788M hours on H800 GPUs.[100] Originally released under the DeepSeek License, then re-released under the MIT License as "DeepSeek-V3-0324" in March 2025.[101]
Amazon Nova December 2024 Amazon Un­known Un­known Un­known Proprietary Includes three models, Nova Micro, Nova Lite, and Nova Pro[102]
DeepSeek-R1 January 2025 DeepSeek 671 Not applicable Un­known MIT No pretraining. Reinforcement-learned upon V3-Base.[103][104]
Qwen2.5 January 2025 Alibaba 72 18T tokens Un­known Qwen License 7 dense models, with parameter count from 0.5B to 72B. They also released 2 MoE variants.[105]
MiniMax-Text-01 January 2025 Minimax 456 4.7T tokens[106] Un­known Minimax Model license [107][106]
Gemini 2.0 February 2025 Google DeepMind Unknown Un­known Un­knownProprietary Three models released: Flash, Flash-Lite and Pro[108][109][110]
Mistral Large November 2024 Mistral AI 123 Un­known Un­known Mistral Research License Upgraded over time. The latest version is 24.11.[111]
Pixtral November 2024 Mistral AI 123 Un­known Un­known Mistral Research License Multimodal. There is also a 12B version which is under Apache 2 license.[111]
Grok 3 February 2025 xAI Un­known Un­known Un­known,
estimated 5,800,000.
Proprietary [112] Training cost estimated 5E26 FLOP.
Llama 4 April 5, 2025 Meta AI 400 40T tokens Llama 4 license [113][114]
Close

See also

Notes

  1. This is the date that documentation describing the model's architecture was first released.
  2. In many cases, researchers release or report on multiple versions of a model having different sizes. In these cases, the size of the largest model is listed here.
  3. This is the license of the pre-trained model weights. In almost all cases the training code itself is open-source or can be easily replicated.
  4. The smaller models including 66B are publicly available, while the 175B model is available on request.
  5. Facebook's license and distribution scheme restricted access to approved researchers, but the model weights were leaked and became widely available.
  6. As stated in Technical report: "Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method ..."[58]

References

Loading related searches...

Wikiwand - on

Seamless Wikipedia browsing. On steroids.