Remove ads
GPU microarchitecture by Nvidia From Wikipedia, the free encyclopedia
Ampere is the codename for a graphics processing unit (GPU) microarchitecture developed by Nvidia as the successor to both the Volta and Turing architectures. It was officially announced on May 14, 2020 and is named after French mathematician and physicist André-Marie Ampère.[1][2]
Launched | May 14, 2020 |
---|---|
Designed by | Nvidia |
Manufactured by | |
Fabrication process | TSMC N7 (professional) Samsung 8N (consumer) |
Codename(s) | GA10x |
Product Series | |
Desktop | |
Professional/workstation |
|
Server/datacenter |
|
Specifications | |
L1 cache | 192 KB per SM (professional) 128 KB per SM (consumer) |
L2 cache | 2 MB to 6 MB |
Memory support | |
PCIe support | PCIe 4.0 |
Supported Graphics APIs | |
DirectX | DirectX 12 Ultimate (Feature Level 12_2) |
Direct3D | Direct3D 12.0 |
Shader Model | Shader Model 6.8 |
OpenCL | OpenCL 3.0 |
OpenGL | OpenGL 4.6 |
CUDA | Compute Capability 8.6 |
Vulkan | Vulkan 1.3 |
Media Engine | |
Encode codecs | |
Decode codecs | |
Color bit-depth |
|
Encoder(s) supported | NVENC |
Display outputs | |
History | |
Predecessor | Turing (consumer) Volta (professional) |
Successor | Ada Lovelace (consumer) Hopper (datacenter) |
Support status | |
Supported |
Nvidia announced the Ampere architecture GeForce 30 series consumer GPUs at a GeForce Special Event on September 1, 2020.[3][4] Nvidia announced the A100 80 GB GPU at SC20 on November 16, 2020.[5] Mobile RTX graphics cards and the RTX 3060 based on the Ampere architecture were revealed on January 12, 2021.[6]
Nvidia announced Ampere's successor, Hopper, at GTC 2022, and "Ampere Next Next" (Blackwell) for a 2024 release at GPU Technology Conference 2021.
Architectural improvements of the Ampere architecture include the following:
Comparison of Compute Capability: GP100 vs GV100 vs GA100[12]
GPU features | Nvidia Tesla P100 | Nvidia Tesla V100 | Nvidia A100 |
---|---|---|---|
GPU codename | GP100 | GV100 | GA100 |
GPU architecture | Pascal | Volta | Ampere |
Compute capability | 6.0 | 7.0 | 8.0 |
Threads / warp | 32 | 32 | 32 |
Max warps / SM | 64 | 64 | 64 |
Max threads / SM | 2048 | 2048 | 2048 |
Max thread blocks / SM | 32 | 32 | 32 |
Max 32-bit registers / SM | 65536 | 65536 | 65536 |
Max registers / block | 65536 | 65536 | 65536 |
Max registers / thread | 255 | 255 | 255 |
Max thread block size | 1024 | 1024 | 1024 |
FP32 cores / SM | 64 | 64 | 64 |
Ratio of SM registers to FP32 cores | 1024 | 1024 | 1024 |
Shared Memory Size / SM | 64 KB | Configurable up to 96 KB | Configurable up to 164 KB |
Comparison of Precision Support Matrix[13][14]
Supported CUDA Core Precisions | Supported Tensor Core Precisions | |||||||||||||||
FP16 | FP32 | FP64 | INT1 | INT4 | INT8 | TF32 | BF16 | FP16 | FP32 | FP64 | INT1 | INT4 | INT8 | TF32 | BF16 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Nvidia Tesla P4 | No | Yes | Yes | No | No | Yes | No | No | No | No | No | No | No | No | No | No |
Nvidia P100 | Yes | Yes | Yes | No | No | No | No | No | No | No | No | No | No | No | No | No |
Nvidia Volta | Yes | Yes | Yes | No | No | Yes | No | No | Yes | No | No | No | No | No | No | No |
Nvidia Turing | Yes | Yes | Yes | No | No | No | No | No | Yes | No | No | Yes | Yes | Yes | No | No |
Nvidia A100 | Yes | Yes | Yes | No | No | Yes | No | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes |
Legend:
Comparison of Decode Performance
Concurrent streams | H.264 decode (1080p30) | H.265 (HEVC) decode (1080p30) | VP9 decode (1080p30) |
---|---|---|---|
V100 | 16 | 22 | 22 |
A100 | 75 | 157 | 108 |
Die | GA100[15] | GA102[16] | GA103[17] | GA104[18] | GA106[19] | GA107[20] | GA10B[21] | GA10F |
---|---|---|---|---|---|---|---|---|
Die size | 826 mm2 | 628 mm2 | 496 mm2 | 392 mm2 | 276 mm2 | 200 mm2 | 448 mm2 | ? |
Transistors | 54.2B | 28.3B | 22B | 17.4B | 12B | 8.7B | 21B | ? |
Transistor density | 65.6 MTr/mm2 | 45.1 MTr/mm2 | 44.4 MTr/mm2 | 44.4 MTr/mm2 | 43.5 MTr/mm2 | 43.5 MTr/mm2 | 46.9 MTr/mm2 | ? |
Graphics processing clusters | 8 | 7 | 6 | 6 | 3 | 2 | 2 | 1 |
Streaming multiprocessors | 128 | 84 | 60 | 48 | 30 | 20 | 16 | 12 |
CUDA cores | 12288 | 10752 | 7680 | 6144 | 3840 | 2560 | 2048 | 1536 |
Texture mapping units | 512 | 336 | 240 | 192 | 120 | 80 | 64 | 48 |
Render output units | 192 | 112 | 96 | 96 | 48 | 32 | 32 | 16 |
Tensor cores | 512 | 336 | 240 | 192 | 120 | 80 | 64 | 48 |
RT cores | N/A | 84 | 60 | 48 | 30 | 20 | 8 | 12 |
L1 cache | 24 MB | 10.5 MB | 7.5 MB | 6 MB | 3 MB | 2.5 MB | 3 MB | 1.5 MB |
192 KB per SM |
128 KB per SM | 192 KB per SM |
128 KB per SM | |||||
L2 cache | 40 MB | 6 MB | 4 MB | 4 MB | 3 MB | 2 MB | 4 MB | ? |
The Ampere-based A100 accelerator was announced and released on May 14, 2020.[9] The A100 features 19.5 teraflops of FP32 performance, 6912 FP32/INT32 CUDA cores, 3456 FP64 CUDA cores, 40 GB of graphics memory, and 1.6 TB/s of graphics memory bandwidth.[22] The A100 accelerator was initially available only in the 3rd generation of DGX server, including 8 A100s.[9] Also included in the DGX A100 is 15 TB of PCIe gen 4 NVMe storage,[22] two 64-core AMD Rome 7742 CPUs, 1 TB of RAM, and Mellanox-powered HDR InfiniBand interconnect. The initial price for the DGX A100 was $199,000.[9]
Comparison of accelerators used in DGX:[23][24][25]
Model | Architecture | Socket | FP32 CUDA cores | FP64 cores (excl. tensor) | Mixed INT32/FP32 cores | INT32 cores | Boost clock | Memory clock | Memory bus width | Memory bandwidth | VRAM | Single precision (FP32) | Double precision (FP64) | INT8 (non-tensor) | INT8 dense tensor | INT32 | FP4 dense tensor | FP16 | FP16 dense tensor | bfloat16 dense tensor | TensorFloat-32 (TF32) dense tensor | FP64 dense tensor | Interconnect (NVLink) | GPU | L1 Cache | L2 Cache | TDP | Die size | Transistor count | Process | Launched |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
B200 | Blackwell | SXM6 | N/A | N/A | N/A | N/A | N/A | 8 Gbit/s HBM3e | 8192-bit | 8 TB/sec | 192 GB HBM3e | N/A | N/A | N/A | 4.5 POPS | N/A | 9 PFLOPS | N/A | 2.25 PFLOPS | 2.25 PFLOPS | 1.2 PFLOPS | 40 TFLOPS | 1.8 TB/sec | GB100 | N/A | N/A | 1000 W | N/A | 208 B | TSMC 4NP | Q4 2024 (expected) |
B100 | Blackwell | SXM6 | N/A | N/A | N/A | N/A | N/A | 8 Gbit/s HBM3e | 8192-bit | 8 TB/sec | 192 GB HBM3e | N/A | N/A | N/A | 3.5 POPS | N/A | 7 PFLOPS | N/A | 1.98 PFLOPS | 1.98 PFLOPS | 989 TFLOPS | 30 TFLOPS | 1.8 TB/sec | GB100 | N/A | N/A | 700 W | N/A | 208 B | TSMC 4NP | |
H200 | Hopper | SXM5 | 16896 | 4608 | 16896 | N/A | 1980 MHz | 6.3 Gbit/s HBM3e | 6144-bit | 4.8 TB/sec | 141 GB HBM3e | 67 TFLOPS | 34 TFLOPS | N/A | 1.98 POPS | N/A | N/A | N/A | 990 TFLOPS | 990 TFLOPS | 495 TFLOPS | 67 TFLOPS | 900 GB/sec | GH100 | 25344 KB (192 KB × 132) | 51200 KB | 1000 W | 814 mm2 | 80 B | TSMC 4N | Q3 2023 |
H100 | Hopper | SXM5 | 16896 | 4608 | 16896 | N/A | 1980 MHz | 5.2 Gbit/s HBM3 | 5120-bit | 3.35 TB/sec | 80 GB HBM3 | 67 TFLOPS | 34 TFLOPS | N/A | 1.98 POPS | N/A | N/A | N/A | 990 TFLOPS | 990 TFLOPS | 495 TFLOPS | 67 TFLOPS | 900 GB/sec | GH100 | 25344 KB (192 KB × 132) | 51200 KB | 700 W | 814 mm2 | 80 B | TSMC 4N | Q3 2022 |
A100 80GB | Ampere | SXM4 | 6912 | 3456 | 6912 | N/A | 1410 MHz | 3.2 Gbit/s HBM2e | 5120-bit | 1.52 TB/sec | 80 GB HBM2e | 19.5 TFLOPS | 9.7 TFLOPS | N/A | 624 TOPS | 19.5 TOPS | N/A | 78 TFLOPS | 312 TFLOPS | 312 TFLOPS | 156 TFLOPS | 19.5 TFLOPS | 600 GB/sec | GA100 | 20736 KB (192 KB × 108) | 40960 KB | 400 W | 826 mm2 | 54.2 B | TSMC N7 | Q1 2020 |
A100 40GB | Ampere | SXM4 | 6912 | 3456 | 6912 | N/A | 1410 MHz | 2.4 Gbit/s HBM2 | 5120-bit | 1.52 TB/sec | 40 GB HBM2 | 19.5 TFLOPS | 9.7 TFLOPS | N/A | 624 TOPS | 19.5 TOPS | N/A | 78 TFLOPS | 312 TFLOPS | 312 TFLOPS | 156 TFLOPS | 19.5 TFLOPS | 600 GB/sec | GA100 | 20736 KB (192 KB × 108) | 40960 KB | 400 W | 826 mm2 | 54.2 B | TSMC N7 | |
V100 32GB | Volta | SXM3 | 5120 | 2560 | N/A | 5120 | 1530 MHz | 1.75 Gbit/s HBM2 | 4096-bit | 900 GB/sec | 32 GB HBM2 | 15.7 TFLOPS | 7.8 TFLOPS | 62 TOPS | N/A | 15.7 TOPS | N/A | 31.4 TFLOPS | 125 TFLOPS | N/A | N/A | N/A | 300 GB/sec | GV100 | 10240 KB (128 KB × 80) | 6144 KB | 350 W | 815 mm2 | 21.1 B | TSMC 12FFN | Q3 2017 |
V100 16GB | Volta | SXM2 | 5120 | 2560 | N/A | 5120 | 1530 MHz | 1.75 Gbit/s HBM2 | 4096-bit | 900 GB/sec | 16 GB HBM2 | 15.7 TFLOPS | 7.8 TFLOPS | 62 TOPS | N/A | 15.7 TOPS | N/A | 31.4 TFLOPS | 125 TFLOPS | N/A | N/A | N/A | 300 GB/sec | GV100 | 10240 KB (128 KB × 80) | 6144 KB | 300 W | 815 mm2 | 21.1 B | TSMC 12FFN | |
P100 | Pascal | SXM/SXM2 | N/A | 1792 | 3584 | N/A | 1480 MHz | 1.4 Gbit/s HBM2 | 4096-bit | 720 GB/sec | 16 GB HBM2 | 10.6 TFLOPS | 5.3 TFLOPS | N/A | N/A | N/A | N/A | 21.2 TFLOPS | N/A | N/A | N/A | N/A | 160 GB/sec | GP100 | 1344 KB (24 KB × 56) | 4096 KB | 300 W | 610 mm2 | 15.3 B | TSMC 16FF+ | Q2 2016 |
Type | GA10B | GA107 | GA106 | GA104 | GA103 | GA102 | GA100 |
---|---|---|---|---|---|---|---|
GeForce MX series | — | GeForce MX570 (mobile) | — | — | — | — | — |
GeForce 20 series | — | GeForce RTX 2050 (mobile) | — | — | — | — | — |
GeForce 30 series | — | GeForce RTX 3050 Laptop GeForce RTX 3050 GeForce RTX 3050 Ti Laptop |
GeForce RTX 3050 GeForce RTX 3060 Laptop GeForce RTX 3060 |
GeForce RTX 3060 GeForce RTX 3060 Ti GeForce RTX 3070 Laptop GeForce RTX 3070 GeForce RTX 3070 Ti Laptop GeForce RTX 3070 Ti GeForce RTX 3080 Laptop |
GeForce RTX 3060 Ti GeForce RTX 3080 Ti Laptop |
GeForce RTX 3070 Ti GeForce RTX 3080 GeForce RTX 3080 Ti GeForce RTX 3090 GeForce RTX 3090 Ti |
— |
Nvidia Workstation GPUs | — | RTX A1000 (mobile) | RTX A2000 (mobile) RTX A2000 | RTX A3000 (mobile) RTX A4000 (mobile) RTX A4000 RTX A5000 (mobile) | RTX A5500 (mobile) | RTX A4500 RTX A5000 RTX A5500 RTX A6000 | — |
Nvidia Data Center GPUs | — | Nvidia A2 Nvidia A16 | — | — | — | Nvidia A10 Nvidia A40 | Nvidia A30 Nvidia A100 |
Tegra SoCs | AGX Orin Orin NX Orin Nano | — | — | — | — | — | — |
Seamless Wikipedia browsing. On steroids.
Every time you click a link to Wikipedia, Wiktionary or Wikiquote in your browser's search results, it will show the modern Wikiwand interface.
Wikiwand extension is a five stars, simple, with minimum permission required to keep your browsing private, safe and transparent.