Volta (microarchitecture)

GPU microarchitecture by Nvidia From Wikipedia, the free encyclopedia

Volta (microarchitecture)

Volta is the codename, but not the trademark,[1] for a GPU microarchitecture developed by Nvidia, succeeding Pascal. It was first announced on a roadmap in March 2013,[2] although the first product was not announced until May 2017.[3] The architecture is named after 18th19th century Italian chemist and physicist Alessandro Volta. It was Nvidia's first chip to feature Tensor Cores, specially designed cores that have superior deep learning performance over regular CUDA cores.[4] The architecture is produced with TSMC's 12 nm FinFET process. The Ampere microarchitecture is the successor to Volta.

Quick Facts Release date, Codename ...
Nvidia Volta
Release dateDecember 7, 2017
CodenameVolta
Fabrication processTSMC 12 nm (FinFET)
Cards
Enthusiast
  • Tesla V100
  • Tesla V100S PCIe
  • Titan V
  • Titan V CEO Edition
  • Quadro GV100
History
PredecessorPascal
VariantTuring (consumer, professional)
SuccessorAmpere (consumer, professional)
Support status
Supported
Close
Thumb
Painting of Alessandro Volta, eponym of architecture

The first graphics card to use it was the datacenter Tesla V100, e.g. as part of the Nvidia DGX-1 system.[3] It has also been used in the Quadro GV100 and Titan V. There were no mainstream GeForce graphics cards based on Volta.

After two USPTO proceedings,[5][6] on July 3, 2023 Nvidia lost the Volta trademark application in the field of artificial intelligence. The Volta trademark[7] owner remains Volta Robots, a company specialized in AI and vision algorithms for robots and unmanned vehicles.

Details

Summarize
Perspective

Architectural improvements of the Volta architecture include the following:

  • CUDA Compute Capability 7.0
    • concurrent execution of integer and floating point operations
  • TSMC's 12 nm FinFET process,[8] allowing 21.1 billion transistors.[9]
  • High Bandwidth Memory 2 (HBM2),[8][10]
  • NVLink 2.0: a high-bandwidth bus between the CPU and GPU, and between multiple GPUs. Allows much higher transfer speeds than those achievable by using PCI Express; estimated to provide 25 Gbit/s per lane.[11] (Disabled for Titan V)
  • Tensor cores: A tensor core is a unit that multiplies two 4×4 FP16 matrices, and then adds a third FP16 or FP32 matrix to the result by using fused multiply–add operations, and obtains an FP32 result that could be optionally demoted to an FP16 result.[12] Tensor cores are intended to speed up the training of neural networks.[12] Volta's Tensor cores are first generation while Ampere has third generation Tensor cores.[13][14]
  • PureVideo Feature Set I hardware video decoding

Comparison of Compute Capability: GP100 vs GV100 vs GA100[15]

More information GPU features, Nvidia Tesla P100 ...
GPU features Nvidia Tesla P100 Nvidia Tesla V100 Nvidia A100
GPU codename GP100 GV100 GA100
GPU architecture Nvidia Pascal Nvidia Volta Nvidia Ampere
Compute capability 6.0 7.0 8.0
Threads / warp 32 32 32
Max warps / SM 64 64 64
Max threads / SM 2048 2048 2048
Max thread blocks / SM 32 32 32
Max 32-bit registers / SM 65536 65536 65536
Max registers / block 65536 65536 65536
Max registers / thread 255 255 255
Max thread block size 1024 1024 1024
FP32 cores / SM 64 64 64
Ratio of SM registers to FP32 cores 1024 1024 1024
Shared Memory Size / SM 64 KB Configurable up to 96 KB Configurable up to 164 KB
Close

Comparison of Precision Support Matrix[16][17]

More information Supported CUDA Core Precisions, Supported Tensor Core Precisions ...
Supported CUDA Core Precisions Supported Tensor Core Precisions
FP16 FP32 FP64 INT1 INT4 INT8 TF32 BF16 FP16 FP32 FP64 INT1 INT4 INT8 TF32 BF16
Nvidia Tesla P4 NoYesYesNoNoYesNoNoNoNoNoNoNoNoNoNo
Nvidia P100 YesYesYesNoNoNoNoNoNoNoNoNoNoNoNoNo
Nvidia Volta YesYesYesNoNoYesNoNoYesNoNoNoNoNoNoNo
Nvidia Turing YesYesYesNoNoNoNoNoYesNoNoYesYesYesNoNo
Nvidia A100 YesYesYesNoNoYesNoYesYesNoYesYesYesYesYesYes
Close

Legend:

  • FPnn: floating point with nn bits
  • INTn: integer with n bits
  • INT1: binary
  • TF32: TensorFloat32
  • BF16: bfloat16

Comparison of Decode Performance

More information Concurrent streams, H.264 decode (1080p30) ...
Concurrent streams H.264 decode (1080p30) H.265 (HEVC) decode (1080p30) VP9 decode (1080p30)
V100 16 22 22
A100 75 157 108
Close

Products

Summarize
Perspective

Volta has been announced as the GPU microarchitecture within the Xavier generation of Tegra SoC focusing on self-driving cars.[18][19]

At Nvidia's annual GPU Technology Conference keynote on May 10, 2017, Nvidia officially announced the Volta microarchitecture along with the Tesla V100.[3] The Volta GV100 GPU is built on a 12 nm process size using HBM2 memory with 900 GB/s of bandwidth.[20]

Nvidia officially announced the Nvidia TITAN V on December 7, 2017.[21][22]

Nvidia officially announced the Quadro GV100 on March 27, 2018.[23]

More information Model, Launch ...
Model Launch Code Name (s) Fab
(nm)
Transistors
(billion)
Die size
(mm2)
Bus Interface Core config SM
Count[a]
Graphics
Processing
Clusters[b]
L2 Cache
Size (MiB)
Clock speeds Fillrate Memory Processing power (GFLOPS) TDP
(Watts)
NVLink Support Launch Price
(USD)
CUDA
core[c]
Tensor
core[d]
Base core
clock (MHz)
Boost clock
(MHz)
Memory
(MT/s)
Pixel
(GP/s)
Texture
(GT/s)
Size
(GiB)
Bandwidth
(GB/s)
Bus
Type
Bus width
(bit)
Single
precision
(boost)
Double
precision
(boost)
Half
precision
(boost)
MSRP
Nvidia Titan V[24] December 7, 2017 GV100-400-A1 TSMC 12 nm 21.1 815 PCIe 3.0 ×16 5120:320:96 640 80 6 4.5 1200 1455 1700 139.7 465.6 12 652.8 HBM2 3072 12288 (14899) 6144 (7450) 24576 (29798) 250 No $2,999
Nvidia Quadro GV100[25] March 27, 2018 GV100 5120:320:128 6 1132 1628 1696 208.4 521 32 868.4 4096 11592 (16671) 5796 (8335) 23183 (33341) Yes $8,999
Nvidia Titan V CEO Edition[26][27] June 21, 2018 1200 1455 1700 186.2 465.6 870.4 12288 (14899) 6144 (7450) 24576 (29798) N/A
Close
  1. One Streaming Multiprocessor encompasses 64 CUDA cores and 4 TMUs.
  2. One Graphics Processing Cluster encompasses fourteen Streaming Multiprocessors.
  3. A Tensor core is a mixed-precision FPU specifically designed for matrix arithmetic.

Application

Volta is also reported to be included in the Summit and Sierra supercomputers, used for GPGPU compute.[28][29] The Volta GPUs will connect to the POWER9 CPUs via NVLink 2.0, which is expected to support cache coherency and therefore improve GPGPU performance.[30][11][31]

V100 accelerator and DGX V100

Comparison of accelerators used in DGX:[32][33][34]

More information Model, Architecture ...
ModelArchitectureSocketFP32
CUDA
cores
FP64 cores
(excl. tensor)
Mixed
INT32/FP32
cores
INT32
cores
Boost
clock
Memory
clock
Memory
bus width
Memory
bandwidth
VRAMSingle
precision
(FP32)
Double
precision
(FP64)
INT8
(non-tensor)
INT8
dense tensor
INT32FP4
dense tensor
FP16FP16
dense tensor
bfloat16
dense tensor
TensorFloat-32
(TF32)
dense tensor
FP64
dense tensor
Interconnect
(NVLink)
GPUL1 CacheL2 CacheTDPDie sizeTransistor
count
ProcessLaunched
P100 PascalSXM/SXM2N/A17923584N/A1480 MHz1.4 Gbit/s HBM24096-bit720 GB/sec16 GB HBM210.6 TFLOPS5.3 TFLOPSN/AN/AN/AN/A21.2 TFLOPSN/AN/AN/AN/A160 GB/secGP1001344 KB (24 KB × 56)4096 KB300 W610 mm215.3 BTSMC 16FF+Q2 2016
V100 16GB VoltaSXM251202560N/A51201530 MHz1.75 Gbit/s HBM24096-bit900 GB/sec16 GB HBM215.7 TFLOPS7.8 TFLOPS62 TOPSN/A15.7 TOPSN/A31.4 TFLOPS125 TFLOPSN/AN/AN/A300 GB/secGV10010240 KB (128 KB × 80)6144 KB300 W815 mm221.1 BTSMC 12FFNQ3 2017
V100 32GB VoltaSXM351202560N/A51201530 MHz1.75 Gbit/s HBM24096-bit900 GB/sec32 GB HBM215.7 TFLOPS7.8 TFLOPS62 TOPSN/A15.7 TOPSN/A31.4 TFLOPS125 TFLOPSN/AN/AN/A300 GB/secGV10010240 KB (128 KB × 80)6144 KB350 W815 mm221.1 BTSMC 12FFN
A100 40GB AmpereSXM4691234566912N/A1410 MHz2.4 Gbit/s HBM25120-bit1.52 TB/sec40 GB HBM219.5 TFLOPS9.7 TFLOPSN/A624 TOPS19.5 TOPSN/A78 TFLOPS312 TFLOPS312 TFLOPS156 TFLOPS19.5 TFLOPS600 GB/secGA10020736 KB (192 KB × 108)40960 KB400 W826 mm254.2 BTSMC N7Q1 2020
A100 80GB AmpereSXM4691234566912N/A1410 MHz3.2 Gbit/s HBM2e5120-bit1.52 TB/sec80 GB HBM2e19.5 TFLOPS9.7 TFLOPSN/A624 TOPS19.5 TOPSN/A78 TFLOPS312 TFLOPS312 TFLOPS156 TFLOPS19.5 TFLOPS600 GB/secGA10020736 KB (192 KB × 108)40960 KB400 W826 mm254.2 BTSMC N7
H100 HopperSXM516896460816896N/A1980 MHz5.2 Gbit/s HBM35120-bit3.35 TB/sec80 GB HBM367 TFLOPS34 TFLOPSN/A1.98 POPSN/AN/AN/A990 TFLOPS990 TFLOPS495 TFLOPS67 TFLOPS900 GB/secGH10025344 KB (192 KB × 132)51200 KB700 W814 mm280 BTSMC 4NQ3 2022
H200 HopperSXM516896460816896N/A1980 MHz6.3 Gbit/s HBM3e6144-bit4.8 TB/sec141 GB HBM3e67 TFLOPS34 TFLOPSN/A1.98 POPSN/AN/AN/A990 TFLOPS990 TFLOPS495 TFLOPS67 TFLOPS900 GB/secGH10025344 KB (192 KB × 132)51200 KB1000 W814 mm280 BTSMC 4NQ3 2023
B100 BlackwellSXM6N/AN/AN/AN/AN/A8 Gbit/s HBM3e8192-bit8 TB/sec192 GB HBM3eN/AN/AN/A3.5 POPSN/A7 PFLOPSN/A1.98 PFLOPS1.98 PFLOPS989 TFLOPS30 TFLOPS1.8 TB/secGB100N/AN/A700 WN/A208 BTSMC 4NPQ4 2024 (expected)
B200 BlackwellSXM6N/AN/AN/AN/AN/A8 Gbit/s HBM3e8192-bit8 TB/sec192 GB HBM3eN/AN/AN/A4.5 POPSN/A9 PFLOPSN/A2.25 PFLOPS2.25 PFLOPS1.2 PFLOPS40 TFLOPS1.8 TB/secGB100N/AN/A1000 WN/A208 BTSMC 4NP
Close

See also

References

Loading related searches...

Wikiwand - on

Seamless Wikipedia browsing. On steroids.