Top Qs
Timeline
Chat
Perspective
X86 SIMD instruction listings
List of x86 microprocessor SIMD instructions From Wikipedia, the free encyclopedia
Remove ads
The x86 instruction set has several times been extended with SIMD (Single instruction, multiple data) instruction set extensions. These extensions, starting from the MMX instruction set extension introduced with Pentium MMX in 1997, typically define sets of wide registers and instructions that subdivide these registers into fixed-size lanes and perform a computation for each lane in parallel.
Remove ads
Summary of SIMD extensions
Summarize
Perspective
The main SIMD instruction set extensions that have been introduced for x86 are:
Remove ads
MMX instructions and extended variants thereof
Summarize
Perspective
These instructions are, unless otherwise noted, available in the following forms:
- MMX: 64-bit vectors, operating on mm0..mm7 registers (aliased on top of the old x87 register file)
- SSE2: 128-bit vectors, operating on xmm0..xmm15 registers (xmm0..xmm7 in 32-bit mode)
- AVX: 128-bit vectors, operating on xmm0..xmm15 registers, with a new three-operand encoding enabled by the new VEX prefix. (AVX introduced 256-bit vector registers, but the full width of these vectors was in general not made available for integer SIMD instructions until AVX2.)
- AVX2: 256-bit vectors, operating on ymm0..ymm15 registers (extended versions of the xmm0..xmm15 registers)
- AVX-512: 512-bit vectors, operating on zmm0..zmm31 registers (zmm0..zmm15 are extended versions of the ymm0..ymm15 registers, while zmm16..zmm31 are new to AVX-512). AVX-512 also introduces opmasks, allowing the operation of most instructions to be masked on a per-lane basis by an opmask register (the lane width varies from one instruction to another). AVX-512 also adds broadcast functionality for many of its instructions - this is used with memory source arguments to replicate a single value to all lanes of a vector calculation. The tables below provide indications of whether opmasks and broadcasts are supported for each instruction, and if so, what lane-widths they are using.
For many of the instruction mnemonics, (V)
is used to indicate that the instruction mnemonic exists in forms with and without a leading V - the form with the leading V is used for the VEX/EVEX-prefixed instruction variants introduced by AVX/AVX2/AVX-512, while the form without the leading V is used for legacy MMX/SSE encodings without VEX/EVEX-prefix.
Original Pentium MMX instructions, and SSE2/AVX/AVX-512 extended variants thereof
- For code that may potentially mix use of legacy-SSE instructions with AVX instructions, it is strongly recommended to execute a
VZEROUPPER
orVZEROALL
instruction after executing AVX instructions but before executing SSE instructions. If this is not done, any subsequent legacy-SSE code may be subject to severe performance degradation.[5] - On some early AVX implementations (e.g. Sandy Bridge[6]) encoding the
VZEROUPPER
andVZEROALL
instructions with VEX.W=1 will result in #UD - for this reason, it is recommended to encode these instructions with VEX.W=0. - The 64-bit move instruction forms that are encoded by using a
REX.W
prefix with the0F 6E
and0F 7E
opcodes are listed with different mnemonics in Intel and AMD documentation —MOVQ
in Intel documentation[7] andMOVD
in AMD documentation.[8]
This is a documentation difference only — the operation performed by these opcodes is the same for Intel and AMD.
This documentation difference applies only to the MMX/SSE forms of these opcodes — for VEX/EVEX-encoded forms, both Intel and AMD use the mnemonicVMOVQ
.) - On all Intel,[9] AMD[10] and Zhaoxin[11] processors that support AVX, the 128-bit forms of
VMOVDQA
(encoded with a VEX prefix and VEX.L=0) are, when used with a memory argument addressing WB (write-back cacheable) memory, architecturally guaranteed to perform the 128-bit memory access atomically - this applies to both load and store.(Intel and AMD provide somewhat wider guarantees covering more 128-bit instruction variants, but Zhaoxin provides the guarantee for cacheable
VMOVDQA
only.)On processors that support SSE but don't support AVX, the 128-bit forms of SSE load/store instructions such as
MOVAPS
/MOVAPD
/MOVDQA
are not guaranteed to execute atomically — examples of processors where such instructions have been observed to execute non-atomically include Intel Core Duo and AMD K10.[12] - For the
VPACK*
andVPUNPCK*
instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction. - For the MMX packed shift instructions
PSLL*
andPSR*
with a shift-argument taken from a vector source (mm or m64), the shift-amount is considered to be a single 64-bit scalar value - the same shift-amount is used for all lanes of the destination vector. This shift-amount is unsigned and is not masked - all bits are considered (e.g. a shift-amount of0x80000000_00000000
can be specified and will have the same effect as a shift-amount of 64).For all SSE2/AVX/AVX512 extended variants of these instructions, the shift-amount vector argument is considered to be a 128-bit (xmm or m128) argument - the bottom 64 bits are used as the shift-amount.
Packed shift-instructions that can take a variable per-lane shift-amount were introduced in AVX2 for 32/64-bit lanes and AVX512BW for 16-bit lanes (
VPSLLV*
,VPSRLV*
,VPSRAV*
instructions).
MMX instructions added with MMX+/SSE/SSE2/SSSE3, and SSE2/AVX/AVX-512 extended variants thereof
- For the
VPSHUFD
,VPSHUFB
,VPHADD*
,VPHSUB*
andVPALIGNR
instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and/or AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction. - For AVX2 and AVX-512 with vectors wider than 128 bits, the
VPSHUFB
instruction is restricted to byte-shuffle within each 128-bit lane. Instructions that can do shuffles across 128-bit lanes include e.g. AVX2'sVPERMD
(shuffle of 32-bit lanes across 256-bit YMM register) and AVX512_VBMI'sVPERMB
(full byte shuffle across 64-byte ZMM register).
Remove ads
SSE instructions and extended variants thereof
Summarize
Perspective
Regularly-encoded floating-point SSE/SSE2 instructions, and AVX/AVX-512 extended variants thereof
For the instructions in the below table, the following considerations apply unless otherwise noted:
- Packed instructions are available at all vector lengths (128-bit for SSE2, 128/256-bit for AVX, 128/256/512-bit for AVX-512)
- FP32 variants of instructions are introduced as part of SSE. FP64 variants of instructions are introduced as part of SSE2.
- The AVX-512 variants of the FP32 and FP64 instructions are introduced as part of the AVX512F subset.
- For AVX-512 variants of the instructions, opmasks and broadcasts are available with a width of 32 bits for FP32 operations and 64 bits for FP64 operations. (Broadcasts are available for vector operations only.)
From SSE2 onwards, some data movement/bitwise instructions exist in three forms: an integer form, an FP32 form and an FP64 form. Such instructions are functionally identical, however some processors with SSE2 will implement integer, FP32 and FP64 execution units as three different execution clusters, where forwarding of results from one cluster to another may come with performance penalties and where such penalties can be minimzed by choosing instruction forms appropriately. (For example, there exists three forms of vector bitwise XOR instructions under SSE2 - PXOR
, XORPS
, and XORPD
- these are intended for use on integer, FP32, and FP64 data, respectively.)
- The VEX-prefix-encoded variants of the scalar instructions listed in this table should be encoded with VEX.L=0. Setting VEX.L=1 for any of these instructions is allowed but will result in what the Intel SDM describes as "unpredictable behavior across different processor generations". This also applies to VEX-encoded variants of
V(U)COMISS
andV(U)COMISD
. (This behavior does not apply to scalar instructions outside this table, such as e.g.VMOVD
/VMOVQ
, where VEX.L=1 results in an #UD exception.) - The SSE2
MOVSD
(MOVe Scalar Double-precision) andCMPSD
(CoMPare Scalar Double-precision) instructions have the same names as the older i386MOVSD
(MOVe String Doubleword) andCMPSD
(CoMPare String Doubleword) instructions, however their operations are completely unrelated.At the assembly language level, they can be distinguished by their use of XMM register operands.
- For the
VUNPCK*
,VSHUFPS
andVSHUFPD
instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction (except that forVSHUFPD
, each 128-bit lane will use a different 2-bit part of the instruction's imm8 argument). - The
CVTPI2PS
andCVTPI2PD
instructions take their input data as a vector of two 32-bit signed integers from either memory or MMX register. They will cause an x87→MMX transition even if the source operand is a memory operand.For vector int→FP conversions that can accept an xmm/ymm/zmm register or vectors wider than 64 bits as input arguments, SSE2 provides the following irregularly-assigned instructions (see table below):
CVTDQ2PS
(0F 5B /r
)CVTDQ2PD
(F3 0F E6 /r
)
- The
CVT(T)PS2PI
andCVT(T)PD2PI
instructions write their result to MMX register as a vector of two 32-bit signed integers.For vector FP→int conversions that can write results to xmm/ymm/zmm registers, SSE2 provides the following irregularly-assigned instructions (see table below):
CVTPS2DQ
(66 0F 5B /r
)CVTTPS2DQ
(F3 0F 5B /r
)CVTPD2DQ
(F2 0F E6 /r
)CVTTPD2DQ
(66 0F E6 /r
)
- This instruction cannot be EVEX-encoded. Instead, AVX512F provides different opcodes -
EVEX.66.0F38 4E/4F /r
- for its newVRSQRT14*
reciprocal square root approximation instructions.The main difference between the AVX-512
VRSQRT14*
instructions and the older SSE/AVX(V)RSQRT*
instructions is that the AVX-512VRSQRT14*
instructions have their operation defined in a bit-exact manner, with a C reference model provided by Intel.[13] - This instruction cannot be EVEX-encoded. Instead, AVX512F provides different opcodes -
EVEX.66.0F38 4C/4D /r
- for its newVRCP14*
reciprocal approximation instructions.The main difference between the AVX-512
VRCP14*
instructions and the older SSE/AVX(V)RCP*
instructions is that the AVX-512VRRCP14*
instructions have their operation defined in a bit-exact manner, with a C reference model provided by Intel.[13] XORPS
/VXORPS
with both source operands being the same register is commonly used as a register-zeroing idiom, and is recognized by most x86 CPUs as an instruction that does not depend on its source arguments.
Under AVX or AVX-512, it is recommended to use a 128-bit form ofVXORPS
for this purpose - this will, on some CPUs, result in fewer micro-ops than wider forms while still achieving register-zeroing of the whole 256 or 512 bit vector-register.- For the floating-point minimum-value and maximum-value instructions
(V)MIN*
and(V)MAX*
, if the two input operands are both zero or at least one of the input operands is NaN, then the second input operand is returned. This matches the behavior of common C programming-language expressions such as((op1)>(op2)?(op1):(op2))
for maximum-value and((op1)<(op2)?(op1):(op2))
for minimum-value.
Integer SSE2/4 instructions with 66h prefix, and AVX/AVX-512 extended variants thereof
These instructions do not have any MMX forms, and do not support any encodings without a prefix. Most of these instructions have extended variants available in VEX-encoded and EVEX-encoded forms:
- The VEX-encoded forms are available under AVX/AVX2. Under AVX, they are available only with a vector length of 128 bits (VEX.L=0 enocding) - under AVX2, they are (with some exceptions noted with "L=0") also made available with a vector length of 256 bits.
- The EVEX-encoded forms are available under AVX-512 - the specific AVX-512 subset needed for each instruction is listed along with the instruction.
- For the
(V)PUNPCK*
,(V)PACKUSDW
,(V)PBLENDW
,(V)PSLLDQ
and(V)PSLRDQ
instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and/or AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction. - The load performed by
(V)MOVNTDQA
is weakly-ordered. It may be reordered with respect to other loads, stores and evenLOCK
s - to impose ordering with respect to other loads/stores,MFENCE
or serialization is needed.If
(V)MOVNTDQA
is used with uncached memory, it may fetch a cache-line-sized block of data around the data actually requested - subsequent(V)MOVNTDQA
instructions may return data from blocks fetched in this manner as long as they are not separated by anMFENCE
or serialization. - For the
VPEXTRD
andVPINSRD
instructions in non-64-bit mode, the instructions are documented as being permitted to be encoded with VEX.W=1 on Intel[14] but not AMD[15] CPUs (although exceptions to this do exist, e.g. Bulldozer permits such encodings[16] while Sandy Bridge does not[17])
In 64-bit mode, these instructions require VEX.W=0 on both Intel and AMD processors — encodings with VEX.W=1 are interpreted asVPEXTRQ
/VPINSRQ
.
Other SSE/2/3/4 SIMD instructions, and AVX/AVX-512 extended variants thereof
SSE SIMD instructions that do not fit into any of the preceding groups. Many of these instructions have AVX/AVX-512 extended forms - unless otherwise indicated (L=0 or footnotes) these extended forms support 128/256-bit operation under AVX and 128/256/512-bit operation under AVX-512.
- For the
VPSHUFLW
,VPSHUFHW
,VHADDP*
,VHSUBP*
,VDPPS
andVDPPD
instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and/or AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction.
Remove ads
AVX
Summarize
Perspective
AVX were first supported by Intel with Sandy Bridge and by AMD with Bulldozer.
Vector operations on 256 bit registers.
F16C
Half-precision floating-point conversion.
Remove ads
AVX2
Summarize
Perspective
Introduced in Intel's Haswell microarchitecture and AMD's Excavator.
Expansion of most vector integer SSE and AVX instructions to 256 bits
Remove ads
FMA3 and FMA4 instructions
Summarize
Perspective
Floating-point fused multiply-add instructions are introduced in x86 as two instruction set extensions, "FMA3" and "FMA4", both of which build on top of AVX to provide a set of scalar/vector instructions using the xmm/ymm/zmm vector registers. FMA3 defines a set of 3-operand fused-multiply-add instructions that take three input operands and writes its result back to the first of them. FMA4 defines a set of 4-operand fused-multiply-add instructions that take four input operands – a destination operand and three source operands.
FMA3 is supported on Intel CPUs starting with Haswell, on AMD CPUs starting with Piledriver, and on Zhaoxin CPUs starting with YongFeng. FMA4 was only supported on AMD Family 15h (Bulldozer) CPUs and has been abandoned from AMD Zen onwards. The FMA3/FMA4 extensions are not considered to be an intrinsic part of AVX or AVX2, although all Intel and AMD (but not Zhaoxin) processors that support AVX2 also support FMA3. FMA3 instructions (in EVEX-encoded form) are, however, AVX-512 foundation instructions.
The FMA3 and FMA4 instruction sets both define a set of 10 fused-multiply-add operations, all available in FP32 and FP64 variants. For each of these variants, FMA3 defines three operand orderings while FMA4 defines two.
FMA3 encoding
FMA3 instructions are encoded with the VEX or EVEX prefixes – on the form VEX.66.0F38 xy /r
or EVEX.66.0F38 xy /r
. The VEX.W/EVEX.W bit selects floating-point format (W=0 means FP32, W=1 means FP64). The opcode byte xy
consists of two nibbles, where the top nibble x
selects operand ordering (9
='132', A
='213', B
='231') and the bottom nibble y
(values 6..F) selects which one of the 10 fused-multiply-add operations to perform. (x
and y
outside the given ranges will result in something that is not an FMA3 instruction.)
At the assembly language level, the operand ordering is specified in the mnemonic of the instruction:
vfmadd132sd xmm1,xmm2,xmm3
will performxmm1 ← (xmm1*xmm3)+xmm2
vfmadd213sd xmm1,xmm2,xmm3
will performxmm1 ← (xmm2*xmm1)+xmm3
vfmadd231sd xmm1,xmm2,xmm3
will performxmm1 ← (xmm2*xmm3)+xmm1
For all FMA3 variants, the first two arguments must be xmm/ymm/zmm vector register arguments, while the last argument may be either a vector register or memory argument. Under AVX-512 and AVX10, the EVEX-encoded variants support EVEX-prefix-encoded broadcast, opmasks and rounding-controls.
The AVX512-FP16 extension, introduced in Sapphire Rapids, adds FP16 variants of the FMA3 instructions – these all take the form EVEX.66.MAP6.W0 xy /r
with the opcode byte working in the same way as for the FP32/FP64 variants. The AVX10.2 extension, published in 2024,[20] similarly adds BF16 variants of the packed (but not scalar) FMA3 instructions – these all take the form EVEX.NP.MAP6.W0 xy /r
with the opcode byte again working similar to the FP32/FP64 variants.
(For the FMA4 instructions, no FP16 or BF16 variants are defined.)
FMA4 encoding
FMA4 instructions are encoded with the VEX prefix, on the form VEX.66.0F3A xx /r ib
(no EVEX encodings are defined). The opcode byte xx
uses its bottom bit to select floating-point format (0=FP32, 1=FP64) and the remaining bits to select one of the 10 fused-multiply-add operations to perform.
For FMA4, operand ordering is controlled by the VEX.W bit. If VEX.W=0, then the third operand is the r/m operand specified by the instruction's ModR/M byte and the fourth operand is a register operand, specified by bits 7:4 of the ib (8-bit immediate) part of the instruction. If VEX.W=1, then these two operands are swapped. For example:
vfmaddsd xmm1,xmm2,[mem],xmm3
will performxmm1 ← (xmm2*[mem])+xmm3
and require a W=0 encoding.vfmaddsd xmm1,xmm2,xmm3,[mem]
will performxmm1 ← (xmm2*xmm3)+[mem]
and require a W=1 encoding.vfmaddsd xmm1,xmm2,xmm3,xmm4
will performxmm1 ← (xmm2*xmm3)+xmm4
and can be encoded with either W=0 or W=1.
Opcode table
The 10 fused-multiply-add operations and the 122 instruction variants they give rise to are given by the following table – with FMA4 instructions highlighted with * and yellow cell coloring, and FMA3 instructions not highlighted:
- Vector register lanes are counted from 0 upwards in a little-endian manner – the lane that contains the first byte of the vector is considered to be even-numbered.
Remove ads
AVX-512
AVX-512, introduced in 2014, adds 512-bit wide vector registers (extending the 256-bit registers, which become the new registers' lower halves) and doubles their count to 32; the new registers are thus named zmm0 through zmm31. It adds eight mask registers, named k0 through k7, which may be used to restrict operations to specific parts of a vector register. Unlike previous instruction set extensions, AVX-512 is implemented in several groups; only the foundation ("AVX-512F") extension is mandatory.[21] Most of the added instructions may also be used with the 256- and 128-bit registers.
Remove ads
AMX
Summarize
Perspective
Intel AMX adds eight new tile-registers, tmm0
-tmm7
, each holding a matrix, with a maximum capacity of 16 rows of 64 bytes per tile-register. It also adds a TILECFG
register to configure the sizes of the actual matrices held in each of the eight tile-registers, and a set of instructions to perform matrix multiplications on these registers.
- For
TILEZERO
, the tile-register to clear is specified by bits 5:3 of the instruction's ModR/M byte. Bits 7:6 must be set to 11b, and bits 2:0 must be set to 000b. - For the
TILELOADD
,TILELOADDT1
andTILESTORED
instructions, the memory argument must use a memory addressing mode with the SIB-byte. Under this addressing mode, the base register and displacement are used to specify the starting address for the first row of the tile to load/store from/to memory – the scale and index are used to specify a per-row stride.
These instructions are all interruptible – an interrupt or memory exception taken in the middle of these instructions will cause progress tracking information to be written toTILECFG.start_row
, so that the instruction may continue on a partially-loaded/stored tile after the interruption.
Remove ads
See also
References
External links
Wikiwand - on
Seamless Wikipedia browsing. On steroids.
Remove ads