Loading AI tools
From Wikipedia, the free encyclopedia
Instructions that have at some point been present as documented instructions in one or more x86 processors, but where the processor series containing the instructions are discontinued or superseded, with no known plans to reintroduce the instructions.
The following instructions were introduced in the Intel 80386, but later discontinued:
Instruction | Opcode | Description | Eventual fate |
---|---|---|---|
XBTS r, r/m | 0F A6 /r | Extract Bit String | Discontinued from revision B1 of the 80386 onwards.
Opcodes briefly reused for Opcodes later reused for VIA PadLock. |
IBTS r/m, r | 0F A7 /r | Insert Bit String | |
MOV r32,TRx | 0F 24 /r | Move from test register | Present in Intel 386 and 486 − not present in Intel Pentium or any later Intel CPUs (except they're present in the i486-derived Quark X1000). Present in all Cyrix CPUs. |
MOV TRx,r32 | 0F 26 /r | Move to test register |
These instructions are only present in the x86 operation mode of early Intel Itanium processors with hardware support for x86. This support was added in "Merced" and removed in "Montecito", replaced with software emulation.
Instruction | Opcode | Description |
---|---|---|
JMPE r/m16 JMPE r/m32 |
0F 00 /6 |
Jump To Intel Itanium Instruction Set.[1] |
JMPE disp16/32 |
0F B8 rel16/32 |
These instructions were introduced in 6th generation Intel Core "Skylake" CPUs. The last CPU generation to support them was the 9th generation Core "Coffee Lake" CPUs.
Intel MPX adds 4 new registers, BND0 to BND3, that each contains a pair of addresses. MPX also defines a bounds-table as a 2-level directory/table data structure in memory that contains sets of upper/lower bounds.
Instruction | Opcode[a] | Description |
---|---|---|
BNDMK b, m |
F3 0F 1B /r [b] |
Make lower and upper bound from memory address expression.
The lower bound is given by base component of address, the upper bound by 1-s complement of the address as a whole. |
BNDCL b, r/m |
F3 0F 1A /r |
Check address against lower bound.
|
BNDCU b, r/m |
F2 0F 1A /r |
Check address against upper bound in 1's-complement form |
BNDCN b, r/m |
F2 0F 1B /r |
Check address against upper bound. |
BMDMOV b, b/m |
66 0F 1A /r |
Move a pair of memory bounds to/from memory or between bounds-registers. |
BNDMOV b/m, b |
66 0F 1B /r | |
BNDLDX b,mib |
NP 0F 1A /r [c] |
Load bounds from the bounds-table, using address translation using an sib-addressing expression mib.[d] |
BNDSTX mib,b |
NP 0F 1B /r [c] |
Store bounds into the bounds-table, using address translation using an sib-addressing expression mib.[d] |
BND |
F2 |
Instruction prefix used with certain branch instructions[e] to indicate that they should not clear the bounds registers. |
67h
mandatory in 16-bit mode and prohibited in 32-bit mode. In 64-bit mode, the 67h
prefix is ignored for the MPX instructions − address size is always 64-bit. These behaviors are unique to the MPX instructions.BND
prefix are the near forms of JMP
(opcodes E9
and FF /4
), CALL
(opcodes E8
and FF /2
), RET
(opcodes C2
and C3
), and the short/near forms of the Jcc
instructions (opcodes 70..7F
and 0F 80..8F
). If the BNDPRESERVE config bit is not set, then executing any of these branch instructions without the BND
prefix will clear all four bounds registers. (Other branch instructions − such as e.g. far jumps, short jumps (EB
), LOOP
, IRET
etc − do not clear the bounds registers regardless of whether an F2h
prefix is present or not.)The Hardware Lock Elision feature of Intel TSX is marked in the Intel SDM as removed from 2019 onwards.[2] This feature took the form of two instruction prefixes, XACQUIRE
and XRELEASE
, that could be attached to memory atomics/stores to elide the memory locking that they represent.
Instruction prefix | Opcode | Description |
---|---|---|
XACQUIRE |
F2 |
Instruction prefix to indicate start of hardware lock elision, used with memory atomic instructions only (for other instructions, the F2 prefix may have other meanings). When used with such instructions, may start a transaction instead of performing the memory atomic operation. |
XRELEASE |
F3 |
Instruction prefix to indicate end of hardware lock elision, used with memory atomic/store instructions only (for other instructions, the F3 prefix may have other meanings). When used with such instructions during hardware lock elision, will end the associated transaction instead of performing the store/atomic. |
The VP2INTERSECT instructions (an AVX-512 subset) were introduced in Tiger Lake (11th generation mobile Core processors), but were never officially supported on any other Intel processors - they are now considered deprecated[3] and are listed in the Intel SDM as removed from 2023 onwards.[2]
As of July 2024, the VP2INTERSECT instructions have been re-introduced on AMD Zen 5 processors.[4]
Instruction | Opcode | Description |
---|---|---|
VP2INTERSECTD k1+1, xmm2, xmm3/m128/m32bcst VP2INTERSECTD k1+1, ymm2, ymm3/m256/m32bcst VP2INTERSECTD k1+1, zmm2, zmm3/m512/m32bcst |
EVEX.NDS.F2.0F38.W0 68 /r |
Store, in an even/odd pair of mask registers, the indicators of the locations of value matches between 32-bit lanes in the two vector source arguments. |
VP2INTERSECTQ k1+1, xmm2, xmm3/m128/m64bcst VP2INTERSECTQ k1+1, ymm2, ymm3/m256/m64bcst VP2INTERSECTQ k1+1, zmm2, zmm3/m512/m64bcst |
EVEX.NDS.F2.0F38.W1 68 /r |
Store, in an even/odd pair of mask registers, the indicators of the locations of value matches between 64-bit lanes in the two vector source arguments. |
The first generation Xeon Phi processors, codenamed "Knights Corner" (KNC), supported a large number of instructions that are not seen in any later x86 processor. An instruction reference is available[5] − the instructions/opcodes unique to KNC are the ones with VEX and MVEX prefixes (except for the KMOV
, KNOT
and KORTEST
instructions − these are kept with the same opcodes and function in AVX-512, but with an added "W" appended to their instruction names).
Most of these KNC-unique instructions are similar but not identical to instructions in AVX-512 − later Xeon Phi processors replaced these instructions with AVX-512.
Early versions of AVX-512 avoided the instruction encodings used by KNC's MVEX prefix, however with the introduction of Intel APX (Advanced Performance Extensions) in 2023, some of the old KNC MVEX instruction encodings have been reused for new APX encodings. For example, both KNC and APX accept the instruction encoding 62 F1 79 48 6F 04 C1
as valid, but assign different meanings to it:
VMOVDQA32 zmm0, k0, xmmword ptr [rcx+rax*8]{uint8}
- vector load with data conversionVMOVDQA32 zmm0, [rcx+r16*8]
- vector load with one of the new APX extended-GPRs used as scaled indexSome of the AVX-512 instructions in the Xeon Phi "Knights Landing" and later models belong to the AVX-512 subsets "AVX512ER", "AVX512_4FMAPS", "AVX512PF" and "AVX512_4VNNIW", all of which are unique to the Xeon Phi series of processors. The ER and PF subsets were introduced in "Knights Landing" − the 4FMAPS and 4VNNIW instructions were later added in "Knights Mill".
The ER and 4FMAPS instructions are floating-point arithmetic instructions that all follow a given pattern where:
Operation | AVX-512 subset |
Basic opcode | FP32 instructions (W=0) | FP64 instructions (W=1) | RC/SAE | |||||
---|---|---|---|---|---|---|---|---|---|---|
Packed | Scalar | Packed | Scalar | |||||||
Xeon Phi specific instructions (ER, 4FMAPS) | ||||||||||
Reciprocal approximation with an accuracy of [a] | ER | EVEX.66.0F38 (CA/CB) /r | VRCP28PS z,z,z/m512 | VRCP28SS x,x,x/m32 | VRCP28PD z,z,z/m512 | VRCP28SD x,x,x/m64 | SAE | |||
Reciprocal square root approximation with an accuracy of [a] | ER | EVEX.66.0F38 (CC/CD) /r | VRSQRT28PS z,z,z/m512 | VRSQRT28SS x,x,x/m32 | VRSQRT28PD z,z,z/m512 | VRSQRT28SD x,x,x/m64 | SAE | |||
Exponential approximation with relative error[a] | ER | EVEX.66.0F38 C8 /r | VEXP2PS z,z/m512 | No | VEXP2PD z,z/m512 | No | SAE | |||
Fused-multiply-add, 4 iterations | 4FMAPS | EVEX.F2.0F38 (9A/9B) /r | V4FMADDPS z,z+3,m128 | V4FMADDSS x,x+3,m128 | No | No | ||||
Fused negate-multiply-add, 4 iterations | 4FMAPS | EVEX.F2.0F38 (AA/AB) /r | V4FNMADDPS z,z+3,m128 | V4FNMADDSS x,x+3,m128 | No | No |
The AVX512PF instructions are a set of 16 prefetch instructions. These instructions all use VSIB encoding, where a memory addressing mode using the SIB byte is required, and where the index part of the SIB byte is taken to index into the AVX512 vector register file rather than the GPR register file. The selected AVX512 vector register is then interpreted as a vector of indexes, causing the standard x86 base+index+displacement address calculation to be performed for each vector lane, causing one associated memory operation (prefetches in case of the AVX512PF instructions) to be performed for each active lane. The instruction encodings all follow a pattern where:
Operation | Basic opcode | 32-bit indexes (opcode C6 ) |
64-bit indexes (opcode C7 ) | ||
---|---|---|---|---|---|
FP32 prefetch (W=0) | FP64 prefetch (W=1) | FP32 prefetch (W=0) | FP64 prefetch (W=1) | ||
Prefetch into L1 cache (T0 hint) | EVEX.66.0F38 (C6/C7) /1 /vsib | VGATHERPF0DPS vm32z {k1} | VGATHERPF0DPD vm32y {k1} | VGATHERPF0QPS vm64z {k1} | VGATHERPF0QPD vm64y {k1} |
Prefetch into L2 cache (T1 hint) | EVEX.66.0F38 (C6/C7) /2 /vsib | VGATHERPF1DPS vm32z {k1} | VGATHERPF1DPD vm32y {k1} | VGATHERPF1QPS vm64z {k1} | VGATHERPF1QPD vm64y {k1} |
Prefetch into L1 cache (T0 hint) with intent to write | EVEX.66.0F38 (C6/C7) /5 /vsib | VSCATTERPF0DPS vm32z {k1} | VSCATTERPF0DPD vm32y {k1} | VSCATTERPF0QPS vm64z {k1} | VSCATTERPF0QPD vm64y {k1} |
Prefetch into L2 cache (T1 hint) with intent to write | EVEX.66.0F38 (C6/C7) /6 /vsib | VSCATTERPF1DPS vm32z {k1} | VSCATTERPF1DPD vm32y {k1} | VSCATTERPF1QPS vm64z {k1} | VSCATTERPF1QPD vm64y {k1} |
The AVX512_4VNNIW instructions read a 128-bit data item from memory, containing 4 two-component vectors (each component being signed 16-bit). Then, for each of 4 consecutive AVX-512 registers, they will, for each 32-bit lane, interpret the lane as a two-component vector (signed 16-bit) and perform a dot-product with the corresponding two-component vector that was read from memory (the first two-component vector from memory is used for the first AVX-512 source register, and so on). These results are then accumulated into a destination vector register.
Instruction | Opcode | Description |
---|---|---|
VP4DPWSSD zmm1{k1}{z}, zmm2+3, m128 |
EVEX.512.F2.0F38.W0 52 /r |
Dot-product of signed words with dword accumulation, 4 iterations |
VP4DPWSSDS zmm1{k1}{z}, zmm2+3, m128 |
EVEX.512.F2.0F38.W0 53 /r |
Dot-product of signed words with dword accumulation and saturation, 4 iterations |
Xeon Phi processors (from Knights Landing onwards) also featured the PREFETCHWT1 m8
instruction (opcode 0F 0D /2
, prefetch into L2 cache with intent to write) − these were the only Intel CPUs to officially support this instruction, but it continues to be supported on some non-Intel processors (e.g. Zhaoxin YongFeng).
A handful of instructions to support System Management Mode were introduced in the Am386SXLV and Am386DXLV processors.[7][8] They were also present in the later Am486SXLV/DXLV and Elan SC300/310 processors.[9]
The SMM functionality of these processors was implemented using Intel ICE microcode without a valid license, resulting in a lawsuit that AMD lost in late 1994.[10] As a result of this loss, the ICE microcode was removed from all later AMD CPUs, and the SMM instructions removed with it.
Instruction | Opcode | Description |
---|---|---|
SMI | F1 | Call SMM interrupt handler (only if DR7 bit 12 is set; not available on Am486SXLV/DXLV[11]) |
UMOV r/m8, r8 | 0F 10 /r | Move data between registers and main system memory |
UMOV r/m, r16/32 | 0F 11 /r | |
UMOV r8, r/m8 | 0F 12 /r | |
UMOV r16/32, r/m | 0F 13 /r | |
RES3 | 0F 07 | Return from SMM interrupt handler (Am386SXLV/DXLV only) Takes a pointer in ES:EDI to a processor save state to resume from − this save state has format nearly identical to that of the undocumented Intel 386 LOADALL instruction.[12] |
RES4 | 0F 07 | Return from SMM interrupt handler (Am486SXLV/DXLV only). Similar to RES3 , but with a different save state format.[13] |
These SMM instructions were also present on the IBM 386SLC and its derivatives (albeit with the LOADALL
-like SMM return opcode 0F 07
named ICERET
),[12][14][11] as well as on the UMC U5S processor.[15]
The 3DNow! instruction set extension was introduced in the AMD K6-2, mainly adding support for floating-point SIMD instructions using the MMX registers (two FP32 components in a 64-bit vector register). The instructions were mainly promoted by AMD, but were supported on some non-AMD CPUs as well. The processors supporting 3DNow! were:
Instruction | Opcode | Instruction description | |
---|---|---|---|
PFADD mm1,mm2/m64 | 0F 0F /r 9E | Packed floating-point addition:dst <- dst + src | |
PFSUB mm1,mm2/m64 | 0F 0F /r 9A | Packed floating-point subtraction:dst <- dst − src | |
PFSUBR mm1,mm2/m64 | 0F 0F /r AA | Packed floating-point reverse subtraction:dst <- src − dst | |
PFMUL mm1,mm2/m64 | 0F 0F /r B4 | Packed floating-point multiplication:dst <- dst * src | |
PFMAX mm1,mm2/m64 | 0F 0F /r A4 | Packed floating-point maximum:dst <- (dst > src) ? dst : src | |
PFMIN mm1,mm2/m64 | 0F 0F /r 94 | Packed floating-point minimum:dst <- (dst < src) ? dst : src | |
PFCMPEQ mm1,mm2/m64 | 0F 0F /r B0 | Packed floating-point comparison, equal:dst <- (dst == src) ? 0xFFFFFFFF : 0 | |
PFCMPGE mm1,mm2/m64 | 0F 0F /r 90 | Packed floating-point comparison, greater than or equal:dst <- (dst >= src) ? 0xFFFFFFFF : 0 | |
PFCMPGT mm1,mm2/m64 | 0F 0F /r A0 | Packed floating-point comparison, greater than:dst <- (dst > src) ? 0xFFFFFFFF : 0 | |
PF2ID mm1,mm2/m64 | 0F 0F /r 1D | Converts packed floating-point operand to packed 32-bit signed integer, with round-to-zero | |
PI2FD mm1,mm2/m64 | 0F 0F /r 0D | Packed 32-bit signed integer to floating-point conversion, with round-to-zero | |
PFRCP mm1,mm2/m64 | 0F 0F /r 96 | Floating-point reciprocal approximation (at least 14 bit precision):temp <- approx(1.0/src[31:0]) |
The 3DNow! specification[16] does not directly specify the operation performed by the PFRCPIT1 , PFRSQIT1 and PFRCPIT2 instructions − instead, it imposes requirements on the results of using these instructions together in specific ways:[a]
If the bottom 32 bits of PFRCP mm1,mm0 PFRCPIT1 mm0,mm1 PFRCPIT2 mm0,mm1 must fill both 32-bit lanes of PFRSQRT mm1,mm0 MOVQ mm2,mm1 PFMUL mm1,mm1 PFRSQIT1 mm1,mm0 PFRCPIT2 mm1,mm2 must fill both 32-bit lanes of |
PFRSQRT mm1,mm2/m64 | 0F 0F /r 97 | Floating-point reciprocal square root approximation (at least 15 bit precision):temp <- approx(1.0/sqrt(src[31:0])) | |
PFRCPIT1 mm1,mm2/m64 | 0F 0F /r A6 | Packed floating-point reciprocal, first iteration step | |
PFRSQIT1 mm1,mm2/m64 | 0F 0F /r A7 | Packed floating-point reciprocal square root, first iteration step | |
PFRCPIT2 mm1,mm2/m64 | 0F 0F /r B6 | Packed floating-point reciprocal/reciprocal square root, second iteration step | |
PFACC mm1,mm2/m64 | 0F 0F /r AE | Floating-point accumulate (horizontal add):dst[31:0] <- dst[31:0] + dst[63:32] | |
PMULHRW mm1,mm2/m64 ,[b]PMULHRWA mm1,mm2/m64 | 0F 0F /r B7 | Multiply signed packed 16-bit integers with rounding and store the high 16 bits:dst <- ((dst * src) + 0x8000) >> 16 | |
PAVGUSB mm1,mm2/m64 | 0F 0F /r BF | Average of unsigned packed 8-bit integers:dst <- (src+dst+1) >> 1 | |
FEMMS | 0F 0E | Faster Enter/Exit of the MMX or x87 floating-point state[c] |
PFRCPIT1
, PFRSQIT1
and PFRCPIT2
instructions would perform various parts of a Newton-Raphson iteration to improve the precision of a low-precision initial result from PFRCP
/PFRSQRT
.[17]PFRCP
and PFRSQRT
instructions would instead compute their results with full 24-bit precision − this made it possible to turn the PFRCPIT1
, PFRSQIT1
and PFRCPIT2
instructions into pure data movement instructions, performing the same operation as MOVQ
.[18]PMULHRW
instruction has the same mnemonic as the Cyrix EMMI PMULHRW
instruction, however its opcode and function differ (the EMMI instruction right-shifts its multiply-result by 15 bits, while the 3DNow! instruction right-shifts by 16 bits).Some assemblers/disassemblers, such as NASM, resolve this ambiguity by using the mnemonic PMULHRWA
for the 3DNow! instruction and PMULHRWC
for the EMMI instruction.
3DNow! also introduced a couple of prefetch instructions: PREFETCH m8
(opcode 0F 0D /0
) and PREFETCHW m8
(opcode 0F 0D /1
). These instructions, unlike the rest of 3DNow!, are not discontinued but continue to be supported on modern AMD CPUs. The PREFETCHW
instruction is also supported on Intel CPUs starting with 65 nm Pentium 4,[19] albeit executed as NOP until Broadwell.
Instruction | Opcode | Instruction description |
---|---|---|
PF2IW mm1,mm2/m64 |
0F 0F /r 1C |
Packed 32-bit floating-point to 16-bit signed integer conversion, with round-to-zero[a] |
PI2FW mm1,mm2/m64 |
0F 0F /r 0C |
Packed 16-bit signed integer to 32-bit floating-point conversion[a] |
PSWAPD mm1,mm2/m64 |
0F 0F /r BB [b] |
Packed Swap Doubleword:dst[31:0] <- src[63:32] |
PFNACC mm1,mm2/m64 |
0F 0F /r 8A |
Packed Floating-Point Negative Accumulate:dst[31:0] <- dst[31:0] − dst[63:32] |
PFPNACC mm1,mm2/m64 |
0F 0F /r 8E |
Packed Floating-Point Positive-Negative Accumulate:dst[31:0] <- dst[31:0] − dst[63:32] |
PF2IW
and PI2FW
instructions also existed as undocumented instructions on the original K6-2.The undocumented variant of PF2IW
in K6-2 would set the top 16 bits of each 32-bit result lane to all-0s, while the documented variant in later processors would sign-extend the 16-bit result to 32 bits.[20][21]
PSWAPD
instruction uses same opcode as the older undocumented K6-2 PSWAPW
instruction.[21]Instruction | Opcode | Instruction description |
---|---|---|
PFRCPV mm1,mm2/m64 | 0F 0F /r 86 | Packed Floating-point Reciprocal Approximation |
PFRSQRTV mm1,mm2/m64 | 0F 0F /r 87 | Packed Floating-point Reciprocal Square Root Approximation |
SSE5 was a proposed SSE extension by AMD, using a new "DREX" instruction encoding to add support for new 3-operand and 4-operand instructions to SSE.[22] The bundle did not include the full set of Intel's SSE4 instructions, making it a competitor to SSE4 rather than a successor.
AMD chose not to implement SSE5 as originally proposed − it was instead reworked into FMA4 and XOP,[23] which provided similar functionality but with a quite different instruction encoding − using the VEX prefix for the FMA4 instructions and the new VEX-like XOP prefix for most of the remaining instructions.
Introduced with the Bulldozer processor core, removed again from Zen (microarchitecture) onward.
A revision of most of the SSE5 instruction set.
The XOP instructions mostly make use of the XOP prefix, which is a 3-byte prefix with the following layout:
Byte 0 | Byte 1 | Byte 2 | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Bits | 7:0 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | |||
Usage | 8Fh |
R̅ | X̅ | B̅ | mmmmm | W | v̅v̅v̅v̅ | L | pp |
where:
POP
instruction, making them unusable for XOP), only maps 8
, 9
and 0Ah
were ever used: map 8
for instructions that take an 8-bit immediate, map 9
for instructions that don't take an immediate, and map 0Ah
for instructions that take a 32-bit immediate.The XOP instructions encoded with the XOP prefix are as follows:
Instruction description | Instruction mnemonics | Opcode | W=1 swap allowed |
L=1 (256b) allowed | |
---|---|---|---|---|---|
Extract fractional portion of floating-point value. | Packed FP32 | VFRCZPS ymm1,ymm2/m256 |
XOP.9 80 /r |
No | Yes |
Packed FP64 | VFRCZPD ymm1,ymm2/m256 |
XOP.9 81 /r |
No | Yes | |
Scalar FP32 | VFRCZSS xmm1,xmm2/m32 |
XOP.9 82 /r |
No | No | |
Scalar FP64 | VFRCZSD xmm1,xmm2/m64 |
XOP.9 83 /r |
No | No | |
Vector per-bit-lane conditional move.
|
VPCMOV ymm1,ymm2,ymm3/m256,ymm4 |
XOP.8 A2 /r /is4 |
Yes | Yes | |
Vector integer compare.
For each vector-register lane, compare src1 to src2, then set destination to all-1s if the comparison passes, all-0s if it fails. The imm8 argument specifies comparison function to perform:
|
Signed 8-bit lanes | VPCOMB xmm1,xmm2,xmm3/m128,imm8 [a] |
XOP.8 CC /r ib |
No | No |
Signed 16-bit lanes | VPCOMW xmm1,xmm2,xmm3/m128,imm8 [a] |
XOP.8 CD /r ib | |||
Signed 32-bit lanes | VPCOMD xmm1,xmm2,xmm3/m128,imm8 [a] |
XOP.8 CE /r ib | |||
Signed 64-bit lanes | VPCOMQ xmm1,xmm2,xmm3/m128,imm8 [a] |
XOP.8 CF /r ib | |||
Unsigned 8-bit lanes | VPCOMUB xmm1,xmm2,xmm3/m128,imm8 [a] |
XOP.8 EC /r ib | |||
Unsigned 16-bit lanes | VPCOMUW xmm1,xmm2,xmm3/m128,imm8 [a] |
XOP.8 ED /r ib | |||
Unsigned 32-bit lanes | VPCOMUD xmm1,xmm2,xmm3/m128,imm8 [a] |
XOP.8 EE /r ib | |||
Unsigned 64-bit lanes | VPCOMUQ xmm1,xmm2,xmm3/m128,imm8 [a] |
XOP.8 EF /r ib | |||
Vector Integer Horizontal Add.
For each N-bit lane, split the lane into a series of M-bit lanes, add the M-bit lanes together, then store the result into the destination as an N-bit zero/sign-extended value. |
2x8bit -> 16bit, signed | VPHADDBW xmm1,xmm2/m128 |
XOP.9 C1 /r |
No | No |
4x8bit -> 32bit, signed | VPHADDBD xmm1,xmm2/m128 |
XOP.9 C2 /r | |||
8x8bit -> 64bit, signed | VPHADDBQ xmm1,xmm2/m128 |
XOP.9 C3 /r | |||
2x16bit -> 32bit, signed | VPHADDWD xmm1,xmm2/m128 |
XOP.9 C6 /r | |||
4x16bit -> 64bit, signed | VPHADDWQ xmm1,xmm2/m128 |
XOP.9 C7 /r | |||
2x32bit -> 64bit, signed | VPHADDDQ xmm1,xmm2/m128 |
XOP.9 CB /r | |||
2x8bit -> 16bit, unsigned | VPHADDUBW xmm1,xmm2/m128 |
XOP.9 D1 /r | |||
4x8bit -> 32bit, unsigned | VPHADDUBD xmm1,xmm2/m128 |
XOP.9 D2 /r | |||
8x8bit -> 64bit, unsigned | VPHADDUBQ xmm1,xmm2/m128 |
XOP.9 D3 /r | |||
2x16bit -> 32bit, unsigned | VPHADDUWD xmm1,xmm2/m128 |
XOP.9 D6 /r | |||
4x16bit -> 64bit, unsigned | VPHADDUWQ xmm1,xmm2/m128 |
XOP.9 D7 /r | |||
2x32bit -> 64bit, unsigned | VPHADDUDQ xmm1,xmm2/m128 |
XOP.9 DB /r | |||
Vector Integer Horizontal Subtract.
For each N-bit lane, split the lane into two signed sub-lanes of N/2 bits each, then subtract the upper lane from the lower lane, then store the result as a signed N-bit result. |
2x8bit -> 16bit | VPHSUBBW xmm1,xmm2/m128 |
XOP.9 E1 /r |
No | No |
2x16bit -> 32bit | VPHSUBWD xmm1,xmm2/m128 |
XOP.9 E2 /r | |||
2x32bit -> 64bit | VPHSUBDQ xmm1,xmm2/m128 |
XOP.9 E3 /r | |||
Vector Signed Integer Multiply-Add.
For each N-bit lane, perform For src1 and src2, the factors to multiply may be taken as signed values from the low half of each lane, high half of each lane or the lane in full (picked in the same way for src1 and src2) − the addend and the result use the full lane. |
16-bit, full-lane | VPMACSWW xmm1,xmm2,xmm3/m128,xmm4 |
XOP.8 95 /r /is4 |
No | No |
32-bit, low-half | VPMACSWD xmm1,xmm2,xmm3/m128,xmm4 |
XOP.8 96 /r /is4 | |||
64-bit, low-half | VPMACSDQL xmm1,xmm2,xmm3/m128,xmm4 |
XOP.8 97 /r /is4 | |||
32-bit, full-lane | VPMACSDD xmm1,xmm2,xmm3/m128,xmm4 |
XOP.8 9E /r /is4 | |||
64-bit, high-half | VPMACSDQH xmm1,xmm2,xmm3/m128,xmm4 |
XOP.8 9F /r /is4 | |||
16-bit, full-lane, saturating | VPMACSSWW xmm1,xmm2,xmm3/m128,xmm4 |
XOP.8 85 /r /is4 | |||
32-bit, low-half, saturating | VPMACSSWD xmm1,xmm2,xmm3/m128,xmm4 |
XOP.8 86 /r /is4 | |||
64-bit, low-half, saturating | VPMACSSDQL xmm1,xmm2,xmm3/m128,xmm4 |
XOP.8 87 /r /is4 | |||
32-bit, full-lane, saturating | VPMACSSDD xmm1,xmm2,xmm3/m128,xmm4 |
XOP.8 8E /r /is4 | |||
64-bit, high-half, saturating | VPMACSSDQH xmm1,xmm2,xmm3/m128,xmm4 |
XOP.8 8F /r /is4 | |||
Packed multiply, add and accumulate signed word to signed doubleword.
For each 32-bit lane, treat src1 and src2 as 2-component vectors of signed 16-bit values, then compute their dot-product, then add src3 as a 32-bit value. |
with saturation | VPMADCSSWD xmm1,xmm2,xmm3/m128,xmm4 |
XOP.8 A6 /r /is4 |
No | No |
without saturation | VPMADCSWD xmm1,xmm2,xmm3/m128,xmm4 |
XOP.8 B6 /r /is4 | |||
Packed Permute Bytes.
For
|
VPPERM xmm1,xmm2,xmm3/m128,xmm4 |
XOP.8 A3 /r /is4 |
Yes | No | |
Packed left-rotate.
Rotation amount is given in the last source argument. It may be provided as an immediate or a vector register − in the latter case, the rotation amount is provided on a per-lane basis. |
8-bit lanes | VPROTB xmm1,xmm2/m128,xmm3 |
XOP.9 90 /r |
Yes | No |
VPROTB xmm1,xmm2/m128,imm8 |
XOP.8 C0 /r ib |
No | |||
16-bit lanes | VPROTW xmm1,xmm2/m128,xmm3 |
XOP.9 91 /r |
Yes | ||
VPROTW xmm1,xmm2/m128,imm8 |
XOP.8 C1 /r ib |
No | |||
32-bit lanes | VPROTD xmm1,xmm2/m128,xmm3 |
XOP.9 92 /r |
Yes | ||
VPROTD xmm1,xmm2/m128,imm8 |
XOP.8 C2 /r ib |
No | |||
64-bit lanes | VPROTQ xmm1,xmm2/m128,xmm3 |
XOP.9 93 /r |
Yes | ||
VPROTQ xmm1,xmm2/m128,imm8 |
XOP.8 C3 /r ib |
No | |||
Packed shift, with signed shift-amounts.
Shift-amount is provided on a per-vector-lane basis, and is taken from the bottom 8 bits of each lane of the last source argument. The shift-amount is considered signed − a positive value will cause left-shift, while a negative value causes right-shift. |
8-bit, signed | VPSHAB xmm1,xmm2/m128,xmm3 |
XOP.9 98 /r |
Yes | No |
16-bit, signed | VPSHAW xmm1,xmm2/m128,xmm3 |
XOP.9 99 /r | |||
32-bit, signed | VPSHAD xmm1,xmm2/m128,xmm3 |
XOP.9 9A /r | |||
64-bit, signed | VPSHAQ xmm1,xmm2/m128,xmm3 |
XOP.9 9B /r | |||
8-bit, unsigned | VPSHLB xmm1,xmm2/m128,xmm3 |
XOP.9 94 /r | |||
16-bit, unsigned | VPSHLW xmm1,xmm2/m128,xmm3 |
XOP.9 95 /r | |||
32-bit, unsigned | VPSHLD xmm1,xmm2/m128,xmm3 |
XOP.9 96 /r | |||
64-bit, unsigned | VPSHLQ xmm1,xmm2/m128,xmm3 |
XOP.9 97 /r |
VPCOM*
instruction, a series of alias mnemonics are available for the instruction, one for each of the eight comparison functions encodable in the imm8 argument. These alias mnemonics specify the comparison to perform after the "VPCOM" part of the mnemonic. For example:VPCOMEQB xmm1,xmm2,xmm3
is an alias for VPCOMB xmm1,xmm2,xmm3,4
VPCOMFALSEUQ xmm1,xmm2,[ebx]
is an alias for VPCOMUQ xmm1,xmm2,[ebx],6
XOP also included two vector instructions that used the VEX prefix instead of the XOP prefix:
Instruction description | Instruction mnemonics | Opcode | W=1 swap allowed |
L=1 (256b) allowed |
---|---|---|---|---|
Permute two-source double-precision floating-point values. | VPERMIL2PD ymm1,ymm2,ymm3/m256,ymm4,imm4 |
VEX.NP.0F3A 49 /r /is4 |
Yes | Yes |
Permute two-source single-precision floating-point values. | VPERMIL2PS ymm1,ymm2,ymm3/m256,ymm4,imm4 |
VEX.NP.0F3A 48 /r /is4 |
Yes | Yes |
The instructions VPERMIL2PD
and VPERMIL2PS
were originally defined by Intel in early drafts of the AVX specification[24] − they were removed in later drafts[25][26] and were never implemented in any Intel processor. They were, however, implemented by AMD, who designated them as being a part of the XOP instruction set extension. (Like the other parts of XOP, they've been removed in AMD Zen.)
Supported in AMD processors starting with the Bulldozer architecture, removed in Zen. Not supported by any Intel chip as of 2023.
Fused multiply-add with four operands. FMA4 was realized in hardware before FMA3.
Instruction | Opcode | Meaning | Notes |
---|---|---|---|
VFMADDPD xmm0, xmm1, xmm2, xmm3 | C4E3 WvvvvL01 69 /r /is4 | Fused Multiply-Add of Packed Double-Precision Floating-Point Values | |
VFMADDPS xmm0, xmm1, xmm2, xmm3 | C4E3 WvvvvL01 68 /r /is4 | Fused Multiply-Add of Packed Single-Precision Floating-Point Values | |
VFMADDSD xmm0, xmm1, xmm2, xmm3 | C4E3 WvvvvL01 6B /r /is4 | Fused Multiply-Add of Scalar Double-Precision Floating-Point Values | |
VFMADDSS xmm0, xmm1, xmm2, xmm3 | C4E3 WvvvvL01 6A /r /is4 | Fused Multiply-Add of Scalar Single-Precision Floating-Point Values | |
VFMADDSUBPD xmm0, xmm1, xmm2, xmm3 | C4E3 WvvvvL01 5D /r /is4 | Fused Multiply-Alternating Add/Subtract of Packed Double-Precision Floating-Point Values | |
VFMADDSUBPS xmm0, xmm1, xmm2, xmm3 | C4E3 WvvvvL01 5C /r /is4 | Fused Multiply-Alternating Add/Subtract of Packed Single-Precision Floating-Point Values | |
VFMSUBADDPD xmm0, xmm1, xmm2, xmm3 | C4E3 WvvvvL01 5F /r /is4 | Fused Multiply-Alternating Subtract/Add of Packed Double-Precision Floating-Point Values | |
VFMSUBADDPS xmm0, xmm1, xmm2, xmm3 | C4E3 WvvvvL01 5E /r /is4 | Fused Multiply-Alternating Subtract/Add of Packed Single-Precision Floating-Point Values | |
VFMSUBPD xmm0, xmm1, xmm2, xmm3 | C4E3 WvvvvL01 6D /r /is4 | Fused Multiply-Subtract of Packed Double-Precision Floating-Point Values | |
VFMSUBPS xmm0, xmm1, xmm2, xmm3 | C4E3 WvvvvL01 6C /r /is4 | Fused Multiply-Subtract of Packed Single-Precision Floating-Point Values | |
VFMSUBSD xmm0, xmm1, xmm2, xmm3 | C4E3 WvvvvL01 6F /r /is4 | Fused Multiply-Subtract of Scalar Double-Precision Floating-Point Values | |
VFMSUBSS xmm0, xmm1, xmm2, xmm3 | C4E3 WvvvvL01 6E /r /is4 | Fused Multiply-Subtract of Scalar Single-Precision Floating-Point Values | |
VFNMADDPD xmm0, xmm1, xmm2, xmm3 | C4E3 WvvvvL01 79 /r /is4 | Fused Negative Multiply-Add of Packed Double-Precision Floating-Point Values | |
VFNMADDPS xmm0, xmm1, xmm2, xmm3 | C4E3 WvvvvL01 78 /r /is4 | Fused Negative Multiply-Add of Packed Single-Precision Floating-Point Values | |
VFNMADDSD xmm0, xmm1, xmm2, xmm3 | C4E3 WvvvvL01 7B /r /is4 | Fused Negative Multiply-Add of Scalar Double-Precision Floating-Point Values | |
VFNMADDSS xmm0, xmm1, xmm2, xmm3 | C4E3 WvvvvL01 7A /r /is4 | Fused Negative Multiply-Add of Scalar Single-Precision Floating-Point Values | |
VFNMSUBPD xmm0, xmm1, xmm2, xmm3 | C4E3 WvvvvL01 7D /r /is4 | Fused Negative Multiply-Subtract of Packed Double-Precision Floating-Point Values | |
VFNMSUBPS xmm0, xmm1, xmm2, xmm3 | C4E3 WvvvvL01 7C /r /is4 | Fused Negative Multiply-Subtract of Packed Single-Precision Floating-Point Values | |
VFNMSUBSD xmm0, xmm1, xmm2, xmm3 | C4E3 WvvvvL01 7F /r /is4 | Fused Negative Multiply-Subtract of Scalar Double-Precision Floating-Point Values | |
VFNMSUBSS xmm0, xmm1, xmm2, xmm3 | C4E3 WvvvvL01 7E /r /is4 | Fused Negative Multiply-Subtract of Scalar Single-Precision Floating-Point Values |
AMD introduced TBM together with BMI1 in its Piledriver[27] line of processors; later AMD Jaguar and Zen-based processors do not support TBM.[28] No Intel processors (as of 2023) support TBM.
The TBM instructions are all encoded using the XOP prefix. They are all available in 32-bit and 64-bit forms, selected with the XOP.W bit (0=32bit, 1=64bit). (XOP.W is ignored outside 64-bit mode.) Like all instructions encoded with VEX/XOP prefixes, they are unavailable in Real Mode and Virtual-8086 mode.
Instruction | Opcode | Description[29] | Equivalent C expression[30] |
---|---|---|---|
BEXTR reg,r/m,imm32 |
XOP.A 10 /r imm32 |
Bit field extract (immediate form)[a]
The imm32 is interpreted as follows:
|
(src >> start) & ((1 << len) − 1) |
BLCFILL reg,r/m |
XOP.9 01 /1 |
Fill from lowest clear bit | x & (x + 1) |
BLCI reg,r/m |
XOP.9 02 /6 |
Isolate lowest clear bit | x | ~(x + 1) |
BLCIC reg,r/m |
XOP.9 01 /5 |
Isolate lowest clear bit and complement | ~x & (x + 1) |
BLCMSK reg,r/m |
XOP.9 02 /1 |
Mask from lowest clear bit | x ^ (x + 1) |
BLCS reg,r/m |
XOP.9 01 /3 |
Set lowest clear bit | x | (x + 1) |
BLSFILL reg,r/m |
XOP.9 01 /2 |
Fill from lowest set bit | x | (x − 1) |
BLSIC reg,r/m |
XOP.9 01 /6 |
Isolate lowest set bit and complement | ~x | (x − 1) |
T1MSKC reg,r/m |
XOP.9 01 /7 |
Inverse mask from trailing ones | ~x | (x + 1) |
TZMSK reg,r/m |
XOP.9 01 /4 |
Mask from trailing zeros | ~x & (x − 1) |
The AMD Lightweight Profiling (LWP) feature was introduced in AMD Bulldozer and removed in AMD Zen. On all supported CPUs, the latest available microcode updates have disabled LWP due to Spectre mitigations.[31]
These instructions are available in Ring 3, but not available in Real Mode and Virtual-8086 mode. All of them use the XOP prefix.
Instruction | Opcode | Description |
---|---|---|
LLWPCB r32/64 |
XOP.9 12 /0 |
Load LWPCB (Lightweight Profiling Control Block) address.[a]
Loading an address of 0 disables LWP. Loading a nonzero address will cause the CPU to perform validation of the specified LWPCB, then enable LWP if the validation passed. If LWP was already enabled, state for the previous LWPCB is flushed to memory. |
SLWPCB r32/64 |
XOP.9 12 /1 |
Store LWPCB address[a] to register, and flush LWP state to memory.
If LWP is not enabled, the stored address is 0. |
LWPINS r32/64, r/m32, imm32 |
XOP.A 12 /0 imm32 |
Insert user event record with EventID=255 in LWP ring buffer. The arguments are inserted into the event record as follows:
The |
LWPVAL r32/64, r/m32, imm32 |
XOP.A 12 /1 imm32 |
Decrement the event counter associated with the programmed value sample event. If the resulting counter value ends up negative, insert an event record with EventID=1 in LWP ring buffer. (The instruction arguments are inserted in this record in the same way as for LWPINS .)
Executes as NOP if LWP is not enabled or if the event counter is not enabled. If no event record is inserted, then the second argument (which may be a memory argument) is not accessed. |
LLWPCB
and SLWPCB
is an effective-address, specified relative to the DS: segment base address. LLWPCB
converts this effective-address to a linear-address by adding the DS base address to it, and SLWPCB
converts it back by subtracting the DS base address. Changing the DS base address while LWP is enabled will thereby cause SLWPCB
to return a different address than what was specified to LLWPCB
, and may also cause XSAVE
to fail to save LWP state properly.These instructions are specific to the NEC V20/V30 CPUs and their successors, and do not appear in any non-NEC CPUs. Many of their opcodes have been reassigned to other instructions in later non-NEC CPUs.
Instruction | Opcode | Description | Available on |
---|---|---|---|
TEST1 r/m8, CL TEST1 r/m16, CL |
0F 10 /0 0F 11 /0 |
Test one bit.
First argument specifies an 8/16-bit register or memory location. Second argument specifies which bit to test. |
All V-series[32] except V30MZ[33] |
TEST1 r/m8, imm8 TEST1 r/m16, imm8 |
0F 18 /0 ib 0F 19 /0 ib | ||
CLR1 r/m8, CL CLR1 r/m16, CL |
0F 12 /0 0F 13 /0 |
Clear one bit. | |
CLR1 r/m8, imm8 CLR1 r/m16, imm8 |
0F 1A /0 ib 0F 1B /0 ib | ||
SET1 r/m8, CL SET1 r/m16, CL |
0F 14 /0 0F 15 /0 |
Set one bit. | |
SET1 r/m8, imm8 SET1 r/m16, imm8 |
0F 1C /0 ib 0F 1D /0 ib | ||
NOT1 r/m8, CL NOT1 r/m16, CL |
0F 16 /0 0F 17 /0 |
Invert one bit. | |
NOT1 r/m8, imm8 NOT1 r/m16, imm8 |
0F 1E /0 ib 0F 1F /0 ib | ||
ADD4S |
0F 20 |
Add Nibble Strings.
Performs a string addition of integers in packed BCD format (2 BCD digits per byte). DS:SI points to a source integer, ES:DI to a destination integer, and CL provides the number of digits to add. The operation is then: destination <- destination + source | |
SUB4S |
0F 22 |
Subtract Nibble Strings.
destination <- destination − source | |
CMP4S |
0F 26 |
Compare Nibble Strings. | |
ROL4 r/m8 |
0F 28 /0 |
Rotate Left Nibble.
Concatenates its 8-bit argument with the bottom 4 bits of AL to form a 12-bit bitvector, then left-rotates this bitvector by 4 bits, then writes this bitvector back to its argument and the bottom 4 bits of AL. | |
ROR4 r/m8 |
0F 2A /0 |
Rotate Right Nibble. Similar to ROL4 , except performs a right-rotate by 4 bits. | |
EXT r8,r8 |
0F 33 /r |
Bitfield extract.
Perform a bitfield read from memory. DS:SI (DS0:IX in NEC nomenclature) points to memory location to read from, first argument specifies bit-offset to read from, and second argument specifies the number of bits to read minus 1. The result is placed in AX. After the bitfield read, SI and the first argument are updated to point just beyond the just-read bitfield. | |
EXT r8,imm8 |
0F 3B /0 ib | ||
INS r8,r8 |
0F 31 /r |
Bitfield Insert.
Perform a bitfield write to memory. ES:DI (DS1:IY in NEC nomenclature) points to memory location to write to, AX contains data to write, first argument specifies bit-offset to write to, and second argument specifies the number of bits to write minus 1. After the bitfield write, DI and the first argument are updated to point just beyond the just-written bitfield. | |
INS r8,imm8 |
0F 39 /0 ib | ||
REPC |
64 |
Repeat if carry. Instruction prefix for use with CMPS /SCAS . | |
REPNC |
65 |
Repeat if not carry. Instruction prefix for use with CMPS /SCAS . | |
FPO2 |
66 /r 67 /r |
"Floating Point Operation 2": extra escape opcodes for floating-point coprocessor, in addition to the standard D8-DF ones used for x87.
The FPO2 escape opcodes are used by the NEC 72291 floating-point coprocessor - this coprocessor also uses the standard | |
BRKEM imm8 |
0F FF ib |
Break to 8080 emulation mode.
Jump to an address picked from the IVT (Interrupt Vector Table) using the imm8 argument, similar to the 8086 |
V20, V30, V40, V50[32] |
BRKXA imm8 |
0F E0 ib |
Break to Extended Address Mode.
Jump to an address picked from the IVT using the imm8 argument. Enables a simple memory paging mechanism after reading the IVT but before executing the jump. The paging mechanism uses an on-chip page table with 16Kbyte pages and no access rights checking.[35] |
V33, V53[32] |
RETXA imm8 |
0F F0 ib |
Return from Extended Address Mode.
Jump to an address picked from the IVT using the imm8 argument. Disables paging after reading the IVT but before executing the jump. | |
MOVSPA |
0F 25 |
Transfer both SS and SP of old register bank after the bank has been switched by an interrupt or BRKCS instruction. |
V25, V35,[36] V55[37] |
BRKCS r16 |
0F 2D /0 |
Perform software interrupt with context switch to register bank specified by low 3 bits of r16. | |
RETRBI |
0F 91 |
Return from register bank context switch interrupt. | |
FINT |
0F 92 |
Finish Interrupt. | |
TSKSW r16 |
0F 94 /7 |
Perform task switch to register bank indicated by low 3 bits of r16. | |
MOVSPB r16 |
0F 95 /7 |
Transfer SS and SP of current register bank to register bank indicated by low 3 bits of r16. | |
BTCLR imm8,imm8,cb |
0F 9C ib ib rel8 |
Bit Test and Clear.
The first argument specifies a V25/V35 Special Function Register to test a bit in. The second argument specifies a bit position in that register. The third argument specifies a short branch offset. If the bit was set to 1, then it is cleared and a short branch is taken, else the branch is not taken. | |
STOP |
0F 9E |
CPU Halt.
Differs from the conventional 8086 | |
BRKS imm8 |
F1 ib |
Break and Enable Software Guard.
Jump to an address picked from the IVT using the imm8 argument, and then continue execution with "Software Guard" enabled. The "Software Guard" is an 8-bit Substitution cipher that, during instruction fetch/decode, translates opcode bytes using a 256-entry lookup table stored in an on-chip Mask ROM. |
V25, V35 "Software Guard"[38] |
BRKN imm8 |
63 ib |
Break and Enable Native Mode. Similar to BRKS , excepts disables "Software Guard" rather than enabling it. | |
MOV r/m,DS3 |
8C /6 |
Move to/from the DS2 and DS3 extended segment registers.
The DS2 and DS3 registers (which are specific to the NEC V55) act similar to regular x86 real mode segment registers except that they are left-shifted by 8 rather than 4, enabling access to 16MB of memory. Block transfer instructions, such as MOVBKW, can access the 16MB memory space by simultaneously prefixing with DS2 and DS3.[39] |
V55[37] |
MOV r/m,DS2 |
8C /7 | ||
MOV DS3,r/m |
8E /6 | ||
MOV DS2,r/m |
8E /7 | ||
PUSH DS3 |
0F 76 [40] | ||
POP DS3 |
0F 77 | ||
PUSH DS2 |
0F 7E | ||
POP DS2 |
0F 7F | ||
MOV DS3,r16,m32 |
0F 36 /r |
Instructions to load both extended segment register and general-purpose register at once, similar to 8086's LDS and LES instructions | |
MOV DS2,r16,m32 |
0F 3E /r | ||
DS2: |
63 |
Segment-override prefixes for the DS2 and DS3 extended segments. | |
DS3: |
D6 | ||
IRAM: |
F1 |
Register File Override Prefix. Will cause memory operands to index into register file rather than general memory. | |
BSCH r/m8 BSCH r/m16 |
0F 3C /0 0F 3D /0 |
Count Trailing Zeroes and store result in CL. Sets ZF=1 for all-0s input. | |
RSTWDT imm8,imm8 |
0F 96 ib ib |
Watchdog Timer Manipulation Instruction. | |
BTCLRL imm8,imm8,cb |
0F 9D ib ib rel8 |
Bit test and clear for second bank of special purpose registers (similar to BTCLR ). | |
QHOUT imm16 |
0F E0 iw |
Queue manipulation instructions. | |
QOUT imm16 |
0F E1 iw | ||
QTIN imm16 |
0F E2 iw | ||
IDLE |
0F 9F |
Put CPU in idle mode. | V55SC[41] |
ALBIT |
0F 9A |
Dedicated fax instructions. | V55PI[37] |
COLTRP |
0F 9B | ||
MHENC |
0F 93 | ||
MRENC |
0F 97 | ||
SCHEOL |
0F 78 | ||
GETBIT |
0F 79 | ||
MHDEC |
0F 7C | ||
MRDEC |
0F 7D | ||
CNVTRP |
0F 7A | ||
(no mnemonic) | 63 |
Designated opcode for termination of the x86 emulation mode on the NEC V60.[42] | V60, V70 |
These instructions are present in Cyrix CPUs as well as NatSemi/AMD Geode CPUs derived from Cyrix microarchitectures (Geode GX and LX, but not NX). They are also present in Cyrix manufacturing partner CPUs from IBM, ST and TI, as well as the VIA Cyrix III ("Joshua" core only, not "Samuel") and a few SoCs such as STPC ATLAS and ZFMicro ZFx86.[43] Many of these opcodes have been reassigned to other instructions in later non-Cyrix CPUs.
Instruction | Opcode | Description | Available on |
---|---|---|---|
SVDC m80,sreg |
0F 78 /r |
Save segment register and descriptor to memory as a 10-byte data structure.
The first 8 bytes are the descriptor, the last two bytes are the selector.[44] |
System Management Mode instructions.[a]
Not present on stepping A of Cx486SLC and Cx486DLC.[45] Present on Cx486SLC/e[46] and all later Cyrix CPUs. Present on all Cyrix-derived Geode CPUs. |
RSDC sreg,m80 [b] |
0F 79 /r |
Restore segment register and descriptor from memory | |
SVLDT m80 |
0F 7A /0 |
Save LDTR and descriptor | |
RSLDT m80 |
0F 7B /0 |
Restore LDTR and descriptor | |
SVTS m80 |
0F 7C /0 |
Save TSR and descriptor | |
RSTS m80 |
0F 7D /0 |
Restore TSR and descriptor | |
SMINT [c] |
0F 7E |
System management software interrupt.
Uses Uses |
Cyrix 486S[11] and later processors - not available on older Cyrix 486SLC/DLC/SRx2/DRx2 processors.
Not available on any Ti486 processors. |
0F 38 | |||
RDSHR r/m32 |
0F 36 /0 [d] |
Read SMM Header Pointer Register | Cyrix 6x86MX[48] and MII
VIA Cyrix III[51] |
WRSHR r/m32 |
0F 37 /0 [d] |
Write SMM Header Pointer Register | |
BB0_RESET |
0F 3A |
Reset BLT Buffer Pointer 0 to base | Cyrix MediaGX and MediaGXm[52]
NatSemi Geode GXm, GXLV, GX1 |
BB1_RESET |
0F 3B |
Reset BLT Buffer Pointer 1 to base | |
CPU_WRITE |
0F 3C |
Write to CPU internal special register (EBX=register-index, EAX=data) | |
CPU_READ |
0F 3D |
Read from CPU internal special register (EBX=register-index, EAX=data) | |
DMINT |
0F 39 |
Debug Management Mode Interrupt | NatSemi Geode GX2
AMD Geode GX, LX[47] |
RDM |
0F 3A |
Return from Debug Management Mode |
RSDC
with CS
as a destination register is only supported on NatSemi Geode GX2 and AMD Geode GX/LX[47] - on other processors, it causes #UD.SMINTOLD
for the 0F 7E
encoding.RDSHR
and WRSHR
instructions, Cyrix's documentation[48] specifies that the instruction accepts a ModR/M byte but does not specify the encoding of the ModR/M byte's reg field. NASM v0.98.31 and later uses /0 for these instructions,[49] while sandpile.org's opcode tables[50] indicate that the reg field is ignored for these instructions.These instructions were introduced in the Cyrix 6x86MX and MII processors, and were also present in the MediaGXm and Geode GX1[53] processors. (In later non-Cyrix processors, all of their opcodes have been used for SSE or SSE2 instructions.)
These instructions are integer SIMD instructions acting on 64-bit vectors in MMX registers or memory. Each instruction takes two explicit operands, where the first one is an MMX register operand and the second one is either a memory operand or a second MMX register. In addition, several of the instructions take an implied operand, which is an MMX register implied from the first operand as follows:
First explicit operand | mm0 | mm1 | mm2 | mm3 | mm4 | mm5 | mm6 | mm7 |
---|---|---|---|---|---|---|---|---|
Implied operand | mm1 | mm0 | mm3 | mm2 | mm5 | mm4 | mm7 | mm6 |
In the instruction descriptions in the below table, arg1
and arg2
refer to the two explicit operands of the instruction, and imp
to the implied operand.
Instruction | Opcode | Description | |
---|---|---|---|
PAVEB mm,mm/m64 | 0F 50 /r | Packed average bytes:[a]arg1 <- (arg1+arg2) >> 1 | |
PADDSIW mm,mm/m64 | 0F 51 /r | Packed add signed words with saturation, using implied destination:imp <- saturate_s16(arg1+arg2) | |
PMAGW mm,mm/m64 | 0F 52 /r | Packed signed word magnitude maximum value:if (abs(arg2) > abs(arg1)) then arg1 <- arg2 | |
PDISTIB mm,m64 [b] | 0F 54 /r | Packed unsigned byte distance and accumulate to implied destination, with saturation:imp <- saturate_u8(imp + (abs(arg1-arg2))) | |
PSUBSIW mm,mm/m64 | 0F 55 /r | Packed subtract signed words with saturation, using implied destination:imp <- saturate_s16(arg1-arg2) | |
PMULHRW mm,mm/m64 ,[c]PMULHRWC mm,mm/m64 | 0F 59 /r | Packed signed word multiply high with rounding:arg1 <- (arg1*arg2+0x4000)>>15 | |
PMULHRIW mm,mm/m64 | 0F 5D /r | Packed signed word multiply high with rounding and implied destination:imp <- (arg1*arg2+0x4000)>>15 | |
PMACHRIW mm,m64 [b] | 0F 5E /r | Packed signed word multiply high with rounding and accumulation to implied destination:imp <- imp + ((arg1*arg2+0x4000)>>15) | |
PMVZB mm,m64 [b] | 0F 58 /r | if (imp == 0) then arg1 <- arg2 |
Packed conditional load from memory to MMX register.
Condition is evaluated on a per-byte-lane basis, by comparing byte lanes in the implied source to zero (with signed compare) − if the comparison passes, then the corresponding destination lane is loaded from memory, otherwise it keeps its original value. |
PMVNZB mm,m64 [b] | 0F 5A /r | if (imp != 0) then arg1 <- arg2 | |
PMVLZB mm,m64 [b] | 0F 5B /r | if (imp < 0) then arg1 <- arg2 | |
PMVGEZB mm,m64 [b] | 0F 5C /r | if (imp >= 0) then arg1 <- arg2 |
PAVEB
instruction treats the bytes as signed or unsigned.[54]PMULHRW
instruction has the same mnemonic as the 3DNow! PMULHRW
instruction, however its opcode and function differ (the EMMI instruction right-shifts its multiply-result by 15 bits, while the 3DNow! instruction right-shifts by 16 bits).Some assemblers/disassemblers, such as NASM, resolve this ambiguity by using the mnemonic PMULHRWA
for the 3DNow! instruction and PMULHRWC
for the EMMI instruction.
All VIA C3 processors support the VIA AIS (Alternate Instruction Set). The x86 instructions present in these processors to support AIS are:
Instruction | Opcode | Description |
---|---|---|
JMPAI EAX | 0F 3F [55] | Near Jump to address in EAX, and enter Alternate Instruction mode. |
AI uop32 |
8D 84 00 imm32 [55] | Alternate instruction wrapper opcode ("Samuel"/"Ezra" variants of C3 - repurposes the instruction encoding for LEA EAX,[EAX+EAX+disp32] )
32-bit immediate is treated as a 32-bit instruction of the RISC-like Alternate Instruction Set. An instruction set reference is available.[56] |
62 80 imm32 [57] | Alternate instruction wrapper opcode ("Nehemiah" variants of C3 - repurposes the instruction encoding for BOUND EAX,[EAX+disp32] ) |
These instructions are not present in VIA C7 or any later VIA processor.
The C&T F8680 PC/Chip is a system-on-a-chip featuring an 80186-compatible CPU core, with a few additional instructions to support the F8680-specific "SuperState R"[58] supervisor/system-management feature. Some of the added instructions for "SuperState R" are:[59]
Instruction | Opcode | Description |
---|---|---|
LFEAT AX | FE F8 | Load datum into F8680 "CREG" configuration register (AH=register-index, AL=datum)[60] |
STFEAT AL,imm8 | FE F0 ib | Read F8680 status register into AL (imm8=register-index) |
C&T also developed a 386-compatible processor known as the Super386. This processor supports, in addition to the basic Intel 386 instruction set, a number of instructions to support the Super386-specific "SuperState V" system-management feature. The added instructions for "SuperState V" are:[7]
Instruction | Opcode | Description |
---|---|---|
SCALL r/m | 0F 18 /0 | Call SMM interrupt handler[61][62] |
SRET | 0F 19 | Return from SMM interrupt handler |
SRESUME | 0F 1A | Return from SMM with interrupts disabled for one instruction |
SVECTOR | 0F 1B | Exit from SMM and issue a shutdown cycle |
EPIC | 0F 1E | Load one of the six interrupt or I/O traps |
RARF1 | 0F 3C | Read from bank 1 of the register file (includes visible and invisible CPU registers) |
RARF2 | 0F 3D | Read from bank 2 of the register file |
RARF3 | 0F 3E | Read from bank 3 of the register file |
LTLB | 0F F0 | Load TLB with page table entry |
RCT | 0F F1 | Read cache tag |
WCT | 0F F2 | Write cache tag |
RCD | 0F F3 | Read cache data |
WCD | 0F F4 | Write cache data |
RTLBPA | 0F F5 | Read TLB data (physical address) |
RTLBLA | 0F F6 | Read TLB tag (linear address) |
LCFG | 0F F7 | Load configuration register |
SCFG | 0F F8 | Store configuration register |
RGPR | 0F F9 | Read general-purpose register or any bank of register file |
RARF0 | 0F FA | Read from bank 0 of the register file |
RARFE | 0F FB | Read from extra bank of the register file |
WGPR | 0F FD | Write general-purpose register or any bank of register file |
WARFE | 0F FE | Write extra bank of the register file |
The M6117 series of embedded microcontrollers feature an Intel 386SX compatible CPU core derived from V.M. Technology (VMT) VM386SX+ processor. VMT VM386SX+ adds a few processor specific additions to the Intel 386 instruction set. The ones documented for DM&P M6117D are:[63]
Instruction | Opcode | Description |
---|---|---|
BRKPM | F1 | System management interrupt − enters "hyper state mode" |
RETPM | D6 E6 | Return from "hyper state mode" |
LDUSR UGRS,EAX | D6 CA 03 A0 | Set page address of SMI entry point |
(mnemonic not listed) | D6 C8 03 A0 | Read page address of SMI entry point |
MOV PWRCR,EAX | D6 FA 03 02 | Write to power control register |
Several 80387-class floating-point coprocessors provided extra instructions in addition to the standard 80387 ones − none of these are supported in later processors:
Instruction | Opcode | Description | Available on |
---|---|---|---|
FRSTPM |
DB F4 [64]
or
|
FPU Reset Protected Mode.
Instruction to signal to the FPU that the main CPU is exiting protected mode, similar to how the FSETPM instruction is used to signal to the FPU that the CPU is entering protected mode. Different sources provide different encodings for this instruction. |
Intel 287XL |
FNSTDW AX |
DF E1 |
Store FPU Device Word to AX | Intel 387SL[12][65] |
FNSTSG AX |
DF E2 |
Store FPU Signature Register to AX[a] | |
FSBP0 |
DB E8 |
Select Coprocessor Register Bank 0 | IIT 2c87, 3c87[12][67] |
FSBP1 |
DB EB |
Select Coprocessor Register Bank 1 | |
FSBP2 |
DB EA |
Select Coprocessor Register Bank 2 | |
FSBP3 |
DB E9 [68] |
Select Coprocessor Register Bank 3 (undocumented) | |
F4X4 ,
|
DB F1 |
Multiply 4-component vector with 4x4 matrix. For proper operation, the matrix must be preloaded into Coprocessor Register banks 1 and 2 (unique to IIT FPUs), and the vector must be loaded into Coprocessor Register Bank 0. Example code is available.[67][69] | |
FTSTP |
D9 E6 |
Equivalent to FTST followed by a stack pop. |
Cyrix EMC87, 83s87, 83d87, 387+[69][12][70] |
FRINT2 |
DB FC |
Round st(0) to integer, with round-to-nearest ties-away-from-zero rounding.[70] | |
FRICHOP |
DD FC |
Round st(0) to integer, with round-to-zero rounding. | |
FRINEAR |
DF FC |
Round st(0) to integer, with round-to-nearest-even rounding.[70] |
FNSTSG AX
instruction can be executed not just on the Intel 387SL FPU but on the Intel 387SX as well - executing the instruction immediately after an FNINIT
will cause the instruction to return 0000h
on 387SX, but a nonzero signature value on the 387SL.[66]Seamless Wikipedia browsing. On steroids.
Every time you click a link to Wikipedia, Wiktionary or Wikiquote in your browser's search results, it will show the modern Wikiwand interface.
Wikiwand extension is a five stars, simple, with minimum permission required to keep your browsing private, safe and transparent.