Jump to content

User:Punpcklbw/sandbox

From Wikipedia, the free encyclopedia


MMX instructions and extended variants thereof

[edit]

These instructions are, unless otherwise noted, available in the following forms:

  • MMX: 64-bit vectors, operating on mm0..mm7 registers (aliased on top of the old x87 register file)
  • SSE2: 128-bit vectors, operating on xmm0..xmm15 registers (xmm0..xmm7 in 32-bit mode)
  • AVX: 128-bit vectors, operating on xmm0..xmm15 registers, with a new three-operand encoding enabled by the new VEX prefix. (AVX introduced 256-bit vector registers, but the full width of these vectors was in general not made available for integer SIMD instructions until AVX2.)
  • AVX2: 256-bit vectors, operating on ymm0..ymm15 registers (extended versions of the xmm0..xmm15 registers)
  • AVX-512: 512-bit vectors, operating on zmm0..zmm31 registers (zmm0..zmm15 are extended versions of the ymm0..ymm15 registers, while zmm16..zmm31 are new to AVX-512). AVX-512 also introduces opmasks, allowing the operation of most instructions to be masked on a per-lane basis by an opmask register (the lane width varies from one instruction to another). AVX-512 also adds broadcast functionality for many of its instructions - this is used with memory source arguments to replicate a single value to all lanes of a vector calculation. The tables below provide indications of whether opmasks and broadcasts are supported for each instruction, and if so, what lane-widths they are using.

For many of the instruction mnemonics, (V) is used to indicate that the instruction mnemonic exists in forms with and without a leading V - the form with the leading V is used for the VEX/EVEX-prefixed instruction variants introduced by AVX/AVX2/AVX-512, while the form without the leading V is used for legacy MMX/SSE encodings without VEX/EVEX-prefix.

Original Pentium MMX instructions, and SSE2/AVX/AVX-512 extended variants thereof

[edit]
Description Instruction mnemonics Basic opcode MMX
(no prefix)
SSE2
(66h prefix)
AVX
(VEX.66 prefix)
AVX-512 (EVEX.66 prefix)
supported subset lane bcst
Empty MMX technology state. (MMX)

Mark all the FP/MMX registers as Empty, so that they can be freely used by later x87 code.

EMMS (MMX) 0F 77 EMMS No VZEROUPPER(L=0)
VZEROALL(L=1)

[a][b]
No
Zero out upper bits of vector registers YMM0 to YMM15 (AVX) VZEROUPPER (AVX)
Zero out all bits of vector registers YMM0 to YMM15 (AVX) VZEROALL (AVX)
Move scalar value from GPR (general-purpose register) or memory to vector register, with zero-fill 32-bit (V)MOVD mm, r/m32 0F 6E /r Yes Yes Yes (L=0,W=0) Yes (L=0,W=0) F No No
64-bit
(x86-64)
(V)MOVQ mm, r/m64,
MOVD mm, r/m64[c]
Yes
(REX.W)
Yes
(REX.W)[d]
Yes (L=0,W=1) Yes (L=0,W=1) F No No
Move scalar value from vector register to GPR or memory 32-bit (V)MOVD r/m32, mm 0F 7E /r Yes Yes Yes (L=0,W=0) Yes (L=0,W=0) F No No
64-bit
(x86-64)
(V)MOVQ r/m64, mm,
MOVD r/m64, mm[c]
Yes
(REX.W)
Yes
(REX.W)[d]
Yes (L=0,W=1) Yes (L=0,W=1) F No No
Vector move between vector register and either memory or another vector register.

For move to/from memory, the memory address is required to be aligned for (V)MOVDQA variants but not for MOVQ.

128-bit VEX-encoded form of VMOVDQA with memory argument will, if the memory is cacheable, perform its memory access atomically.[e]

MOVQ mm/m64, mm(MMX)
(V)MOVDQA xmm/m128,xmm
0F 7F /r MOVQ MOVDQA VMOVDQA[f] VMOVDQA32​(W0) F 32 No
VMOVDQA64​(W1) F 64 No
MOVQ mm, mm/m64(MMX)
(V)MOVDQA xmm,xmm/m128
0F 6F /r MOVQ MOVDQA VMOVDQA[f] VMOVDQA32​(W0) F 32 No
VMOVDQA64​(W1) F 64 No
Pack 32-bit signed integers to 16-bit, with saturation (V)PACKSSDW mm, mm/m64[g] 0F 6B /r Yes Yes Yes Yes (W=0) BW 16 32
Pack 16-bit signed integers to 8-bit, with saturation (V)PACKSSWB mm, mm/m64[g] 0F 63 /r Yes Yes Yes Yes BW 8 No
Pack 16-bit unsigned integers to 8-bit, with saturation (V)PACKUSWB mm, mm/m64[g] 0F 67 /r Yes Yes Yes Yes BW 8 No
Unpack and interleave packed integers from the high halves of two input vectors 8-bit (V)PUNPCKHBW mm, mm/m64[g] 0F 68 /r Yes Yes Yes Yes BW 8 No
16-bit (V)PUNPCKHWD mm, mm/m64[g] 0F 69 /r Yes Yes Yes Yes BW 16 No
32-bit (V)PUNPCKHDQ mm, mm/m64[g] 0F 6A /r Yes Yes Yes Yes (W=0) F 32 32
Unpack and interleave packed integers from the low halves of two input vectors 8-bit (V)PUNPCKLBW mm, mm/m32[g][h] 0F 60 /r Yes Yes Yes Yes BW 8 No
16-bit (V)PUNPCKLWD mm, mm/m32[g][h] 0F 61 /r Yes Yes Yes Yes BW 16 No
32-bit (V)PUNPCKLDQ mm, mm/m32[g][h] 0F 62 /r Yes Yes Yes Yes (W=0) F 32 32
Add packed integers 8-bit (V)PADDB mm, mm/m64 0F FC /r Yes Yes Yes Yes BW 8 No
16-bit (V)PADDW mm, mm/m64 0F FD /r Yes Yes Yes Yes BW 16 No
32-bit (V)PADDD mm, mm/m64 0F FE /r Yes Yes Yes Yes (W=0) F 32 32
Add packed signed integers with saturation 8-bit (V)PADDSB mm, mm/m64 0F EC /r Yes Yes Yes Yes BW 8 No
16-bit (V)PADDSW mm, mm/m64 0F ED /r Yes Yes Yes Yes BW 16 No
Add packed unsigned integers with saturation 8-bit (V)PADDUSB mm, mm/m64 0F DC /r Yes Yes Yes Yes BW 8 No
16-bit (V)PADDUSW mm, mm/m64 0F DD /r Yes Yes Yes Yes BW 16 No
Subtract packed integers 8-bit (V)PSUBB mm, mm/m64 0F F8 /r Yes Yes Yes Yes BW 8 No
16-bit (V)PSUBW mm, mm/m64 0F F9 /r Yes Yes Yes Yes BW 16 No
32-bit (V)PSUBD mm, mm/m64 0F FA /r Yes Yes Yes Yes (W=0) F 32 32
Subtract packed signed integers with saturation 8-bit (V)PSUBSB mm, mm/m64 0F E8 /r Yes Yes Yes Yes BW 8 No
16-bit (V)PSUBSW mm, mm/m64 0F E9 /r Yes Yes Yes Yes BW 16 No
Subtract packed unsigned integers with saturation 8-bit (V)PSUBUSB mm, mm/m64 0F D8 /r Yes Yes Yes Yes BW 8 No
16-bit (V)PSUBUSW mm, mm/m64 0F D9 /r Yes Yes Yes Yes BW 16 No
Compare packed integers for equality 8-bit (V)PCMPEQB mm, mm/m64 0F 74 /r Yes Yes Yes Yes[i] BW 8 No
16-bit (V)PCMPEQW mm, mm/m64 0F 75 /r Yes Yes Yes Yes[i] BW 16 No
32-bit (V)PCMPEQD mm, mm/m64 0F 76 /r Yes Yes Yes Yes (W=0)[i] F 32 32
Compare packed integers for signed greater-than 8-bit (V)PCMPGTB mm, mm/m64 0F 64 /r Yes Yes Yes Yes[i] BW 8 No
16-bit (V)PCMPGTW mm, mm/m64 0F 65 /r Yes Yes Yes Yes[i] BW 16 No
32-bit (V)PCMPGTD mm, mm/m64 0F 66 /r Yes Yes Yes Yes (W=0)[i] F 32 32
Multiply packed 16-bit signed integers, add results pairwise into 32-bit integers (V)PMADDWD mm, mm/m64 0F F5 /r Yes Yes Yes Yes[j] BW 32 No
Multiply packed 16-bit signed integers, store high 16 bits of results (V)PMULHW mm, mm/m64 0F E5 /r Yes Yes Yes Yes BW 16 No
Multiply packed 16-bit integers, store low 16 bits of results (V)PMULLW mm, mm/m64 0F D5 /r Yes Yes Yes Yes BW 16 No
Vector bitwise AND (V)PAND mm, mm/m64 0F DB /r Yes Yes Yes VPANDD​(W0) F 32 32
VPANDQ​(W1) F 64 64
Vector bitwise AND-NOT (V)PANDN mm, mm/m64 0F DF /r Yes Yes Yes VPANDND​(W0) F 32 32
VPANDNQ​(W1) F 64 64
Vector bitwise OR (V)POR mm, mm/m64 0F EB /r Yes Yes Yes VPORD(W0) F 32 32
VPORQ(W1) F 64 64
Vector bitwise XOR (V)PXOR mm, mm/m64 0F EE /r Yes Yes Yes VPXORD(W0) F 32 32
VPXORQ(W1) F 64 64
left-shift of packed integers, with common shift-amount 16-bit (V)PSLLW mm, imm8 0F 71 /6 ib Yes Yes Yes Yes BW 16 No
(V)PSLLW mm, mm/m64[k] 0F F1 /r Yes Yes Yes Yes BW 16 No
32-bit (V)PSLLD mm, imm8 0F 72 /6 ib Yes Yes Yes Yes (W=0) F 32 32
(V)PSLLD mm, mm/m64[k] 0F F2 /r Yes Yes Yes Yes (W=0) F 32 No
64-bit (V)PSLLQ mm, imm8 0F 73 /6 ib Yes Yes Yes Yes (W=1) F 64 64
(V)PSLLQ mm, mm/m64[k] 0F F3 /r Yes Yes Yes Yes (W=1) F 64 No
Right-shift of packed signed integers, with common shift-amount 16-bit (V)PSRAW mm, imm8 0F 71 /4 ib Yes Yes Yes Yes BW 16 No
(V)PSRAW mm, mm/m64[k] 0F E1 /r Yes Yes Yes Yes BW 16 No
32-bit (V)PSRAD mm, imm8 0F 72 /4 ib Yes Yes Yes Yes (W=0) F 32 32
(V)PSRAD mm, mm/m64[k] 0F E2 /r Yes Yes Yes Yes (W=0) F 32 No
Right-shift of packed unsigned integers, with common shift-amount 16-bit (V)PSRLW mm, imm8 0F 71 /2 ib Yes Yes Yes Yes BW 16 No
(V)PSRLW mm, mm/m64[k] 0F D1 /r Yes Yes Yes Yes BW 16 No
32-bit (V)PSRLD mm, imm8 0F 72 /2 ib Yes Yes Yes Yes (W=0) F 32 32
(V)PSRLD mm, mm/m64[k] 0F D2 /r Yes Yes Yes Yes (W=0) F 32 No
64-bit (V)PSRLQ mm, imm8 0F 73 /2 ib Yes Yes Yes Yes (W=1) F 64 64
(V)PSRLQ mm, mm/m64[k] 0F D3 /r Yes Yes Yes Yes (W=1) F 64 No
  1. ^ For code that may potentially mix use of legacy-SSE instructions with AVX instructions, it is strongly recommended to execute a VZEROUPPER or VZEROALL instruction after executing AVX instructions but before executing SSE instructions. If this is not done, any subsequent legacy-SSE code may be subject to severe performance degradation.[1]
  2. ^ On some early AVX implementations (e.g. Sandy Bridge[2]) encoding the VZEROUPPER and VZEROALL instructions with VEX.W=1 will result in #UD - for this reason, it is recommended to encode these instructions with VEX.W=0.
  3. ^ a b The 64-bit move instruction forms that are encoded by using a REX.W prefix with the 0F 6E and 0F 7E opcodes are listed with different mnemonics in Intel and AMD documentation — MOVQ in Intel documentation[3] and MOVD in AMD documentation.[4]
    This is a documentation difference only — the operation performed by these opcodes is the same for Intel and AMD.
    This documentation difference applies only to the MMX/SSE forms of these opcodes — for VEX/EVEX-encoded forms, both Intel and AMD use the mnemonic VMOVQ.)
  4. ^ a b The REX.W-encoded variants of MOVQ are available in 64-bit "long mode" only. For SSE2 and later, MOVQ to and from xmm/ymm/zmm registers can also be encoded with F3 0F 7E /r and 66 0F D6 /r respectively - these encodings are shorter and available outside 64-bit mode.
  5. ^ On all Intel,[5] AMD[6] and Zhaoxin[7] processors that support AVX, the 128-bit forms of VMOVDQA (encoded with a VEX prefix and VEX.L=0) are, when used with a memory argument addressing WB (write-back cacheable) memory, architecturally guaranteed to perform the 128-bit memory access atomically - this applies to both load and store.

    (Intel and AMD provide somewhat wider guarantees covering more 128-bit instruction variants, but Zhaoxin provides the guarantee for cacheable VMOVDQA only.)

    On processors that support SSE but don't support AVX, the 128-bit forms of SSE load/store instructions such as MOVAPS/MOVAPD/MOVDQA are not guaranteed to execute atomically — examples of processors where such instructions have been observed to execute non-atomically include Intel Core Duo and AMD K10.[8]

  6. ^ a b VMOVDQA is available with a vector length of 256 bits under AVX, not requiring AVX2.
  7. ^ a b c d e f g h i For the VPACK* and VPUNPCK* instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction.
  8. ^ a b c For the memory argument forms of (V)PUNPCKL* instructions, the memory argument is half-width only for the MMX variants of the instructions. For SSE/AVX/AVX-512 variants, the width of the memory argument is the full vector width even though only half of it is actually used.
  9. ^ a b c d e f The EVEX-encoded variants of the VPCMPEQ* and VPCMPGT* instructions write their results to AVX-512 opmask registers. This differs from the older non-EVEX variants, which write comparison results as vectors of all-0s/all-1s values to the regular mm/xmm/ymm vector registers.
  10. ^ The (V)PMADDWD instruction will add multiplication results pairwise, but will not add the sum to an accumulator. AVX512_VNNI provides the instructions VDPWSSD and WDPWSSDS, which will add multiplication results pairwise, and then also add them to a per-32-bit-lane accumulator.
  11. ^ a b c d e f g h For the MMX packed shift instructions PSLL* and PSR* with a shift-argument taken from a vector source (mm or m64), the shift-amount is considered to be a single 64-bit scalar value - the same shift-amount is used for all lanes of the destination vector. This shift-amount is unsigned and is not masked - all bits are considered (e.g. a shift-amount of 0x80000000_00000000 can be specified and will have the same effect as a shift-amount of 64).

    For all SSE2/AVX/AVX512 extended variants of these instructions, the shift-amount vector argument is considered to be a 128-bit (xmm or m128) argument - the bottom 64 bits are used as the shift-amount.

    Packed shift-instructions that can take a variable per-lane shift-amount were introduced in AVX2 for 32/64-bit lanes and AVX512BW for 16-bit lanes (VPSLLV*, VPSRLV*, VPSRAV* instructions).

MMX instructions added with MMX+/SSE/SSE2/SSSE3, and SSE2/AVX/AVX-512 extended variants thereof

[edit]
Description Instruction mnemonics Basic opcode MMX
(no prefix)
SSE2
(66h prefix)
AVX
(VEX.66 prefix)
AVX-512 (EVEX.66 prefix)
supported subset lane bcst
Added with SSE and MMX+
Perform shuffle of four 16-bit integers in 64-bit vector (MMX)[a] PSHUFW mm,mm/m64,imm8(MMX) 0F 70 /r ib PSHUFW PSHUFD VPSHUFD VPSHUFD
(W=0)
F 32 32
Perform shuffle of four 32-bit integers in 128-bit vector (SSE2) (V)PSHUFD xmm,xmm/m128,imm8[b]
Insert integer into 16-bit vector register lane (V)PINSRW mm,r32/m16,imm8 0F C4 /r ib Yes Yes Yes (L=0,W=0[c]) Yes (L=0) BW No No
Extract integer from 16-bit vector register lane, with zero-extension (V)PEXTRW r32,mm,imm8[d] 0F C5 /r ib Yes Yes Yes (L=0,W=0[c]) Yes (L=0) BW No No
Create a bitmask made from the top bit of each byte in the source vector, and store to integer register (V)PMOVMSKB r32,mm 0F D7 /r Yes Yes Yes No[e]
Minimum-value of packed unsigned 8-bit integers (V)PMINUB mm,mm/m64 0F DA /r Yes Yes Yes Yes BW 8 No
Maximum-value of packed unsigned 8-bit integers (V)PMAXUB mm,mm/m64 0F DE /r Yes Yes Yes Yes BW 8 No
Minimum-value of packed signed 16-bit integers (V)PMINSW mm,mm/m64 0F EA /r Yes Yes Yes Yes BW 16 No
Maximum-value of packed signed 16-bit integers (V)PMAXSW mm,mm/m64 0F EE /r Yes Yes Yes Yes BW 16 No
Rounded average of packed unsigned integers. The per-lane operation is:
dst ← (src1 + src2 + 1)>>1
8-bit (V)PAVGB mm,mm/m64 0F E0 /r Yes Yes Yes Yes BW 8 No
16-bit (V)PAVGW mm,mm/m64 0F E3 /r Yes Yes Yes Yes BW 16 No
Multiply packed 16-bit unsigned integers, store high 16 bits of results (V)PMULHUW mm,mm/mm64 0F E4 /r Yes Yes Yes Yes BW 16 No
Store vector register to memory using Non-Temporal Hint.

Memory operand required to be aligned for all (V)MOVNTDQ variants, but not for MOVNTQ.

MOVNTQ m64,mm(MMX)
(V)MOVNTDQ m128,xmm
0F E7 /r MOVNTQ MOVNTDQ VMOVNTDQ[f] VMOVNTDQ
(W=0)
F No No
Compute sum of absolute differences for eight 8-bit unsigned integers, storing the result as a 64-bit integer.

For vector widths wider than 64 bits (SSE/AVX/AVX-512), this calculation is done separately for each 64-bit lane of the vectors, producing a vector of 64-bit integers.

(V)PSADBW mm,mm/m64 0F F6 /r Yes Yes Yes Yes BW No No
Unaligned store vector register to memory using byte write-mask, with Non-Temporal Hint.

First argument provides data to store, second argument provides byte write-mask (top bit of each byte).[g] Address to store to is given by DS:DI/EDI/RDI (DS: segment overridable with segment-prefix).

MASKMOVQ mm,mm(MMX)
(V)MASKMOVDQU xmm,xmm
0F F7 /r MASKMOVQ MASKMOVDQU VMASKMOVDQU
(L=0)[h]
No[i]
Added with SSE2
Multiply packed 32-bit unsigned integers, store full 64-bit result.

The input integers are taken from the low 32 bits of each 64-bit vector lane.

(V)PMULUDQ mm,mm/m64 0F F4 /r Yes Yes Yes Yes (W=1) F 64 64
Add packed 64-bit integers (V)PADDQ mm, mm/m64 0F D4 /r Yes Yes Yes Yes (W=1) F 64 64
Subtract packed 64-bit integers (V)PSUBQ mm,mm/m64 0F FB /r Yes Yes Yes Yes (W=1) F 64 64
Added with SSSE3
Vector Byte Shuffle (V)PSHUFB mm,mm/m64[b] 0F38 00 /r Yes Yes[j] Yes Yes BW 8 No
Pairwise horizontal add of packed integers 16-bit (V)PHADDW mm,mm/mm64[b] 0F38 01 /r Yes Yes Yes No
32-bit (V)PHADDD mm,mm/mm64[b] 0F38 02 /r Yes Yes Yes No
Pairwise horizontal add of packed 16-bit signed integers, with saturation (V)PHADDSW mm,mm/mm64[b] 0F38 03 /r Yes Yes Yes No
Multiply packed 8-bit signed and unsigned integers, add results pairwise into 16-bit signed integers with saturation. First operand is treated as unsigned, second operand as signed. (V)PMADDUBSW mm,mm/m64 0F38 04 /r Yes Yes Yes Yes BW 16 No
Pairwise horizontal subtract of packed integers.

The higher-order integer of each pair is subtracted from the lower-order integer.

16-bit (V)PHSUBW mm,mm/m64[b] 0F38 05 /r Yes Yes Yes No
32-bit (V)PHSUBD mm,mm/m64[b] 0F38 06 /r Yes Yes Yes No
Pairwise horizontal subtract of packed 16-bit signed integers, with saturation (V)PHSUBSW mm,mm/m64[b] 0F38 07 /r Yes Yes Yes No
Modify packed integers in first source argument based on the sign of packed signed integers in second source argument. The per-lane operation performed is:
if( src2 < 0 ) dst ← -src1
else if( src2 == 0 ) dst ← 0
else dst ← src1
8-bit (V)PSIGNB mm,mm/m64 0F38 08 /r Yes Yes Yes No
16-bit (V)PSIGNW mm,mm/m64 0F38 09 /r Yes Yes Yes No
32-bit (V)PSIGND mm,mm/m64 0F38 0A /r Yes Yes Yes No
Multiply packed 16-bit signed integers, then perform rounding and scaling to produce a 16-bit signed integer result.

The calculation performed per 16-bit lane is:
dst ← (src1*src2 + (1<<14)) >> 15

(V)PMULHRSW mm,mm/m64 0F38 0B /r Yes Yes Yes Yes BW 16 No
Absolute value of packed signed integers 8-bit (V)PABSB mm,mm/m64 0F38 1C /r Yes Yes Yes Yes BW 8 No
16-bit (V)PABSW mm,mm/m64 0F38 1D /r Yes Yes Yes Yes BW 8 No
32-bit (V)PABSD mm,mm/m64 0F38 1E /r PABSD PABSD VPABSD VPABSD(W0) F 32 32
64-bit VPABSQ xmm,xmm/m128(AVX-512) VPABSQ(W1) F 64 64
Packed Align Right.

Concatenate two input vectors into a double-size vector, then right-shift by the number of bytes specified by the imm8 argument. The shift-amount is not masked - if the shift-amount is greater than the input vector size, zeroes will be shifted in.

(V)PALIGNR mm,mm/mm64,imm8[b] 0F3A 0F /r ib Yes Yes Yes Yes[k] BW 8 No
  1. ^ For shuffle of four 16-bit integers in a 64-bit section of a 128-bit XMM register, the SSE2 instructions PSHUFLW (opcode F2 0F 70 /r) or PSHUFHW (opcode F3 0F 70 /r) may be used.
  2. ^ a b c d e f g h i For the VPSHUFD, VPSHUFB, VPHADD*, VPHSUB* and VPALIGNR instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and/or AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction.
  3. ^ a b For the VEX-encoded forms of the VPINSRW and VPEXTRW instruction, the Intel SDM (as of rev 084) indicates that the instructions must be encoded with VEX.W=0, however neither Intel XED nor AMD APM indicate any such requirement.
  4. ^ The 0F C5 /r ib variant of PEXTRW allows register destination only. For SSE4.1 and later, a variant that allows a memory destination is available with the opcode 66 0F 3A 15 /r ib.
  5. ^ EVEX-prefixed opcode not available. Under AVX-512, a bitmask made from the top bit of each byte can instead be constructed with the VPMOVB2M instruction, with opcode EVEX.F3.0F38.W0 29 /r, which will store such a bitmask to an opmask register.
  6. ^ VMOVNTDQ is available with a vector length of 256 bits under AVX, not requiring AVX2.
  7. ^ For the MASKMOVQ and (V)MASKMOVDQU instructions, exception and trap behavior for disabled lanes is implementation-dependent. For example, a given implementation may signal a data breakpoint or a page fault for bytes that are zero-masked and not actually written.
  8. ^ For AVX, masked stores to memory are also available using the VMASKMOVPS instruction with opcode VEX.66.0F38 2E /r - unlike VMASKMOVDQU, this instruction allows 256-bit stores without temporal hints, although its mask is coarser - 4 bytes vs 1 byte per lane.
  9. ^ Opcode not available under AVX-512. Under AVX-512, unaligned masked stores to memory (albeit without temporal hints) can be done with the VMOVDQU(8|16|32|64) instructions with opcode EVEX.F2/F3.0F 7F /r, using an opmask register to provide a write mask.
  10. ^ For AVX2 and AVX-512 with vectors wider than 128 bits, the VPSHUFB instruction is restricted to byte-shuffle within each 128-bit lane. Instructions that can do shuffles across 128-bit lanes include e.g. AVX2's VPERMD (shuffle of 32-bit lanes across 256-bit YMM register) and AVX512_VBMI's VPERMB (full byte shuffle across 64-byte ZMM register).
  11. ^ For AVX-512, VPALIGNR is supported but will perform its operation within each 128-bit lane. For packed alignment shifts that can shift data across 128-bit lanes, AVX512F's VALIGND instruction may be used, although its shift-amount is specified in units of 32-bits rather than bytes.

SSE instructions and extended variants thereof

[edit]

Regularly-encoded floating-point SSE/SSE2 instructions, and AVX/AVX-512 extended variants thereof

[edit]

For the instructions in the below table, the following considerations apply unless otherwise noted:

  • Packed instructions are available at all vector lengths (128-bit for SSE2, 128/256-bit for AVX, 128/256/512-bit for AVX-512)
  • FP32 variants of instructions are introduced as part of SSE. FP64 variants of instructions are introduced as part of SSE2.
  • The AVX-512 variants of the FP32 and FP64 instructions are introduced as part of the AVX512F subset.
  • For AVX-512 variants of the instructions, opmasks and broadcasts are available with a width of 32 bits for FP32 operations and 64 bits for FP64 operations. (Broadcasts are available for vector operations only.)

From SSE2 onwards, some data movement/bitwise instructions exist in three forms: an integer form, an FP32 form and an FP64 form. Such instructions are functionally identical, however some processors with SSE2 will implement integer, FP32 and FP64 execution units as three different execution clusters, where forwarding of results from one cluster to another may come with performance penalties and where such penalties can be minimzed by choosing instruction forms appropriately. (For example, there exists three forms of vector bitwise XOR instructions under SSE2 - PXOR, XORPS, and XORPD - these are intended for use on integer, FP32, and FP64 data, respectively.)

Instruction Description Basic opcode Single Precision (FP32) Double Precision (FP64)
AVX-512: RC/SAE
Packed (no prefix) Scalar (F3h prefix) Packed (66h prefix) Scalar (F2h prefix)
SSE instruction AVX
(VEX)
AVX-512
(EVEX)
SSE instruction AVX
(VEX)[a]
AVX-512
(EVEX)
SSE2 instruction AVX
(VEX)
AVX-512
(EVEX)
SSE2 instruction AVX
(VEX)[a]
AVX-512
(EVEX)
Unaligned load from memory or vector register 0F 10 /r MOVUPS x,x/m128 Yes Yes[b] MOVSS x,x/m32 Yes Yes MOVUPD x,x/m128 Yes Yes[b] MOVSD x,x/m64[c] Yes Yes No
Unaligned store to memory or vector register 0F 11 /r MOVUPS x/m128,x Yes Yes[b] MOVSS x/m32,x Yes Yes MOVUPD x/m128,x Yes Yes[b] MOVSD x/m64,x[c] Yes Yes No
Load 64 bits from memory or upper half of XMM register into the lower half of XMM register while keeping the upper half unchanged 0F 12 /r MOVHLPS x,x (L0)[d] (L0)[d] (MOVSLDUP)[e] MOVLPD x,m64 (L0)[d] (L0)[d] (MOVDDUP)[e] No
MOVLPS x,m64 (L0)[d] (L0)[d]
Store 64 bits to memory from lower half of XMM register 0F 13 /r MOVLPS m64,x (L0)[d] (L0)[d] No No No MOVLPD m64,x (L0)[d] (L0)[d] No No No No
Unpack and interleave low-order floating-point values 0F 14 /r UNPCKLPS x,x/m128 Yes[f] Yes[f] No No No UNPCKLPD x,x/m128 Yes[f] Yes[f] No No No No
Unpack and interleave high-order floating-point values 0F 15 /r UNPCKHPS x,x/m128 Yes[f] Yes[f] No No No UNPCKHPD x,x/m128 Yes[f] Yes[f] No No No No
Load 64 bits from memory or lower half of XMM register into the upper half of XMM register while keeping the lower half unchanged 0F 16 /r MOVLHPS x,x (L0)[d] (L0)[d] (MOVSHDUP)[e] MOVHPD x,m64 (L0)[d] (L0)[d] No No No No
MOVHPS x,m64 (L0)[d] (L0)[d]
Store 64 bits to memory from upper half of XMM register 0F 17 /r MOVHPS m64,x (L0)[d] (L0)[d] No No No MOVHPD m64,x (L0)[d] (L0)[d] No No No No
Aligned load from memory or vector register 0F 28 /r MOVAPS x,x/m128 Yes Yes[b] No No No MOVAPD x,x/m128 Yes Yes[b] No No No No
Aligned store to memory or vector register 0F 29 /r MOVAPS x/m128,x Yes Yes[b] No No No MOVAPD x/m128,x Yes Yes[b] No No No No
Integer to floating-point conversion using general-registers, MMX-registers or memory as source 0F 2A /r CVTPI2PS x,mm/m64[g] No No CVTSI2SS x,r/m32
CVTSI2SS x,r/m64
[h]
Yes Yes[i] CVTPI2PD x,mm/m64[g] No No CVTSI2SD x,r/m32
CVTSI2SD x,r/m64
[h]
Yes Yes[i] RC
Non-temporal store to memory from vector register.

The packed variants require aligned memory addresses even in VEX/EVEX-encoded forms.

0F 2B /r MOVNTPS m128,x Yes Yes[i] MOVNTSS m32,x
(AMD SSE4a)
No No MOVNTPD m128,x Yes Yes[i] MOVNTSD m64,x
(AMD SSE4a)
No No No
Floating-point to integer conversion with truncation, using general-purpose registers or MMX-registers as destination 0F 2C /r CVTTPS2PI mm,x/m64[j] No No CVTTSS2SI r32,x/m32
CVTTSS2SI r64,x/m32[k]
Yes Yes[i] CVTTPD2PI mm,x/m64[j] No No CVTTSD2SI r32,x/m64
CVTTSD2SI r64,x/m64[k]
Yes Yes[i] SAE
Floating-point to integer conversion, using general-purpose registers or MMX-registers as destination 0F 2D /r CVTPS2PI mm,x/m64[j] No No CVTSS2SI r32,x/m32
CVTSS2SI r64,x/m32[k]
Yes Yes[i] CVTPD2PI mm,x/m64[j] No No CVTSD2SI r32,x/m64
CVTSD2SI r64,x/m64[k]
Yes Yes[i] RC
Unordered compare floating-point values and set EFLAGS.

Compares the bottom lanes of xmm vector registers.

0F 2E /r UCOMISS x,x/m32 Yes[a] Yes[i] No No No UCOMISD x,x/m64 Yes[a] Yes[i] No No No SAE
Compare floating-point values and set EFLAGS.

Compares the bottom lanes of xmm vector registers.

0F 2F /r COMISS x,x/m32 Yes[a] Yes[i] No No No COMISD x,x/m64 Yes[a] Yes[i] No No No SAE
Extract packed floating-point sign mask 0F 50 /r MOVMSKPS r32,x Yes No[l] No No No MOVMSKPD r32,x Yes No[l] No No No
Floating-point Square Root 0F 51 /r SQRTPS x,x/m128 Yes Yes SQRTSS x,x/m32 Yes Yes SQRTPD x,x/m128 Yes Yes SQRTSD x,x/m64 Yes Yes RC
Reciprocal Square Root Approximation[m] 0F 52 /r RSQRTPS x,x/m128 Yes No[n] RSQRTSS x,x/m32 Yes No[n] No No No[n] No No No[n]
Reciprocal Approximation[m] 0F 53 /r RCPPS x,x/m128 Yes No[o] RCPSS x,x/m32 Yes No[o] No No No[o] No No No[o]
Vector bitwise AND 0F 54 /r ANDPS x,x/m128 Yes (DQ)[p] No No No ANDPD x,x/m128 Yes (DQ)[p] No No No No
Vector bitwise AND-NOT 0F 55 /r ANDNPS x,x/m128 Yes (DQ)[p] No No No ANDNPD x,x/m128 Yes (DQ)[p] No No No No
Vector bitwise OR 0F 56 /r ORPS x,x/m128 Yes (DQ)[p] No No No ORPD x,x/m128 Yes (DQ)[p] No No No No
Vector bitwise XOR[q] 0F 57 /r XORPS x,x/m128 Yes (DQ)[p] No No No XORPD x,x/m128 Yes (DQ)[p] No No No No
Floating-point Add 0F 58 /r ADDPS x,x/m128 Yes Yes ADDSS x,x/m32 Yes Yes ADDPD x,x/m128 Yes Yes ADDSD x,x/m64 Yes Yes RC
Floating-point Multiply 0F 59 /r MULPS x,x/m128 Yes Yes MULSS x,x/m32 Yes Yes MULPD x,x/m128 Yes Yes MULSD x,x/m64 Yes Yes RC
Convert between floating-point formats
(FP32→FP64, FP64→FP32)
0F 5A /r CVTPS2PD x,x/m64
(SSE2)
Yes Yes[r] CVTSS2SD x,x/m32
(SSE2)
Yes Yes[r] CVTPD2PS x,x/m128 Yes Yes[r] CVTSD2SS x,x/m64 Yes Yes[r] SAE,
RC[s]
Floating-point Subtract 0F 5C /r SUBPS x,x/m128 Yes Yes SUBSS x,x/m32 Yes Yes SUBPD x,x/m128 Yes Yes SUBSD x,x/m64 Yes Yes RC
Floating-point Minimum Value[t] 0F 5D /r MINPS x,x/m128 Yes Yes MINSS x,x/m32 Yes Yes MINPD x,x/m128 Yes Yes MINSD x,x/m64 Yes Yes SAE
Floating-point Divide 0F 5E /r DIVPS x,x/m128 Yes Yes DIVSS x,x/m32 Yes Yes DIVPD x,x/m128 Yes Yes DIVSD x,x/m64 Yes Yes RC
Floating-point Maximum Value[t] 0F 5F /r MAXPS x,x/m128 Yes Yes MAXSS x,x/m32 Yes Yes MAXPD x,x/m128 Yes Yes MAXSD x,x/m64 Yes Yes SAE
Floating-point compare. Result is written as all-0s/all-1s values (all-1s for comparison true) to vector registers for SSE/AVX, but opmask register for AVX-512. Comparison function is specified by imm8 argument.[u] 0F C2 /r ib CMPPS x,x/m128,imm8 Yes Yes CMPSS x,x/m32,imm8 Yes Yes CMPPD x,x/m128,imm8 Yes Yes CMPSD x,x/m64,imm8
[c]
Yes Yes SAE
Packed Interleaved Shuffle.

Performs a shuffle on each of its two input arguments, then keeps the bottom half of the shuffle result from its first argument and the top half of the shuffle result from its second argument.

0F C6 /r ib SHUFPS x,x/m128,imm8[f] Yes Yes No No No SHUFPD x,x/m128,imm8[f] Yes Yes No No No No
  1. ^ a b c d e f The VEX-prefix-encoded variants of the scalar instructions listed in this table should be encoded with VEX.L=0. Setting VEX.L=1 for any of these instructions is allowed but will result in what the Intel SDM describes as "unpredictable behavior across different processor generations". This also applies to VEX-encoded variants of V(U)COMISS and V(U)COMISD. (This behavior does not apply to scalar instructions outside this table, such as e.g. VMOVD/VMOVQ, where VEX.L=1 results in an #UD exception.)
  2. ^ a b c d e f g h EVEX-encoded variants of VMOVAPS, VMOVUPS, VMOVAPD and VMOVUPD support opmasks but do not support broadcast.
  3. ^ a b c The SSE2 MOVSD (MOVe Scalar Double-precision) and CMPSD (CoMPare Scalar Double-precision) instructions have the same names as the older i386 MOVSD (MOVe String Doubleword) and CMPSD (CoMPare String Doubleword) instructions, however their operations are completely unrelated.

    At the assembly language level, they can be distinguished by their use of XMM register operands.

  4. ^ a b c d e f g h i j k l m n o p q r s t For variants of VMOVLPS, VMOVHPS, VMOVLPD, VMOVHPD, VMOVLHPS, VMOVHLPS encoded with VEX or EVEX prefixes, the only supported vector length is 128 bits (VEX.L=0 or EVEX.L=0).

    For the EVEX-encoded variants, broadcasts and opmasks are not supported.

  5. ^ a b c The MOVSLDUP, MOVSHDUP and MOVDDUP instructions are not regularly-encoded scalar SSE1/2 instructions, but instead irregularly-assigned SSE3 vector instructions. For a description of these instructions, see table below.
  6. ^ a b c d e f g h i j For the VUNPCK*, VSHUFPS and VSHUFPD instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction (except that for VSHUFPD, each 128-bit lane will use a different 2-bit part of the instruction's imm8 argument).
  7. ^ a b The CVTPI2PS and CVTPI2PD instructions take their input data as a vector of two 32-bit signed integers from either memory or MMX register. They will cause an x87→MMX transition even if the source operand is a memory operand.

    For vector int→FP conversions that can accept an xmm/ymm/zmm register or vectors wider than 64 bits as input arguments, SSE2 provides the following irregularly-assigned instructions (see table below):

    • CVTDQ2PS (0F 5B /r)
    • CVTDQ2PD (F3 0F E6 /r)
    These exist in AVX/AVX-512 extended forms as well.
  8. ^ a b For the (V)CVTSI2SS and (V)CVTSI2SD instructions, variants with a 64-bit source argument are only available in 64-bit long mode and require REX.W, VEX.W or EVEX.W to be set to 1.

    In 32-bit mode, their source argument is always 32-bit even if VEX.W or EVEX.W is set to 1.

  9. ^ a b c d e f g h i j k l EVEX-encoded variants of
    • VMOVNTPS, VMOVNTSS
    • VCOMISS, VCOMISD, VUCOMISS, VUCOMISD
    • VCVTSI2SS, VCTSI2SD
    • VCVT(T)SS2SI,VCVT(T)SD2SI
    support neither opmasks nor broadcast.
  10. ^ a b c d The CVT(T)PS2PI and CVT(T)PD2PI instructions write their result to MMX register as a vector of two 32-bit signed integers.

    For vector FP→int conversions that can write results to xmm/ymm/zmm registers, SSE2 provides the following irregularly-assigned instructions (see table below):

    • CVTPS2DQ (66 0F 5B /r)
    • CVTTPS2DQ (F3 0F 5B /r)
    • CVTPD2DQ (F2 0F E6 /r)
    • CVTTPD2DQ (66 0F E6 /r)
    These exist in AVX/AVX-512 extended forms as well.
  11. ^ a b c d For the (V)CVT(T)SS2SI and (V)CVT(T)SD2SI instructions, variants with a 64-bit destination register are only available in 64-bit long mode and require REX.W, VEX.W or EVEX.W to be set to 1.

    In 32-bit mode, their destination register is always 32-bit even if VEX.W or EVEX.W is set to 1.

  12. ^ a b This instruction cannot be EVEX-encoded. Under AVX512DQ, extracting packed floating-point sign-bits can instead be done with the VPMOVD2M and VPMOVQ2M instructions.
  13. ^ a b The (V)RCPSS, (V)RCPPS, (V)RSQRTSS and (V)RSQRTPS approximation instructions compute their result with a relative error of at most . The exact calculation is implementation-specific and known to vary between different x86 CPUs.
  14. ^ a b c d This instruction cannot be EVEX-encoded. Instead, AVX512F provides different opcodes - EVEX.66.0F38 4E/4F /r - for its new VRSQRT14* reciprocal square root approximation instructions.

    The main difference between the AVX-512 VRSQRT14* instructions and the older SSE/AVX (V)RSQRT* instructions is that the AVX-512 VRSQRT14* instructions have their operation defined in a bit-exact manner, with a C reference model provided by Intel.[9]

  15. ^ a b c d This instruction cannot be EVEX-encoded. Instead, AVX512F provides different opcodes - EVEX.66.0F38 4C/4D /r - for its new VRCP14* reciprocal approximation instructions.

    The main difference between the AVX-512 VRCP14* instructions and the older SSE/AVX (V)RCP* instructions is that the AVX-512 VRRCP14* instructions have their operation defined in a bit-exact manner, with a C reference model provided by Intel.[9]

  16. ^ a b c d e f g h The EVEX-encoded versions of the VANDPS, VANDPD, VANDNPS, VANDNPD, VORPS, VORPD, VXORPS, VXORPD instructions are not introduced as part of the AVX512F subset, but instead the AVX512DQ subset.
  17. ^ XORPS/VXORPS with both source operands being the same register is commonly used as a register-zeroing idiom, and is recognized by most x86 CPUs as an instruction that does not depend on its source arguments.
    Under AVX or AVX-512, it is recommended to use a 128-bit form of VXORPS for this purpose - this will, on some CPUs, result in fewer micro-ops than wider forms while still achieving register-zeroing of the whole 256 or 512 bit vector-register.
  18. ^ a b c d For EVEX-encoded variants of conversions between FP formats of different widths, the opmask lane width is determined by the result format: 64-bit for VCVTPS2PD and VCVTSS2SD and 32-bit for VCVTPD2PS and VCVTSS2SD.
  19. ^ Widening FP→FP conversions (CVTPS2PD, CVTSS2SD, VCVTPH2PD, VCVTSH2SD) support the SAE modifier. Narrowing conversions (CVTPD2PS, CVTSD2SS) support the RC modifier.
  20. ^ a b For the floating-point minimum-value and maximum-value instructions (V)MIN* and (V)MAX*, if the two input operands are both zero or at least one of the input operands is NaN, then the second input operand is returned. This matches the behavior of common C programming-language expressions such as ((op1)>(op2)?(op1):(op2)) for maximum-value and ((op1)<(op2)?(op1):(op2)) for minimum-value.
  21. ^ For the SIMD floating-point compares, the imm8 argument has the following format:
    Bits Usage
    1:0 Basic comparison predicate
    2 Invert comparison result
    3 Invert comparison result if unordered (VEX/EVEX only)
    4 Invert signalling behavior (VEX/EVEX only)

    The basic comparison predicates are:

    Value Meaning
    00b Equal (non-signalling)
    01b Less-than (signalling)
    10b Less-than-or-equal (signalling)
    11b Unordered (non-signalling)

    A signalling compare will cause an exception if any of the inputs are QNaN.

Integer SSE2/4 instructions with 66h prefix, and AVX/AVX-512 extended variants thereof

[edit]

These instructions do not have any MMX forms, and do not support any encodings without a prefix. Most of these instructions have extended variants available in VEX-encoded and EVEX-encoded forms:

  • The VEX-encoded forms are available under AVX/AVX2. Under AVX, they are available only with a vector length of 128 bits (VEX.L=0 enocding) - under AVX2, they are (with some exceptions noted with "L=0") also made available with a vector length of 256 bits.
  • The EVEX-encoded forms are available under AVX-512 - the specific AVX-512 subset needed for each instruction is listed along with the instruction.
Description Instruction mnemonics Basic opcode SSE (66h prefix) AVX
(VEX.66 prefix)
AVX-512 (EVEX.66 prefix)
supported subset lane bcst
Added with SSE2
Unpack and interleave low-order 64-bit integers (V)PUNPCKLQDQ xmm,xmm/m128[a] 0F 6C /r Yes Yes Yes (W=1) F 64 64
Unpack and interleave high-order 64-bit integers (V)PUNPCKHQDQ xmm,xmm/m128[a] 0F 6D /r Yes Yes Yes (W=1) F 64 64
Right-shift 128-bit unsigned integer by specified number of bytes (V)PSRLDQ xmm,imm8[a] 0F 73 /3 ib Yes Yes Yes BW No No
Left-shift 128-bit integer by specified number of bytes (V)PSLLDQ xmm,imm8[a] 0F 73 /7 ib Yes Yes Yes BW No No
Move 64-bit scalar value from xmm register to xmm register or memory (V)MOVQ xmm/m64,xmm 0F D6 /r Yes Yes (L=0) Yes
(L=0,W=1)
F No No
Added with SSE4.1
Variable blend packed bytes.

For each byte lane of the result, pick the value from either the first or the second argument depending on the top bit of the corresponding byte lane of XMM0.

PBLENDVB xmm,xmm/m128
PBLENDVB xmm,xmm/m128,XMM0[b]
0F38 10 /r Yes No[c] No[d]
Sign-extend packed integers into wider packed integers 8-bit → 16-bit (V)PMOVSXBW xmm,xmm/m64 0F38 20 /r Yes Yes Yes BW 16 No
8-bit → 32-bit (V)PMOVSXBD xmm,xmm/m32 0F38 21 /r Yes Yes Yes F 32 No
8-bit → 64-bit (V)PMOVSXBQ xmm,xmm/m16 0F38 22 /r Yes Yes Yes F 64 No
16-bit → 32-bit (V)PMOVSXWD xmm,xmm/m64 0F38 23 /r Yes Yes Yes F 32 No
16-bit → 64-bit (V)PMOVSXWQ xmm,xmm/m32 0F38 24 /r Yes Yes Yes F 64 No
32-bit → 64-bit (V)PMOVSXDQ xmm,xmm/m64 0F38 25 /r Yes Yes Yes (W=0) F 64 No
Multiply packed 32-bit signed integers, store full 64-bit result.

The input integers are taken from the low 32 bits of each 64-bit vector lane.

(V)PMULDQ xmm,xmm/m128 0F38 28 /r Yes Yes Yes (W=1) F 64 64
Compare packed 64-bit integers for equality (V)PCMPEQQ xmm,xmm/m128 0F38 29 /r Yes Yes Yes (W=1)[e] F 64 64
Aligned non-temporal vector load from memory.[f] (V)MOVNTDQA xmm,m128 0F38 2A /r Yes Yes Yes (W=0) F No No
Pack 32-bit unsigned integers to 16-bit, with saturation (V)PACKUSDW xmm, xmm/m128[a] 0F38 2B /r Yes Yes Yes (W=0) BW 16 32
Zero-extend packed integers into wider packed integers 8-bit → 16-bit (V)PMOVZXBW xmm,xmm/m64 0F38 30 /r Yes Yes Yes BW 16 No
8-bit → 32-bit (V)PMOVZXBD xmm,xmm/m32 0F38 31 /r Yes Yes Yes F 32 No
8-bit → 64-bit (V)PMOVZXBQ xmm,xmm/m16 0F38 32 /r Yes Yes Yes F 64 No
16-bit → 32-bit (V)PMOVZXWD xmm,xmm/m64 0F38 33 /r Yes Yes Yes F 32 No
16-bit → 64-bit (V)PMOVZXWQ xmm,xmm/m32 0F38 34 /r Yes Yes Yes F 64 No
32-bit → 64-bit (V)PMOVZXDQ xmm,xmm/m64 0F38 35 /r Yes Yes Yes (W=0) F 64 No
Packed minimum-value of signed integers 8-bit (V)PMINSB xmm,xmm/m128 0F38 38 /r Yes Yes Yes BW 8 No
32-bit (V)PMINSD xmm,xmm/m128 0F38 39 /r PMINSD VPMINSD VPMINSD(W0) F 32 32
64-bit VPMINSQ xmm,xmm/m128(AVX-512) VPMINSQ(W1) F 64 64
Packed minimum-value of unsigned integers 16-bit (V)PMINUW xmm,xmm/m128 0F38 3A /r Yes Yes Yes BW 16 No
32-bit (V)PMINUD xmm,xmm/m128
0F38 3B /r PMINUD VPMINUD VPMINUD(W0) F 32 32
64-bit VPMINUQ xmm,xmm/m128(AVX-512) VPMINUQ(W1) F 64 64
Packed maximum-value of signed integers 8-bit (V)PMAXSB xmm,xmm/m128 0F38 3C /r Yes Yes Yes BW 8 No
32-bit (V)PMAXSD xmm,xmm/m128 0F38 3D /r PMAXSD VPMAXSD VPMAXSD(W0) F 32 32
64-bit VPMAXSQ xmm,xmm/m128(AVX-512) VPMAXSQ(W1) F 64 64
Packed maximum-value of unsigned integers 16-bit (V)PMAXUW xmm,xmm/m128 0F38 3E /r Yes Yes Yes BW 16 No
32-bit (V)PMAXUD xmm,xmm/m128
0F38 3F /r PMAXUD VPMAXUD VPMAXUD(W0) F 32 32
64-bit VPMAXUQ xmm,xmm/m128(AVX-512) VPMAXUQ(W1) F 64 64
Multiply packed 32/64-bit integers, store low half of results (V)PMULLD mm,mm/m64
PMULLQ xmm,xmm/m128(AVX-512)
0F38 40 /r PMULLD VPMULLD VPMULLD(W0) F 32 32
VPMULLQ(W1) DQ 64 64
Packed Horizontal Word Minimum

Find the smallest 16-bit integer in a packed vector of 16-bit unsigned integers, then return the integer and its index in the bottom two 16-bit lanes of the result vector.

(V)PHMINPOSUW xmm,xmm/m128 0F38 41 /r Yes Yes (L=0) No
Blend Packed Words.

For each 16-bit lane of the result, pick a 16-bit value from either the first or the second source argument depending on the corresponding bit of the imm8.

(V)PBLENDW xmm,xmm/m128,imm8[a] 0F3A 0E /r ib Yes Yes[g] No[h]
Extract integer from indexed lane of vector register, and store to GPR or memory.

Zero-extended if stored to GPR.

8-bit (V)PEXTRB r32/m8,xmm,imm8[i] 0F3A 14 /r ib Yes Yes (L=0) Yes (L=0) BW No No
16-bit (V)PEXTRW r32/m16,xmm,imm8[i] 0F3A 15 /r ib Yes Yes (L=0) Yes (L=0) BW No No
32-bit (V)PEXTRD r/m32,xmm,imm8 0F3A 16 /r ib Yes Yes
(L=0,W=0)[j]
Yes
(L=0,W=0)
DQ No No
64-bit
(x86-64)
(V)PEXTRQ r/m64,xmm,imm8 Yes
(REX.W)
Yes
(L=0,W=1)
Yes
(L=0,W=1)
DQ No No
Insert integer from general-purpose register into indexed lane of vector register 8-bit (V)PINSRB xmm,r32/m8,imm8[k] 0F3A 20 /r ib Yes Yes (L=0) Yes (L=0) BW No No
32-bit (V)PINSRD xmm,r32/m32,imm8 0F3A 22 /r ib Yes Yes
(L=0,W=0)[j]
Yes
(L=0,W=0)
DQ No No
64-bit
(x86-64)
(V)PINSRQ xmm,r64/m64,imm8 Yes
(REX.W)
Yes
(L=0,W=1)
Yes
(L=0,W=1)
DQ No No
Compute Multiple Packed Sums of Absolute Difference.

The 128-bit form of this instruction computes 8 sums of absolute differences from sequentially selected groups of four bytes in the first source argument and a selected group of four contiguous bytes in the second source operand, and writes the sums to sequential 16-bit lanes of destination register. If the two source arguments src1 and src2 are considered to be two 16-entry arrays of uint8 values and temp is considered to be an 8-entry array of uint16 values, then the operation of the instruction is:

for i = 0 to 7 do
    temp[i] := 0
    for j = 0 to 3 do
         a := src1[ i+(imm8[2]*4)+j ]
         b := src2[ (imm8[1:0]*4)+j ]
         temp[i] := temp[i] + abs(a-b)
    done
done
dst := temp

For wider forms of this instruction under AVX2 and AVX10.2, the operation is split into 128-bit lanes where each lane internally performs the same operation as the 128-bit variant of the instruction - except that odd-numbered lanes use bits 5:3 rather than bits 2:0 of the imm8.

(V)MPSADBW xmm,xmm/m128,imm8 0F3A 42 /r ib Yes Yes Yes (W=0) 10.2[l] 16 No
Added with SSE 4.2
Compare packed 64-bit signed integers for greater-than (V)PCMPGTQ xmm, xmm/m128 0F38 37 /r Yes Yes Yes (W=1)[e] F 64 64
Packed Compare Explicit Length Strings, Return Mask (V)PCMPESTRM xmm,xmm/m128,imm8 0F3A 60 /r ib Yes[m] Yes (L=0) No
Packed Compare Explicit Length Strings, Return Index (V)PCMPESTRI xmm,xmm/m128,imm8 0F3A 61 /r ib Yes[m] Yes (L=0) No
Packed Compare Implicit Length Strings, Return Mask (V)PCMPISTRM xmm,xmm/m128,imm8 0F3A 62 /r ib Yes[m] Yes (L=0) No
Packed Compare Implicit Length Strings, Return Index (V)PCMPISTRI xmm,xmm/m128,imm8 0F3A 63 /r ib Yes[m] Yes (L=0) No
  1. ^ a b c d e f For the (V)PUNPCK*, (V)PACKUSDW, (V)PBLENDW, (V)PSLLDQ and (V)PSLRDQ instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and/or AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction.
  2. ^ Assemblers may accept PBLENDVB with or without XMM0 as a third argument.
  3. ^ The PBLENDVB instruction with opcode 66 0F38 10 /r is not VEX-encodable. AVX does provide a VPBLENDVB instruction that is similar to PBLENDVB, however, it uses a different opcode and operand encoding - VEX.66.0F3A.W0 4C /r /is4.
  4. ^ Opcode not EVEX-encodable. Under AVX-512, variable blend of packed bytes may be done with the VPBLENDMB instruction (opcode EVEX.66.0F38.W0 66 /r).
  5. ^ a b The EVEX-encoded variants of the VPCMPEQ* and VPCMPGT* instructions write their results to AVX-512 opmask registers. This differs from the older non-EVEX variants, which write comparison results as vectors of all-0s/all-1s values to the regular mm/xmm/ymm vector registers.
  6. ^ The load performed by (V)MOVNTDQA is weakly-ordered. It may be reordered with respect to other loads, stores and even LOCKs - to impose ordering with respect to other loads/stores, MFENCE or serialization is needed.

    If (V)MOVNTDQA is used with uncached memory, it may fetch a cache-line-sized block of data around the data actually requested - subsequent (V)MOVNTDQA instructions may return data from blocks fetched in this manner as long as they are not separated by an MFENCE or serialization.

  7. ^ For AVX, the VBLENDPS and VPBLENDD instructions can be used to perform a blend with 32-bit lanes, allowing one imm8 mask to span a full 256-bit vector without repetition.
  8. ^ Opcode not EVEX-encodable. Under AVX-512, variable blend of packed words may be done with the VPBLENDMW instruction (opcode EVEX.66.0F38.W1 66 /r).
  9. ^ a b For (V)PEXTRB and (V)PEXTRW, if the destination argument is a register, then the extracted 8/16-bit value is zero-extended to 32/64 bits.
  10. ^ a b For the VPEXTRD and VPINSRD instructions in non-64-bit mode, the instructions are documented as being permitted to be encoded with VEX.W=1 on Intel[10] but not AMD[11] CPUs (although exceptions to this do exist, e.g. Bulldozer permits such encodings[12] while Sandy Bridge does not[13])
    In 64-bit mode, these instructions require VEX.W=0 on both Intel and AMD processors — encodings with VEX.W=1 are interpreted as VPEXTRQ/VPINSRQ.
  11. ^ In the case of a register source argument to (V)PINSRB, the argument is considered to be a 32-bit register of which the 8 bottom bits are used, not an 8-bit register proper. This means that it is not possible to specify AH/BH/CH/DH as a source argument to (V)PINSRB.
  12. ^ EVEX-encoded variants of the VMPSADBW instruction are only available if AVX10.2 is supported.
  13. ^ a b c d The SSE4.2 packed string compare PCMP*STR* instructions allow their 16-byte memory operands to be misaligned even when using legacy SSE encoding.

Other SSE/2/3/4 SIMD instructions, and AVX/AVX-512 extended variants thereof

[edit]

SSE SIMD instructions that do not fit into any of the preceding groups. Many of these instructions have AVX/AVX-512 extended forms - unless otherwise indicated (L=0 or footnotes) these extended forms support 128/256-bit operation under AVX and 128/256/512-bit operation under AVX-512.

Description Instruction mnemonics Basic opcode   SSE   AVX
(VEX prefix)
AVX-512 (EVEX prefix)
supported subset lane bcst rc/sae
Added with SSE
Load MXCSR (Media eXtension Control and Status Register) from memory (V)LDMXCSR m32 NP 0F AE /2 Yes Yes
(L=0)
No
Store MXCSR to memory (V)STMXCSR m32 NP 0F AE /3 Yes Yes
(L=0)
No
Added with SSE2
Move a 64-bit data item from MMX register to bottom half of XMM register. Top half is zeroed out. MOVQ2DQ xmm,mm F3 0F D6 /r Yes No No
Move a 64-bit data item from bottom half of XMM register to MMX register. MOVDQ2Q mm,xmm F2 0F D6 /r Yes No No
Load a 64-bit integer from memory or XMM register to bottom 64 bits of XMM register, with zero-fill (V)MOVQ xmm,xmm/m64 F3 0F 7E /r Yes Yes (L=0) Yes (L=0,W=1) F No No No
Vector load from unaligned memory or vector register (V)MOVDQU xmm,xmm/m128 F3 0F 6F /r Yes Yes VMOVDQU64(W1) F 64 No No
VMOVDQU32(W0) F 32 No No
F2 0F 6F /r No No VMOVDQU16(W1) BW 16 No No
VMOVDQU8(W0) BW 8 No No
Vector store to unaligned memory or vector register (V)MOVDQU xmm/m128,xmm F3 0F 7F /r Yes Yes VMOVDQU64(W1) F 64 No No
VMOVDQU32(W0) F 32 No No
F2 0F 7F /r No No VMOVDQU16(W1) BW 16 No No
VMOVDQU8(W0) BW 8 No No
Shuffle the four top 16-bit lanes of source vector, then place result in top half of destination vector (V)PSHUFHW xmm,xmm/m128,imm8[a] F3 0F 70 /r ib Yes Yes[b] Yes BW 16 No No
Shuffle the four bottom 16-bit lanes of source vector, then place result in bottom half of destination vector (V)PSHUFLW xmm,xmm/m128,imm8[a] F2 0F 70 /r ib Yes Yes[b] Yes BW 16 No No
Convert packed signed 32-bit integers to FP32 (V)CVTDQ2PS xmm,xmm/m128 NP 0F 5B /r Yes Yes Yes (W=0) F 32 32 RC
Convert packed FP32 values to packed signed 32-bit integers (V)CVTPS2DQ xmm,xmm/m128 66 0F 5B /r Yes Yes Yes (W=0) F 32 32 RC
Convert packed FP32 values to packed signed 32-bit integers, with round-to-zero (V)CVTTPS2DQ xmm,xmm/m128 F3 0F 5B /r Yes Yes Yes (W=0) F 32 32 SAE
Convert packed FP64 values to packed signed 32-bit integers, with round-to-zero (V)CVTTPD2DQ xmm,xmm/m64 66 0F E6 /r Yes Yes Yes (W=1) F 32 64 SAE
Convert packed signed 32-bit integers to FP64 (V)CVTDQ2PD xmm,xmm/m64 F3 0F E6 /r Yes Yes Yes (W=0) F 64 32 RC[c]
Convert packed FP64 values to packed signed 32-bit integers (V)CVTPD2DQ xmm,xmm/m128 F2 0F E6 /r Yes Yes Yes (W=1) F 32 64 RC
Added with SSE3
Duplicate floating-point values from even-numbered lanes to next odd-numbered lanes up 32-bit (V)MOVSLDUP xmm,xmm/m128 F3 0F 12 /r Yes Yes Yes (W=0) F 32 No No
64-bit (V)MOVDDUP xmm/xmm/m128 F2 0F 12 /r Yes Yes Yes (W=1) F 64 No No
Duplicate FP32 values from odd-numbered lanes to next even-numbered lanes down (V)MOVSHDUP xmm,xmm/m128 F3 0F 16 /r Yes Yes Yes (W=0) F 32 No No
Packed pairwise horizontal addition of floating-point values 32-bit (V)HADDPS xmm,xmm/m128[a] F2 0F 7C /r Yes Yes No
64-bit (V)HADDPD xmm,xmm/m128[a] 66 0F 7C /r Yes Yes No
Packed pairwise horizontal subtraction of floating-point values 32-bit (V)HSUBPS xmm,xmm/m128[a] F2 0F 7D /r Yes Yes No
64-bit (V)HSUBPD xmm,xmm/m128[a] 66 0F 7D /r Yes Yes No
Packed floating-point add/subtract in alternating lanes. Even-numbered lanes (counting from 0) do subtract, odd-numbered lanes do add. 32-bit (V)ADDSUBPS xmm,xmm/m128 F2 0F D0 /r Yes Yes No
64-bit (V)ADDSUBPD xmm,xmm/m128 66 0F D0 /r Yes Yes No
Vector load from unaligned memory with looser semantics than (V)MOVDQU.

Unlike (V)MOVDQU, it may fetch data more than once or, for a misaligned access, fetch additional data up until the next 16/32-byte alignment boundaries below/above the actually-requested data.

(V)LDDQU xmm,m128 F2 0F F0 /r Yes Yes No
Added with SSE4.1
Vector logical test.

Sets ZF=1 if bitwise-AND between first operand and second operand results in all-0s, ZF=0 otherwise. Sets CF=1 if bitwise-AND between second operand and bitwise-NOT of first operand results in all-0s, CF=0 otherwise

(V)PTEST xmm,xmm/m128 66 0F38 17 /r Yes Yes No[d]
Variable blend packed floating-point values.

For each lane of the result, pick the value from either the first or the second argument depending on the top bit of the corresponding lane of XMM0.

32-bit BLENDVPS xmm,xmm/m128
BLENDVPS xmm,xmm/m128,XMM0[e]
66 0F38 14 /r Yes No[f] No
64-bit BLENDVPD xmm,xmm/m128
BLENDVPD xmm,xmm/m128,XMM0[e]
66 0F38 15 /r Yes No[f] No
Rounding of packed floating-point values to integer.

Rounding mode specified by imm8 argument.

32-bit (V)ROUNDPS xmm,xmm/m128,imm8 66 0F3A 08 /r ib Yes Yes No[g]
64-bit (V)ROUNDPD xmm,xmm/m128,imm8 66 0F3A 09 /r ib Yes Yes No[g]
Rounding of scalar floating-point value to integer. 32-bit (V)ROUNDSS xmm,xmm/m128,imm8 66 0F3A 0A /r ib Yes Yes No[g]
64-bit (V)ROUNDSD xmm,xmm/m128,imm8 66 0F3A 0B /r ib Yes Yes No[g]
Blend packed floating-point values. For each lane of the result, pick the value from either the first or the second argument depending on the corresponding imm8 bit. 32-bit (V)BLENDPS xmm,xmm/m128,imm8 66 0F3A 0C /r ib Yes Yes No
64-bit (V)BLENDPD xmm,xmm/m128,imm8 66 0F3A 0D /r ib Yes Yes No
Extract 32-bit lane of XMM register to general-purpose register or memory location.

Bits[1:0] of imm8 is used to select lane.

(V)EXTRACTPS r/m32,xmm,imm8 66 0F3A 17 /r ib Yes Yes (L=0) Yes (L=0) F No No No
Obtain 32-bit value from source XMM register or memory, and insert into the specified lane of destination XMM register.

If the source argument is an XMM register, then bits[7:6] of the imm8 is used to select which 32-bit lane to select source from, otherwise the specified 32-bit memory value is used. This 32-bit value is then inserted into the destination register lane specified by bits[5:4] of the imm8. After insertion, each 32-bit lane of the destination register may optionally be zeroed out - bits[3:0] of the imm8 provides a bitmap of which lanes to zero out.

(V)INSERTPS xmm,xmm/m32,imm8 66 0F3A 21 /r ib Yes Yes (L=0) Yes (L=0,W=0) F No No No
4-component dot-product of 32-bit floating-point values.

Bits [7:4] of the imm8 specify which lanes should participate in the dot-product, bits[3:0] specify which lanes in the result should receive the dot-product (remaining lanes are filled with zeros)

(V)DPPS xmm,xmm/m128,imm8[a] 66 0F3A 40 /r ib Yes Yes No
2-component dot-product of 64-bit floating-point values.

Bits [5:4] of the imm8 specify which lanes should participate in the dot-product, bits[1:0] specify which lanes in the result should receive the dot-product (remaining lanes are filled with zeros)

(V)DPPD xmm,xmm/m128,imm8[a] 66 0F3A 41 /r ib Yes Yes No
Added with SSE4a (AMD only)
64-bit bitfield insert, using the low 64 bits of XMM registers.

First argument is an XMM register to insert bitfield into, second argument is a source register containing the bitfield to insert (starting from bit 0).

For the 4-argument version, the first imm8 specifies bitfield length and the second imm8 specifies bit-offset to insert bitfield at. For the 2-argument version, the length and offset are instead taken from bits [69:64] and [77:72] of the second argument, respectively.

INSERTQ xmm,xmm,imm8,imm8 F2 0F 78 /r ib ib Yes No No[h]
INSERTQ xmm,xmm F2 0F 79 /r Yes No No[h]
64-bit bitfield extract, from the lower 64 bits of an XMM register.

The first argument serves as both source that bitfield is extracted from and destination that bitfield is written to.

For the 3-argument version, the first imm8 specifies bitfield length and the second imm8 specifies bitfield bit-offset. For the 2-argument version, the second argument is an XMM register that contains bitfield length at bits[5:0] and bit-offset at bits[13:8].

EXTRQ xmm,imm8,imm8 66 0F 78 /0 ib ib Yes No No[h]
EXTRQ xmm,xmm 66 0F 79 /r Yes No No[h]
  1. ^ a b c d e f g h For the VPSHUFLW, VPSHUFHW, VHADDP*, VHSUBP*, VDPPS and VDPPD instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and/or AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction.
  2. ^ a b Under AVX, the VPSHUFHW and VPSHUFLW instructions are only available in 128-bit forms - the 256-bit forms of these instructions require AVX2.
  3. ^ For the EVEX-encoded form of VCVTDQ2PD, EVEX embedded rounding controls are permitted but have no effect.
  4. ^ Opcode not EVEX-encodable. Performing a vector logical test under AVX-512 requires a sequence of at least 2 instructions, e.g. VPTESTMD followed by KORTESTW.
  5. ^ a b Assemblers may accept the BLENDVPS/BLENDVPD instructions with or without XMM0 as a third argument.
  6. ^ a b While AVX does provide VBLENDVPS/VPD instruction that are similar in function to BLENDVPS/VPD, they uses a different opcode and operand encoding - VEX.66.0F3A.W0 4A/4B /r /is4.
  7. ^ a b c d Opcode not available under AVX-512. Instead, AVX512F provides different opcodes - EVEX.66.0F3A (08..0B) /r ib - for its new VRNDSCALE* rounding instructions.
  8. ^ a b c d Under AVX-512, EVEX-encoding the INSERTQ/EXTRQ opcodes result in AVX-512 instructions completely unrelated to SSE4a, namely VCVT(T)P(S|D)2UQQ and VCVT(T)S(S|D)2USI.

AVX/AVX2 instructions, and AVX-512 extended variants thereof

[edit]

This covers instructions/opcodes that are new to AVX and AVX2.

AVX and AVX2 also include extended VEX-encoded forms of a large number of MMX/SSE instructions - please see tables above.

Some of the AVX/AVX2 instructions also exist in extended EVEX-encoded forms under AVX-512 as well.

Description Instruction mnemonics Basic opcode (VEX) AVX AVX-512 (EVEX-encoded)
supported subset lane bcst
Added with AVX
Broadcast floating-point data from memory or bottom of XMM-register to all lanes of XMM/YMM/ZMM-register. 32-bit VBROADCASTSS ymm,xmm/m32[a] VEX.66.0F38.W0 18 /r Yes Yes F 32 (32)[b]
64-bit VBROADCASTSD ymm,xmm/m64[a]
VBROADCASTF32X2 zmm,xmm/m64(AVX-512)
VEX.66.0F38 19 /r VBROADCASTSD
(L=1[c],W=0)
VBROADCASTF32X2(L≠0,W=0) DQ 32 (64)[b]
VBROADCASTSD(L≠0,W=1) F 64 (64)[b]
128-bit VBROADCASTF128 ymm,m128
VBROADCASTF32X4 zmm,m128(AVX-512)
VBROADCASTF64X2 zmm,m128(AVX-512)
VEX.66.0F38 1A /r VBROADCASTF128
(L=1,W=0)
VBROADCASTF32X4(L≠0,W=0) F 32 (128)[b]
VBROADCASTF64X2(L≠0,W=1) DQ 64 (128)[b]
Extract 128-bit vector-lane of floating-point data from wider vector-register VEXTRACTF128 xmm/m128,ymm,imm8
VEXTRACTF32X4 xmm/m128,zmm,imm8(AVX-512)
VEXTRACTF64X2 xmm/m128,zmm,imm8(AVX-512)
VEX.66.0F3A 19 /r ib VEXTRACTF128
(L=1,W=0)
VEXTRACTF32X4(L≠0,W=0) F 32 No
VEXTRACTF64X2(L≠0,W=1) DQ 64 No
Insert 128-bit vector of floating-point data into 128-bit lane of wider vector VINSERTF128 ymm,ymm,xmm/m128,imm8
VINSERTF32X4 zmm,zmm,xmm/m128,imm8(AVX-512)
VINSERTF64X2 zmm,zmm,xmm/m128,imm8(AVX-512)
VEX.66.0F3A 18 /r ib VINSERTF128
(L=1,W=0)
VINSERTF32X4(L≠0,W=0) F 32 No
VINSERTF64X2(L≠0,W=1) DQ 64 No
Concatenate the two source vectors into a vector of four 128-bit components, then use imm8 to index into vector
  • Bits [1:0] of imm8 picks element to use for low 128 bits of result
  • Bits[3:2] of imm8 picks element to use for high 128 bits of result
VPERM2F128 ymm,ymm,ymm/m256,imm8 VEX.66.0F3A.W0 06 /r /ib (L=1) No
Perform shuffle of 32-bit sub-lanes within each 128-bit lane of vectors.

Variable-shuffle form uses bits[1:0] of each lane for selection.
imm8 form uses same shuffle in every 128-bit lane.

VPERMILPS ymm,ymm,ymm/m256 VEX.66.0F38.W0 0C /r Yes Yes F 32 32
VPERMILPS ymm,ymm/m256,imm8 VEX.66.0F3A.W0 04 /r ib Yes Yes F 32 32
Perform shuffle of 64-bit sub-lanes within each 128-bit lane of vectors.

Variable-shuffle form uses bit[1] of each lane for selection.
imm8 form uses two bits of the imm8 for each of the 128-bit lanes.

VPERMILPD ymm,ymm,ymm/m256 VEX.66.0F38.W0 0D /r Yes Yes F 64 64
VPERMILPD ymm,ymm/m256,imm8 VEX.66.0F3A.W0 05 /r ib Yes Yes F 64 64
Packed memory load/store of floating-point data with per-lane write masking.

First argument is destination, third argument is source. The second argument provides masks, in the top bit of each 32-bit lane.

32-bit VMASKMOVPS ymm,ymm,m256 VEX.66.0F38.W0 2C /r Yes No
VMASKMOVPS m256,ymm,ymm VEX.66.0F38.W0 2E /r Yes No
64-bit VMASKMOVPD ymm,ymm,m256 VEX.66.0F38.W0 2D /r Yes No
VMASKMOVPD m256,ymm,ymm VEX.66.0F38.W0 2F /r Yes No
Vector logical sign-bit test on packed floating-point values.

Sets ZF=1 if bitwise-AND between sign-bits of the first operand and second operand results in in all-0s, ZF=0 otherwise. Sets CF=1 if bitwise-AND between sign-bits of second operand and bitwise-NOT of first operand results in all-0s, CF=0 otherwise.

32-bit VTESTPS ymm,ymm/m256 VEX.66.0F38.W0 0E /r Yes No
64-bit VTESTPD ymm,ymm/m256 VEX.66.0F38.W0 0F /r Yes No
Variable blend packed floating-point values.

For each lane of the result, pick the value from either the second or the third argument depending on the top bit of the corresponding lane of the fourth argument.

32-bit VBLENDVPS ymm,ymm,ymm/m256,ymm VEX.66.0F3A.W0 4A /r /is4 Yes No
64-bit VBLENDVPD ymm,ymm,ymm/m256,ymm VEX.66.0F3A.W0 4B /r /is4 Yes No
Variable blend packed bytes.

For each byte lane of the result, pick the value from either the second or the third argument depending on the top bit of the corresponding byte lane of the fourth argument.

VPBLENDVB xmm,xmm,xmm/m128,xmm[d] VEX.66.0F3A.W0 4C /r is4 Yes No
Zero out upper bits of YMM/ZMM registers.

Zeroes outs all bits except bits 127:0 of ymm0 to ymm15.

VZEROUPPER VEX.0F 77 (L=0) No
Zero out YMM/ZMM registers.

Zeroes out registers ymm0 to ymm15.

VZEROALL (L=1) No
Added with AVX2
Broadcast integer data from memory or bottom lane of XMM-register to all lanes of XMM/YMM/ZMM register 8-bit VPBROADCASTB ymm,xmm/m8 VEX.66.0F38.W0 78 /r Yes Yes[e] BW 8 (8)[b]
16-bit VPBROADCASTW ymm,xmm/m16 VEX.66.0F38.W0 79 /r Yes Yes[e] BW 16 (16)[b]
32-bit VPBROADCASTD ymm,xmm/m32 VEX.66.0F38.W0 58 /r Yes Yes[e] F 32 (32)[b]
64-bit VPBROADCASTQ ymm,xmm/m64
VBROADCASTI32X2 zmm,xmm/m64(AVX-512)
VEX.66.0F38 59 /r VPBROADCASTQ
(L=1,W=0)
VBROADCASTI32X2(W=0) DQ 32 (64)[b]
VPBROADCASTQ(W=1)[e] F 64 (64)[b]
128-bit VBROADCASTI128 ymm,m128
VBROADCASTI32X4 zmm,m128(AVX-512)
VBROADCASTI64X2 zmm,m128(AVX-512)
VEX.66.0F38 5A /r VBROADCASTI128
(L=1,W=0)
VBROADCASTI32X4(L≠0,W=0) F 32 (128)[b]
VBROADCASTI64X2(L≠0,W=1) DQ 64 (128)[b]
Extract 128-bit vector-lane of integer data from wider vector-register VEXTRACTI128 xmm/m128,ymm,imm8
VEXTRACTI32X4 xmm/m128,zmm,imm8(AVX-512)
VEXTRACTI64X2 xmm/m128,zmm,imm8(AVX-512)
VEX.66.0F3A 39 /r ib VEXTRACTI128
(L=1,W=0)
VEXTRACTI32X4(L≠0,W=0) F 32 No
VEXTRACTI64X2(L≠0,W=1) DQ 64 No
Insert 128-bit vector of integer data into lane of wider vector VINSERTI128 ymm,ymm,xmm/m128,imm8
VINSERTI32X4 ymm,ymm,xmm/m128,imm8(AVX-512)
VINSERTI64X2 ymm,ymm,xmm/m128,imm8(AVX-512)
VEX.66.0F3A 38 /r ib VINSERTI128
(L=1,W=0)
VINSERTI32X4(L≠0,W=0) F 32 No
VINSERTI64X2(L≠0,W=1) DQ 64 No
Concatenate the two source vectors into a vector of four 128-bit components, then use imm8 to index into vector
  • Bits [1:0] of imm8 picks element to use for low 128 bits of result
  • Bits[3:2] of imm8 picks element to use for high 128 bits of result
VPERM2I128 ymm,ymm,ymm/m256,imm8 VEX.66.0F3A.W0 46 /r /ib (L=1) No
Perform shuffle of FP64 values in vector VPERMPD ymm,ymm/m256,imm8 VEX.66.0F3A.W1 01 /r ib (L=1)[f] Yes (L≠0) F 64 64
Perform shuffle of 64-bit integers in vector VPERMQ ymm,ymm/m256,imm8 VEX.66.0F3A.W1 00 /r ib (L=1)[f] Yes (L≠0) F 64 64
Perform variable shuffle of FP32 values in vector VPERMPS ymm,ymm,ymm/m256 VEX.66.0F38.W0 16 /r (L=1)[f] Yes (L≠0) F 32 32
Perform variable shuffle of 32-bit integers in vector VPERMD ymm,ymm,ymm/m256 VEX.66.0F38.W0 36 /r (L=1)[f] Yes (L≠0) F 32 32
Packed memory load/store of integer data with per-lane write masking.

First argument is destination, third argument is source. The second argument provides masks, in the top bit of each lane.

32-bit VPMASKMOVD ymm,ymm,m256 VEX.66.0F38.W0 8C /r Yes No
VPMASKMOVD m256,ymm,ymm VEX.66.0F38.W0 8E /r Yes No
64-bit VPMASKMOVQ ymm,ymm,m256 VEX.66.0F38.W1 8C /r Yes No
VPMASKMOVQ m256,ymm,ymm VEX.66.0F38.W1 8E /r Yes No
Blend packed 32-bit integer values.

For each 32-bit lane of result, pick value from second or third argument depending on the corresponding bit in the imm8 argument.

VPBLENDD ymm,ymm,ymm/m256,imm8 VEX.66.0F3A.W0 02 /r ib Yes No
Left-shift packed integers, with per-lane shift-amount 32-bit VPSLLVD ymm,ymm,xmm/m256 VEX.66.0F38.W0 47 /r Yes Yes F 32 32
64-bit VPSLLVQ ymm,ymm,xmm/m256 VEX.66.0F38.W1 47 /r Yes Yes F 32 64
Right-shift packed signed integers, with per-lane shift-amount 32-bit VPSRAVD ymm,ymm,ymm/m256 VEX.66.0F38 46 /r VPSRAVD
(W=0)
VPSRAVD(W=0) F 32 32
64-bit VPSRAVQ zmm,zmm,zmm/m512(AVX-512) VPSRAVQ(W=1) F 64 64
Right-shift packed unsigned integers, with per-lane shift-amount 32-bit VPSRLVD ymm,ymm,ymm/m256 VEX.66.0F38.W0 45 /r Yes Yes F 32 32
64-bit VPSRLVQ ymm,ymm,ymm/m256 VEX.66.0F38.W5 45 /r Yes Yes F 64 64
VGATHERDPD Yes
VGATHERQPD Yes
VGATHERDPS Yes
VGATHERQPS Yes
VPGATHERDD Yes
VPGATHERQD Yes
VGATHERDQ Yes
VGATHERQD Yes
Added with F16C
VCVTPH2PS Yes
VCVTPS2PH Yes
  1. ^ a b VBROADCASTSS and VBROADCASTSD with a register source operand are not supported under AVX - support for xmm-register source operands for these instructions was added in AVX2.
  2. ^ a b c d e f g h i j k l The V(P)BROADCAST* instructions perform broadcast as part of their normal operation - under AVX-512 with EVEX prefix, they do not need the EVEX.b modifier.
  3. ^ The VBROADCASTSD instruction does not support broadcast of 64-bit data into 128-bit vector. For broadcast of 64-bit data into 128-bit vector, the SSE3 (V)MOVDDUP instruction can be used.
  4. ^ Under AVX, the VPBLENDVB instruction is only available with a 128-bit vector width (VEX.L=0). Support for 256-bit vector width was added in AVX2.
  5. ^ a b c d For AVX-512, variants of the VPBROADCAST(B/W/D/Q) instructions that can use a general-purpose register as source exist as well, with opcodes EVEX.66.0F38.W0 (7A..7C)
  6. ^ a b c d For VPERMPS, VPERMPD, VPERMD and VPERMQ, minimum supported vector width is 256 bits. For shuffles in a 128-bit vector, use VPERMILPS or VPERMILPD.

AVX-512 foundation instructions (F, BW and DQ subsets)

[edit]

Regularly-encoded floating-point instructions

[edit]

These instructions all follow a given pattern where:

  • EVEX.W is used to specify floating-point format (0=FP32, 1=FP64)
  • The bottom opcode bit is used to select between packed and scalar operation (0: packed, 1:scalar)
  • For a given operation, all the scalar/packed variants belong to the same AVX-512 subset.
  • The instructions all support result masking by opmask registers. Except for the AVX512_4FMAPS instructions, they also all support broadcast of memory operands.
  • Except for the AVX512ER and AVX512_4FMAPS extensions, all vector widths (128-bit, 256-bit and 512-bit) are supported.
  • Except for the AVX512_4FMAPS instructions, all variants support broadcast for memory operands.
Operation AVX-512
subset
Basic opcode FP32 instructions (EVEX.W=0) FP64 instructions (EVEX.W=1) RC/SAE
Packed Scalar Packed Scalar
AVX-512 foundation instructions (F, DQ)
Power-of-2 scaling of floating-point values F EVEX.66.0F38 (2C/2D) /r VSCALEFPS z,z,z/m512 VSCALEFSS x,x,x/m32 VSCALEFPD z,z,z/m512 VSCALEFSD x,x,x/m32 SAE
Convert exponent of floating-point value to floating-point F EVEX.66.0F38 (42/43) /r VGETEXPPS z,z/m512 VGETEXPSS x,x,x/m32 VGETEXPPD z,z/m512 VGETEXPSD x,x,x/m32 SAE
Reciprocal approximation with an accuracy of 2^-14 F EVEX.66.0F38 (4C/4D) /r VRCP14PS z,z/m512 VRCP14SS x,x,x/m32 VRCP14PD z,z/m512 VRCP14SD x,x,x/m64
Reciprocal square root approximation with an accuracy of 2^-14 F EVEX.66.0F38 (4E/4F) /r VRSQRT14PS z,z/m512 VRSQRT14SS x,x,x/m32 VRSQRT14PD z,z/m512 VRSQRT14SD x,x,x/m64
Extract normalized mantissa from floating-point value
  • Bits[1:0] of imm8 argument specifies normalization interval
  • Bits[3:2] of imm8 argument specifies sign control
F EVEX.66.0F3A (26/27) /r ib VGETMANTPS z,z/m512,imm8 VGETMANTSS x,x,x/m32,imm8 VGETMANTPD z,z/m512,imm8 VGETMANTSD x,x,x/m64,imm8 SAE
Fix up special floating-point values F EVEX.66.0F3A (54/55) /r ib VFIXUPIMMPS z,z,z/m512,imm8 VFIXUPIMMSS x,x,x/m32,imm8 VFIXUPIMMPD z,z,z/m512,imm8 VFIXUPIMMSD x,x,x/m64,imm8 SAE
Range Restriction Calculation DQ EVEX.66.0F3A (50/51) /r ib VRANGEPS x,x,x/m128,imm8 VRANGESS x,x,x/m32,imm8 VRANGEPD x,x,x/m128,imm8 VRANGESD x,x,x/m64,imm8 SAE
Reduction Transformation DQ EVEX.66.0F3A (56/57) /r ib VREDUCEPS x,x/m128,imm8 VREDUCESS x,x,x/m32,imm8 VREDUCEPD x,x/m128,imm8 VREDUCESD x,x,x/m64,imm8 SAE
Floating-point classification test.

imm8 specifies a set of floating-point number classes to test for as a bitmap. Result is written to opmask register.

DQ EVEX.66.0F3A (66/67) /r ib VFPCLASSPS k,x/m128,imm8 VFPCLASSSS k,x/m32,imm8 VFPCLASSPD k,x/m128,imm8 VFPCLASSSD k,x/m64,imm8
Xeon Phi specific instructions (ER, 4FMAPS)
Reciprocal approximation with an accuracy of 2^-28 ER EVEX.66.0F38 (CA/CB) /r VRCP28PS z,z,z/m512 VRCP28SS x,x,x/m32 VRCP28PD z,z,z/m512 VRCP28SD x,x,x/m64 SAE
Reciprocal square root approximation with an accuracy of 2^-28 ER EVEX.66.0F38 (CC/CD) /r VRCP28PS z,z,z/m512 VRCP28SS x,x,x/m32 VRCP28PD z,z,z/m512 VRCP28SD x,x,x/m64 SAE
Exponential 2^x approximation with 2^-23 relative error ER EVEX.66.0F38 C8 /r VEXP2PS z,z/m512 No VEXP2PD z,z/m512 No SAE
Fused-multiply-add, 4 iterations 4FMAPS EVEX.F2.0F38 (9A/9B) /r V4FMADDPS z,z+3,m128 V4FMADDSS x,x+3,m128 No No
Fused negate-multiply-add, 4 iterations 4FMAPS EVEX.F2.0F38 (AA/AB) /r V4FNMADDPS z,z+3,m128 V4FNMADDSS x,x+3,m128 No No

Opmask instructions

[edit]

AVX-512 introduces, in addition to 512-bit vectors, a set of eight opmask registers, named k0,k1,k2...k7. These registers are 64 bits wide in implementations that support AVX512BW and 16 bits wide otherwise. They are mainly used to enable/disable operation on a per-lane basis for most of the AVX-512 vector instructions. They are usually set with vector-compare instructions or instructions that otherwise produce a 1-bit per-lane result as a natural part of their operation - however, AVX-512 defines a set of 55 new instructions to help assist manual manipulation of the opmask registers.

These instructions are, for the most part, defined in groups of 4 instructions, where the four instructions in a group are basically just 8-bit, 16-bit, 32-bit and 64-bit variants of the same basic operation (where only the low 8/16/32/64 bits of the registers participate in the given operation and, if a result is written back to a register, all bits except the bottom 8/16/32/64 bits are set to zero). The opmask instructions are all encoded with the VEX prefix (unlike all other AVX-512 instructions, which are encoded with the EVEX prefix).

In general, the 16-bit variants of the instructions are introduced by AVX512F (except KADDW and KTESTW), the 8-bit variants by the AVX512DQ extension, and the 32/64-bit variants by the AVX512BW extension.

Most of the instructions follow a very regular encoding pattern where the four instructions in a group have identical encodings except for the VEX.pp and VEX.W fields:

Description Basic opcode 8-bit instructions
(AVX512DQ)
encoded with
VEX.66.W0
16-bit instructions
(AVX512F)
encoded with
VEX.NP.W0
32-bit instructions
(AVX512BW)
encoded with
VEX.66.W1
64-bit instructions
(AVX512BW)
encoded with
VEX.NP.W1
Bitwise AND between two opmask-registers VEX.L1.0F 41 /r KANDB k,k,k KANDW k,k,k KANDD k,k,k KANDQ k,k,k
Bitwise AND-NOT between two opmask-registers VEX.L1.0F 42 /r KANDNB k,k,k KANDNW k,k,k KANDND k,k,k KANDNQ k,k,k
Bitwise NOT of opmask-register VEX.L0.0F 44 /r KNOTB k,k KNOTW k,k KNOTD k,k KNOTQ k,k
Bitwise OR of two opmask-registers VEX.L1.0F 45 /r KORB k,k,k KORW k,k,k KORD k,k,k KORQ k,k,k
Bitwise XNOR of two opmask-registers VEX.L1.0F 46 /r KXNORB k,k,k KXNORW k,k,k KXNORD k,k,k KXNORQ k,k,k
Bitwise XOR of two opmask-registers VEX.L1.0F 47 /r KXORB k,k,k KXORW k,k,k KXORD k,k,k KXORQ k,k,k
Integer addition of two opmask-registers VEX.L1.0F 4A /r KADDB k,k,k KADDW k,k,k[a] KADDD k,k,k KADDQ k,k,k
Load opmask-register from memory or opmask-register VEX.L0.0F 90 /r[b] KMOVB k,k/m8 KMOVW k,k/m16 KMOVD k,k/m32 KMOVQ k,k/m64
Store opmask-register to memory VEX.L0.0F 91 /r[b] KMOVB m8,k KMOVW m16,k KMOVD m32,k KMOVQ m64,k
Load opmask-register from general-purpose register VEX.L0.0F 92 /r[b] KMOVB k,r32 KMOVW k,r32 No[c] No[c]
Store opmask-register to general-purpose register with zero-extension VEX.L0.0F 93 /r[b] KMOVB r32,k KMOVW r32,k No[c] No[c]
Bitwise OR-and-test.

Performs bitwise-OR between two opmask-registers and set flags accordingly.
If the bitwise-OR resulted in all-0s, set ZF=1, else set ZF=0.
If the bitwise-OR resulted in all-1s, set CF=1, else set CF=0.

VEX.L0.0F 98 /r KORTESTB k,k KORTESTW k,k KORTESTD k,k KORTESTQ k,k
Bitwise test.

Performs bitwise-AND and ANDNOT between two opmask-registers and set flags accordingly.
If the bitwise-AND resulted in all-0s, set ZF=1, else set ZF=0.
If the bitwise AND between the inverted first operand and the second operand resulted in all-0s, set CF=1, else set CF=0.

VEX.L0.0F 99 /r KTESTB k,k KTESTW k,k[a] KTESTD k,k KTESTQ k,k
  1. ^ a b The 16-bit opmask instructions KADDW and KTESTW were introduced with AVX512DQ, not AVX512F.
  2. ^ a b c d On processors that support Intel APX, all forms of the KMOV* instructions (but not any other opmask instructions) can be EVEX-encoded.
  3. ^ a b c d The 32/64-bit KMOVD/KMOVQ instructions to move between opmask-registers and general-purpose registers do exist, but do not match the pattern of the opcodes in this table. See table below.

Not all of the opmask instructions fit the pattern above - the remaining ones are:

Description Instruction mnemomics Opcode Operation/result width AVX-512 subset
Opmask-register shift right immediate with zero-fill[a] KSHIFTRB k,k,imm8 VEX.L0.66.0F3A.W0 30 /r /ib 8 DQ
KSHIFTRW k,k,imm8 VEX.L0.66.0F3A.W1 30 /r /ib 16 F
KSHIFTRD k,k,imm8 VEX.L0.66.0F3A.W0 31 /r /ib 32 BW
KSHIFTRQ k,k,imm8 VEX.L0.66.0F3A.W1 31 /r /ib 64 BW
Opmask-register shift left immediate[a] KSHIFTLB k,k,imm8 VEX.L0.66.0F3A.W0 32 /r /ib 8 DQ
KSHIFTLW k,k,imm8 VEX.L0.66.0F3A.W1 32 /r /ib 16 F
KSHIFTLD k,k,imm8 VEX.L0.66.0F3A.W0 33 /r /ib 32 BW
KSHIFTLQ k,k,imm8 VEX.L0.66.0F3A.W1 33 /r /ib 64 BW
32/64-bit move between general-purpose registers and opmask registers KMOVD k,r32 VEX.L0.F2.0F.W0 92 /r[b] 32 BW
KMOVQ k,r64[c] VEX.L0.F2.0F.W1 92 /r[b] 64 BW
KMOVD r32,k VEX.L0.F2.0F.W0 93 /r[b] 32 BW
KMOVQ r64,k[c] VEX.L0.F2.0F.W1 93 /r[b] 64 BW
Concatenate two 8-bit opmasks into a 16-bit opmask KUNPCKBW k,k,k VEX.L1.66.0F.W0 4B /r 16 F
Concatenate two 16-bit opmasks into a 32-bit opmask KUNPCKWD k,k,k VEX.L1.0F.W0 4B /r 32 BW
Concatenate two 32-bit opmasks into a 64-bit opmask KUNPCKDQ k,k,k VEX.L1.0F.W1 4B /r 64 BW
  1. ^ a b For the KSHIFT* instructions, the imm8 shift-amount is not masked. Specifying a shift-amount greater than or equal to the operand size will produce an all-zeroes result.
  2. ^ a b c d On processors that support Intel APX, all forms of the KMOV* instructions (but not any other opmask instructions) can be EVEX-encoded.
  3. ^ a b KMOVQ instruction with 64-bit general-purpose register operand only available in x86-64 long mode. Instruction will execute as KMOVD in 32-bit mode.

Data conversion instructions

[edit]
Description Instruction mnemomics Opcode AVX-512
subset
lane-width broadcast
lane-width
rc/sae
Packed integer narrowing conversions
Convert packed integers to narrower integers, with unsigned saturation 16-bit → 8-bit VPMOVUSWB ymm/m256,zmm EVEX.F3.0F38.W0 10 /r BW 8 No
32-bit → 8-bit VPMOVUSDB xmm/m128,zmm EVEX.F3.0F38.W0 11 /r F 8 No
64-bit → 8-bit VPMOVUSQB xmm/m64,zmm EVEX.F3.0F38.W0 12 /r F 8 No
32-bit → 16-bit VPMOVUSDW ymm/m256,zmm EVEX.F3.0F38.W0 13 /r F 16 No
64-bit → 16-bit VPMOVUSQW xmm/m128,zmm EVEX.F3.0F38.W0 14/r F 16 No
64-bit → 32-bit VPMOVUSQD ymm/m256,zmm EVEX.F3.0F38.W0 15 /r F 32 No
Convert packed integers to narrower integers, with signed saturation 16-bit → 8-bit VPMOVSWB ymm/m256,zmm EVEX.F3.0F38.W0 20 /r BW 8 No
32-bit → 8-bit VPMOVSDB xmm/m128,zmm EVEX.F3.0F38.W0 21 /r F 8 No
64-bit → 8-bit VPMOVSQB xmm/m64,zmm EVEX.F3.0F38.W0 22 /r F 8 No
32-bit → 16-bit VPMOVSDW ymm/m256,zmm EVEX.F3.0F38.W0 23 /r F 16 No
64-bit → 16-bit VPMOVSQW xmm/m128,zmm EVEX.F3.0F38.W0 24 /r F 16 No
64-bit → 32-bit VPMOVSQD ymm/m256,zmm EVEX.F3.0F38.W0 25 /r F 32 No
Convert packed integers to narrower integers, with truncation 16-bit → 8-bit VPMOVWB ymm/m256,zmm EVEX.F3.0F38.W0 30 /r BW 8 No
32-bit → 8-bit VPMOVDB xmm/m128,zmm EVEX.F3.0F38.W0 31 /r F 8 No
64-bit → 8-bit VPMOVQB xmm/m64,zmm EVEX.F3.0F38.W0 32 /r F 8 No
32-bit → 16-bit VPMOVDW ymm/m256,zmm EVEX.F3.0F38.W0 33 /r F 16 No
64-bit → 16-bit VPMOVQW xmm/m128,zmm EVEX.F3.0F38.W0 34 /r F 16 No
64-bit → 32-bit VPMOVQD ymm/m256,zmm EVEX.F3.0F38.W0 35 /r F 32 No
Packed conversions between floating-point and integer
Convert packed floating-point values to packed unsigned integers FP32 → uint32 VCVTPS2UDQ xmm,xmm/m128 EVEX.0F.W0 79 /r F 32 32 RC
FP64 → uint32 VCVTPD2UDQ xmm,xmm/m128 EVEX.0F.W1 79 /r F 32 64 RC
FP32 → uint64 VCVTPS2UQQ xmm,xmm/m64 EVEX.66.0F.W0 79 /r DQ 64 32 RC
FP64 → uint64 VCVTPD2UQQ xmm,xmm/m128 EVEX.66.0F.W1 79 /r DQ 64 64 RC
Convert packed floating-point values to packed signed integers FP32 → int64 VCVTPS2QQ xmm,xmm/m64 EVEX.66.0F.W0 7B /r DQ 64 32 RC
FP64 → int64 VCVTPD2QQ xmm,xmm/m128 EVEX.66.0F.W1 7B /r DQ 64 64 RC
Convert packed floating-point values to packed unsigned integers, with round-to-zero FP32 → uint32 VCVTTPS2UDQ xmm,xmm/m128 EVEX.0F.W0 78 /r F 32 32 SAE
FP64 → uint32 VCVTTPD2UDQ xmm,xmm/m128 EVEX.0F.W1 78 /r F 32 64 SAE
FP32 → uint64 VCVTTPS2UQQ xmm,xmm/m64 EVEX.66.0F.W0 78 /r DQ 64 32 SAE
FP64 → uint64 VCVTTPD2UQQ xmm,xmm/m128 EVEX.66.0F.W1 78 /r DQ 64 64 SAE
Convert packed floating-point values to packed signed integers, with round-to-zero FP32 → int64 VCVTTPS2QQ xmm,xmm/m64 EVEX.66.0F.W0 7A /r DQ 64 32 SAE
FP64 → int64 VCVTTPD2QQ xmm,xmm/m128 EVEX.66.0F.W1 7A /r DQ 64 64 SAE
Convert packed unsigned integers to floating-point uint32 → FP32 VCVTUDQ2PS xmm,xmm/m128 EVEX.F2.0F.W0 7A /r F 32 32 RC
uint32 → FP64 VCVTUDQ2PD xmm,xmm/m128 EVEX.F3.0F.W0 7A /r F 64 32 RC[a]
uint64 → FP32 VCVTUQQ2PS xmm,xmm/m128 EVEX.F2.0F.W1 7A /r DQ 32 64 RC
uint64 → FP64 VCVTUQQ2PD xmm,xmm/m128 EVEX.F3.0F.W1 7A /r DQ 64 64 RC
Covert packed signed integers to floating-point int64 → FP32 VCVTQQ2PS xmm,xmm/m128 EVEX.0F.W1 5B /r DQ 32 64 RC
int64 → FP64 VCVTQQ2PD xmm,xmm/m128 EVEX.F3.0F.W1 E6 /r DQ 64 64 RC
Scalar conversions between floating-point and unsigned integer
Convert scalar floating-point value to unsigned integer, and store integer in GPR. FP32 → uint32 VCVTSS2USI r32,xmm/m32 EVEX.F3.0F.W0 79 /r F No No RC
FP32 → uint64 VCVTSS2USI r64,xmm/m32[b] EVEX.F3.0F.W1 79 /r F No No RC
FP64 → uint32 VCVTSD2USI r32,xmm/m64 EVEX.F2.0F.W0 79 /r F No No RC
FP64 → uint64 VCVTSD2USI r64,xmm/m64[b] EVEX.F2.0F.W1 79 /r F No No RC
Convert scalar floating-point value to unsigned integer with round-to-zero, and store integer in GPR. FP32 → uint32 VCVTTSS2USI r32,xmm/m32 EVEX.F3.0F.W0 78 /r F No No SAE
FP32 → uint64 VCVTTSS2USI r64,xmm/m32[b] EVEX.F3.0F.W1 78 /r F No No SAE
FP64 → uint32 VCVTTSD2USI r32,xmm/m64 EVEX.F2.0F.W0 78 /r F No No SAE
FP64 → uint32 VCVTTSD2USI r64,xmm/m64[b] EVEX.F2.0F.W1 78 /r F No No SAE
Convert scalar unsigned integer to floating-point uint32 → FP32 VCVTUSI2SS xmm,xmm,r/m32 EVEX.F3.0F.W0 7B /r F No No RC
uint64 → FP32 VCVTUSI2SS xmm,xmm,r/m64[b] EVEX.F3.0F.W1 7B /r F No No RC
uint32 → FP64 VCVTUSI2SD xmm,xmm,r/m32 EVEX.F2.0F.W0 7B /r F No No RC[a]
uint64 → FP64 VCVTUSI2SD xmm,xmm,r/m64[b] EVEX.F2.0F.W1 7B /r F No No RC
  1. ^ a b For instructions that perform conversions from unsigned 32-bit integer to FP64 (VCVTUDQ2PD and the W=0 variant of VCVTUSI2SD), EVEX embedded rounding controls are permitted but have no effect.
  2. ^ a b c d e f Scalar conversions to/from 64-bit integer (VCVTSS2USI, VCVTSD2USI, VCVTTSS2USI, VCVTTSD2USI, VCVTUSI2SS, VCVTUSI2SD with EVEX.W=1 encoding) are only available in 64-bit "long mode". Otherwise, these instructions execute as if EVEX.W=0, resulting in 32-bit integer operation.

Compare, test, blend, opmask-convert

[edit]

Vector-register instructions that use opmasks in ways other than just as a result writeback mask.

Description Instruction mnemomics Opcode AVX-512 subset lane-width bcst
Packed integer compare
Compare packed signed integers into opmask register 8-bit VPCMPB k,xmm,xmm/m128,imm8 EVEX.66.0F3A.W0 3F /r ib BW 8 No
16-bit VPCMPW k,xmm,xmm/m128,imm8 EVEX.66.0F3A.W1 3F /r ib BW 16 No
32-bit VPCMPD k,xmm,xmm/m128,imm8 EVEX.66.0F3A.W0 1F /r ib F 32 32
64-bit VPCMPQ k,xmm,xmm/m128,imm8 EVEX.66.0F3A.W1 1F /r ib F 64 64
Compare packed unsigned integers into opmask register 8-bit VPCMPUB k,xmm,xmm/m128,imm8 EVEX.66.0F3A.W0 3E /r ib BW 8 No
16-bit VPCMPUW k,xmm,xmm/m128,imm8 EVEX.66.0F3A.W1 3E /r ib BW 16 No
32-bit VPCMPUD k,xmm,xmm/m128,imm8 EVEX.66.0F3A.W0 1E /r ib F 32 32
64-bit VPCMPUQ k,xmm,xmm/m128,imm8 EVEX.66.0F3A.W1 1E /r ib F 64 64
Packed integer test
Perform bitwise-AND on packed integer values, then write zero/nonzero status of each integer result element into the corresponding opmask register bit (zero→0, nonzero→1) 8-bit VPTESTMB k,xmm,xmm/m128 EVEX.66.0F38.W0 26 /r BW 8 No
16-bit VPTESTMW k,xmm,xmm/m128 EVEX.66.0F38.W1 26 /r BW 16 No
32-bit VPTESTMD k,xmm,xmm/m128 EVEX.66.0F38.W0 27 /r F 32 32
64-bit VPTESTMQ k,xmm,xmm/m128 EVEX.66.0F38.W1 27 /r F 64 64
Perform bitwise-AND on packed integer values, then write negated zero/nonzero status of each integer result element into the corresponding opmask register bit (zero→1, nonzero→0) 8-bit VPTESTNMB k,xmm,xmm/m128 EVEX.F3.0F38.W0 26 /r BW 8 No
16-bit VPTESTNMW k,xmm,xmm/m128 EVEX.F3.0F38.W1 26 /r BW 16 No
32-bit VPTESTNMD k,xmm,xmm/m128 EVEX.F3.0F38.W0 27 /r F 32 32
64-bit VPTESTNMQ k,xmm,xmm/m128 EVEX.F3.0F38.W1 27 /r F 64 64
Packed blend
Variable blend packed integer values.

For each lane of result, pick value from either second or third vector argument based on the opmask register.

8-bit VPBLENDMB xmm{k},xmm,xmm/m128[a] EVEX.66.0F38.W0 66 /r BW No No
16-bit VPBLENDMW xmm{k},xmm,xmm/m128[a] EVEX.66.0F38.W1 66 /r BW No No
32-bit VPBLENDMD xmm{k},xmm,xmm/m128[a] EVEX.66.0F38.W0 64 /r F No 32
64-bit VPBLENDMQ xmm{k},xmm,xmm/m128[a] EVEX.66.0F38.W1 64 /r F No 64
Variable blend packed floating-point values.

For each lane of result, pick value from either second or third vector argument based on the opmask register.

32-bit VBLENDMPS xmm{k},xmm,xmm/m128[a] EVEX.66.0F38.W0 65 /r F No 32
64-bit VBLENDMPD xmm{k},xmm,xmm/m128[a] EVEX.66.0F38.W1 65 /r F No 64
Conversions between vector register and opmask register
Convert opmask register to vector register, with each vector lane set to 0 or all-1s based on corresponding opmask bit. 8-bit VPMOVM2B xmm,k[b] EVEX.F3.0F38.W0 28 /r BW No No
16-bit VPMOVM2W xmm,k[b] EVEX.F3.0F38.W1 28 /r BW No No
32-bit VPMOVM2D xmm,k[b] EVEX.F3.0F38.W0 38 /r DQ No No
64-bit VPMOVM2Q xmm,k[b] EVEX.F3.0F38.W1 38 /r DQ No No
Convert vector register to opmask register, by picking the top bit of each vector register lane. 8-bit VPMOVB2M k,xmm[b] EVEX.F3.0F38.W0 29 /r BW No No
16-bit VPMOVW2M k,xmm[b] EVEX.F3.0F38.W1 29 /r BW No No
32-bit VPMOVD2M k,xmm[b] EVEX.F3.0F38.W0 39 /r DQ No No
64-bit VPMOVQ2M k,xmm[b] EVEX.F3.0F38.W1 39 /r DQ No No
  1. ^ a b c d e f For the AVX-512 V(P)BLENDM* instructions, result write masking is not available - the EVEX-prefix opmask register argument that is normally used for write-masking with most other AVX-512 instructions is instead used for source selection.
  2. ^ a b c d e f g h The VPMOVM2* and VPMOV*2M instructions do not support result masking by EVEX-encoded opmask register, requiring EVEX.aaa=0. The opmask registers actually specified for these instructions is specified through the ModR/M byte.

Data movement instructions

[edit]
Description Instruction mnemonics Opcode AVX-512 subset lane-width broadcast lane-width rc/sae
VPBROADCASTB xmm,r32 EVEX.66.0F38.W0 7A /r AVX512BW 8 No
VPBROADCASTW xmm,r32 EVEX.66.0F38.W0 7B /r AVX512BW 16 No
VPBROADCASTD xmm,r32 EVEX.66.0F38.W0 7C /r AVX512F 32 No
VPBROADCASTQ xmm,r64[a] EVEX.66.0F38.W1 7C /r AVX512F 64 No
VCOMPRESSPS
VCOMPRESSPD
VPCOMPRESSD
VPCOMPRESSQ
VEXPANDPS
VEXPANDPD
VPEXPANDD
VPEXPANDQ
VPERMW
VPERMT2W
VPERMI2PS
VPERMI2PD
VPERMI2D
VPERMI2Q
VPERMI2W
VPERMT2PS
VPERMT2PD
VPERMT2D
VPERMT2Q
VSHUFF32x4
VSHUFF64x2
VSHUFI32x4
VSHUFI64x2
VPSCATTERDD
VPSCATTERDQ
VPSCATTERQD
VPSCATTERQQ
VSCATTERDPS
VSCATTERDPD
VSCATTERQPS
VSCATTERQPD
  1. ^ VPBROADCASTQ with 64-bit register operand available only in 64-bit long mode. In 32-bit mode, the instruction will execute as if EVEX.W=0, resulting in 32-bit operation.

Other AVX-512 foundation instructions

[edit]
Description Instruction mnemonics Opcode AVX-512 subset lane-width broadcast lane-width rc/sae
VPTERNLOGD
VPTERNLOGQ
VALIGND
VALIGNQ
VRNDSCALEPS
VRNDSCALESS
VRNDSCALEPD
VRNDSCALESD
VDBPSADBW
VPMULLQ
VPROLD
VPROLVD
VPROLQ
VPROLVQ
VPRORD
VPRORVD
VPRORQ
VPRORVQ
VPSRAQ
VPSRAVQ
VPSLLVW
VPSRAVW
VPSRLVW


Description Instruction mnemomics Opcode supported subset lane bcst
Broadcast vector of 256-bit floating-point data from memory to all 256-bit lanes of zmm-register VBROADCASTF32X8 zmm,m256
VBROADCASTF64X4 zmm,m256
EVEX.66.0F38 1B /r VBROADCASTF32X8(L=2,W=0) DQ 32 (256)[a]
VBROADCASTF64X4(L=2,W=1) F 64 (256)[a]
Extract 256-bit vector-lane of floating-point data from wider vector-register VEXTRACTF32X8 ymm/m256,zmm
VEXTRACTF64X4 ymm/m256,zmm
EVEX.66.0F3A 1B /r ib VEXTRACTF32X8(L=2,W=0) DQ 32 No
VEXTRACTF64X4(L=2,W=1) F 64 No
Insert 256-bit vector of floating-point data into lane of wider vector VINSERTF32X8 zmm,zmm,ymm/m256,imm8
VINSERTF64X4 zmm,zmm,ymm/m256,imm8
EVEX.66.0F3A 1A /r ib VINSERTF32X8(L=2,W=0) DQ 32 No
VINSERTF64X4(L=2,W=1) F 64 No
Broadcast vector of 256-bit floating-point data from memory to all 256-bit lanes of zmm-register VBROADCASTIF32X8 zmm,m256
VBROADCASTI64X4 zmm,m256
EVEX.66.0F38 5B /r VBROADCASTI32X8(L=2,W=0) DQ 32 (256)[a]
VBROADCASTI64X4(L=2,W=1) F 64 (256)[a]
Extract 256-bit vector-lane of integer data from wider vector-register VEXTRACTI32X8 ymm/m256,zmm
VEXTRACTI64X4 ymm/m256,zmm
EVEX.66.0F3A 3B /r ib VEXTRACTI32X8(L=2,W=0) DQ 32 No
VEXTRACTI64X4(L=2,W=1) F 64 No
Insert 256-bit vector of integer data into lane of wider vector VINSERTI32X8 zmm,zmm,ymm/m256,imm8
VINSERTI64X4 zmm,zmm,ymm/m256,imm8
EVEX.66.0F3A 3A /r ib VINSERTI32X8(L=2,W=0) DQ 32 No
VINSERTI64X4(L=2,W=1) F 64 No
Perform shuffle of FP64 values in vector VPERMPD ymm,ymm,ymm/m256 EVEX.66.0F38.W1 01 /r Yes (L≠0)[b] F 64 64
Perform shuffle of 64-bit integers in vector VPERMQ ymm,ymm,ymm/m256 EVEX.66.0F38.W0 36 /r Yes (L≠0)[b] F 64 64
  1. ^ a b c d The V(P)BROADCAST* instructions perform broadcast as part of their normal operation - under AVX-512 with EVEX prefix, they do not need the EVEX.b modifier.
  2. ^ a b Cite error: The named reference avx2_perm256 was invoked but never defined (see the help page).

Cryptographic instruction set extensions that use SIMD registers

[edit]
Description Instruction mnemonics Basic opcode   SSE   AVX
(VEX prefix)
AVX-512 (EVEX prefix)
supported subset lane bcst
Added with AES-NI and VAES
Perform one round of an AES encryption flow (V)AESENC xmm,xmm/m128[a] 66 0F38 DC /r Yes Yes[b] Yes[b] AVX512F + VAES No No
Perform the last round of an AES encryption flow (V)AESENCLAST xmm,xmm/m128[a] 66 0F38 DD /r
Perform one round of an AES decryption flow (V)AESDEC xmm,xmm/m128[a] 66 0F38 DE /r
Perform the last round of an AES decryption flow (V)AESDECLAST xmm,xmm/m128[a] 66 0F38 DF /r
Perform the AES InvMixColumn transform (V)AESIMC xmm,xmm/m128 66 0F38 DB /r Yes Yes (L=0) No
AES Round Key Generation Assist (V)AESKEYGENASSIST xmm,xmm/m128,imm8 66 0F3A DF /r ib
Added with PCLMULQDQ and VPCLMULQDQ
Carry-less multiply of two 64-bit integers, with full 128-bit result stored.

Bit[0] of the imm8 is used to select which 64-bit lane of the first source argument to use as input to the multiply. Bit[4] of the imm8 is used to select which 64-bit lane of the second source argument to use as input to the multiply.

(V)PCLMULQDQ xmm,xmm/m128,imm8 [a][c] 66 0F3A 44 /r ib Yes Yes[d] Yes[d] AVX512F + VPCLMULQDQ No No
Added with SHA-NI
Perform four rounds of SHA1 operation SHA1RNDS4 xmm,xmm/m128,imm8 NP 0F3A CC /r ib Yes No No
Calculate SHA1 State Variable E after four rounds SHA1NEXTE xmm,xmm/m128 NP 0F38 C8 /r
Perform an intermediate calculation for the next 128 bits of the SHA message SHA1MSG1 xmm,xmm/m128 NP 0F38 C9 /r
Perform a final calculation for the next 128 bits of the SHA message SHA1MSG2 xmm,xmm/m128 NP 0F38 CA /r
Perform two rounds of SHA256 operation.

Uses XMM0 as an implicit operand.

SHA256RNDS2 xmm,xmm/m128 NP 0F38 CB /r
Perform an intermediate calculation for the next 128 bits of the SHA256 message SHA256MSG1 xmm,xmm/m128 NP 0F38 CC /r
Perform a final calculation for the next 128 bits of the SHA256 message SHA256MSG2 xmm,xmm/m128 NP 0F38 CD /r
Added with GFNI
Galois Field Affine Transformation Inverse (V)GF2P8AFFINEINVQB xmm,xmm/m128,imm8 66 0F3A CF /r /ib Yes Yes Yes AVX512F + GFNI 8 64
Galois Field Affine Transformation (V)GF2P8AFFINEQB xmm,xmm/m128,imm8 66 0F3A CE /r /ib
Galois Field Multiply Bytes (V)GF2P8MULB xmm,xmm/m128 66 0F38 CF /r
Added with AES Key Locker
Load internal wrapping key ("IWKey") from the two xmm register operands and XMM0.

The two explicit operands (which must be register operands) specify a 256-bit encryption key. The implicit operand in XMM0 specifies a 128-bit integrity key. EAX contains flags controlling operation of instruction. After being loaded, the IWKey cannot be directly read from software, but is used for the key wrapping done by ENCODEKEY128/256 and checked by the Key Locker encode/decode instructions.

The LOADIWKEY instruction is privileged and can run in Ring 0 only.

LOADIWKEY xmm,xmm F3 0F38 DC /r Yes No No
Wrap a 128-bit AES key from XMM0 into a 384-bit key handle and output handle in XMM0-2.

Source argument specifies handle restrictions. Destination operand is populated with information about the source of the key and its attributes.

ENCODEKEY128 r32,r32 F3 0F38 FA /r
Wrap a 256-bit AES key from XMM1:XMM0 into a 512-bit key handle and output handle in XMM0-3.

Source argument specifies handle restrictions. Destination operand is populated with information about the source of the key and its attributes.

ENCODEKEY256 r32,r32 F3 0F38 FB /r
Encrypt xmm using 128-bit AES key indicated by handle at m384 and store result in xmm AESENC128KL xmm,m384 F3 0F38 DC /r
Decrypt xmm using 128-bit AES key indicated by handle at m384 and store result in xmm AESDEC128KL xmm,m384 F3 0F38 DD /r
Encrypt xmm using 256-bit AES key indicated by handle at m512 and store result in xmm AESENC256KL xmm,m512 F3 0F38 DE /r
Decrypt xmm using 256-bit AES key indicated by handle at m512 and store result in xmm AESDEC256KL xmm,m512 F3 0F38 DF /r
Encrypt XMM0-7 using 128-bit AES key indicated by handle at m384 and store each resultant block back to its corresponding register AESENCWIDE128KL m384 F3 0F38 D8 /0
Decrypt XMM0-7 using 128-bit AES key indicated by handle at m384 and store each resultant block back to its corresponding register AESDECWIDE128KL m384 F3 0F38 D8 /1
Encrypt XMM0-7 using 256-bit AES key indicated by handle at m512 and store each resultant block back to its corresponding register AESENCWIDE256KL m512 F3 0F38 D8 /2
Decrypt XMM0-7 using 256-bit AES key indicated by handle at m512 and store each resultant block back to its corresponding register AESDECWIDE256KL m512 F3 0F38 D8 /3
  1. ^ a b c d e For the VAESENC(LAST), VAESDEC(LAST) and VPCLMULQDQ instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction.
  2. ^ a b For the VAESENC(LAST) and VAESDEC(LAST) instructions, VEX-encoded variants with VEX.L=1 (to indicate 256-bit vectors), as well as EVEX encoded variants (of any length), are only available if the VAES extension is present.
  3. ^ For the PCLMULQDQ instruction, both Intel SDM and AMD APM list a series of four pseudo-instructions that are commonly recognized by compilers and assemblers:
    • PCLMULLQLQDQ xmm1,xmm2 : equal to PCLMULQDQ xmm1,xmm2,00h
    • PCLMULHQLQDQ xmm1,xmm2 : equal to PCLMULQDQ xmm1,xmm2,01h
    • PCLMULLQHQDQ xmm1,xmm2 : equal to PCLMULQDQ xmm1,xmm2,10h
    • PCLMULHQHQDQ xmm1,xmm2 : equal to PCLMULQDQ xmm1,xmm2,11h
  4. ^ a b For the VPCLMULQDQ instruction, VEX-encoded variants with VEX.L=1 (to indicate 256-bit vectors), as well as EVEX encodings, are only available if the VPCLMULQDQ extension is present.
  1. ^ Intel, Avoiding AVX-SSE Transition Penalties, see section 3.3. Archived on Sep 20, 2024.
  2. ^ Intel, 2nd Generation Intel Core Processor Family Desktop Specification Update, order no. 324643-037, apr 2016, see erratum BJ49 on page 36. Archived from the original on 6 Jul 2017.
  3. ^ Intel, Intel 64 and IA-32 Architectures Software Developer’s Manual order no. 325462-086, December 2024, Volume 2B, MOVD/MOVQ instruction entry, page 1289. Archived on 30 Dec 2024.
  4. ^ AMD, AMD64 Architecture Programmer’s Manual Volumes 1–5, pub.no. 40332, rev 4.08, April 2024, see entries for MOVD instruction on pages 2159 and 3040. Archived on 19 Jan 2025.
  5. ^ Intel, Intel 64 and IA-32 Architectures Software Developer’s Manual order no. 325462-086, December 2024, Volume 3B section 10.1.1, page 3368. Archived on 30 Dec 2024.
  6. ^ AMD, AMD64 Architecture Programmer’s Manual Volumes 1–5, pub.no. 40332, rev 4.08, April 2024, Volume 2, section 7.3.2 on page 650. Archived on 19 Jan 2025.
  7. ^ GCC Bugzilla, Bug 104688 - gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX, see comments 34 and 38 for a statement from Zhaoxin on VMOVDQA atomicity. Archived on 12 Dec 2024.
  8. ^ Stack Overflow, SSE instructions: which CPUs can do atomic 16B memory operations? Archived on 30 Sep 2024.
  9. ^ a b Intel, Reference Implementations for Intel Architecture Approximation Instructions VRCP14, VRSQRT14, VRCP28, VRSQRT28, and VEXP2, id #671685, Dec 28, 2015. Archived on Sep 18, 2023.

    C code "recip14.c" archived on 18 Sep 2023.

  10. ^ Intel, Intel 64 and IA-32 Architectures Software Developer’s Manual order no. 325462-086, December 2024, Volume 2B, VPEXTRD entry on page 1511 and VPINSRD entry on page 1530. Archived on 30 Dec 2024.
  11. ^ AMD, AMD64 Architecture Programmer’s Manual Volumes 1–5, pub.no. 40332, rev 4.08, April 2024, see entry for VPEXTRD on page 2302 and VPINSRD on page 2329. Archived on 19 Jan 2025.
  12. ^ AMD, Revision Guide for AMD Family 15h Models 00h-0Fh Processors, pub.no. 38603 rev 3.24, sep 2014, see erratum 592 on page 37. Archived on 22 Jan 2025.
  13. ^ Intel, 2nd Generation Intel Core Processor Family Desktop Specification Update, order no. 324643-037, apr 2016, see erratum BJ72 on page 43. Archived from the original on 6 Jul 2017.