A good reference on this topic is Agner Fog's Optimizing software in C++ , Chapter 13 (available here) and his blog.
Also see a recent article here.
You are eligible for Intel compiler reimbursement if you meet some criteria.
-x switch | Processor Dispatch Routine |
---|---|
(none) | __intel_new_proc_init |
-xsse2 | __intel_new_proc_init |
-xsse3 -xP | __intel_new_proc_init_P |
-xssse3 -xsse3_atom -xT | __intel_new_proc_init_T |
-xsse4.1 -xS | __intel_new_proc_init_S |
-xsse4.2 -xH | __intel_new_proc_init_H |
-xavx -xG | __intel_new_proc_init_G |
-xcore-avx-i | __intel_new_proc_init_I |
-xcore-avx2 | __intel_new_proc_init_E |
All of __intel_new_proc_init_* call __intel_cpu_indicator_init to determine the capability of the CPU, and then call their respective routines to display an error message (via irc__print function) such as
Fatal Error: This program was not built to run on the processor in your system. The allowed processors are: Intel(R) processors with Intel(R) AVX instructions support.(if the CPU is not capable enough) and enable DAZ (Denormals Are Zero) and FTZ/FZ (Flush To Zero) flags in the MXCSR (MMX Extension Control/Status Register.) Therefore, it is enough for us to analyze __intel_cpu_indicator_init in detail.
__intel_new_proc_init relies on the cpuid instruction, which takes the value in EAX register as input and puts output in EAX EBX, ECX, and EDX registers. __intel_cpu_indicator_init only calls cpuid with EAX=0 (to get vendor ID string and maximum standard cpuid levels) and EAX=1 (to get feature flags.) Detailed explanation of cpuid can be found here.
The pseudo code of __intel_cpu_indicator_init is as follows.
Call cpuid with EAX=0 and store the result in local variables. Call cpuid with EAX=1 and store the result in local variables. if (vendor ID string is NOT "GenuineIntel") { set "maximum standard cpuid level" to be 0 } if (maximum standard cpuid level is 0) /* Maximum standard cpuid level is the max value EAX can have when calling cpuid. If this is greater than 0, then calling cpuid with EAX=1 will return the feature flags. Otherwise, there is no need to continue. */ return; } if (CPU family is 15) { /* i.e. Pentium 4 and derivatives */ if (is SSE3 capable) { __intel_cpu_indicator = SSE3; } return; } if (CPU family is not 6) { /* Pentium Pro and anything comes after Pentium 4, i.e. Core 2, Nehalem, Sandy Bridges, etc, are of Family 6. */ return; } __intel_cpu_indicator = SSE3; if (is SSSE3 capable) __intel_cpu_indicator = SSSE3; else return; if (has MOVBE instruction) __intel_cpu_indicator = MOVBE; else return; /* MOVBE (MOV with Bi-Endian support) is a new instruction introduced in Intel Atom processors. It allows swapping the high and low bits of a long value during a move operation. MOVBE is currently only available on Intel Atom processors. */ if (is SSE4.1 capable) __intel_cpu_indicator = SSE4.1; else return; if (has POPCNT instruction and is SSE4.2 capable) __intel_cpu_indicator = SSE4.2; else return; /* POPCNT is a new instruction introduced in SSE4.2 (Intel) and SSE4a (AMD). It counts the number of bit 1's in a 64-bit word. */ if (has PCLMULQDQ instruction and is AES capable) __intel_cpu_indicator = PCLMULQDQ; else return; /* PCLMULQDQ is part of the AES New Instructions (AES-NI). It performs carry-less multiplications. See here for an application in cryptography. */ if (has XGETBV instruction enabled by the OS) { Call XGETBV with ECX=0 and get results in EAX and EDX. if (is AVX capable) if (EDX shows that both XMM and YMM are both available) { __intel_cpu_indicator = AVX; } } /* XGETBV instruction will get the value of Extended Control Register. */ if (has F16C instructions) __intel_cpu_indicator = F16C; /* F16C are 16-bit floating-point conversion instructions */ (more checks... for AVX2) return;
Note that __intel_new_proc_init checks for availability of AVX using XGETBV instruction. This is also the recommended approach in Intel Advanced Vector Extensions Programming Reference.
One example is __intel_ssse3_memcpy, which makes use of _data_cache_size_half and _largest_cache_size_half.
/* In the following, "__attribute__ ((constructor))" means this routine ("my_intel_cpu_indicator_override" below, but one can pick any name of your choice) should be executed before main() Agner Fog's "Optimizing software in C++" guide Chapter 13 uses a similar approach, but on Linux you will see errors from the linker ld: "multiple definitions of __intel_cpu_indicator_init" This error can be fixed by using "-z multidef" command-line option, which instructs the linker to accept multiple definitions. And when one does this, make sure "__intel_cpu_indicator_init" is in the same source file as the main() function. */ #ifdef __INTEL_COMPILER void __attribute__ ((constructor)) my_intel_cpu_indicator_override() { extern unsigned int __intel_cpu_indicator; __intel_cpu_indicator = 1<<11; } #endifwhere 1<<11 means the CPU should be recognized as SSE3 capable, 1<<12 as SSSE3 capable, 1<<14 as MOVBE capable, 1<<13 as SSE4.1 capable, 1<<15 as SSE4.2 & POPCNT capable, 1<<16 as PCLMULQDQ & AES capable, 1<<17 as AVX capable, 1<<18 as F16C and RDRAND capable, and 1<<22 as AVX2, BMI (bit manipulation instructions), LZCNT and FMA capable.
If your code is already compiled with the Intel compiler version 12.0 or 13.0, you can still override it using this Perl script.
Why would anyone want to override it ? A use case is for AMD processors, __intel_cpu_indicator will be 1, the baseline case, even if they are SSE3 capable. Setting it to 1<<11 can enable the optimal execution path on SSE3 capable processors.
If one wants to use Intel optimized memory operations functions on AMD processors, one needs to further manually set the following variables, because Intel's runtime routine _irc_init_cache_tbl checks for CPU vendor string and is not able to obtain cache size info through cpuid instruction if the CPU is not GenuineIntel (TM):
#ifdef __INTEL_COMPILER void __intel_init_mem_ops_method() __attribute__ ((weak)); void __attribute__ ((constructor)) my_intel_init_mem_ops_method() { if (__intel_init_mem_ops_method) { extern unsigned int __intel_memcpy_mem_ops_method, _data_cache_size, _data_cache_size_half, __intel_memcpy_largest_cache_size, _largest_cache_size_half, __intel_memcpy_largest_cachelinesize; /* initialize _irc_cache_tbl */ __intel_init_mem_ops_method(); /* override the cache parameters */ __intel_memcpy_mem_ops_method = 2; /* 2 = SSE2 capable */ /* for AMD Shanghai processors, 64 KB L1 data cache */ _data_cache_size = 65536; _data_cache_size_half = _data_cache_size/2; /* for AMD Shanghai processors, 6 MB L3 cache */ __intel_memcpy_largest_cache_size = 6291456; _largest_cache_size_half = __intel_memcpy_largest_cache_size/2; __intel_memcpy_largest_cachelinesize = 64; } } #endifMake sure you know what you are doing. As mentioned earlier, __intel_cpu_indicator is used by Intel optimized functions and your program could end up with "illegal instruction" error if the CPU does not support the instructions you specified. Even if your program does not call the tuned version of math functions, -x switch can generate code which uses instructions your CPU does not support.
DAZ (Denormals Are Zero) and FTZ/FZ (Flush To Zero) are not of IEEE-754 standard, but they can speed up floating point arithmetics:
DAZ and FTZ only affect SSE instructions but not the traditional x87 instructions.
If any optimization option is used, Intel Compiler will insert a code to enable DAZ/FTZ, unless -ftz- switch is also used.
Depending on the capability of the CPU, there are different ways to detect & enable DAZ/FTZ. They are just different ways to manipulate the DAZ/FTZ bits in the MXCSR (MMX Extension Control/Status Register). For details, see Chapter 7 and Code Example 9.4 in Intel Processor Identification and the CPUID Instruction or this link.
On a side note, gcc can enable DAZ/FTZ by the -ffast-math switch.
daxpy implementation function | SSE instruction set |
---|---|
mkl_blas_def_xdaxpy | The default, untuned version, which works for SSE capable processors. |
mkl_blas_p4n_xdaxpy | SSE2 version (Pentium 4 processors or better). |
mkl_blas_mc_xdaxpy | Supplemental SSE 3 version (Core/Merom processors or better). |
mkl_blas_mc3_xdaxpy | SSE4.2 version (Nehalem processors or better). |
mkl_blas_avx_xdaxpy | AVX version (Sandy Bridge processors or better). |
mkl_blas_avx2_xdaxpy | AVX2 version (Haswell processors or better). [Since MKL version 11.0] |
And the 32-bit libmkl_core.a:
daxpy implementation function | SSE instruction set |
---|---|
mkl_blas_def_xdaxpy | The default, untuned version, which works for SSE capable processors. |
mkl_blas_p4_xdaxpy | SSE2 version (Pentium 4 processors or better). |
mkl_blas_p4p_xdaxpy | SSE3 version (Pentium 4 Prescott processors or better). |
mkl_blas_p4m_xdaxpy | Supplemental SSE 3 version (Core/Merom processors or better). |
mkl_blas_p4m3_xdaxpy | SSE4.2 version (Nehalem processors or better). |
mkl_blas_avx_xdaxpy | AVX version (Sandy Bridge processors or better). |
If you instead choose dynamic linking, then these functions are always called mkl_blas_xdaxpy, and you can find them in libmkl_def.so, libmkl_p4n.so, libmkl_mc.so, libmkl_mc3.so, libmkl_avx.so, libmkl_avx2.so, etc.
Anyway, in x86_64 MKL version 10.2.5/10.3/11.0, one can find the following Processor Dispatch (TM) code:
MKL dynamic link library file | Function | Purpose |
---|---|---|
libmkl_core.so | mkl_serv_intel_cpu | Check for Intel processor |
libmkl_core.so | MKL_CPUisINTEL | Check for Intel processor |
libmkl_core.so | mkl_serv_cpuhasnhm | Check for SSE 4.2 (nhm=Nehalem) |
libmkl_core.so | mkl_serv_cpuhaspnr | Check for SSE 4.1 (pnr=Penryn) |
libmkl_core.so | xxxMKL_CPUhasNHMWST | Check for AES (WST=Westmere) |
libmkl_core.so | mkl_serv_cpuisitbarcelona | Check for AMD Barcelona processor |
libmkl_{intel|gf}_{lp|ilp}64.so | _vmlserv_getCPUisintel (or mkl_vml_serv_getCPUisintel) | Check for Intel processor |
libmkl_{intel|gf}_{lp|ilp}64.so | mkl_vml_serv_CPUisHSW | Check for AVX2 (HSW=Haswell) |
libmkl_{intel|gf}_{lp|ilp}64.so | _vmlserv_CPUisGSSE (or mkl_vml_serv_CPUisGSSE) | Check for AVX |
libmkl_{intel|gf}_{lp|ilp}64.so | _vmlserv_CPUisSSE42 (or mkl_vml_serv_CPUisSSE42) | Check for SSE 4.2 |
libmkl_{intel|gf}_{lp|ilp}64.so | _vmlserv_CPUisSSE41 (or mkl_vml_serv_CPUisSSE41) | Check for SSE 4.1 |
libmkl_{intel|gf}_{lp|ilp}64.so | _vmlserv_CPUisSSE4 (or mkl_vml_serv_CPUisSSE4) | Check for Supplemental SSE 3 |
In x86_64 MKL version 10.2.2, one can find the following code:
MKL dynamic link library file | Function | Purpose |
---|---|---|
libmkl_core.so | mkl_serv_intel_cpu | Check for Intel processor |
libmkl_core.so | MKL_CPUisINTEL | Check for Intel processor |
libmkl_core.so | MKL_CPUhasNHMx | Check for SSE 4.2 (NHMx=Nehalem) |
libmkl_core.so | mkl_serv_cpuhasnhm | Check for SSE 4.2 (nhm=Nehalem) |
libmkl_core.so | mkl_serv_cpuhaspnr | Check for SSE 4.1 (pnr=Penryn) |
libmkl_core.so | MKL_CPUhasMNI | Check for Supplemental SSE 3 (MNI=Merom New Instructions) |
libmkl_core.so | MKL_CPUhasSSE3 | Check for SSE 3 |
libmkl_core.so | MKL_CPUhasAVX | Check for AVX |
libmkl_core.so | mkl_serv_cpuisitbarcelona | Check for AMD Barcelona processor |
libmkl_{intel|gnu|pgi}_thread.so | GetAPIC_ID | Get APIC ID. This is to determine the processor/core topology/enumeration. See here or here for more info. Why MKL needs to know this ? Because when running the multi-threaded version of MKL, MKL will by default ignore the "extra" logical cores created by the HyperThreading technology. |
libmkl_{intel|gnu|pgi}_thread.so | MaxCorePerPhysicalProc | Get number of cores per physical processor. This can help optimize cache usage. |
libmkl_{intel|gnu|pgi}_thread.so | MaxLogicalProcPerPhysicalProc | Get number of logic cores per physical processor. |
libmkl_{intel|gnu|pgi}_thread.so | GetCpuIdInfo | Check for Intel processor |
libmkl_{intel|gnu|pgi}_thread.so | CountProcNum_omp | Check for Intel processor and count the number of processors |
libmkl_{intel|gf}_{lp|ilp}64.so | _vmlserv_getCPUisintel | Check for Intel processor |
libmkl_{intel|gf}_{lp|ilp}64.so | _vmlserv_CPUisGSSE | Check for AVX |
libmkl_{intel|gf}_{lp|ilp}64.so | _vmlserv_CPUisSSE42 | Check for SSE 4.2 |
libmkl_{intel|gf}_{lp|ilp}64.so | _vmlserv_CPUisSSE41 | Check for SSE 4.1 |
libmkl_{intel|gf}_{lp|ilp}64.so | _vmlserv_CPUisSSE4 | Check for Supplemental SSE 3 |
MKL internal parameter | Purpose |
---|---|
disable_fast_mm | Enable/disable fast memory management. One should instead use the environmental variable MKL_DISABLE_FAST_MM. Fast memory management is used only when MKL allocates certain sizes of memory chunks in certain BLAS functions (e.g. dgemm) |
__MKL_CPU_MicroArchitecture | The CPU microarchitecture. This is not as useful as MKL_DEBUG_CPU_TYPE. See the MKL_DEBUG_CPU_MA section here |
itisBarcelona | When __MKL_CPU_MicroArchitecture is 0, this parameter indicates whether the CPU is AMD Barcelona or not. |
mkl_cpu_type | SSE instruction set level. It has the same value as the environmental variable MKL_DEBUG_CPU_TYPE |
__HT | Intel Hyper-Threading technology is present or not |
__N_Logical_Cores __N_Physical_Cores __N_CPU_Packages __N_Cores_per_Packages | Processor topology. __N_Physical_Cores is used to determine the number of threads to be used. One should instead use the environmental variables OMP_NUM_THREADS or MKL_NUM_THREADS instead. |
MKL_cache_sizes | The levels of on-chip cache and their sizes in byte. |
If one really needs to modify these internal parameters in the program, use this code snippet.
As of version 4.0.0 and 4.0.1, Intel MPI has its own additional Processor Dispatch Code to determine the algorithms for collective operations. First, Intel MPI's MPD (multi-purpose daemon) script (mpd or mpd.py) contains a function called pin_Topology, which executes the cpuinfo utility under the same directory with a single command-line argument p. The MPD script then reads in the result of this command and sets I_MPI_INFO_-prefixed environmental variables (e.g. I_MPI_INFO_STATE, I_MPI_INFO_C_NAME, I_MPI_INFO_CACHE1, etc), which are then read by the Intel MPI run-time code, e.g. libmpi.so. Based on the values of these environmental variables, the Intel MPI run-time code will set an internal variable called I_MPI_Platform, which is used to determine the algorithms for collective operations (I_MPI_COLL_DEFAULT, I_MPI_COLL_DEFAULT_HTN, I_MPI_COLL_DEFAULT_NHM, I_MPI_COLL_DEFAULT_WSM)
There is an undocumented environmental variable called I_MPI_PLATFORM which allows users to override the default value for I_MPI_Platform. See here for more info.