Sayed Adel, Matti Picus, Ralf Gommers
While compilers are getting better at using hardware-specific routines to
optimize code, they sometimes do not produce optimal results. Also, we would
like to be able to copy binary optimized C-extension modules from one machine
to another with the same base architecture (x86, ARM, or PowerPC) but with
different capabilities without recompiling.
We have a mechanism in the ufunc machinery to build alternative loops
indexed by CPU feature name. At import (in InitOperators), the loop
function that matches the run-time CPU info is chosen from the candidates.This
NEP proposes a mechanism to build on that for many more features and
architectures. The steps proposed are to:
Establish a set of well-defined, architecture-agnostic, universal intrisics
which capture features available across architectures.
Capture these universal intrisics in a set of C macros and use the macros
to build code paths for sets of features from the baseline up to the maximum
set of features available on that architecture. Offer these as a limited
number of compiled alternative code paths.
At runtime, discover which CPU features are available, and choose from among
the possible code paths accordingly.
Traditionally NumPy has depended on compilers to generate optimal code
specifically for the target architecture.
However few users today compile NumPy locally for their machines. Most use the
binary packages which must provide run-time support for the lowest-common
denominator CPU architecture. Thus NumPy cannot take advantage of
more advanced features of their CPU processors, since they may not be available
on all users’ systems.
Traditionally, CPU features have been exposed through intrinsics which are
compiler-specific instructions that map directly to assembly instructions.
Recently there were discussions about the effectiveness of adding more
intrinsics (e.g., gh-11113 for AVX optimizations for floats). In the past,
architecture-specific code was added to NumPy for fast avx512 routines in
various ufuncs, using the mechanism described above to choose the best loop
for the architecture. However the code is not generic and does not generalize
to other architectures.
Recently, OpenCV moved to using universal intrinsics in the Hardware
Abstraction Layer (HAL) which provided a nice abstraction for common shared
Single Instruction Multiple Data (SIMD) constructs. This NEP proposes a similar
mechanism for NumPy. There are three stages to using the mechanism:
Infrastructure is provided in the code for abstract intrinsics. The ufunc
machinery will be extended using sets of these abstract intrinsics, so that
a single ufunc will be expressed as a set of loops, going from a minimal to
a maximal set of possibly availabe intrinsics.
At compile time, compiler macros and CPU detection are used to turn the
abstract intrinsics into concrete intrinsic calls. Any intrinsics not
available on the platform, either because the CPU does not support them
(and so cannot be tested) or because the abstract intrinsic does not have a
parallel concrete intrinsic on the platform will not error, rather the
corresponding loop will not be produced and added to the set of
At runtime, the CPU detection code will further limit the set of loops
available, and the optimal one will be chosen for the ufunc.
The current NEP proposes only to use the runtime feature detection and optimal
loop selection mechanism for ufuncs. Future NEPS may propose other uses for the
The ufunc machinery already has the ability to select an optimal loop for
specifically available CPU features at runtime, currently used for avx2,
fma and avx512f loops (in the generated __umath_generated.c file);
universal intrinsics would extend the generated code to include more loop
The end user will be able to get a list of intrinsics available for their
platform and compiler. Optionally,
the user may be able to specify which of the loops available at runtime will be
used, perhaps via an environment variable to enable benchmarking the impact of
the different loops. There should be no direct impact to naive end users, the
results of all the loops should be identical to within a small number (1-3?)
ULPs. On the other hand, users with more powerful machines should notice a
significant performance boost.
The binaries released by this process will be larger since they include all
possible loops for the architecture. Some packagers may prefer to limit the
number of loops in order to limit the size of the binaries, we would hope they
would still support a wide range of families of architectures. Note this
problem already exists in the Intel MKL offering, where the binary package
includes an extensive set of alternative shared objects (DLLs) for various CPU
See “Detailed Description” below. A source build where the packager knows
details of the target machine could theoretically produce a smaller binary by
choosing to compile only the loops needed by the target via command line
Adding more code which use intrinsics will make the code harder to maintain.
Therefore, such code should only be added if it yields a significant
performance benefit. Assessing this performance benefit can be nontrivial.
To aid with this, the implementation for this NEP will add a way to select
which instruction sets can be used at runtime via environment variables.
(name TBD). This ablility is critical for CI code verification.
A new dictionary __cpu_features__ will be available to python. The keys are
the available features, the value is a boolean whether the feature is available
or not. Various new private
C functions will be used internally to query available features. These
might be exposed via specific c-extension modules for testing.
NumPy will always have a baseline C implementation for any code that may be
a candidate for SIMD vectorization. If a contributor wants to add SIMD
support for some architecture (typically the one of most interest to them),
this comment is the beginning of a tutorial on how to do so:
As of this moment, NumPy has a number of avx512f and avx2 and fma
SIMD loops for many ufuncs. These would likely be the first candidates
to be ported to universal intrinsics. The expectation is that the new
implementation may cause a regression in benchmarks, but not increase the
size of the binary. If the regression is not minimal, we may choose to keep
the X86-specific code for that platform and use the universal intrisic code
for other platforms.
Any new PRs to implement ufuncs using intrinsics will be expected to use the
universal intrinsics. If it can be demonstrated that the use of universal
intrinsics is too awkward or is not performant enough, platform specific code
may be accepted as well. In rare cases, a single-platform only PR may be
accepted, but it would have to be examined within the framework of preferring
a solution using universal intrinsics.
The subjective criteria for accepting new loops are:
correctness: the new code must not decrease accuracy by more than 1-3 ULPs
even at edge points in the algorithm.
code bloat: both source code size and especially binary size of the compiled
maintainability: how readable is the code
performance: benchmarks must show a significant performance boost
If a contributor wants to use a platform-specific SIMD instruction that is not
yet supported as a universal intrinsic, then:
It should be added as a universal intrinsic for all platforms
If it does not have an equivalent instruction on other platforms (e.g.
_mm512_mask_i32gather_ps in AVX512), then no universal intrinsic
should be added and a platform-specific ufunc or a short helper fuction
should be written instead. If such a helper function is used, it must be
wrapped with the feature macros, and a reasonable non-intrinsic fallback to
be used by default.
We expect (2) to be the exception. The contributor and maintainers should
consider whether that single-platform intrinsic is worth it compared to using
the best available universal intrinsic based implementation.
It would be nice if the universal intrinsics would be available to other
libraries like SciPy or Astropy that also build ufuncs, but that is not an
explicit goal of the first implementation of this NEP.
There should be no impact on backwards compatibility.
The CPU-specific are mapped to unversal intrinsics which are
similar for all x86 SIMD variants, ARM SIMD variants etc. For example, the
NumPy universal intrinsic npyv_load_u32 maps to:
vld1q_u32 for ARM based NEON
_mm256_loadu_si256 for x86 based AVX2
_mm512_loadu_si512 for x86 based AVX-512
Anyone writing a SIMD loop will use the npyv_load_u32 macro instead of the
architecture specific intrinsic. The code also supplies guard macros for
compilation and runtime, so that the proper loops can be chosen.
Two new build options are available to runtests.py and setup.py:
--cpu-baseline and --cpu-dispatch.
The absolute minimum required features to compile are defined by
--cpu-baseline. For instance, on x86_64 this defaults to SSE3. The
minimum features will be enabled if the compiler support it. The
set of additional intrinsics that can be detected and used as sets of
requirements to dispatch on are set by --cpu-dispatch. For instance, on
x86_64 this defaults to [SSSE3, SSE41, POPCNT, SSE42, AVX, F16C, XOP,
FMA4, FMA3, AVX2, AVX512F, AVX512CD, AVX512_KNL, AVX512_KNM, AVX512_SKX,
AVX512_CLX, AVX512_CNL, AVX512_ICL]. These features are all mapped to a
c-level boolean array npy__cpu_have, and a c-level convenience function
npy_cpu_have(int feature_id) queries this array, and the results are stored
in __cpu_features__ at runtime.
[SSSE3, SSE41, POPCNT, SSE42, AVX, F16C, XOP,
FMA4, FMA3, AVX2, AVX512F, AVX512CD, AVX512_KNL, AVX512_KNM, AVX512_SKX,
AVX512_CLX, AVX512_CNL, AVX512_ICL]
When importing the ufuncs, the available compiled loops’ required features are
matched to the ones discovered. The loop with the best match is marked to be
called by the ufunc.
Pixman is the library used by Cairo and X to manipulate pixels. It uses
a technique like the one described here to fill a structure with function
pointers at runtime. These functions are similar to ufunc loops.
Eigen is a C++ template library for linear algebra: matrices, vectors,
numerical solvers, and related algorithms. It is a higher level-abstraction
than the intrinsics discussed here.
xsimd is a header-only C++ library for x86 and ARM that implements the
mathematical functions used in the algorithms of boost.SIMD.
Simd is a high-level image processing and machine learning library with
optimizations for different platforms.
OpenCV used to have the one-implementation-per-architecture design, but more
recently moved to a design that is quite similar to what is proposed in this
NEP. The top-level dispatch code includes a generic header that is
specialized at compile time by the CMakefile system.
VOLK is a GPL3 library used by gnuradio and others to abstract SIMD
intrinsics. They offer a set of high-level operations which have been
optimized for each architecture.
The C++ Standards Committee has proposed class templates for portable
SIMD programming via vector types, and namespaces for the templates.
gh-13421 improve runtime detection of CPU features
gh-13516: enable multi-platform SIMD compiler optimizations
The compile-time and runtime code infrastructure are supplied by the first PR.
The second adds a demonstration of use of the infrastructure for a loop. Once
the NEP is approved, more work is needed to write loops using the machnisms
provided by the NEP.
A proposed alternative in gh-13516 is to implement loops for each CPU
architecture separately by hand, without trying to abstract common patterns in
the SIMD intrinsics (e.g., have loops.avx512.c.src, loops.avx2.c.src,
loops.sse.c.src, loops.vsx.c.src, loops.neon.c.src, etc.). This is more
similar to what PIXMAX does. There’s a lot of duplication here though, and the
manual code duplication requires a champion who will be dedicated to
implementing and maintaining that platform’s loop code.
Most of the discussion took place on the PR gh-15228 to accecpt this NEP.
Discussion on the mailing list mentioned VOLK which was added to
the section on related work. The question of maintainability also was raised
both on the mailing list and in gh-15228 and resolved as follows:
If contributors want to leverage a specific SIMD instruction, will they be
expected to add software implementation of this instruction for all other
architectures too? (see the new-intrinsics part of the workflow).
On whom does the burden lie to verify the code and benchmarks for all
architectures? What happens if adding a universal ufunc in place of
architecture-specific code helps one architecture but harms performance
on another? (answered in the tradeoffs part of the workflow).
Each NEP must either be explicitly labeled as placed in the public domain (see
this NEP as an example) or licensed under the Open Publication License.
This document has been placed in the public domain. 1