NEP 38 — Using SIMD optimization instructions for performance#


Sayed Adel, Matti Picus, Ralf Gommers









While compilers are getting better at using hardware-specific routines to optimize code, they sometimes do not produce optimal results. Also, we would like to be able to copy binary optimized C-extension modules from one machine to another with the same base architecture (x86, ARM, or PowerPC) but with different capabilities without recompiling.

We have a mechanism in the ufunc machinery to build alternative loops indexed by CPU feature name. At import (in InitOperators), the loop function that matches the run-time CPU info is chosen from the candidates.This NEP proposes a mechanism to build on that for many more features and architectures. The steps proposed are to:

  • Establish a set of well-defined, architecture-agnostic, universal intrinsics which capture features available across architectures.

  • Capture these universal intrinsics in a set of C macros and use the macros to build code paths for sets of features from the baseline up to the maximum set of features available on that architecture. Offer these as a limited number of compiled alternative code paths.

  • At runtime, discover which CPU features are available, and choose from among the possible code paths accordingly.

Motivation and scope#

Traditionally NumPy has depended on compilers to generate optimal code specifically for the target architecture. However few users today compile NumPy locally for their machines. Most use the binary packages which must provide run-time support for the lowest-common denominator CPU architecture. Thus NumPy cannot take advantage of more advanced features of their CPU processors, since they may not be available on all users’ systems.

Traditionally, CPU features have been exposed through intrinsics which are compiler-specific instructions that map directly to assembly instructions. Recently there were discussions about the effectiveness of adding more intrinsics (e.g., gh-11113 for AVX optimizations for floats). In the past, architecture-specific code was added to NumPy for fast avx512 routines in various ufuncs, using the mechanism described above to choose the best loop for the architecture. However the code is not generic and does not generalize to other architectures.

Recently, OpenCV moved to using universal intrinsics in the Hardware Abstraction Layer (HAL) which provided a nice abstraction for common shared Single Instruction Multiple Data (SIMD) constructs. This NEP proposes a similar mechanism for NumPy. There are three stages to using the mechanism:

  • Infrastructure is provided in the code for abstract intrinsics. The ufunc machinery will be extended using sets of these abstract intrinsics, so that a single ufunc will be expressed as a set of loops, going from a minimal to a maximal set of possibly available intrinsics.

  • At compile time, compiler macros and CPU detection are used to turn the abstract intrinsics into concrete intrinsic calls. Any intrinsics not available on the platform, either because the CPU does not support them (and so cannot be tested) or because the abstract intrinsic does not have a parallel concrete intrinsic on the platform will not error, rather the corresponding loop will not be produced and added to the set of possibilities.

  • At runtime, the CPU detection code will further limit the set of loops available, and the optimal one will be chosen for the ufunc.

The current NEP proposes only to use the runtime feature detection and optimal loop selection mechanism for ufuncs. Future NEPS may propose other uses for the proposed solution.

The ufunc machinery already has the ability to select an optimal loop for specifically available CPU features at runtime, currently used for avx2, fma and avx512f loops (in the generated __umath_generated.c file); universal intrinsics would extend the generated code to include more loop variants.

Usage and impact#

The end user will be able to get a list of intrinsics available for their platform and compiler. Optionally, the user may be able to specify which of the loops available at runtime will be used, perhaps via an environment variable to enable benchmarking the impact of the different loops. There should be no direct impact to naive end users, the results of all the loops should be identical to within a small number (1-3?) ULPs. On the other hand, users with more powerful machines should notice a significant performance boost.

Binary releases - wheels on PyPI and conda packages#

The binaries released by this process will be larger since they include all possible loops for the architecture. Some packagers may prefer to limit the number of loops in order to limit the size of the binaries, we would hope they would still support a wide range of families of architectures. Note this problem already exists in the Intel MKL offering, where the binary package includes an extensive set of alternative shared objects (DLLs) for various CPU alternatives.

Source builds#

See “Detailed Description” below. A source build where the packager knows details of the target machine could theoretically produce a smaller binary by choosing to compile only the loops needed by the target via command line arguments.

How to run benchmarks to assess performance benefits#

Adding more code which use intrinsics will make the code harder to maintain. Therefore, such code should only be added if it yields a significant performance benefit. Assessing this performance benefit can be nontrivial. To aid with this, the implementation for this NEP will add a way to select which instruction sets can be used at runtime via environment variables. (name TBD). This ability is critical for CI code verification.


A new dictionary __cpu_features__ will be available to python. The keys are the available features, the value is a boolean whether the feature is available or not. Various new private C functions will be used internally to query available features. These might be exposed via specific c-extension modules for testing.

Workflow for adding a new CPU architecture-specific optimization#

NumPy will always have a baseline C implementation for any code that may be a candidate for SIMD vectorization. If a contributor wants to add SIMD support for some architecture (typically the one of most interest to them), this comment is the beginning of a tutorial on how to do so: numpy/numpy#13516

As of this moment, NumPy has a number of avx512f and avx2 and fma SIMD loops for many ufuncs. These would likely be the first candidates to be ported to universal intrinsics. The expectation is that the new implementation may cause a regression in benchmarks, but not increase the size of the binary. If the regression is not minimal, we may choose to keep the X86-specific code for that platform and use the universal intrinsic code for other platforms.

Any new PRs to implement ufuncs using intrinsics will be expected to use the universal intrinsics. If it can be demonstrated that the use of universal intrinsics is too awkward or is not performant enough, platform specific code may be accepted as well. In rare cases, a single-platform only PR may be accepted, but it would have to be examined within the framework of preferring a solution using universal intrinsics.

The subjective criteria for accepting new loops are:

  • correctness: the new code must not decrease accuracy by more than 1-3 ULPs even at edge points in the algorithm.

  • code bloat: both source code size and especially binary size of the compiled wheel.

  • maintainability: how readable is the code

  • performance: benchmarks must show a significant performance boost

Adding a new intrinsic#

If a contributor wants to use a platform-specific SIMD instruction that is not yet supported as a universal intrinsic, then:

  1. It should be added as a universal intrinsic for all platforms

  2. If it does not have an equivalent instruction on other platforms (e.g. _mm512_mask_i32gather_ps in AVX512), then no universal intrinsic should be added and a platform-specific ufunc or a short helper function should be written instead. If such a helper function is used, it must be wrapped with the feature macros, and a reasonable non-intrinsic fallback to be used by default.

We expect (2) to be the exception. The contributor and maintainers should consider whether that single-platform intrinsic is worth it compared to using the best available universal intrinsic based implementation.

Reuse by other projects#

It would be nice if the universal intrinsics would be available to other libraries like SciPy or Astropy that also build ufuncs, but that is not an explicit goal of the first implementation of this NEP.

Backward compatibility#

There should be no impact on backwards compatibility.

Detailed description#

The CPU-specific are mapped to universal intrinsics which are similar for all x86 SIMD variants, ARM SIMD variants etc. For example, the NumPy universal intrinsic npyv_load_u32 maps to:

  • vld1q_u32 for ARM based NEON

  • _mm256_loadu_si256 for x86 based AVX2

  • _mm512_loadu_si512 for x86 based AVX-512

Anyone writing a SIMD loop will use the npyv_load_u32 macro instead of the architecture specific intrinsic. The code also supplies guard macros for compilation and runtime, so that the proper loops can be chosen.

Two new build options are available to and --cpu-baseline and --cpu-dispatch. The absolute minimum required features to compile are defined by --cpu-baseline. For instance, on x86_64 this defaults to SSE3. The minimum features will be enabled if the compiler support it. The set of additional intrinsics that can be detected and used as sets of requirements to dispatch on are set by --cpu-dispatch. For instance, on x86_64 this defaults to [SSSE3, SSE41, POPCNT, SSE42, AVX, F16C, XOP, FMA4, FMA3, AVX2, AVX512F, AVX512CD, AVX512_KNL, AVX512_KNM, AVX512_SKX, AVX512_CLX, AVX512_CNL, AVX512_ICL]. These features are all mapped to a c-level boolean array npy__cpu_have, and a c-level convenience function npy_cpu_have(int feature_id) queries this array, and the results are stored in __cpu_features__ at runtime.

When importing the ufuncs, the available compiled loops’ required features are matched to the ones discovered. The loop with the best match is marked to be called by the ufunc.


Current PRs:

The compile-time and runtime code infrastructure are supplied by the first PR. The second adds a demonstration of use of the infrastructure for a loop. Once the NEP is approved, more work is needed to write loops using the mechanisms provided by the NEP.


A proposed alternative in gh-13516 is to implement loops for each CPU architecture separately by hand, without trying to abstract common patterns in the SIMD intrinsics (e.g., have loops.avx512.c.src, loops.avx2.c.src, loops.sse.c.src, loops.vsx.c.src, loops.neon.c.src, etc.). This is more similar to what PIXMAX does. There’s a lot of duplication here though, and the manual code duplication requires a champion who will be dedicated to implementing and maintaining that platform’s loop code.


Most of the discussion took place on the PR gh-15228 to accept this NEP. Discussion on the mailing list mentioned VOLK which was added to the section on related work. The question of maintainability also was raised both on the mailing list and in gh-15228 and resolved as follows:

  • If contributors want to leverage a specific SIMD instruction, will they be expected to add software implementation of this instruction for all other architectures too? (see the new-intrinsics part of the workflow).

  • On whom does the burden lie to verify the code and benchmarks for all architectures? What happens if adding a universal ufunc in place of architecture-specific code helps one architecture but harms performance on another? (answered in the tradeoffs part of the workflow).

References and footnotes#