NEP 35 — Array Creation Dispatching With __array_function__

Author

Peter Andreas Entschev <pentschev@nvidia.com>

Status

Draft

Type

Standards Track

Created

2019-10-15

Updated

2020-08-17

Resolution

Abstract

We propose the introduction of a new keyword argument like= to all array creation functions to address one of the shortcomings of __array_function__, as described by NEP 18 1. The like= keyword argument will create an instance of the argument’s type, enabling direct creation of non-NumPy arrays. The target array type must implement the __array_function__ protocol.

Motivation and Scope

Many libraries implement the NumPy API, such as Dask for graph computing, CuPy for GPGPU computing, xarray for N-D labeled arrays, etc. Underneath, they have adopted the __array_function__ protocol which allows NumPy to understand and treat downstream objects as if they are the native numpy.ndarray object. Hence the community while using various libraries still benefits from a unified NumPy API. This not only brings great convenience for standardization but also removes the burden of learning a new API and rewriting code for every new object. In more technical terms, this mechanism of the protocol is called a “dispatcher”, which is the terminology we use from here onwards when referring to that.

x = dask.array.arange(5)    # Creates dask.array
np.diff(x)                  # Returns dask.array

Note above how we called Dask’s implementation of diff via the NumPy namespace by calling np.diff, and the same would apply if we had a CuPy array or any other array from a library that adopts __array_function__. This allows writing code that is agnostic to the implementation library, thus users can write their code once and still be able to use different array implementations according to their needs.

Obviously, having a protocol in-place is useful if the arrays are created elsewhere and let NumPy handle them. But still these arrays have to be started in their native library and brought back. Instead if it was possible to create these objects through NumPy API then there would be an almost complete experience, all using NumPy syntax. For example, say we have some CuPy array cp_arr, and want a similar CuPy array with identity matrix. We could still write the following:

x = cupy.identity(3)

Instead, the better way would be using to only use the NumPy API, this could now be achieved with:

x = np.identity(3, like=cp_arr)

As if by magic, x will also be a CuPy array, as NumPy was capable to infer that from the type of cp_arr. Note that this last step would not be possible without like=, as it would be impossible for the NumPy to know the user expects a CuPy array based only on the integer input.

The new like= keyword proposed is solely intended to identify the downstream library where to dispatch and the object is used only as reference, meaning that no modifications, copies or processing will be performed on that object.

We expect that this functionality will be mostly useful to library developers, allowing them to create new arrays for internal usage based on arrays passed by the user, preventing unnecessary creation of NumPy arrays that will ultimately lead to an additional conversion into a downstream array type.

Support for Python 2.7 has been dropped since NumPy 1.17, therefore we make use of the keyword-only argument standard described in PEP-3102 2 to implement like=, thus preventing it from being passed by position.

Usage and Impact

NumPy users who don’t use other arrays from downstream libraries can continue to use array creation routines without a like= argument. Using like=np.ndarray will work as if no array was passed via that argument. However, this will incur additional checks that will negatively impact performance.

To understand the intended use for like=, and before we move to more complex cases, consider the following illustrative example consisting only of NumPy and CuPy arrays:

import numpy as np
import cupy

def my_pad(arr, padding):
    padding = np.array(padding, like=arr)
    return np.concatenate((padding, arr, padding))

my_pad(np.arange(5), [-1, -1])    # Returns np.ndarray
my_pad(cupy.arange(5), [-1, -1])  # Returns cupy.core.core.ndarray

Note in the my_pad function above how arr is used as a reference to dictate what array type padding should have, before concatenating the arrays to produce the result. On the other hand, if like= wasn’t used, the NumPy case would still work, but CuPy wouldn’t allow this kind of automatic conversion, ultimately raising a TypeError: Only cupy arrays can be concatenated exception.

Now we should look at how a library like Dask could benefit from like=. Before we understand that, it’s important to understand a bit about Dask basics and ensures correctness with __array_function__. Note that Dask can perform computations on different sorts of objects, like dataframes, bags and arrays, here we will focus strictly on arrays, which are the objects we can use __array_function__ with.

Dask uses a graph computing model, meaning it breaks down a large problem in many smaller problems and merges their results to reach the final result. To break the problem down into smaller ones, Dask also breaks arrays into smaller arrays that it calls “chunks”. A Dask array can thus consist of one or more chunks and they may be of different types. However, in the context of __array_function__, Dask only allows chunks of the same type; for example, a Dask array can be formed of several NumPy arrays or several CuPy arrays, but not a mix of both.

To avoid mismatched types during computation, Dask keeps an attribute _meta as part of its array throughout computation: this attribute is used to both predict the output type at graph creation time, and to create any intermediary arrays that are necessary within some function’s computation. Going back to our previous example, we can use _meta information to identify what kind of array we would use for padding, as seen below:

import numpy as np
import cupy
import dask.array as da
from dask.array.utils import meta_from_array

def my_dask_pad(arr, padding):
    padding = np.array(padding, like=meta_from_array(arr))
    return np.concatenate((padding, arr, padding))

# Returns dask.array<concatenate, shape=(9,), dtype=int64, chunksize=(5,), chunktype=numpy.ndarray>
my_dask_pad(da.arange(5), [-1, -1])

# Returns dask.array<concatenate, shape=(9,), dtype=int64, chunksize=(5,), chunktype=cupy.ndarray>
my_dask_pad(da.from_array(cupy.arange(5)), [-1, -1])

Note how chunktype in the return value above changes from numpy.ndarray in the first my_dask_pad call to cupy.ndarray in the second. We have also renamed the function to my_dask_pad in this example with the intent to make it clear that this is how Dask would implement such functionality, should it need to do so, as it requires Dask’s internal tools that are not of much use elsewhere.

To enable proper identification of the array type we use Dask’s utility function meta_from_array, which was introduced as part of the work to support __array_function__, allowing Dask to handle _meta appropriately. Readers can think of meta_from_array as a special function that just returns the type of the underlying Dask array, for example:

np_arr = da.arange(5)
cp_arr = da.from_array(cupy.arange(5))

meta_from_array(np_arr)  # Returns a numpy.ndarray
meta_from_array(cp_arr)  # Returns a cupy.ndarray

Since the value returned by meta_from_array is a NumPy-like array, we can just pass that directly into the like= argument.

The meta_from_array function is primarily targeted at the library’s internal usage to ensure chunks are created with correct types. Without the like= argument, it would be impossible to ensure my_pad creates a padding array with a type matching that of the input array, which would cause a TypeError exception to be raised by CuPy, as discussed above would happen to the CuPy case alone. Combining Dask’s internal handling of meta arrays and the proposed like= argument, it now becomes possible to handle cases involving creation of non-NumPy arrays, which is likely the heaviest limitation Dask currently faces from the __array_function__ protocol.

Backward Compatibility

This proposal does not raise any backward compatibility issues within NumPy, given that it only introduces a new keyword argument to existing array creation functions with a default None value, thus not changing current behavior.

Detailed description

The introduction of the __array_function__ protocol allowed downstream library developers to use NumPy as a dispatching API. However, the protocol did not – and did not intend to – address the creation of arrays by downstream libraries, preventing those libraries from using such important functionality in that context.

The purpose of this NEP is to address that shortcoming in a simple and straighforward way: introduce a new like= keyword argument, similar to how the empty_like family of functions work. When array creation functions receive such an argument, they will trigger the __array_function__ protocol, and call the downstream library’s own array creation function implementation. The like= argument, as its own name suggests, shall be used solely for the purpose of identifying where to dispatch. In contrast to the way __array_function__ has been used so far (the first argument identifies the target downstream library), and to avoid breaking NumPy’s API with regards to array creation, the new like= keyword shall be used for the purpose of dispatching.

Downstream libraries will benefit from the like= argument without any changes to their API, given the argument is of exclusive implementation in NumPy. It will still be required that downstream libraries implement the __array_function__ protocol, as described by NEP 18 1, and appropriately introduce the argument to their calls to NumPy array creation functions, as exemplified in Usage and Impact.

Implementation

The implementation requires introducing a new like= keyword to all existing array creation functions of NumPy. As examples of functions that would add this new argument (but not limited to) we can cite those taking array-like objects such as array and asarray, functions that create arrays based on numerical inputs such as range and identity, as well as the empty family of functions, even though that may be redundant, since specializations for those already exist with the naming format empty_like. As of the writing of this NEP, a complete list of array creation functions can be found in 5.

This newly proposed keyword shall be removed by the __array_function__ mechanism from the keyword dictionary before dispatching. The purpose for this is twofold:

  1. The object will have no use in the downstream library’s implementation; and

  2. Simplifies adoption of array creation by those libraries already opting-in to implement the __array_function__ protocol, thus removing the requirement to explicitly opt-in for all array creation functions.

Downstream libraries thus shall _NOT_ include the like= keyword to their array creation APIs, which is a NumPy-exclusive keyword.

Function Dispatching

There are two different cases to dispatch: Python functions, and C functions. To permit __array_function__ dispatching, one possible implementation is to decorate Python functions with overrides.array_function_dispatch, but C functions have a different requirement, which we shall describe shortly.

The example below shows a suggestion on how the asarray could be decorated with overrides.array_function_dispatch:

def _asarray_decorator(a, dtype=None, order=None, *, like=None):
    return (like,)

@set_module('numpy')
@array_function_dispatch(_asarray_decorator)
def asarray(a, dtype=None, order=None, *, like=None):
    return array(a, dtype, copy=False, order=order)

Note in the example above that the implementation remains unchanged, the only difference is the decoration, which uses the new _asarray_decorator function to instruct the __array_function__ protocol to dispatch if like is not None.

We will now look at a C function example, and since asarray is anyway a specialization of array, we will use the latter as an example now. As array is a C function, currently all NumPy does regarding its Python source is to import the function and adjust its __module__ to numpy. The function will now be decorated with a specialization of overrides.array_function_from_dispatcher, which shall take care of adjusting the module too.

array_function_nodocs_from_c_func_and_dispatcher = functools.partial(
    overrides.array_function_from_dispatcher,
    module='numpy', docs_from_dispatcher=False, verify=False)

@array_function_nodocs_from_c_func_and_dispatcher(_multiarray_umath.array)
def array(a, dtype=None, *, copy=True, order='K', subok=False, ndmin=0,
          like=None):
    return (like,)

There are two downsides to the implementation above for C functions:

  1. It creates another Python function call; and

  2. To follow current implementation standards, documentation should be attached directly to the Python source code.

The first version of this proposal suggested the implementation above as one viable solution for NumPy functions implemented in C. However, due to the downsides pointed out above we have decided to discard any changes on the Python side and resolve those issues with a pure-C implementation. Please refer to [implementation] for details.

Alternatives

Recently a new protocol to replace __array_function__ entirely was proposed by NEP 37 6, which would require considerable rework by downstream libraries that adopt __array_function__ already, because of that we still believe the like= argument is beneficial for NumPy and downstream libraries. However, that proposal wouldn’t necessarily be considered a direct alternative to the present NEP, as it would replace NEP 18 entirely, upon which this builds. Discussion on details about this new proposal and why that would require rework by downstream libraries is beyond the scope of the present proposal.