NEP 25 — NA support via special dtypes#
- Author:
Nathaniel J. Smith <njs@pobox.com>
- Status:
Deferred
- Type:
Standards Track
- Created:
2011-07-08
Abstract#
Context: this NEP was written as an additional alternative to NEP 12 (NEP 24 is another alternative), which at the time of writing had an implementation that was merged into the NumPy main branch.
To try and make more progress on the whole missing values/masked arrays/… debate, it seems useful to have a more technical discussion of the pieces which we can agree on. This is the second, which attempts to nail down the details of how NAs can be implemented using special dtype’s.
Rationale#
An ordinary value is something like an integer or a floating point number. A missing value is a placeholder for an ordinary value that is for some reason unavailable. For example, in working with statistical data, we often build tables in which each row represents one item, and each column represents properties of that item. For instance, we might take a group of people and for each one record height, age, education level, and income, and then stick these values into a table. But then we discover that our research assistant screwed up and forgot to record the age of one of our individuals. We could throw out the rest of their data as well, but this would be wasteful; even such an incomplete row is still perfectly usable for some analyses (e.g., we can compute the correlation of height and income). The traditional way to handle this would be to stick some particular meaningless value in for the missing data,e.g., recording this person’s age as 0. But this is very error prone; we may later forget about these special values while running other analyses, and discover to our surprise that babies have higher incomes than teenagers. (In this case, the solution would be to just leave out all the items where we have no age recorded, but this isn’t a general solution; many analyses require something more clever to handle missing values.) So instead of using an ordinary value like 0, we define a special “missing” value, written “NA” for “not available”.
There are several possible ways to represent such a value in memory. For instance, we could reserve a specific value (like 0, or a particular NaN, or the smallest negative integer) and then ensure that this value is treated specially by all arithmetic and other operations on our array. Another option would be to add an additional mask array next to our main array, use this to indicate which values should be treated as NA, and then extend our array operations to check this mask array whenever performing computations. Each implementation approach has various strengths and weaknesses, but here we focus on the former (value-based) approach exclusively and leave the possible addition of the latter to future discussion. The core advantages of this approach are (1) it adds no additional memory overhead, (2) it is straightforward to store and retrieve such arrays to disk using existing file storage formats, (3) it allows binary compatibility with R arrays including NA values, (4) it is compatible with the common practice of using NaN to indicate missingness when working with floating point numbers, (5) the dtype is already a place where “weird things can happen” – there are a wide variety of dtypes that don’t act like ordinary numbers (including structs, Python objects, fixed-length strings, …), so code that accepts arbitrary NumPy arrays already has to be prepared to handle these (even if only by checking for them and raising an error). Therefore adding yet more new dtypes has less impact on extension authors than if we change the ndarray object itself.
The basic semantics of NA values are as follows. Like any other value, they must be supported by your array’s dtype – you can’t store a floating point number in an array with dtype=int32, and you can’t store an NA in it either. You need an array with dtype=NAint32 or something (exact syntax to be determined). Otherwise, NA values act exactly like any other values. In particular, you can apply arithmetic functions and so forth to them. By default, any function which takes an NA as an argument always returns an NA as well, regardless of the values of the other arguments. This ensures that if we try to compute the correlation of income with age, we will get “NA”, meaning “given that some of the entries could be anything, the answer could be anything as well”. This reminds us to spend a moment thinking about how we should rephrase our question to be more meaningful. And as a convenience for those times when you do decide that you just want the correlation between the known ages and income, then you can enable this behavior by adding a single argument to your function call.
For floating point computations, NAs and NaNs have (almost?) identical behavior. But they represent different things – NaN an invalid computation like 0/0, NA a value that is not available – and distinguishing between these things is useful because in some situations they should be treated differently. (For example, an imputation procedure should replace NAs with imputed values, but probably should leave NaNs alone.) And anyway, we can’t use NaNs for integers, or strings, or booleans, so we need NA anyway, and once we have NA support for all these types, we might as well support it for floating point too for consistency.
General strategy#
NumPy already has a general mechanism for defining new dtypes and slotting them in so that they’re supported by ndarrays, by the casting machinery, by ufuncs, and so on. In principle, we could implement NA-dtypes just using these existing interfaces. But we don’t want to do that, because defining all those new ufunc loops etc. from scratch would be a huge hassle, especially since the basic functionality needed is the same in all cases. So we need some generic functionality for NAs – but it would be better not to bake this in as a single set of special “NA types”, since users may well want to define new custom dtypes that have their own NA values, and have them integrate well the rest of the NA machinery. Our strategy, therefore, is to avoid the mid-layer mistake by exposing some code for generic NA handling in different situations, which dtypes can selectively use or not as they choose.
- Some example use cases:
We want to define a dtype that acts exactly like an int32, except that the most negative value is treated as NA.
We want to define a parametrized dtype to represent categorical data, and the bit-pattern to be used for NA depends on the number of categories defined, so our code needs to play an active role handling it rather than simply deferring to the standard machinery.
We want to define a dtype that acts like an length-10 string and supports NAs. Since our string may hold arbitrary binary values, we want to actually allocate 11 bytes for it, with the first byte a flag indicating whether this string is NA and the rest containing the string content.
We want to define a dtype that allows multiple different types of NA data, which print differently and can be distinguished by the new ufunc that we define called
is_na_of_type(...)
, but otherwise takes advantage of the generic NA machinery for most operations.
dtype C-level API extensions#
The PyArray_Descr struct gains the following new fields:
void * NA_value;
PyArray_Descr * NA_extends;
int NA_extends_offset;
The following new flag values are defined:
NPY_NA_AUTO_ARRFUNCS
NPY_NA_AUTO_CAST
NPY_NA_AUTO_UFUNC
NPY_NA_AUTO_UFUNC_CHECKED
NPY_NA_AUTO_ALL /* the above flags OR'ed together */
The PyArray_ArrFuncs struct gains the following new fields:
void (*isna)(void * src, void * dst, npy_intp n, void * arr);
void (*clearna)(void * data, npy_intp n, void * arr);
We add at least one new convenience macro:
#define NPY_NA_SUPPORTED(dtype) ((dtype)->f->isna != NULL)
The general idea is that anywhere where we used to call a dtype-specific function pointer, the code will be modified to instead:
Check for whether the relevant
NPY_NA_AUTO_...
bit is enabled, the NA_extends field is non-NULL, and the function pointer we wanted to call is NULL.If these conditions are met, then use
isna
to identify which entries in the array are NA, and handle them appropriately. Then look up whatever function we were going to call using this dtype on theNA_extends
dtype instead, and use that to handle the non-NA elements.
For more specifics, see following sections.
Note that if NA_extends
points to a parametrized dtype, then the dtype
object it points to must be fully specified. For example, if it is a string
dtype, it must have a non-zero elsize
field.
In order to handle the case where the NA information is stored in a field next
to the real’ data, the ``NA_extends_offset` field is set to a non-zero value;
it must point to the location within each element of this dtype where some data
of the NA_extends
dtype is found. For example, if we have are storing
10-byte strings with an NA indicator byte at the beginning, then we have:
elsize == 11
NA_extends_offset == 1
NA_extends->elsize == 10
When delegating to the NA_extends
dtype, we offset our data pointer by
NA_extends_offset
(while keeping our strides the same) so that it sees an
array of data of the expected type (plus some superfluous padding). This is
basically the same mechanism that record dtypes use, IIUC, so it should be
pretty well-tested.
When delegating to a function that cannot handle “misbehaved” source data (see
the PyArray_ArrFuncs
documentation for details), then we need to check for
alignment issues before delegating (especially with a non-zero
NA_extends_offset
). If there’s a problem, when we need to “clean up” the
source data first, using the usual mechanisms for handling misaligned data. (Of
course, we should usually set up our dtypes so that there aren’t any alignment
issues, but someone screws that up, or decides that reduced memory usage is
more important to them then fast inner loops, then we should still handle that
gracefully, as we do now.)
The NA_value
and clearna
fields are used for various sorts of casting.
NA_value
is a bit-pattern to be used when, for example, assigning from
np.NA. clearna
can be a no-op if elsize
and NA_extends->elsize
are
the same, but if they aren’t then it should clear whatever auxiliary NA storage
this dtype uses, so that none of the specified array elements are NA.
Core dtype functions#
The following functions are defined in PyArray_ArrFuncs
. The special
behavior described here is enabled by the NPY_NA_AUTO_ARRFUNCS bit in the dtype
flags, and only enabled if the given function field is not filled in.
getitem
: Calls isna
. If isna
returns true, returns np.NA.
Otherwise, delegates to the NA_extends
dtype.
setitem
: If the input object is np.NA
, then runs
memcpy(self->NA_value, data, arr->dtype->elsize);
. Otherwise, calls
clearna
, and then delegates to the NA_extends
dtype.
copyswapn
, copyswap
: FIXME: Not sure whether there’s any special
handling to use for these?
compare
: FIXME: how should this handle NAs? R’s sort function discards
NAs, which doesn’t seem like a good option.
argmax
: FIXME: what is this used for? If it’s the underlying implementation
for np.max, then it really needs some way to get a skipna argument. If not,
then the appropriate semantics depends on what it’s supposed to accomplish…
dotfunc
: QUESTION: is it actually guaranteed that everything has the same
dtype? FIXME: same issues as for argmax
.
scanfunc
: This one’s ugly. We may have to explicitly override it in all of
our special dtypes, because assuming that we want the option of, say, having
the token “NA” represent an NA value in a text file, we need some way to check
whether that’s there before delegating. But ungetc
is only guaranteed to
let us put back 1 character, and we need 2 (or maybe 3 if we actually check for
“NA “). The other option would be to read to the next delimiter, check whether
we have an NA, and if not then delegate to fromstr
instead of scanfunc
,
but according to the current API, each dtype might in principle use a totally
different rule for defining “the next delimiter”. So… any ideas? (FIXME)
fromstr
: Easy – check for “NA “, if present then assign NA_value
,
otherwise call clearna
and delegate.
nonzero
: FIXME: again, what is this used for? (It seems redundant with
using the casting machinery to cast to bool.) Probably it needs to be modified
so that it can return NA, though…
fill
: Use isna
to check if either of the first two values is NA. If so,
then fill the rest of the array with NA_value
. Otherwise, call clearna
and then delegate.
fillwithvalue
: Guess this can just delegate?
sort
, argsort
: These should probably arrange to sort NAs to a
particular place in the array (either the front or the back – any opinions?)
scalarkind
: FIXME: I have no idea what this does.
castdict
, cancastscalarkindto
, cancastto
: See section on casting
below.
Casting#
FIXME: this really needs attention from an expert on NumPy’s casting rules. But I can’t seem to find the docs that explain how casting loops are looked up and decided between (e.g., if you’re casting from dtype A to dtype B, which dtype’s loops are used?), so I can’t go into details. But those details are tricky and they matter…
But the general idea is, if you have a dtype with NPY_NA_AUTO_CAST
set,
then the following conversions are automatically allowed:
Casting from the underlying type to the NA-type: this is performed by the
usual
clearna
+ potentially-strided copy dance. Also,isna
iscalled to check that none of the regular values have been accidentally
converted into NA; if so, then an error is raised.
Casting from the NA-type to the underlying type: allowed in principle, but if
isna
returns true for any of the values that are to be converted, then again, an error is raised. (If you want to get around this, usenp.view(array_with_NAs, dtype=float)
.)Casting between the NA-type and other types that do not support NA: this is allowed if the underlying type is allowed to cast to the other type, and is performed by combining a cast to or from the underlying type (using the above rules) with a cast to or from the other type (using the underlying type’s rules).
Casting between the NA-type and other types that do support NA: if the other type has NPY_NA_AUTO_CAST set, then we use the above rules plus the usual dance with
isna
on one array being converted toNA_value
elements in the other. If only one of the arrays has NPY_NA_AUTO_CAST set, then it’s assumed that that dtype knows what it’s doing, and we don’t do any magic. (But this is one of the things that I’m not sure makes sense, as per my caveat above.)
Ufuncs#
All ufuncs gain an additional optional keyword argument, skipNA=
, which
defaults to False.
If skipNA == True
, then the ufunc machinery unconditionally calls
isna
for any dtype where NPY_NA_SUPPORTED(dtype) is true, and then acts as
if any values for which isna returns True were masked out in the where=
argument (see miniNEP 1 for the behavior of where=
). If a where=
argument is also given, then it acts as if the isna
values had be ANDed out
of the where=
mask, though it does not actually modify the mask. Unlike the
other changes below, this is performed unconditionally for any dtype which
has an isna
function defined; the NPY_NA_AUTO_UFUNC flag is not checked.
If NPY_NA_AUTO_UFUNC is set, then ufunc loop lookup is modified so that
whenever it checks for the existence of a loop on the current dtype, and does
not find one, then it also checks for a loop on the NA_extends
dtype. If
that loop is found, then it uses it in the normal way, with the exceptions that
(1) it is only called for values which are not NA according to isna
, (2) if
the output array has NPY_NA_AUTO_UFUNC set, then clearna
is called on it
before calling the ufunc loop, (3) pointer offsets are adjusted by
NA_extends_offset
before calling the ufunc loop. In addition, if
NPY_NA_AUTO_UFUNC_CHECK is set, then after evaluating the ufunc loop we call
isna
on the output array, and if there are any NAs in the output which
were not in the input, then we raise an error. (The intention of this is to
catch cases where, say, we represent NA using the most-negative integer, and
then someone’s arithmetic overflows to create such a value by accident.)
FIXME: We should go into more detail here about how NPY_NA_AUTO_UFUNC works when there are multiple input arrays, of which potentially some have the flag set and some do not.
Printing#
FIXME: There should be some sort of mechanism by which values which are NA are automatically repr’ed as NA, but I don’t really understand how NumPy printing works, so I’ll let someone else fill in this section.
Indexing#
Scalar indexing like a[12]
goes via the getitem
function, so according
to the proposal as described above, if a dtype delegates getitem
, then
scalar indexing on NAs will return the object np.NA
. (If it doesn’t
delegate getitem
, of course, then it can return whatever it wants.)
This seems like the simplest approach, but an alternative would be to add a
special case to scalar indexing, where if an NPY_NA_AUTO_INDEX
flag were
set, then it would call isna
on the specified element. If this returned
false, it would call getitem
as usual; otherwise, it would return a 0-d
array containing the specified element. The problem with this is that it breaks
expressions like if a[i] is np.NA: ...
. (Of course, there is nothing nearly
so convenient as that for NaN values now, but then, NaN values don’t have their
own global singleton.) So for now we stick to scalar indexing just returning
np.NA
, but this can be revisited if anyone objects.
Python API for generic NA support#
NumPy will gain a global singleton called numpy.NA
, similar to None, but with
semantics reflecting its status as a missing value. In particular, trying to
treat it as a boolean will raise an exception, and comparisons with it will
produce numpy.NA
instead of True or False. These basics are adopted from the
behavior of the NA value in the R project. To dig deeper into the ideas,
http://en.wikipedia.org/wiki/Ternary_logic#Kleene_logic provides a starting
point.
Most operations on np.NA
(e.g., __add__
, __mul__
) are overridden to
unconditionally return np.NA
.
The automagic dtype detection used for expressions like np.asarray([1, 2,
3])
, np.asarray([1.0, 2.0. 3.0])
will be extended to recognize the
np.NA
value, and use it to automatically switch to a built-in NA-enabled
dtype (which one being determined by the other elements in the array). A simple
np.asarray([np.NA])
will use an NA-enabled float64 dtype (which is
analogous to what you get from np.asarray([])
). Note that this means that
expressions like np.log(np.NA)
will work: first np.NA
will be coerced
to a 0-d NA-float array, and then np.log
will be called on that.
Python-level dtype objects gain the following new fields:
NA_supported
NA_value
NA_supported
is a boolean which simply exposes the value of the
NPY_NA_SUPPORTED
flag; it should be true if this dtype allows for NAs,
false otherwise. [FIXME: would it be better to just key this off the existence
of the isna
function? Even if a dtype decides to implement all other NA
handling itself, it still has to define isna
in order to make skipNA=
work correctly.]
NA_value
is a 0-d array of the given dtype, and its sole element contains
the same bit-pattern as the dtype’s underlying NA_value
field. This makes
it possible to determine the default bit-pattern for NA values for this type
(e.g., with np.view(mydtype.NA_value, dtype=int8)
).
We do not expose the NA_extends
and NA_extends_offset
values at the
Python level, at least for now; they’re considered an implementation detail
(and it’s easier to expose them later if they’re needed then unexpose them if
they aren’t).
Two new ufuncs are defined: np.isNA
returns a logical array, with true
values where-ever the dtype’s isna
function returned true. np.isnumber
is only defined for numeric dtypes, and returns True for all elements which are
not NA, and for which np.isfinite
would return True.
Builtin NA dtypes#
The above describes the generic machinery for NA support in dtypes. It’s flexible enough to handle all sorts of situations, but we also want to define a few generally useful NA-supporting dtypes that are available by default.
For each built-in dtype, we define an associated NA-supporting dtype, as follows:
floats: the associated dtype uses a specific NaN bit-pattern to indicate NA (chosen for R compatibility)
complex: we do whatever R does (FIXME: look this up – two NA floats, probably?)
signed integers: the most-negative signed value is used as NA (chosen for R compatibility)
unsigned integers: the most-positive value is used as NA (no R compatibility possible).
strings: the first byte (or, in the case of unicode strings, first 4 bytes) is used as a flag to indicate NA, and the rest of the data gives the actual string. (no R compatibility possible)
objects: Two options (FIXME): either we don’t include an NA-ful version, or we use np.NA as the NA bit pattern.
boolean: we do whatever R does (FIXME: look this up – 0 == FALSE, 1 == TRUE, 2 == NA?)
Each of these dtypes is trivially defined using the above machinery, and are
what are automatically used by the automagic type inference machinery (for
np.asarray([True, np.NA, False])
, etc.).
They can also be accessed via a new function np.withNA
, which takes a
regular dtype (or an object that can be coerced to a dtype, like ‘float’) and
returns one of the above dtypes. Ideally withNA
should also take some
optional arguments that let you describe which values you want to count as NA,
etc., but I’ll leave that for a future draft (FIXME).
FIXME: If d
is one of the above dtypes, then should d.type
return?
The NEP also contains a proposal for a somewhat elaborate
domain-specific-language for describing NA dtypes. I’m not sure how great an
idea that is. (I have a bias against using strings as data structures, and find
the already existing strings confusing enough as it is – also, apparently the
NEP version of NumPy uses strings like ‘f8’ when printing dtypes, while my
NumPy uses object names like ‘float64’, so I’m not sure what’s going on there.
withNA(float64, arg1=value1)
seems like a more pleasant way to print a
dtype than “NA[f8,value1]”, at least to me.) But if people want it, then cool.
Type hierarchy#
FIXME: how should we do subtype checks, etc., for NA dtypes? What does
issubdtype(withNA(float), float)
return? How about
issubdtype(withNA(float), np.floating)
?
Serialization#
Copyright#
This document has been placed in the public domain.