NEP 34 — Disallow inferring dtype=object
from sequences#
- Author:
Matti Picus
- Status:
Final
- Type:
Standards Track
- Created:
2019-10-10
- Resolution:
https://mail.python.org/pipermail/numpy-discussion/2019-October/080200.html
Abstract#
When users create arrays with sequences-of-sequences, they sometimes err in
matching the lengths of the nested sequences, commonly called “ragged
arrays”. Here we will refer to them as ragged nested sequences. Creating such
arrays via np.array([<ragged_nested_sequence>])
with no dtype
keyword
argument will today default to an object
-dtype array. Change the behaviour to
raise a ValueError
instead.
Motivation and scope#
Users who specify lists-of-lists when creating a numpy.ndarray via
np.array
may mistakenly pass in lists of different lengths. Currently we
accept this input and automatically create an array with dtype=object
. This
can be confusing, since it is rarely what is desired. Changing the automatic
dtype detection to never return object
for ragged nested sequences (defined as a
recursive sequence of sequences, where not all the sequences on the same
level have the same length) will force users who actually wish to create
object
arrays to specify that explicitly. Note that lists
, tuples
,
and nd.ndarrays
are all sequences [0]. See for instance issue 5303.
Usage and impact#
After this change, array creation with ragged nested sequences must explicitly define a dtype:
>>> np.array([[1, 2], [1]])
ValueError: cannot guess the desired dtype from the input
>>> np.array([[1, 2], [1]], dtype=object)
# succeeds, with no change from current behaviour
The deprecation will affect any call that internally calls np.asarray
. For
instance, the assert_equal
family of functions calls np.asarray
, so
users will have to change code like:
np.assert_equal(a, [[1, 2], 3])
to:
np.assert_equal(a, np.array([[1, 2], 3], dtype=object))
Detailed description#
To explicitly set the shape of the object array, since it is sometimes hard to determine what shape is desired, one could use:
>>> arr = np.empty(correct_shape, dtype=object)
>>> arr[...] = values
We will also reject mixed sequences of non-sequence and sequence, for instance all of these will be rejected:
>>> arr = np.array([np.arange(10), [10]])
>>> arr = np.array([[range(3), range(3), range(3)], [range(3), 0, 0]])
Implementation#
The code to be changed is inside PyArray_GetArrayParamsFromObject
and the
internal discover_dimensions
function. The first implementation in PR
14794 caused a number of downstream library failures and was reverted before
the release of 1.18. Subsequently downstream libraries fixed the places they
were using ragged arrays. The reimplementation became PR 15119 which was
merged for the 1.19 release.
Backward compatibility#
Anyone depending on creating object arrays from ragged nested sequences will
need to modify their code. There will be a deprecation period during which the
current behaviour will emit a DeprecationWarning
.
Alternatives#
We could continue with the current situation.
It was also suggested to add a kwarg
depth
to array creation, or perhaps to add another array creation API functionragged_array_object
. The goal was to eliminate the ambiguity in creating an object array fromarray([[1, 2], [1]], dtype=object)
: should the returned array have a shape of(1,)
, or(2,)
? This NEP does not deal with that issue, and only deprecates the use ofarray
with nodtype=object
for ragged nested sequences. Users of ragged nested sequences may face another deprecation cycle in the future. Rationale: we expect that there are very few users who intend to use ragged arrays like that, this was never intended as a use case of NumPy arrays. Users are likely better off with another library or just using list of lists.It was also suggested to deprecate all automatic creation of
object
-dtype arrays, which would require adding an explicitdtype=object
for something likenp.array([Decimal(10), Decimal(10)])
. This too is out of scope for the current NEP. Rationale: it’s harder to asses the impact of this larger change, we’re not sure how many users this may impact.
Discussion#
Comments to issue 5303 indicate this is unintended behaviour as far back as 2014. Suggestions to change it have been made in the ensuing years, but none have stuck. The WIP implementation in PR 14794 seems to point to the viability of this approach.
References and footnotes#
Copyright#
This document has been placed in the public domain.