NEP 55 — Add a UTF-8 variable-width string DType to NumPy#
- Author:
Nathan Goldbaum <ngoldbaum@quansight.com>
- Author:
Warren Weckesser
- Author:
Marten van Kerkwijk
- Status:
Final
- Type:
Standards Track
- Created:
2023-06-29
- Updated:
2024-01-18
- Resolution:
Abstract#
We propose adding a new string data type to NumPy where each item in the array is an arbitrary length UTF-8 encoded string. This will enable performance, memory usage, and usability improvements for NumPy users, including:
Memory savings for workflows that currently use fixed-width strings and store primarily ASCII data or a mix of short and long strings in a single NumPy array.
Downstream libraries and users will be able to move away from object arrays currently used as a substitute for variable-length string arrays, unlocking performance improvements by avoiding passes over the data outside of NumPy and allowing use of fast GIL-releasing C casts and string ufuncs for string operations.
A more intuitive user-facing API for working with arrays of Python strings, without a need to think about the in-memory array representation.
Motivation and scope#
First, we will describe how the current state of support for string or string-like data in NumPy arose. Next, we will summarize the last major previous discussion about this topic. Finally, we will describe the scope of the proposed changes to NumPy as well as changes that are explicitly out of scope of this proposal.
History of string support in Numpy#
Support in NumPy for textual data evolved organically in response to early user needs and then changes in the Python ecosystem.
Support for strings was added to NumPy to support users of the NumArray
chararray
type. Remnants of this are still visible in the NumPy API:
string-related functionality lives in np.char
, to support the
np.char.chararray
class. This class is not formally deprecated, but has a
had comment in the module docstring suggesting to use string dtypes instead
since NumPy 1.4.
NumPy’s bytes_
DType was originally used to represent the Python 2 str
type before Python 3 support was added to NumPy. The bytes DType makes the most
sense when it is used to represent Python 2 strings or other null-terminated
byte sequences. However, ignoring trailing nulls means the bytes_
DType is
only suitable for fixed-width bytestreams that do not contain trailing nulls, so
it is a possibly problematic match for generic bytestreams where trailing nulls
need to round-trip through a NumPy string.
The unicode
DType was added to support the Python 2 unicode
type. It
stores data in 32-bit UCS-4 codepoints (e.g. a UTF-32 encoding), which makes for
a straightforward implementation, but is inefficient for storing text that can
be represented well using a one-byte ASCII or Latin-1 encoding. This was not a
problem in Python 2, where ASCII or mostly-ASCII text could use the str
DType.
With the arrival of Python 3 support in NumPy, the string DTypes were largely
left alone due to backward compatibility concerns, although the unicode DType
became the default DType for str
data and the old string
DType was
renamed the bytes_
DType. This change left NumPy with the sub-optimal
situation of shipping a data type originally intended for null-terminated
bytestrings as the data type for all python bytes
data, and a default
string type with an in-memory representation that consumes four times as much
memory than what is needed for data that can be represented well by a one-byte
ASCII or Latin-1 encoding.
Problems with fixed-width strings#
Both existing string DTypes represent fixed-width sequences, allowing storage of the string data in the array buffer. This avoids adding out-of-band storage to NumPy, however, it makes for an awkward user interface for many use cases. In particular, the maximum string size must be inferred by NumPy or estimated by the user before loading the data into a NumPy array or selecting an output DType for string operations. In the worst case, this requires an expensive pass over the full dataset to calculate the maximum length of an array element. It also wastes memory when array elements have varying lengths. Pathological cases where an array stores many short strings and a few very long strings are particularly bad for wasting memory.
Downstream usage of string data in NumPy arrays has proven out the need for a
variable-width string data type. In practice, many downstream libraries avoid
using fixed-width strings due to usability issues and instead employ object
arrays for storing strings. In particular, Pandas has explicitly deprecated
support for NumPy fixed-width strings, coerces NumPy fixed-width string arrays
to either object
string arrays or PyArrow
-backed string arrays, and in
the future will switch to only supporting string data via PyArrow
, which has
native support for UTF-8 encoded variable-width string arrays [1].
Previous discussions#
The project last publicly discussed this topic in depth in 2017, when Julian Taylor proposed a fixed-width text data type parameterized by an encoding [2]. This started a wide-ranging discussion about pain points for working with string data in NumPy and possible ways forward.
The discussion highlighted two use-cases that the current support for strings does a poor job of handling [3] [4] [5]:
Loading or memory-mapping scientific datasets with unknown encoding,
Working with “a NumPy array of python strings” in a manner that allows transparent conversion between NumPy arrays and Python strings, including support for missing strings. The
object
DType partially satisfies this need, albeit with a cost of slow performance and no type checking.
As a result of this discussion, improving support for string data was added to the NumPy project roadmap [6], with an explicit call-out to add a DType better suited to memory-mapping bytes with any or no encoding, and a variable-width string DType that supports missing data to replace usages of object string arrays.
Proposed work#
This NEP proposes adding StringDType
, a DType that stores variable-width
heap-allocated strings in Numpy arrays, to replace downstream usages of the
object
DType for string data. This work will heavily leverage recent
improvements in NumPy to improve support for user-defined DTypes, so we will
also necessarily be working on the data type internals in NumPy. In particular,
we propose to:
Add a new variable-length string DType to NumPy, targeting NumPy 2.0.
Work out issues related to adding a DType implemented using the experimental DType API to NumPy itself.
Support for a user-provided missing data sentinel.
Exposing string ufuncs in a new
np.strings
namespace for functions and types related to string support, enabling a migration path for a future deprecation ofnp.char
.
The following is out of scope for this work:
Changing DType inference for string data.
Adding a DType for memory-mapping text in unknown encodings or a DType that attempts to fix issues with the
bytes_
DType.Fully agreeing on the semantics of a missing data sentinels or adding a missing data sentinel to NumPy itself.
Implement SIMD optimizations for string operations.
An update to the
npy
andnpz
file formats to allow storage of arbitrary-length sidecar data.
While we’re explicitly ruling out implementing these items as part of this work, adding a new string DType helps set up future work that does implement some of these items.
If implemented this NEP will make it easier to add a new fixed-width text DType in the future by moving string operations into a long-term supported namespace and improving the internal infrastructure in NumPy for handling strings. We are also proposing a memory layout that should be amenable to SIMD optimization in some cases, increasing the payoff for writing string operations as SIMD-optimized ufuncs in the future.
While we are not proposing adding a missing data sentinel to NumPy, we are
proposing adding support for an optional, user-provided missing data sentinel,
so this does move NumPy a little closer to officially supporting missing
data. We are attempting to avoid resolving the disagreement described in
NEP 26 and this proposal does not require or preclude adding a
missing data sentinel or bitflag-based missing data support to ndarray
in
the future.
Usage and impact#
The DType is intended as a drop-in replacement for object string arrays. This
means that we intend to support as many downstream usages of object string
arrays as possible, including all supported NumPy functionality. Pandas is the
obvious first user, and substantial work has already occurred to add support in
a fork of Pandas. scikit-learn
also uses object string arrays and will be
able to migrate to a DType with guarantees that the arrays contains only
strings. Both h5py [7] and PyTables [8] will be able to add first-class
support for variable-width UTF-8 encoded string datasets in HDF5. String data
are heavily used in machine-learning workflows and downstream machine learning
libraries will be able to leverage this new DType.
Users who wish to load string data into NumPy and leverage NumPy features like fancy advanced indexing will have a natural choice that offers substantial memory savings over fixed-width unicode strings and better validation guarantees and overall integration with NumPy than object string arrays. Moving to a first-class string DType also removes the need to acquire the GIL during string operations, unlocking future optimizations that are impossible with object string arrays.
Performance#
Here we briefly describe preliminary performance measurements of the prototype
version of StringDType
we have implemented outside of NumPy using the
experimental DType API. All benchmarks in this section were performed on a Dell
XPS 13 9380 running Ubuntu 22.04 and Python 3.11.3 compiled using pyenv. NumPy,
Pandas, and the StringDType
prototype were all compiled with meson release
builds.
Currently, the StringDType
prototype has comparable performance with object
arrays and fixed-width string arrays. One exception is array creation from
python strings, performance is somewhat slower than object arrays and comparable
to fixed-width unicode arrays:
In [1]: from stringdtype import StringDType
In [2]: import numpy as np
In [3]: data = [str(i) * 10 for i in range(100_000)]
In [4]: %timeit arr_object = np.array(data, dtype=object)
3.15 ms ± 74.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: %timeit arr_stringdtype = np.array(data, dtype=StringDType())
8.8 ms ± 12.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [6]: %timeit arr_strdtype = np.array(data, dtype=str)
11.6 ms ± 57.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In this example, object DTypes are substantially faster because the objects in
the data
list can be directly interned in the array, while StrDType
and
StringDType
need to copy the string data and StringDType
needs to
convert the data to UTF-8 and perform additional heap allocations outside the
array buffer. In the future, if Python moves to a UTF-8 internal representation
for strings, the string loading performance of StringDType
should improve.
String operations have similar performance:
In [7]: %timeit np.array([s.capitalize() for s in data], dtype=object)
31.6 ms ± 728 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [8]: %timeit np.char.capitalize(arr_stringdtype)
41.5 ms ± 84.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: %timeit np.char.capitalize(arr_strdtype)
47.6 ms ± 386 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The poor performance here is a reflection of the slow iterator-based
implementation of operations in np.char
. When we finish rewriting these
operations as ufuncs, we will unlock substantial performance
improvements. Using the example of the add
ufunc, which we have implemented
for the StringDType
prototype:
In [10]: %timeit arr_object + arr_object
10.1 ms ± 400 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %timeit arr_stringdtype + arr_stringdtype
3.64 ms ± 258 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [12]: %timeit np.char.add(arr_strdtype, arr_strdtype)
17.7 ms ± 245 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As described below, we have already updated a fork of Pandas to use a prototype
version of StringDType
. This demonstrates the performance improvements
available when data are already loaded into a NumPy array and are passed to a
third-party library. Currently Pandas attempts to coerce all str
data to
object
DType by default, and has to check and sanitize existing object
arrays that are passed in. This requires a copy or pass over the data made
unnecessary by first-class support for variable-width strings in both NumPy and
Pandas:
In [13]: import pandas as pd
In [14]: %timeit pd.Series(arr_stringdtype)
18.8 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
If we force Pandas to use object string arrays, which was the default until very recently, we see the substantial performance penalty of a pass over the data outside of NumPy:
In [15]: %timeit pd.Series(arr_object, dtype='string[python]')
907 µs ± 67 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each
Pandas switched to PyArrow-backed string arrays by default specifically to avoid this and other performance costs associated with object string arrays.
Backward compatibility#
We are not proposing a change to DType inference for python strings and do not expect to see any impacts on existing usages of NumPy.
Detailed description#
Here we provide a detailed description of the version of StringDType
we
would like to include in NumPy. This is mostly identical to the prototype, but
has a few differences that are impossible to implement in a DType that lives
outside of NumPy.
First, we describe the Python API for instantiating StringDType
instances. Next, we will describe the missing data handling support and support
for strict string type checking for array elements. We next discuss the cast and
ufunc implementations we will define and discuss our plan for a new
np.strings
namespace to directly expose string ufuncs in the Python
API. Finally, we provide an overview of the C API we would like to expose and
the details of the memory layout and heap allocation strategy we have chosen for
the initial implementation.
Python API for StringDType
#
The new DType will be accessible via the np.dtypes
namespace:
>>> from numpy.dtypes import StringDType
>>> dt = StringDType()
>>> dt
numpy.dtypes.StringDType()
In addition, we propose reserving the character "T"
(short for text) for
usage with np.dtype
, so the above would be identical to:
>>> np.dtype("T")
numpy.dtypes.StringDType()
StringDType
can be used out of the box to represent strings of arbitrary
length in a NumPy array:
>>> data = ["this is a very long string", "short string"]
>>> arr = np.array(data, dtype=StringDType())
>>> arr
array(['this is a very long string', 'short string'], dtype=StringDType())
Note that unlike fixed-width strings, StringDType
is not parameterized by
the maximum length of an array element, arbitrarily long or short strings can
live in the same array without needing to reserve storage for padding bytes in
the short strings.
The StringDType
class will be a synonym for the default StringDType
instance when the class is passed as a dtype
argument in the NumPy Python
API. We have already converted most of the API surface to work like this, but
there are still a few spots that have not yet been converted and it’s likely
third-party code has not been converted, so we will not emphasize this in the
docs. Emphasizing that StringDType
is a class and StringDType()
is an
instance is a more forward-looking API that the rest of the NumPy DType API can
move towards now that DType classes are importable from the np.dtypes
namespace, so we will include an explicit instantiation of a StringDType
object in the documentation even if it is not strictly necessary.
We propose associating the python str
builtin as the DType’s scalar type:
>>> StringDType.type
<class 'str'>
While this does create an API wart in that the mapping from builtin DType
classes to scalars in NumPy will no longer be one-to-one (the unicode
DType’s scalar type is str
), this avoids needing to define, optimize, or
maintain a str
subclass for this purpose or other hacks to maintain this
one-to-one mapping. To maintain backward compatibility, the DType detected for a
list of python strings will remain a fixed-width unicode string.
As described below, StringDType
supports two parameters that can adjust the
runtime behavior of the DType. We will not attempt to support parameters for the
dtype via a character code. If users need an instance of the DType that does not
use the default parameters, they will need to instantiate an instance of the
DType using the DType class.
We will also extend the NPY_TYPES
enum in the C API with an NPY_VSTRING
entry (there is already an NPY_STRING
entry). This should not interfere with
legacy user-defined DTypes since the integer type numbers for these data types
begin at 256. In principle there is still room for hundreds more builtin
DTypes in the integer range available in the NPY_TYPES
enum.
In principle we do not need to reserve a character code and there is a desire to
move away from character codes. However, a substantial amount of downstream code
relies on checking DType character codes to discriminate between builtin NumPy
DTypes, and we think it would harm adoption to require users to refactor their
DType-handling code if they want to use StringDType
.
We also hope that in the future we might be able to add a new fixed-width text
version of StringDType
that can re-use the "T"
character code with
length or encoding modifiers. This will allow a migration to a more flexible
text dtype for use with structured arrays and other use-cases with a fixed-width
string is a better fit than a variable-width string.
Missing Data Support#
Missing data can be represented using a sentinel:
>>> dt = StringDType(na_object=np.nan)
>>> arr = np.array(["hello", nan, "world"], dtype=dt)
>>> arr
array(['hello', nan, 'world'], dtype=StringDType(na_object=nan))
>>> arr[1]
nan
>>> np.isnan(arr[1])
True
>>> np.isnan(arr)
array([False, True, False])
>>> np.empty(3, dtype=dt)
array(['', '', ''])
We only propose supporting user-provided sentinels. By default, empty arrays will be populated with empty strings:
>>> np.empty(3, dtype=StringDType())
array(['', '', ''], dtype=StringDType())
By only supporting user-provided missing data sentinels, we avoid resolving exactly how NumPy itself should support missing data and the correct semantics of the missing data object, leaving that up to users to decide. However, we do detect whether the user is providing a NaN-like missing data value, a string missing data value, or neither. We explain how we handle these cases below.
A cautious reader may be worried about the complexity of needing to handle three
different categories of missing data sentinel. The complexity here is reflective
of the flexibility of object arrays and the downstream usage patterns we’ve
found. Some users want comparisons with the sentinel to error, so they use
None
. Others want comparisons to succeed and have some kind of meaningful
ordering, so they use some arbitrary, hopefully unique string. Other users want
to use something that acts like NaN in comparisons and arithmetic or is
literally NaN so that NumPy operations that specifically look for exactly NaN
work and there isn’t a need to rewrite missing data handling outside of
NumPy. We believe it is possible to support all this, but it requires a bit of
hopefully manageable complexity.
NaN-like Sentinels#
A NaN-like sentinel returns itself as the result of arithmetic operations. This
includes the python nan
float and the Pandas missing data sentinel
pd.NA
. We choose to make NaN-like sentinels inherit these behaviors in
operations, so the result of addition is the sentinel:
>>> dt = StringDType(na_object=np.nan)
>>> arr = np.array(["hello", np.nan, "world"], dtype=dt)
>>> arr + arr
array(['hellohello', nan, 'worldworld'], dtype=StringDType(na_object=nan))
We also chose to make a NaN-like sentinel sort to the end of the array,
following the behavior of sorting an array containing nan
.
>>> np.sort(arr)
array(['hello', 'world', nan], dtype=StringDType(na_object=nan))
String Sentinels#
A string missing data value is an instance of str
or subtype of str
.
Operations will use the sentinel value directly for missing entries. This is the
primary usage of this pattern we’ve found in downstream code, where a missing
data sentinel like "__nan__"
is passed to a low-level sorting or
partitioning algorithm.
Other Sentinels#
Any other python object will raise errors in operations or comparisons, just as
None
does as a missing data sentinel for object arrays currently:
>>> dt = StringDType(na_object=None)
>>> np.sort(np.array(["hello", None, "world"], dtype=dt))
ValueError: Cannot compare null that is not a string or NaN-like value
Since comparisons need to raise an error, and the NumPy comparison API has no way to signal value-based errors during a sort without holding the GIL, sorting arrays that use arbitrary missing data sentinels will hold the GIL. We may also attempt to relax this restriction by refactoring NumPy’s comparison and sorting implementation to allow value-based error propagation during a sort operation.
Implications for DType Inference#
If, in the future, we decide to break backward compatibility to make
StringDType
the default DType for str
data, the support for arbitrary
objects as missing data sentinels may seem to pose a problem for implementing
DType inference. However, given that initial support for this DType will require
using the DType directly and will not be able to rely on NumPy to infer the
DType, we do not think this will be a major problem for downstream users of the
missing data feature. To use StringDType
, they will need to update
their code to explicitly specify a DType when an array is created, so if NumPy
changes DType inference in the future, their code will not change behavior and
there will never be a need for missing data sentinels to participate in DType
inference.
Coercing non-strings#
By default, non-string data are coerced to strings:
>>> np.array([1, object(), 3.4], dtype=StringDType())
array(['1', '<object object at 0x7faa2497dde0>', '3.4'], dtype=StringDType())
If this behavior is not desired, an instance of the DType can be created that disables string coercion:
>>> np.array([1, object(), 3.4], dtype=StringDType(coerce=False))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: StringDType only allows string data when string coercion
is disabled
This allows strict data validation in the same pass over the data NumPy uses to create the array without a need for downstream libraries to implement their own string validation in a separate, expensive, pass over the input array-like. We have chosen not to make this the default behavior to follow NumPy fixed-width strings, which coerce non-strings.
Casts, ufunc support, and string manipulation functions#
A full set of round-trip casts to the builtin NumPy DTypes will be available. In
addition, we will add implementations for the comparison operators as well as an
add
loop that accepts two string arrays, multiply
loops that accept
string and integer arrays, an isnan
loop, and implementations for the
str_len
, isalpha
, isdecimal
, isdigit
, isnumeric
,
isspace
, find
, rfind
, count
, strip
, lstrip
, rstrip
,
and replace
string ufuncs that will be newly available in NumPy 2.0.
The isnan
ufunc will return True
for entries that are NaN-like sentinels
and False
otherwise. Comparisons will sort data in order of unicode code
point, as is currently implemented for the fixed-width unicode DType. In the
future NumPy or a downstream library may add locale-aware sorting, case folding,
and normalization for NumPy unicode strings arrays, but we are not proposing
adding these features at this time.
Two StringDType
instances are considered equal if they are created with the
same na_object
and coerce
parameter. For ufuncs that accept more than
one string argument we also introduce the concept of “compatible”
StringDType
instances. We allow distinct DType instances to be used in ufunc
operations together if have the same na_object
or if only one
or the other DType has an na_object
explicitly set. We do not consider
string coercion for determining whether instances are compatible, although if
the result of the operation is a string, the result will inherit the stricter
string coercion setting of the original operands.
This notion of “compatible” instances will be enforced in the
resolve_descriptors
function of binary ufuncs. This choice makes it easier
to work with non-default StringDType
instances, because python strings are
coerced to the default StringDType
instance, so the following idiomatic
expression is allowed:
>>> arr = np.array(["hello", "world"], dtype=StringDType(na_object=None))
>>> arr + "!"
array(['hello!', 'world!'], dtype=StringDType(na_object=None))
If we only considered equality of StringDType
instances, this would
be an error, making for an awkward user experience. If the operands have
distinct na_object
settings, NumPy will raise an error because the choice
for the result DType is ambiguous:
>>> arr + np.array("!", dtype=StringDType(na_object=""))
TypeError: Cannot find common instance for incompatible dtype instances
np.strings
namespace#
String operations will be available in a np.strings
namespace that will
be populated with string ufuncs:
>>> np.strings.upper((np.array(["hello", "world"], dtype=StringDType())
array(['HELLO', 'WORLD'], dtype=StringDType())
>>> isinstance(np.strings.upper, np.ufunc)
True
We feel np.strings
is a more intuitive name than np.char
, and eventually
will replace np.char
once the minimum NumPy version supported by downstream
libraries per SPEC-0 is new
enough that they can safely switch to np.strings
without needing any logic
conditional on the NumPy version.
Serialization#
Since string data are stored outside the array buffer, serialization to the
npy
format would requires a format revision to support storing
variable-width sidecare data. Rather than doing this as part of this effort, we
do not plan on supporting serialization to the npy
or npz
format without
specifying allow_pickle=True
.
This is a continuation of the current situation with object string arrays,
which can only be saved to an npy
file using the allow_pickle=True
option.
In the future we may decide to add support for this, but care should be taken to not break parsers outside of NumPy that may not be maintained.
C API for StringDType
#
The goal of the C API is to hide details of how string data are stored on the
heap from the user and provide a thread-safe interface for reading and writing
strings stored in StringDType
arrays. To accomplish this, we have decided to
split strings into two different packed and unpacked representations. A
packed string lives directly in the array buffer and may contain either the
string data for a sufficiently short string or metadata for a heap allocation
where the characters of the string are stored. An unpacked string exposes the
size of the string in bytes and a char *
pointer to the string data.
To access the unpacked string data for a string stored in a numpy array, a user must call a function to load the packed string into an unpacked string or call another function to pack an unpacked string into an array. These operations require both a pointer to an array entry and a reference to an allocator struct. The allocator manages the bookkeeping needed to store the string data on the heap. Centralizing this bookkeeping in the allocator means we have the freedom to change the underlying allocation strategy. We also ensure thread safety by guarding access to the allocator with a mutex.
Below we describe this design in more detail, enumerating the types and functions we would like to add to the C API. In the next section we describe the memory layout and heap allocation strategy we plan to implement using this API.
The PyArray_StringDType
and PyArray_StringDTypeObject
structs#
We will publicly expose structs for the StringDType
metaclass and a struct
for the type of StringDType
instances. The former PyArray_StringDType
will be available in the C API in the same way as other PyArray_DTypeMeta
instances for writing ufunc and cast loops. In addition, we will make the
following struct public:
struct PyArray_StringDTypeObject {
PyArray_Descr base;
// The object representing a null value
PyObject *na_object;
// Flag indicating whether or not to coerce arbitrary objects to strings
char coerce;
// Flag indicating the na object is NaN-like
char has_nan_na;
// Flag indicating the na object is a string
char has_string_na;
// If nonzero, indicates that this instance is owned by an array already
char array_owned;
// The string data to use when a default string is needed
npy_static_string default_string;
// The name of the missing data object, if any
npy_static_string na_name;
// the allocator should only be directly accessed after
// acquiring the allocator_lock and the lock should
// be released immediately after the allocator is
// no longer needed
npy_string_allocator *allocator;
}
Making this definition public eases future integration with other dtypes.
String and Allocator Types#
Unpacked strings are represented in the C API with the npy_static_string
type, which will be publicly exposed with the following definition:
struct npy_static_string {
size_t size;
const char *buf;
};
Where size
is the size, in bytes, of the string and buf
is a const
pointer to the beginning of a UTF-8 encoded bytestream containing string
data. This is a read-only view onto the string, we will not expose a public
interface for modifying these strings. We do not append a trailing null
character to the byte stream, so users attempting to pass the buf
field to
an API expecting a C string must create a copy with a trailing null. In the
future we may decide to always write a trailing null byte if the need to copy
into a null-terminated buffer proves to be cost-prohibitive for downstream users
of the C API.
In addition, we will expose two opaque structs, npy_packed_static_string
and
npy_string_allocator
. Each entry in StringDType
NumPy array will store
the contents of an npy_packed_static_string
; a packed representation of a
string. The string data are stored either directly in the packed string or on
the heap, in an allocation managed by a separate npy_string_allocator
struct
attached to the descriptor instance associated with the array. The precise
layout of the packed string and the strategy used to allocate data on the heap
will not be publicly exposed and users should not depend on these details.
New C API Functions#
The C API functions we plan to expose fall into two categories: functions for acquiring and releasing the allocator lock and functions for loading and packing strings.
Acquiring and Releasing Allocators#
The main interface for acquiring and releasing the allocator is the following pair of static inline functions:
static inline npy_string_allocator *
NpyString_acquire_allocator(PyArray_StringDTypeObject *descr)
static inline void
NpyString_release_allocator(npy_string_allocator *allocator)
The first function acquires the allocator lock attached to the descriptor
instance and returns a pointer to the allocator associated with the
descriptor. The allocator can then be used by that thread to load existing
packed strings or pack new strings into the array. Once the operation requiring
the allocator is finished, the allocator lock must then be released. Use of the
allocator after calling NpyString_release_allocator
may lead to data races
or memory corruption.
There are also cases when it is convenient to simultaneously work with several
allocators. For example, the add
ufunc takes two string arrays and produces
a third string array. This means the ufunc loop needs three allocators to be
able to load the strings for each operand and pack the result into the output
array. This is also made more tricky by the fact that input and output operands
need not be distinct objects and operands can share allocators by virtue of
being the same array. In principle we could require users to acquire and release
locks inside of a ufunc loop, but that would add a large performance overhead
compared to acquiring all three allocators in the loop setup and releasing them
simultaneously after the end of the loop.
To handle these situations, we will also expose variants of both functions that
take an arbitrary number of descriptors and allocators
(NpyString_acquire_allocators
, and
NpyString_release_allocators
). Exposing these functions makes it
straightforward to write code that works simultaneously with more than one
allocator. The naive approach that simply calls NpyString_acquire_allocator
and NpyString_release_allocator
multiple times will cause undefined behavior
by attempting to acquire the same lock more than once in the same thread when
ufunc operands share descriptors. The multiple-descriptor variants check
for identical descriptors before trying to acquire locks, avoiding the undefined
behavior. To do the correct thing, the user will only need to choose the variant
to acquire or release allocators that accepts the same number of descriptors as
the number they need to work with.
Packing and Loading Strings#
Accessing strings is mediated by the following function:
int NpyString_load(
npy_string_allocator *allocator,
const npy_packed_static_string *packed_string,
npy_static_string *unpacked_string)
This function returns -1 on error, which can happen if there is a threading bug
or corruption preventing access to a heap allocation. On success it can either
return 1 or 0. If it returns 1, this indicates that the contents of the packed
string are the null string, and special logic for handling null strings can
happen in this case. If the function returns 0, this indicates the contents of
the packed_string
can be read from the unpacked_string
.
Packing strings can happen via one of these functions:
int NpyString_pack(
npy_string_allocator *allocator,
npy_packed_static_string *packed_string,
const char *buf, size_t size)
int NpyString_pack_null(
npy_string_allocator *allocator,
npy_packed_static_string *packed_string)
The first function packs the contents of the first size
elements of buf
into packed_string
. The second function packs the null string into
packed_string
. Both functions invalidate any previous heap allocation
associated with the packed string and old unpacked representations that are
still in scope are invalid after packing a string. Both functions return 0 on
success and -1 on failure, for example if malloc
fails.
Example C API Usage#
Loading a String#
Say we are writing a ufunc implementation for StringDType
. If we are given
const char *buf
pointer to the beginning of a StringDType
array entry, and a
PyArray_Descr *
pointer to the array descriptor, one can
access the underlying string data like so:
npy_string_allocator *allocator = NpyString_acquire_allocator(
(PyArray_StringDTypeObject *)descr);
npy_static_string sdata = {0, NULL};
npy_packed_static_string *packed_string = (npy_packed_static_string *)buf;
int is_null = 0;
is_null = NpyString_load(allocator, packed_string, &sdata);
if (is_null == -1) {
// failed to load string, set error
return -1;
}
else if (is_null) {
// handle missing string
// sdata->buf is NULL
// sdata->size is 0
}
else {
// sdata->buf is a pointer to the beginning of a string
// sdata->size is the size of the string
}
NpyString_release_allocator(allocator);
Packing a String#
This example shows how to pack a new string into an array:
char *str = "Hello world";
size_t size = 11;
npy_packed_static_string *packed_string = (npy_packed_static_string *)buf;
npy_string_allocator *allocator = NpyString_acquire_allocator(
(PyArray_StringDTypeObject *)descr);
// copy contents of str into packed_string
if (NpyString_pack(allocator, packed_string, str, size) == -1) {
// string packing failed, set error
return -1;
}
// packed_string contains a copy of "Hello world"
NpyString_release_allocator(allocator);
Cython Support and the Buffer Protocol#
It’s impossible for StringDType
to support the Python buffer protocol, so
Cython will not support idiomatic typed memoryview syntax for StringDType
arrays unless special support is added in Cython in the future. We have some
preliminary ideas for ways to either update the buffer protocol [9] or make
use of the Arrow C data interface [10] to expose NumPy arrays for DTypes that
don’t make sense in the buffer protocol, but those efforts will likely not come
to fruition in time for NumPy 2.0. This means adapting legacy Cython code that
uses arrays of fixed-width strings to work with StringDType
will be
non-trivial. Adapting code that worked with object string arrays should be
straightforward since object arrays aren’t supported by the buffer protocol
either and will likely have no types or have object
type in Cython.
We will add cython nogil
wrappers for the public C API functions added as
part of this work to ease integration with downstream cython code.
Memory Layout and Managing Heap Allocations#
Below we provide a detailed description of the memory layout we have chosen, but before diving in we want to observe that the C API described above does not publicly expose any of these details. All of the following is subject to future revision, improvement, and change because the precise memory layout of the string data are not publicly exposed.
Memory Layout and Small String Optimization#
Each array element is represented as a union, with the following definition on little-endian architectures:
typedef struct _npy_static_vstring_t {
size_t offset;
size_t size_and_flags;
} _npy_static_string_t;
typedef struct _short_string_buffer {
char buf[sizeof(_npy_static_string_t) - 1];
unsigned char size_and_flags;
} _short_string_buffer;
typedef union _npy_static_string_u {
_npy_static_string_t vstring;
_short_string_buffer direct_buffer;
} _npy_static_string_u;
The _npy_static_vstring_t
representation is most useful for representing
strings living on the heap directly or in an arena allocation, with the
offset
field either containing a size_t
representation of the address
directly, or an integer offset into an arena allocation. The
_short_string_buffer
representation is most useful for the small string
optimization, with the string data stored in the direct_buffer
field and the
size in the size_and_flags
field. In both cases the size_and_flags
field
stores both the size
of the string as well as bitflags. Small strings store
the size in the final four bits of the buffer, reserving the first four bits of
size_and_flags
for flags. Heap strings or strings in arena allocations use
the most significant byte for flags, reserving the leading bytes for the string
size. It’s worth pointing out that this choice limits the maximum string sized
allowed to be stored in an array, particularly on 32 bit systems where the limit
is 16 megabytes per string - small enough to worry about impacting real-world
workflows.
On big-endian systems, the layout is reversed, with the size_and_flags
field
appearing first in the structs. This allows the implementation to always use the
most significant bits of the size_and_flags
field for flags. The
endian-dependent layouts of these structs is an implementation detail and is not
publicly exposed in the API.
Whether or not a string is stored directly on the arena buffer or in the heap is
signaled by setting the NPY_OUTSIDE_ARENA
and NPY_STRING_LONG
flags on
the string data. Because the maximum size of a heap-allocated string is limited
to the size of the largest 7-byte unsized integer, these flags can never be set
for a valid heap string.
See Memory Layout Examples for some visual examples of strings in each of these memory layouts.
Arena Allocator#
Strings longer than 15 bytes on 64 bit systems and 7 bytes on 32 bit systems are
stored on the heap outside of the array buffer. The bookkeeping for the
allocations is managed by an arena allocator attached to the StringDType
instance associated with an array. The allocator will be exposed publicly as an
opaque npy_string_allocator
struct. Internally, it has the following layout:
struct npy_string_allocator {
npy_string_malloc_func malloc;
npy_string_free_func free;
npy_string_realloc_func realloc;
npy_string_arena arena;
PyThread_type_lock *allocator_lock;
};
This allows us to group memory-allocation functions together and choose different allocation functions at runtime if we desire. Use of the allocator is guarded by a mutex, see below for more discussion about thread safety.
The memory allocations are handled by the npy_string_arena
struct member,
which has the following layout:
struct npy_string_arena {
size_t cursor;
size_t size;
char *buffer;
};
Where buffer
is a pointer to the beginning of a heap-allocated arena,
size
is the size of that allocation, and cursor
is the location in the
arena where the last arena allocation ended. The arena is filled using an
exponentially expanding buffer, with an expansion factor of 1.25.
Each string entry in the arena is prepended by a size, stored either in a
char
or a size_t
, depending on the length of the string. Strings with
lengths between 16 or 8 (depending on architecture) and 255 are stored with a
char
size. We refer to these as “medium” strings internally. This choice
reduces the overhead for storing smaller strings on the heap by 7 bytes per
medium-length string. Strings in the arena with lengths longer than 255 bytes
have the NPY_STRING_LONG
flag set.
If the contents of a packed string are freed and then assigned to a new string with the same size or smaller than the string that was originally stored in the packed string, the existing short string or arena allocation is re-used. There is one exception however, when a string in the arena is overwritten with a short string, the arena metadata is lost and the arena allocation cannot be re-used.
If the string is enlarged, the existing space in the arena buffer cannot be
used, so instead we resort to allocating space directly on the heap via
malloc
and the NPY_STRING_OUTSIDE_ARENA
and NPY_STRING_LONG
flags
are set. Note that NPY_STRING_LONG
can be set even for strings with lengths
less than 255 bytes in this case. Since the heap address overwrites the arena
offset, and future string replacements will be stored on the heap or directly
in the array buffer as a short string.
No matter where it is stored, once a string is initialized it is marked with the
NPY_STRING_INITIALIZED
flag. This lets us clearly distinguish between an
uninitialized empty string and a string that has been mutated into the empty
string.
The size of the allocation is stored in the arena to allow reuse of the arena allocation if a string is mutated. In principle we could disallow re-use of the arena buffer and not store the sizes in the arena. This may or may not save memory or be more performant depending on the exact usage pattern. For now we are erring on the side of avoiding unnecessary heap allocations when a string is mutated but in principle we could simplify the implementation by choosing to always store mutated arena strings as heap strings and ignore the arena allocation. See below for more detail on how we deal with the mutability of NumPy arrays in a multithreaded context.
Using a per-array arena allocator ensures that the string buffers for nearby array elements are usually nearby on the heap. We do not guarantee that neighboring array elements are contiguous on the heap to support the small string optimization, missing data, and allow mutation of array entries. See below for more discussion on how these topics affect the memory layout.
Mutation and Thread Safety#
Mutation introduces the possibility of data races and use-after-free errors when an array is accessed and mutated by multiple threads. Additionally, if we allocate mutated strings in the arena buffer and mandate contiguous storage where the old string is replaced by the new one, mutating a single string may trigger reallocating the arena buffer for the entire array. This is a pathological performance degradation compared with object string arrays or fixed-width strings.
One solution would be to disable mutation, but inevitably there will be downstream uses of object string arrays that mutate array elements that we would like to support.
Instead, we have opted to pair the npy_string_allocator
instance attached to
PyArray_StringDType
instances with a PyThread_type_lock
mutex. Any function in
the static string C API that allows manipulating heap-allocated data accepts an
allocator
argument. To use the C API correctly, a thread must acquire the
allocator mutex before any usage of the allocator
.
The PyThread_type_lock
mutex is relatively heavyweight and does not provide
more sophisticated locking primitives that allow multiple simultaneous
readers. As part of the GIL-removal project, CPython is adding new
synchronization primitives to the C API for projects like NumPy to make use
of. When this happens, we can update the locking strategy to allow multiple
simultaneous reading threads, along with other fixes for threading bugs in NumPy
that will be needed once the GIL is removed.
Freeing Strings#
Existing strings must be freed before discarding or re-using a packed string. The API is constructed to require this for all strings, even for short strings with no heap allocations. In all cases, all data in the packed string are zeroed out, except for the flags, which are preserved.
Memory Layout Examples#
We have created illustrative diagrams for the three possible string memory layouts. All diagrams assume a 64 bit little endian architecture.
Short strings store string data directly in the array buffer. On little-endian architectures, the string data appear first, followed by a single byte that allows space for four flags and stores the size of the string as an unsigned integer in the final 4 bits. In this example, the string contents are “Hello world”, with a size of 11. The flags indicate this string is stored outside the arena and is initialized.
Arena strings store string data in a heap-allocated arena buffer that is managed
by the StringDType
instance attached to the array. In this example, the
string contents are “Numpy is a very cool library”, stored at offset 0x94C
in the arena allocation. Note that the size
is stored twice, once in the
size_and_flags
field, and once in the arena allocation. This facilitates
re-use of the arena allocation if a string is mutated. Also note that because
the length of the string is small enough to fit in an unsigned char
, this is
a “medium”-length string and the size requires only one byte in the arena
allocation. An arena string larger than 255 bytes would need 8 bytes in the
arena to store the size in a size_t
. The only flag set indicates this string
is initialized.
Heap strings store string data in a buffer returned by PyMem_RawMalloc
and
instead of storing an offset into an arena buffer, directly store the address of
the heap address returned by malloc
. In this example, the string contents
are “Numpy is a very cool library” and are stored at heap address
0x4d3d3d3
. The string has three flags set, indicating it is a “long” string
(e.g. not a short string) stored outside the arena, and is initialized. Note
that if this string were stored inside the arena, it would not have the long
string flag set because it requires less than 256 bytes to store.
Empty Strings and Missing Data#
The layout we have chosen has the benefit that newly created array buffer
returned by calloc
will be an array filled with empty strings by
construction, since a string with no flags set is an uninitialized zero-length
arena string. This is not the only valid representation of an empty string, since other
flags may be set to indicate that the empty string is associated with a
pre-existing short string or arena string.
Missing strings will have an identical representation, except they will always
have a flag, NPY_STRING_MISSING
set in the flags field. Users will need to
check if a string is null before accessing an unpacked string buffer and we have
set up the C API in such a way as to force null-checking whenever a string is
unpacked. Both missing and empty strings can be detected based on data in the
packed string representation and do not require corresponding room in the arena
allocation or extra heap allocations.
Implementation#
We have an open pull request [12] that is ready to merge into NumPy adding StringDType.
We have created a development branch of Pandas that supports creating Pandas
data structures using StringDType
[13]. This illustrates the refactoring
necessary to support StringDType
in downstream libraries that make
substantial use of object string arrays.
If accepted, the bulk of the remaining work of this NEP is in updating documentation and polishing the NumPy 2.0 release. We have already done the following:
Create an
np.strings
namespace and expose the string ufuncs directly in that namespace.Move the
StringDType
implementation from an external extension module into NumPy, refactoring NumPy where appropriate. This new DType will be added in one large pull request including documentation updates. Where possible, we will extract fixes and refactorings unrelated toStringDType
into smaller pull requests before issuing the main pull request.
We will continue doing the following:
Deal with remaining issues in NumPy related to new DTypes. In particular, we are already aware that remaining usages of
copyswap
inNumPy
should be migrated to use a cast or an as-yet-to-be-added single-element copy DType API slot. We also need to ensure that DType classes can be used interchangeably with DType instances in the Python API everywhere it makes sense to do so and add useful errors in all other places DType instances can be passed in but DType classes don’t make sense to use.
Alternatives#
The main alternative is to maintain the status quo and offer object arrays as the solution for arrays of variable-length strings. While this will work, it means immediate memory usage and performance improvements, as well as future performance improvements, will not be implemented anytime soon and NumPy will lose relevance to other ecosystems with better support for arrays of textual data.
We do not see the proposed DType as mutually exclusive to an improved
fixed-width binary string DType that can represent arbitrary binary data or text
in any encoding and adding such a DType in the future will be easier once
overall support for string data in NumPy has improved after adding
StringDType
.
Discussion#
References and footnotes#
Copyright#
This document has been placed in the public domain.