This document proposes a restructuring of the NumPy Documentation, both in form
and content, with the goal of making it more organized and discoverable for
beginners and experienced users.
See here for the front page of the latest docs.
The organization is quite confusing and illogical (e.g. user and developer docs
are mixed). We propose the following:
Reorganizing the docs into the four categories mentioned in 1, namely Tutorials, How Tos, Reference Guide and Explanations (more about this below).
Creating dedicated sections for Tutorials and How-Tos, including orientation
on how to create new content;
Adding an Explanations section for key concepts and techniques that require
deeper descriptions, some of which will be rearranged from the Reference Guide.
The documentation is a fundamental part of any software project, especially
open source projects. In the case of NumPy, many beginners might feel demotivated
by the current structure of the documentation, since it is difficult to discover
what to learn (unless the user has a clear view of what to look for in the
Reference docs, which is not always the case).
Looking at the results of a “NumPy Tutorial” search on any search engine also
gives an idea of the demand for this kind of content. Having official high-level
documentation written using up-to-date content and techniques will certainly
mean more users (and developers/contributors) are involved in the NumPy
The restructuring will effectively demand a complete rewrite of links and some
of the current content. Input from the community will be useful for identifying
key links and pages that should not be broken.
As discussed in the article 1, there are four categories of doc content:
We propose to use those categories as the ones we use (for writing and
reviewing) whenever we add a new documentation section.
The reasoning for this is that it is clearer both for
developers/documentation writers and to users where each piece of
information should go, and the scope and tone of each document. For
example, if explanations are mixed with basic tutorials, beginners
might be overwhelmed and alienated. On the other hand, if the reference
guide contains basic how-tos, it might be difficult for experienced
users to find the information they need, quickly.
Currently, there are many blogs and tutorials on the internet about NumPy or
using NumPy. One of the issues with this is that if users search for this
information they may end up in an outdated (unofficial) tutorial before
they find the current official documentation. This can be especially
confusing, especially for beginners. Having a better infrastructure for the
documentation also aims to solve this problem by giving users high-level,
up-to-date official documentation that can be easily updated.
NumPy has a quite complete reference guide. All functions are documented, most
have examples, and most are cross-linked well with See Also sections. Further
improving the reference guide is incremental work that can be done (and is being
done) by many people. There are, however, many explanations in the reference
guide. These can be moved to a more dedicated Explanations section on the docs.
NumPy does not have many how-to’s. The subclassing and array ducktyping section
may be an example of a how-to. Others that could be added are:
Parallelization (controlling BLAS multithreading with threadpoolctl, using
multiprocessing, random number generation, etc.)
Storing and loading data (.npy/.npz format, text formats, Zarr, HDF5,
Performance (memory layout, profiling, use with Numba, Cython, or Pythran)
Writing generic code that works with NumPy, Dask, CuPy, pydata/sparse, etc.
There is a reasonable amount of content on fundamental NumPy concepts such as
indexing, vectorization, broadcasting, (g)ufuncs, and dtypes. This could be
organized better and clarified to ensure it’s really about explaining the concepts
and not mixed with tutorial or how-to like content.
There are few explanations about anything other than those fundamental NumPy
Some examples of concepts that could be expanded:
Copies vs. Views;
BLAS and other linear algebra libraries;
In addition, there are many explanations in the Reference Guide, which should be
moved to this new dedicated Explanations section.
There’s a lot of scope for writing better tutorials. We have a new NumPy for
absolute beginners tutorial 3 (GSoD project of Anne Bonner). In addition we
need a number of tutorials addressing different levels of experience with Python
and NumPy. This could be done using engaging data sets, ideas or stories. For
example, curve fitting with polynomials and functions in numpy.linalg could
be done with the Keeling curve (decades worth of CO2 concentration in air
measurements) rather than with synthetic random data.
Ideas for tutorials (these capture the types of things that make sense, they’re
not necessarily the exact topics we propose to implement):
Conway’s game of life with only NumPy (note: already in Nicolas Rougier’s book)
Using masked arrays to deal with missing data in time series measurements
Using Fourier transforms to analyze the Keeling curve data, and extrapolate it.
Geospatial data (e.g. lat/lon/time to create maps for every year via a stacked
array, like gridMet data)
Using text data and dtypes (e.g. use speeches from different people, shape
(n_speech, n_sentences, n_words))
(n_speech, n_sentences, n_words)
The Preparing to Teach document 2 from the Software Carpentry Instructor
Training materials is a nice summary of how to write effective lesson plans (and
tutorials would be very similar). In addition to adding new tutorials, we also
propose a How to write a tutorial document, which would help users contribute
new high-quality content to the documentation.
Using interesting data in the NumPy docs requires giving all users access to
that data, either inside NumPy or in a separate package. The former is not the
best idea, since it’s hard to do without increasing the size of NumPy
significantly. Even for SciPy there has so far been no consensus on this (see
scipy PR 8707 on adding a new
So we’ll aim for a new (pure Python) package, named numpy-datasets or
scipy-datasets or something similar. That package can take some lessons from
how, e.g., scikit-learn ships data sets. Small data sets can be included in the
repo, large data sets can be accessed via a downloader class or function.
Some examples of documentation organization in other projects:
Documentation for Jupyter
Documentation for Python
Documentation for TensorFlow
These projects make the intended audience for each part of the documentation
more explicit, as well as previewing some of the content in each section.
Currently, the documentation for NumPy can be
confusing, especially for beginners. Our proposal is to reorganize the docs in
the following structure:
Absolute Beginners Tutorial
main Tutorials section
How Tos for common tasks with NumPy
Reference Guide (API Reference)
Building and extending the documentation
NumPy Enhancement Proposals
Besides rewriting the current documentation to some extent, it would be ideal
to have a technical infrastructure that would allow more contributions from the
community. For example, if Jupyter Notebooks could be submitted as-is as
tutorials or How-Tos, this might create more contributors and broaden the NumPy
Similarly, if people could download some of the documentation in Notebook
format, this would certainly mean people would use less outdated material for
It would also be interesting if the new structure for the documentation makes
Discussion around this NEP can be found on the NumPy mailing list:
What nobody tells you about documentation
Preparing to Teach (from the Software Carpentry Instructor Training materials)
NumPy for absolute beginners Tutorial by Anne Bonner
This document has been placed in the public domain.