NumPy advanced tutorial - super detailed

Article catalog

Numpy basic tutorial link

1, Inside the target of darray

NumPy's ndarray provides a way to interpret homogeneous data blocks (which can be continuous or spanning) as multidimensional array objects. As you've seen before, data types determine how data is interpreted, such as floating-point numbers, integers, Booleans, and so on.

Part of the reason why ndarray is so powerful is that all array objects are a striped view of data blocks. You may want to know why array view arr[::2,::-1] does not copy any data. In short, ndarray is not only a block of memory and a dtype, it also has span information, which enables the array to move in memory in various steps. To be more precise, the inside of ndarray consists of the following:

*A pointer to data (a piece of data in a memory or memory mapping file).
*A data type or dtype that describes a lattice of fixed size values in an array.
*A tuple representing the shape of an array.
*A span tuple (strip), where the integer refers to the number of bytes that need to be "spanned" in order to advance to the next element of the current dimension.

The figure below illustrates the internal structure of darray.


For example, an array of 10 × 5, whose shape is (10,5):

import numpy as np
np.ones((10, 5)).shape
(10, 5)

A typical (C-order, which will be explained in detail later) 3 × 4 × 5 float64 (8 bytes) array with a span of (160,40,8) - it is very useful to know the span. Generally, the larger the span is on an axis, the greater the cost of calculation along this axis is:

np.ones((3, 4, 5), dtype=np.float64).strides
(160, 40, 8)

Although NumPy users are rarely interested in array span information, they are an important factor in building non replicated array views. The span can even be negative, which makes the array move backward in memory, such as in slice obj[::-1] or obj[:,::-1].

1.1 Numpy dtype hierarchy

You may occasionally need to check whether an array contains integers, floating-point numbers, strings, or Python objects. Because there are many kinds of floating-point numbers (from float16 to float128), it is very tedious to judge whether dtype belongs to a large class. Fortunately, dtype has a superclass (such as np.integer and np.floating ), they can follow np.issubdtype Functions in combination:

ints = np.ones(10, dtype=np.uint16)
floats = np.ones(10, dtype=np.float32)
np.issubdtype(ints.dtype, np.integer)
True
np.issubdtype(floats.dtype, np.floating)
True

Call dtype's mro method to view all its parent classes:

np.float64.mro()

[numpy.float64,
 numpy.floating,
 numpy.inexact,
 numpy.number,
 numpy.generic,
 float,
 object]

Then we get:

np.issubdtype(ints.dtype, np.number)

True

Most NumPy users do not need to know this knowledge at all, but it can be used occasionally. The figure below illustrates the dtype system and parent-child class relationships.

2, High order array operation

In addition to fancy index, slicing, Boolean conditional subset and other operations, there are many ways to operate arrays. Although the advanced functions in pandas can handle many heavy tasks in data analysis, sometimes you still need to write some data algorithms that cannot be found in the existing library.

2.1 reshaping array

In most cases, you can convert an array from one shape to another without copying any data. Just pass a tuple representing the new shape to the instance method reshape of the array. For example, suppose you have a one-dimensional array that we want to rearrange into a matrix

arr = np.arange(8)
arr

array([0, 1, 2, 3, 4, 5, 6, 7])
arr.reshape((4, 2))

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7]])


Reshaping in C (row direction) and F (column direction)

Multidimensional arrays can also be reshaped:

arr.reshape((4, 2)).reshape((2, 4))

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

One dimension of the shape as a parameter can be - 1, which means that the size of the dimension is inferred from the data itself:

arr = np.arange(15)
arr.reshape((5, -1))

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In contrast to reshape's operation of converting a one-dimensional array to a multi-dimensional array, the operation is usually called flattening or diverging:

arr = np.arange(15).reshape((5, 3))
arr

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])
arr.ravel()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

If the values in the result are continuous in the original array, travel does not produce a copy of the source data. The behavior of the flatten method is similar to that of travel, except that it always returns a copy of the data:

Arrays can be reshaped or scattered in other order. This is a delicate problem for the novice NumPy, so we will focus on it in the next section.

2.2 C order and Fortran order

NumPy allows you to control the layout of data in memory more flexibly. By default, NumPy arrays are created in row priority. In terms of space, this means that for a two-dimensional array, the data items in each row are stored in adjacent memory locations. Another order is column precedence, which means that the data items in each column are stored in adjacent memory locations.

For some historical reasons, row and column precedence is also known as C and Fortran order, respectively. In FORTRAN 77, matrices are all column first.

Functions such as reshape and reval can accept an order parameter that represents the storage order of array data. Generally, it can be 'C' or 'F' (there are also 'a' and 'K' which are not commonly used, please refer to NumPy's documents for details). Figure A-3 illustrates this:

arr = np.arange(12).reshape((3, 4))
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
arr.ravel()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])
arr.ravel('F')

array([ 0,  4,  8,  1,  5,  9,  2,  6, 10,  3,  7, 11])

The reconstruction process of two-dimensional or higher-dimensional array is more puzzling. The key difference between C and Fortran order is the progression order of dimensions:

C / row precedence: first through higher dimensions (for example, axis 1 will be processed before axis 0).
Fortran / column precedence: after a higher dimension (for example, axis 0 will be processed before axis 1).

2.3 join and separate arrays

numpy.concatenate You can connect a sequence of arrays (such as tuples, lists, etc.) together by a specified axis:

arr1 = np.array([[1, 2, 3], [4, 5, 6]])
arr2 = np.array([[7, 8, 9], [10, 11, 12]])
np.concatenate([arr1, arr2], axis=0)

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])
np.concatenate([arr1, arr2], axis=1)

array([[ 1,  2,  3,  7,  8,  9],
       [ 4,  5,  6, 10, 11, 12]])

For common connection operations, NumPy provides some convenient methods (such as vstack and hstack). Therefore, the above operations can also be expressed as:

np.vstack((arr1, arr2))

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])
np.hstack((arr1, arr2))

array([[ 1,  2,  3,  7,  8,  9],
       [ 4,  5,  6, 10, 11, 12]])

In contrast, split is used to split an array into multiple arrays along a specified axis:

arr = np.random.randn(5, 2)
arr

array([[-0.27526867, -0.67921962],
       [-0.92972287, -0.35201398],
       [-0.2558084 , -0.28675778],
       [-1.26047769,  1.43530095],
       [ 0.69262824, -0.57397718]])
first, second, third = np.split(arr, [1, 3])
first

array([[-0.27526867, -0.67921962]])
second

array([[-0.92972287, -0.35201398],
       [-0.2558084 , -0.28675778]])
third

array([[-1.26047769,  1.43530095],
       [ 0.69262824, -0.57397718]])

Incoming to np.split The value [1,3] of indicates at which index the array is split.

The following table lists all the functions related to array join and split, some of which are specially provided for the convenience of common join operations.

2.31 stacking assistant: r_ And c_

There are two special objects in NumPy namespace -- r_ And c_ , which can make the stacking operation of arrays more concise:

arr = np.arange(6)
arr1 = arr.reshape((3, 2))
arr2 = np.random.randn(3, 2)
np.r_[arr1, arr2]

array([[ 0.412107  , -0.23394682],
       [-0.01930548, -0.99743316],
       [-0.85863788, -0.40004808]])
np.r_[arr1, arr2]

array([[ 0.        ,  1.        ],
       [ 2.        ,  3.        ],
       [ 4.        ,  5.        ],
       [ 0.412107  , -0.23394682],
       [-0.01930548, -0.99743316],
       [-0.85863788, -0.40004808]])
np.c_[np.r_[arr1, arr2], arr]

array([[ 0.        ,  1.        ,  0.        ],
       [ 2.        ,  3.        ,  1.        ],
       [ 4.        ,  5.        ,  2.        ],
       [ 0.412107  , -0.23394682,  3.        ],
       [-0.01930548, -0.99743316,  4.        ],
       [-0.85863788, -0.40004808,  5.        ]])

It can also convert slices to arrays:

np.c_[1:6, -10:-5]

array([[  1, -10],
       [  2,  -9],
       [  3,  -8],
       [  4,  -7],
       [  5,  -6]])

2.4 repeating elements: tile and repeat

Repeat and tile are the main tools to repeat arrays to produce larger arrays. Repeat repeats each element in the array a certain number of times, resulting in a larger array:

arr = np.arange(3)
arr.repeat(3)

array([0, 0, 0, 1, 1, 1, 2, 2, 2])

By default, if you pass in an integer, each element repeats that many times. If you pass in a set of integers, each element can repeat a different number of times:

arr.repeat([2, 3, 4])

array([0, 0, 1, 1, 1, 2, 2, 2, 2])

For multidimensional arrays, you can also have their elements repeat along the specified axis:

arr = np.random.randn(2, 2)
arr

array([[ 0.88588819, -1.25990033],
       [ 1.62085848,  0.92912152]])
arr.repeat(2, axis=0)

array([[ 0.88588819, -1.25990033],
       [ 0.88588819, -1.25990033],
       [ 1.62085848,  0.92912152],
       [ 1.62085848,  0.92912152]])

Note that if the axis is not set, the array will be flattened, which may not be the result you want. In the same way, when you repeat a multi dimension, you can also pass in a set of integers, which will make each slice repeat for different times:

arr.repeat([2, 3], axis=0)

array([[ 0.88588819, -1.25990033],
       [ 0.88588819, -1.25990033],
       [ 1.62085848,  0.92912152],
       [ 1.62085848,  0.92912152],
       [ 1.62085848,  0.92912152]])
arr.repeat([2, 3], axis=1)

array([[ 0.88588819,  0.88588819, -1.25990033, -1.25990033, -1.25990033],
       [ 1.62085848,  1.62085848,  0.92912152,  0.92912152,  0.92912152]])

tile's function is to stack copies of an array along a specified axis. You can visualize it as "tiling":

arr

array([[ 0.88588819, -1.25990033],
       [ 1.62085848,  0.92912152]])
np.tile(arr, 2)

array([[ 0.88588819, -1.25990033,  0.88588819, -1.25990033],
       [ 1.62085848,  0.92912152,  1.62085848,  0.92912152]])

The second parameter is the number of tiles. For scalars, tiles are laid horizontally, not vertically. It can be a tuple representing a lay out:

arr

array([[ 0.88588819, -1.25990033],
       [ 1.62085848,  0.92912152]])
np.tile(arr, (2, 1))

array([[ 0.88588819, -1.25990033],
       [ 1.62085848,  0.92912152],
       [ 0.88588819, -1.25990033],
       [ 1.62085848,  0.92912152]])
np.tile(arr, (3, 2))

array([[ 0.88588819, -1.25990033,  0.88588819, -1.25990033],
       [ 1.62085848,  0.92912152,  1.62085848,  0.92912152],
       [ 0.88588819, -1.25990033,  0.88588819, -1.25990033],
       [ 1.62085848,  0.92912152,  1.62085848,  0.92912152],
       [ 0.88588819, -1.25990033,  0.88588819, -1.25990033],
       [ 1.62085848,  0.92912152,  1.62085848,  0.92912152]])

2.5 the equivalent method of magic index: take and put

As we mentioned before, one way to get and set array subsets is to use a fancy index through an integer array:

arr = np.arange(10) * 100
inds = [7, 1, 2, 6]
arr[inds]

array([700, 100, 200, 600])

There are other ways for ndarray to get a selection on a single axis:

arr.take(inds)

array([700, 100, 200, 600])

put function is used to replace the specified selection with a numeric value

arr.put(inds, 42)
arr

array([  0,  42,  42, 300, 400, 500,  42,  42, 800, 900])
arr.put(inds, [40, 41, 42, 43])
arr

array([  0,  41,  42, 300, 400, 500,  43,  40, 800, 900])

To use take on other axes, simply pass in the axis keyword:

inds = [2, 0, 2, 1]
arr = np.random.randn(2, 4)
arr

array([[ 1.46902272,  0.01935251, -0.6950622 ,  2.24996752],
       [ 1.03686395, -0.67560415, -0.83536367, -0.27154602]])
arr.take(inds, axis=1)

array([[-0.6950622 ,  1.46902272, -0.6950622 ,  0.01935251],
       [-0.83536367,  1.03686395, -0.83536367, -0.67560415]])

put does not accept the axis parameter, it only indexes on the flattened version of the array (one-dimensional, C-order). Therefore, it is better to use a fancy index when you need to set elements with other axial indexes.

3, Broadcast

broadcasting refers to the execution of arithmetic operations between arrays of different shapes. It's a very powerful feature, but it's also misleading, even for experienced veterans. The simplest broadcast occurs when a scalar value is combined with an array:

arr = np.arange(5)
arr

array([0, 1, 2, 3, 4])
arr * 4

array([ 0,  4,  8, 12, 16])

Here we say: in this multiplication operation, the scalar value 4 is broadcast to all other elements.

Let's take an example. We can offset each column of an array by subtracting the average value of the columns. The solution to this problem is very simple:

arr = np.random.randn(4, 3)
arr.mean(0)

array([-1.44187575,  0.13700902, -0.06376677])
demeaned = arr - arr.mean(0)
demeaned

array([[ 0.79875636, -1.16113748, -0.26786685],
       [-1.56813261,  2.14316652,  1.48499361],
       [ 0.32467283, -0.1763157 ,  1.12517759],
       [ 0.44470342, -0.80571333, -2.34230434]])

The figure below shows the process vividly. It will be a little more difficult to use the broadcast method to level the rows. Fortunately, as long as certain rules are followed, the values of low dimensions can be broadcast to any dimension of the array (such as subtracting the row average value from each column of the two-dimensional array).

So you get:

Although I am an experienced NumPy veteran, I often have to stop and draw a picture and think about the principles of broadcasting. Let's look at the last example. Suppose you want to subtract that average from each row. because arr.mean The length of (0) is 3, so it can broadcast in the 0 axis: because the ending dimension of arr is 3, they are compatible.

According to this principle, to subtract in the 1 axis (that is, subtract the average value of each row), the shape of the smaller array must be (4,1):

arr

array([[-0.64311939, -1.02412846, -0.33163362],
       [-3.01000836,  2.28017554,  1.42122684],
       [-1.11720292, -0.03930668,  1.06141082],
       [-0.99717233, -0.66870431, -2.40607111]])
row_means = arr.mean(1)
row_means.shape

(4,)
row_means.reshape((4, 1))

array([[-0.66629382],
       [ 0.23046467],
       [-0.03169959],
       [-1.35731592]])
demeaned = arr - row_means.reshape((4, 1))
demeaned

array([[ 0.02317443, -0.35783463,  0.3346602 ],
       [-3.24047303,  2.04971087,  1.19076216],
       [-1.08550333, -0.00760708,  1.09311041],
       [ 0.36014359,  0.68861161, -1.0487552 ]])

The following figure illustrates the process of this operation.


Figure A-6 shows another case, this time adding a two-dimensional array along the 0 axis to a three-dimensional array.

3.1 broadcast on other axes

The broadcast of high-dimensional array seems to be more difficult to understand, but in fact, it also follows the broadcast principle. If not, you will get the following error:

arr - arr.mean(1)

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-73-8b8ada26fac0> in <module>
----> 1 arr - arr.mean(1)



ValueError: operands could not be broadcast together with shapes (4,3) (4,) 

People often need to broadcast the array with lower dimension on other axis except 0 axis through arithmetic operation. According to the broadcast principle, the broadcast dimension of a smaller array must be 1. In the above example of row minus mean, this means to change the shape of row mean to (4,1) instead of (4,):

arr - arr.mean(1).reshape((4, 1))

array([[ 0.02317443, -0.35783463,  0.3346602 ],
       [-3.24047303,  2.04971087,  1.19076216],
       [-1.08550333, -0.00760708,  1.09311041],
       [ 0.36014359,  0.68861161, -1.0487552 ]])

In the case of 3D, broadcasting on any one dimension of 3D means reshaping the data into compatible shapes. Figure A-7 illustrates the shape requirements to be broadcast across the dimensions of a 3D array.

So there is a very common problem (especially in the general algorithm), that is to add a new axis with a length of 1 for broadcasting. While reshape is a solution, the insertion axis needs to construct a tuple representing the new shape. It's a depressing process. Therefore, NumPy arrays provide a special syntax for inserting axes through an indexing mechanism. The following code uses the special np.newaxis Property and the all slice to insert a new axis:

arr = np.zeros((4, 4))
arr_3d = arr[:, np.newaxis, :]
arr_3d.shape

(4, 1, 4)
arr_1d = np.random.normal(size=3)
arr_1d[:, np.newaxis]

array([[1.58331572],
       [0.20398254],
       [0.52159683]])
arr_1d[np.newaxis, :]

array([[1.58331572, 0.20398254, 0.52159683]])

Therefore, if we have a three-dimensional array and want to subtract the mean value from axis 2, we just need to write the following code:

arr = np.random.randn(3, 4, 5)
depth_means = arr.mean(2)
depth_means

array([[-0.48697362,  0.60461413, -0.73433923,  0.12402652],
       [-0.54290158, -0.87327633,  0.96338243,  0.07150238],
       [ 0.64337069, -0.5125288 , -0.42618718, -0.36957221]])
depth_means.shape

(3, 4)
demeaned = arr - depth_means[:, :, np.newaxis]
demeaned.mean(2).astype('int')

array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])

Some readers may wonder if there is a common way to do this without sacrificing performance? Actually, there are, but some indexing skills are needed:

def demean_axis(arr, axis=0):
    means = arr.mean(axis)

    # This generalizes things like [:, :, np.newaxis] to N dimensions
    indexer = [slice(None)] * arr.ndim
    indexer[axis] = np.newaxis
    return arr - means[indexer]

3.2 setting the value of the array through broadcast

The broadcast principle followed by arithmetic operation is also applicable to the operation of setting array value through index mechanism. For the simplest case, we can do this:

arr = np.zeros((4, 3))
arr[:] = 5
arr

array([[5., 5., 5.],
       [5., 5., 5.],
       [5., 5., 5.],
       [5., 5., 5.]])

However, if we want to use a one-dimensional array to set the columns of the target array, as long as the shape is compatible:

col = np.array([1.28, -0.42, 0.44, 1.6])
arr[:] = col[:, np.newaxis]
arr

array([[ 1.28,  1.28,  1.28],
       [-0.42, -0.42, -0.42],
       [ 0.44,  0.44,  0.44],
       [ 1.6 ,  1.6 ,  1.6 ]])
arr[:2] = [[-1.37], [0.509]]

4, High level ufunc usage

Although many NumPy users only use the fast element level operations provided by general functions, there are actually some advanced uses of general functions that enable us to write more concise code without looping.

4.1 ufunc instance method

Each binary ufunc of NumPy has some special methods for performing specific vectorization operations. I'll illustrate them with a few specific examples.

Reduce takes an array parameter and aggregates its values (indicating the axis) through a series of binary operations. For example, we can use np.add.reduce Sum the elements in the array:

arr = np.arange(10)
np.add.reduce(arr)

45
arr.sum()

45

The starting value depends on ufunc (in the case of add, 0). If the axis number is set, the reduction is performed along that axis. This allows you to get answers to some questions in a simpler way. In the following example, we use np.logical_and check that the values in each row of the array are ordered:

np.random.seed(12346)
arr = np.random.randn(5, 5)
arr[::2].sort(1)
arr[:, :-1] < arr[:, 1:]

array([[ True,  True,  True,  True],
       [False,  True, False, False],
       [ True,  True,  True,  True],
       [ True, False,  True,  True],
       [ True,  True,  True,  True]])
np.logical_and.reduce(arr[:, :-1] < arr[:, 1:], axis=1)

array([ True, False,  True, False,  True])

Attention, logical_and.reduce It is equivalent to all method.

The relationship between achieve and reduce is just like that between cum sum and sum. It produces an array of intermediate "cumulative" values of the same size as the original array:

arr = np.arange(15).reshape((3, 5))
np.add.accumulate(arr, axis=1)

array([[ 0,  1,  3,  6, 10],
       [ 5, 11, 18, 26, 35],
       [10, 21, 33, 46, 60]], dtype=int32)

outer is used to calculate the cross product of two arrays:

arr = np.arange(3).repeat([1, 2, 2])
arr

array([0, 1, 1, 2, 2])
np.multiply.outer(arr, np.arange(5))

array([[0, 0, 0, 0, 0],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 2, 4, 6, 8],
       [0, 2, 4, 6, 8]])

The dimension of output result of outer is the sum of two dimensions of input data:

x, y = np.random.randn(3, 4), np.random.randn(5)
result = np.subtract.outer(x, y)
result.shape

(3, 4, 5)

The last method, reduceat, is used to calculate "local reduction", which is actually a group by operation to aggregate the data slices. It accepts a set of box edges that indicate how values are split and aggregated:

arr = np.arange(10)
np.add.reduceat(arr, [0, 5, 8])

array([10, 18, 17], dtype=int32)

The final result is a polycondensation performed on arr[0:5], arr[5:8], and arr[8:] (here is the addition). Like other methods, an axis parameter can be passed in here:

arr = np.multiply.outer(np.arange(4), np.arange(5))
arr

array([[ 0,  0,  0,  0,  0],
       [ 0,  1,  2,  3,  4],
       [ 0,  2,  4,  6,  8],
       [ 0,  3,  6,  9, 12]])
np.add.reduceat(arr, [0, 2, 4], axis=1)

array([[ 0,  0,  0],
       [ 1,  5,  4],
       [ 2, 10,  8],
       [ 3, 15, 12]], dtype=int32)

The figure below summarizes some of the ufunc methods.

4.2 using python to write a new ufunc method

There are many ways to write your own NumPy ufuncs. The most common is to use the NumPy C API, but it goes beyond the scope of this book. In this section, we talk about pure Python ufunc.

numpy.frompyfunc Accepts a Python function and two parameters representing the number of input and output parameters, respectively. For example, the following is a simple function that can implement element level addition:

def add_elements(x, y):
    return x + y
add_them = np.frompyfunc(add_elements, 2, 1)
add_them(np.arange(8), np.arange(8))

array([0, 2, 4, 6, 8, 10, 12, 14], dtype=object)

Functions created with frompyfunc always return arrays of Python objects, which is inconvenient. Fortunately, there is another way numpy.vectorize . Although not as powerful as fromyfunction, it allows you to specify the output type:

add_them = np.vectorize(add_elements, otypes=[np.float64])
add_them(np.arange(8), np.arange(8))

array([ 0.,  2.,  4.,  6.,  8., 10., 12., 14.])

Although these two functions provide a way to create ufunc type functions, they are very slow, because they need to perform a Python function call when calculating each element, which is much slower than the C-based ufunc provided by NumPy:

arr = np.random.randn(10000)
%timeit add_them(arr, arr)

1.84 ms ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.add(arr, arr)

5.04 µs ± 53.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

5, Structured and recorded arrays

You may have noticed that the ndarray we have discussed so far is a kind of homogeneous data container, that is to say, in the memory block it represents, each element occupies the same number of bytes (depending on dtype). On the surface, it doesn't seem to be able to represent heterogeneous or tabular data. Structured array is a special kind of ndarray, in which each element can be regarded as a structure in C language (struct, which is the origin of "structure") or a row with multiple named fields in SQL table:

dtype = [('x', np.float64), ('y', np.int32)]
sarr = np.array([(1.5, 6), (np.pi, -2)], dtype=dtype)
sarr

array([(1.5       ,  6), (3.14159265, -2)],
      dtype=[('x', '<f8'), ('y', '<i4')])

There are many ways to define structured dtype s (see NumPy's online documentation). The most typical method is the tuple list, and the format of each tuple is (field)_ name,field_ data_ type). In this way, the elements of the array become tuple objects, and each element in the object can be accessed like a dictionary:

sarr[0]

(1.5, 6)
sarr[0]['y']

6

Field name saved dtype.names Property. When accessing a field of a structured array, the view of the data is returned, so data replication will not occur:

sarr['x']

array([1.5       , 3.14159265])

5.1 nested dtype and multidimensional fields

When defining a structured dtype, you can set another shape (an integer or a tuple):

dtype = [('x', np.int64, 3), ('y', np.int32)]
arr = np.zeros(4, dtype=dtype)
arr

array([([0, 0, 0], 0), ([0, 0, 0], 0), ([0, 0, 0], 0), ([0, 0, 0], 0)],
      dtype=[('x', '<i8', (3,)), ('y', '<i4')])

In this case, the x field of each record represents an array with a length of 3:

arr[0]['x']

array([0, 0, 0], dtype=int64)

In this way, you can access arr ['x '] to get a two-dimensional array instead of the one-dimensional array in the previous example:

arr['x']

array([[0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0]], dtype=int64)

This allows you to store complex nested structures in blocks of memory in a single array. You can also nest dtype s to make more complex structures. Here is a simple example:

dtype = [('x', [('a', 'f8'), ('b', 'f4')]), ('y', np.int32)]
data = np.array([((1, 2), 5), ((3, 4), 6)], dtype=dtype)
data

array([((1., 2.), 5), ((3., 4.), 6)],
      dtype=[('x', [('a', '<f8'), ('b', '<f4')]), ('y', '<i4')])
data['x']

array([(1., 2.), (3., 4.)], dtype=[('a', '<f8'), ('b', '<f4')])
data['y']

array([5, 6])
data['x']['a']

array([1., 3.])

The data frame of pandas does not directly support this function, but its hierarchical indexing mechanism is similar to this one.

5.2 why to use structured array

Compared with panda's DataFrame, NumPy's structured array is a relatively low-level tool. It can interpret a single memory block as a tabular structure with any complex nested columns. Because each element in the array is represented as a fixed number of bytes in memory, structured array can provide very fast and efficient disk data reading and writing (including memory image), network transmission and other functions.

Another common use of structured arrays is to write data files as fixed length record byte streams, which is a common data serialization method in C and C++ code (found in many historical systems in the industry). You can use the np.fromfile Read data into memory. This usage is beyond the scope of this book. You can know this.

6, About NumPy sorting

Like Python's built-in list, the sort instance method of ndarray is in place sorting. In other words, rearranging the contents of an array does not result in a new array:|

arr = np.random.randn(6)
arr.sort()
arr

array([-1.47108206, -1.13286962, -1.01114869, -0.33176812, -0.08468875,
        0.87050269])

When sorting arrays in place, pay attention to that if the target array is just a view, the original array will be modified:

arr = np.random.randn(3, 5)
arr

array([[-0.34357617,  2.17140268,  0.12337075, -0.01893118,  0.17731791],
       [ 0.7423957 ,  0.85475634,  1.03797268, -0.32899594, -1.11807759],
       [-0.24152521, -2.0051193 ,  0.73788753, -1.06137462,  0.59545348]])
arr[:, 0].sort()
arr

array([[-0.34357617,  2.17140268,  0.12337075, -0.01893118,  0.17731791],
       [-0.24152521,  0.85475634,  1.03797268, -0.32899594, -1.11807759],
       [ 0.7423957 , -2.0051193 ,  0.73788753, -1.06137462,  0.59545348]])

contrary, numpy.sort A sorted copy of the original array is created. In addition, the parameters it accepts (such as kind) follow the ndarray.sort Same:

arr = np.random.randn(5)
arr

array([-0.26822958,  1.33885804, -0.18715572,  0.91108374, -0.32150045])
np.sort(arr)

array([-0.32150045, -0.26822958, -0.18715572,  0.91108374,  1.33885804])
arr

array([-0.26822958,  1.33885804, -0.18715572,  0.91108374, -0.32150045])

Both sorting methods can accept an axis parameter to sort each block of data separately along the specified axis:

arr = np.random.randn(3, 5)
arr

array([[ 1.00543901, -0.51683937,  1.19251887, -0.19893404,  0.39691349],
       [-1.76381537,  0.60709023, -0.22215536, -0.21707838, -1.21357483],
       [-0.87044607, -0.2305542 ,  1.04376344, -1.14410284, -0.36360302]])
arr.sort(axis=1)
arr

array([[-0.51683937, -0.19893404,  0.39691349,  1.00543901,  1.19251887],
       [-1.76381537, -1.21357483, -0.22215536, -0.21707838,  0.60709023],
       [-1.14410284, -0.87044607, -0.36360302, -0.2305542 ,  1.04376344]])

You may notice that neither of these sorting methods can be set to descending. In fact, it doesn't matter, because array slicing will generate views (that is, no copies will be generated, and no other calculation work is required). Many Python users are familiar with a list trick: values[::-1] can return a list in reverse order. The same is true for darray:

arr[:, ::-1]

array([[ 1.19251887,  1.00543901,  0.39691349, -0.19893404, -0.51683937],
       [ 0.60709023, -0.21707838, -0.22215536, -1.21357483, -1.76381537],
       [ 1.04376344, -0.2305542 , -0.36360302, -0.87044607, -1.14410284]])

6.1 indirect sorting: argsort and lexport

In the work of data analysis, data sets often need to be sorted according to one or more keys. For example, a data table about student information might need to be sorted by last name and first name (last name before first name). This is an example of indirect sorting. If you have read the chapter about pandas, you have seen many advanced examples. Given one or more keys, you can get an index array of integers (which I affectionately call an indexer), where the index value indicates the position of the data in the new order. argsort and numpy.lexsort It is the two main methods to realize this function. Here is a simple example:

values = np.array([5, 0, 1, 3, 2])
indexer = values.argsort()
indexer # Get the index array after sorting

array([1, 2, 4, 3, 0], dtype=int64)
values[indexer]

array([0, 1, 2, 3, 5])

For a more complex example, the following code sorts an array according to its first line:

arr = np.random.randn(3, 5)
arr[0] = values
arr

array([[ 5.        ,  0.        ,  1.        ,  3.        ,  2.        ],
       [ 0.23159352,  0.72798172, -1.3918432 ,  1.99558262, -0.29812485],
       [ 1.20366758, -0.01576758,  0.74394881,  0.86879898, -0.42864822]])
arr[:, arr[0].argsort()]

array([[ 0.        ,  1.        ,  2.        ,  3.        ,  5.        ],
       [ 0.72798172, -1.3918432 , -0.29812485,  1.99558262,  0.23159352],
       [-0.01576758,  0.74394881, -0.42864822,  0.86879898,  1.20366758]])

Lexport is similar to argsort except that it can perform an indirect sort (dictionary order) on multiple key arrays at once. Suppose we want to sort some data identified by last name and first name:

first_name = np.array(['Bob', 'Jane', 'Steve', 'Bill', 'Barbara'])
last_name = np.array(['Jones', 'Arnold', 'Arnold', 'Jones', 'Walters'])
sorter = np.lexsort((first_name, last_name))
sorter

array([1, 2, 3, 0, 4], dtype=int64)
arr = zip(last_name[sorter], first_name[sorter])
for i,j in arr:
    print(i, j)

Arnold Jane
Arnold Steve
Jones Bill
Jones Bob
Walters Barbara

It's easy to get dizzy when you start using lexport, because the order of key application starts from the last one passed in. It's not hard to see, last_name precedes first_name is applied.

6.2 other sorting algorithms

stable sorting algorithm can keep the relative position of equivalent elements. This is important for those indirect sorts where relative positions have practical significance:

values = np.array(['2:first', '2:second', '1:first', '1:second','1:third'])
key = np.array([2, 2, 1, 1, 1])
indexer = key.argsort(kind='mergesort')
indexer

array([2, 3, 4, 0, 1], dtype=int64)
values.take(indexer)

array(['1:first', '1:second', '1:third', '2:first', '2:second'],
      dtype='<U8')

mergesort is the only stable sort, which guarantees O(n log n) performance (spatial complexity), but its average performance is worse than the default quicksort. Table A-3 lists the available sorting algorithms and their related performance indicators. Most users don't need to know these things at all, but it's always good to know.

6.3 partial sorting of arrays

One of the purposes of sorting may be to determine the largest or smallest element in an array. NumPy has two optimization methods, numpy.partition and np.argpartition , arrays that can be divided in the k-th smallest element:

np.random.seed(12345)
arr = np.random.randn(20)
arr

array([-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057,
        1.39340583,  0.09290788,  0.28174615,  0.76902257,  1.24643474,
        1.00718936, -1.29622111,  0.27499163,  0.22891288,  1.35291684,
        0.88642934, -2.00163731, -0.37184254,  1.66902531, -0.43856974])
np.partition(arr, 3)

array([-2.00163731, -1.29622111, -0.5557303 , -0.51943872, -0.37184254,
       -0.43856974, -0.20470766,  0.28174615,  0.76902257,  0.47894334,
        1.00718936,  0.09290788,  0.27499163,  0.22891288,  1.35291684,
        0.88642934,  1.39340583,  1.96578057,  1.66902531,  1.24643474])

When you call partition(arr, 3), the first three elements in the result are the smallest, with no specific order. numpy.argpartition And numpy.argsort Similarly, the index will be returned, and the data will be rearranged to the equivalent order:

indices = np.argpartition(arr, 3)
indices

array([16, 11,  3,  2, 17, 19,  0,  7,  8,  1, 10,  6, 12, 13, 14, 15,  5,
        4, 18,  9], dtype=int64)
arr[indices]

array([-2.00163731, -1.29622111, -0.5557303 , -0.51943872, -0.37184254,
       -0.43856974, -0.20470766,  0.28174615,  0.76902257,  0.47894334,
        1.00718936,  0.09290788,  0.27499163,  0.22891288,  1.35291684,
        0.88642934,  1.39340583,  1.96578057,  1.66902531,  1.24643474])

six point four numpy.searchsorted : find elements in sorted array

searchsorted is an array method that performs binary search on an ordered array. As long as the value is inserted into the position it returns, the array order can be maintained:

arr = np.array([0, 1, 7, 12, 15])
arr.searchsorted(9)

3

You can pass in a set of values to get a set of indexes:

arr.searchsorted([0, 8, 11, 16])

array([0, 3, 3, 5], dtype=int64)

As you can see from the above results, searchsorted returns 0 for element 0. This is because the default behavior is to return the left index of the equality group:

arr = np.array([0, 0, 0, 1, 1, 1, 1])
arr.searchsorted([0, 1])

array([0, 3], dtype=int64)
arr.searchsorted([0, 1], side='right')

array([3, 7], dtype=int64)

Let's look at another usage of searchsorted. Suppose we have a data array (the value is between 0 and 10000) and an array that represents the "boundary of the surface element". We want to use it to split the data array:

data = np.floor(np.random.uniform(0, 10000, size=50))
bins = np.array([0, 100, 1000, 5000, 10000])
data

array([2449., 7928., 4951., 9150., 9453., 5332., 2524., 7208., 3674.,
       4986., 2265., 3535., 6508., 3129., 7687., 7818., 8524., 9499.,
       1073., 9107., 3360., 8263., 8981.,  427., 1957., 2945., 6269.,
        862., 1429., 5158., 6893., 8566., 6473., 5816., 7111., 2524.,
       9001., 4422.,  205., 9596., 6522., 5132., 6823., 4895., 9264.,
       5158.,  721., 5675., 6152., 9415.])

Then, in order to get the number of the interval to which each data point belongs (where 1 represents face element [0100)), we can directly use searchsorted:

labels = bins.searchsorted(data)
labels

array([3, 4, 3, 4, 4, 4, 3, 4, 3, 3, 3, 3, 4, 3, 4, 4, 4, 4, 3, 4, 3, 4,
       4, 2, 3, 3, 4, 2, 3, 4, 4, 4, 4, 4, 4, 3, 4, 3, 2, 4, 4, 4, 4, 3,
       4, 4, 2, 4, 4, 4], dtype=int64)

Using this result through pandas's groupby, you can easily split the original dataset:

import pandas as pd
pd.Series(data).groupby(labels).mean()

2     553.750000
3    3132.375000
4    7482.733333
dtype: float64

7, Writing fast NumPy functions with Numba

Numba is an open source project that can use CPUs, GPUs or other hardware to create fast functions for NumPy like data. It uses the LLVM project to convert Python code to machine code.

To introduce Numba, consider a pure Python function that uses the for loop to evaluate the expression (x - y).mean():

import numpy as np
def mean_distance(x, y):
    nx = len(x)
    result = 0.0
    count = 0
    for i in range(nx):
        result += x[i] - y[i]
        count += 1
    return result / count

This function is slow:

x = np.random.randn(10000000)
y = np.random.randn(10000000)
%timeit mean_distance(x, y)

6.64 s ± 81.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit (x - y).mean()

47.6 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

NumPy is 100 times faster than it. We can convert this function to a compiled Numba function using the numba.jit Function:

import numba as nb
numba_mean_distance = nb.jit(mean_distance)

It can also be written as a decorator:

@nb.jit
def mean_distance(x, y):
    nx = len(x)
    result = 0.0
    count = 0
    for i in range(nx):
        result += x[i] - y[i]
        count += 1
    return result / count

It's faster than vectorized NumPy:

 %timeit numba_mean_distance(x, y)

17.7 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Numba can't compile Python code, but it supports a part written in pure Python and can write numerical algorithms.

Numba can't compile Python code, but it supports a part written in pure Python and can write numerical algorithms.

Numba is a deep library that supports a variety of hardware, compilation modes, and user plug-ins. It can also compile part of the NumPy Python API without the for loop. Numba can also recognize structures that can be encoded as machines, but if the CPython API is called, it doesn't know how to compile. Numba's jit function has an option, nopython=True, which limits the code that can be converted to Python code, which can be compiled into LLVM, but there is no Python C API call. jit(nopython=True) has a short alias numba.njit .

In the previous example, we can also write as follows:

from numba import float64, njit
@njit(float64(float64[:], float64[:]))
def mean_distance(x, y):
    return (x - y).mean()

7.1 create custom with Numba numpy.ufunc object

numba.vectorize A compiled NumPy ufunc is created, which is similar to the built-in ufunc. Consider a numpy.add Python example:

from numba import vectorize

@vectorize
def nb_add(x, y):
    return x + y

Now there are:

x = np.arange(10)
nb_add(x, x)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18], dtype=int64)

8, High order array input and output

np.save and np.load Can be used to read and write arrays stored in binary format on a disk. There are also tools available for more complex scenes. memory map, in particular, enables you to handle data sets that cannot be stored in memory.

8.1 memory mapping file

Memory image file is a way to treat the very large binary data file on disk as an array in memory. NumPy implements a memmap object similar to ndarray, which allows large files to be read and written in small segments instead of reading the entire array into memory at one time. In addition, memmap also has the same method as ordinary arrays, so basically, as long as the algorithm can be used for ndarray, it can also be used for memmap.

To create a memory image, use the np.memmap And a file path, data type, shape and file mode are passed in:

mmap = np.memmap('mymmap.txt', dtype='float64', mode='w+',shape=(10000, 10000))
mmap

memmap([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

The memmap slice will return the view of the data on the disk:

section = mmap[:5]

If you assign data to these views: the data is first cached in memory (like Python's file object), you can call flush to write it to disk:

section[:] = np.random.randn(5, 10000)
mmap.flush()
mmap

memmap([[ 0.56711132, -0.12717231,  0.52758445, ..., -0.51964959,
         -0.55576608, -0.62599963],
        [-0.60118962, -0.2320138 ,  1.78400269, ...,  0.5460359 ,
         -0.81396878, -0.46026551],
        [ 0.14608177, -1.1583803 , -1.28189275, ...,  0.53420363,
          0.09238763, -1.28271782],
        ...,
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ]])
del mmap

As long as a memory image is out of scope, it will be recycled by the garbage collector, and any previous changes to it will be written to disk. When opening an existing memory image, you still need to indicate the data type and shape, because the file on the disk is only a piece of binary data, without any metadata:

mmap = np.memmap('mymmap.txt', dtype='float64', shape=(10000, 10000))
mmap

memmap([[ 0.56711132, -0.12717231,  0.52758445, ..., -0.51964959,
         -0.55576608, -0.62599963],
        [-0.60118962, -0.2320138 ,  1.78400269, ...,  0.5460359 ,
         -0.81396878, -0.46026551],
        [ 0.14608177, -1.1583803 , -1.28189275, ...,  0.53420363,
          0.09238763, -1.28271782],
        ...,
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ]])

Memory images can use the structured or nested dtype s described earlier.

9, Performance skills

The performance of code using NumPy is generally very good, because array operations are generally much faster than pure Python loops. Here are a few things to note:

1. Convert Python loop and conditional logic to array operation and Boolean array operation.
2. Try to use broadcasting.
3. Avoid copying data and try to use array view (i.e. slice).
4. Using ufunc and other methods.

9.1 importance of continuous memory

In some application scenarios, the memory layout of arrays can greatly affect the calculation speed. This is because the performance difference is to some extent related to the CPU cache system. During the operation, it is generally the fastest to access continuous memory blocks (for example, summing the rows of arrays stored in C order), because the memory subsystem will cache the appropriate memory blocks into the super high-speed L1 or L2CPU Cache. In addition, NumPy's C language basic code (some) optimizes the continuous storage situation, so as to avoid some leapfrog memory access.

The memory layout of an array is continuous, that is, the elements are stored in memory in the order in which they appear in the array (that is, Fortran type (column first) or C type (row first)). By default, NumPy arrays are created in a C-Continuous fashion. Column first arrays (such as transposes of C-type continuous arrays) are also known as Fortran type continuous. You can view these information through the flags property of ndarray:

arr_c = np.ones((1000, 1000), order='C')
arr_f = np.ones((1000, 1000), order='F')
arr_c.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False
arr_f.flags

  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False
arr_f.flags.f_contiguous

True

In this example, sum the rows of two arrays. In theory, arr_c will be better than arr_f fast, because arr_ The rows of C are continuous in memory. We can use% timeit in IPython to confirm:

%timeit arr_f.sum(0)

918 µs ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit arr_c.sum(0)

814 µs ± 4.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

If you want to improve performance from NumPy, this is the place to start. If the memory order of the array does not meet your requirements, use copy and pass in 'C' or 'F' to solve the problem:

arr_f.copy('C').flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

Note that when you construct a view of an array, the results are not necessarily continuous:

arr_c[:50].flags.contiguous
True
arr_c[:, :50].flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

Tags: Python IPython C SQL

Posted on Tue, 16 Jun 2020 02:33:01 -0400 by badboy1245