NumPy Review
I wanted to review NumPy before continuing learning about Machine Learning. There were some code snippets in Hands On Machine Learning that I didn't understand well do to my lack of knowledge about numpy.
Numpy Review
Since NumPy is an integral part of Machine Learning / Deep Learning, I am going to review it here before retuning to do Machine Learning problems.
References
- Learn NumPy
- NumPy QuickStart Tutorial
- NumPy Tutorials
- NumPy Illustrated: The Visual Guide to NumPy
- NumPy: The Absolute Basics for Beginners
- NumPy Tutorial
- NumPy User Guide
- Python for Data Analysis
- Advanced Numpy - broadcasting rules, strides, and advanced indexing
What is NumPy
NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.
At the core of NumPy is the ndarray object. This encapsulates n-dimensional arrays of homogeneous data types, with many operations being performed in compiled code for performance. Differences between Python and NumPy sequences
- NumPy arrays have a fixed size at creation, unlike Python lists, which can grow dynamically
- The elements in a NumPy array are all required to be of the same data type, and thus will be the same size in memory.
- NumPy arrays facilitate advanced mathematical and other types of operations on large numbers of data.
- A growing plethora of scientific and mathematical Python-based packages are using NumPy arrays
Why is NumPy Fast?
Vectorization described the absence of any explicit looping, indexing, etc., in the code - these things are taking place, of course, just "behind the scenes" in optimized, pre-compiled C code. Vectorized code has many advantages, including:
- vectorized code is more concise and easier to read
- fewer lines of code generally mean fewer bugs
- the code more closely resembles standard mathematical notation (making it easier, typically, to correctly code mathematic constructs)
- vectorization results in more "Pythonic" code.
Broadcasting is the term used to describe the implicit element-by-element behavior of operations; generally speaking, in NumPy, all operations, not just arithmetic operations, but logical, bit-wise, functional, etc., behave in this implicit element-by-element fashion, i.e., they broadcast.
NumPy QuickStart
The basics
NumPy's main object is the homogeneous multidimensional array. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of non-negative integers. In NumPy, dimensions are called axes.
- One Axis, Length 3: [1,2,1]
- Two Axes: [[1.,2.,0.],[0.,1.,1.]]
NumPy's array class is called ndarray. It is also known by the alias array. The most important attributes of an ndarray object are:
- ndarray.ndim
- the number of axes (dimensions) of the array
- ndarray.shape
- the dimensions of the array. This is a tuple of integers indicating the size of the array in each dimension. For a matrix of n rows and m columns, shape will be (n,m). The length of the shape tuple s equal to ndim
- ndarray.size
- The total number of elements in the array. This is equal to the product of the elements of shape
- ndarray.dtype
- an object describing the type of the elements in the array. NumPy provides types of its own (numpy.int32, numpy.float64).
- ndarray.itemsize
- The size in bytes of each element in the array
- ndarray.data
- The buffer containing the actual elements of the array
Array Creation
There are several types of ways to create arrays.
- np.array(sequence)
- array transforms sequences of sequences into two-dimensional arrays, and sequences of sequences of sequences into 3-D arrays, and so on...
- Often, the elements of an array are originally unknown, but its size is known. Hence, NumPy offers several functions to create arrays with initial placeholder content. These minimize the necessity of growing arrays, an expensive concern
- np.zeros
- Creates an array of zeros
- np.ones
- Creates an array full of ones
- np.empty
- Creates an array whose initial content is random and depends on the state of memory
- By default, the dtype argument of the created array is float64, but this can be specified with the keyword argument dtype
- The arange function allows you to create sequences of numbers (start, stop, step)
- np.linspace(start, stop, # elements)
- When used with floating point arguments, it is better to use linspace that receives as an argument the number of elements that you want, instead of the step
Printing Arrays
When you print an array, NumPy uses the following layout:
- the last axis is printed from left to right
- the second-to-last is printed from top to bottom
- the rest are also printed from top to bottom, with each slice separated from the next by an empty line
- If the array is too large to be printed, NumPy automatically skips the central part of the array and only prints the corners
- To disable this behavior, you can change the printing options using set_printoptions: np.set_printoptions(threshold=sys.maxsize)
import numpy as np
def make_title(s):
return "\n" + s + "\n------------------------------------------"
print(make_title("Attributes"))
a = np.arange(15).reshape(3,5)
print("a =",a)
print("shape =",a.shape)
print("ndim =",a.ndim)
print("dtype =",a.dtype.name)
print("itemsize =",a.itemsize)
print("size =",a.size)
print("type =",type(a))
b = np.array([6,7,8])
print("b =",b)
print("type b =",type(b))
## Array Creation
print(make_title("Array Creation"))
a = np.array([2,3,4])
b = np.array([1.2,3.5,5.1])
print("dtypes (a, b) = (",a.dtype,",",b.dtype,")")
print(np.zeros((3,4)))
print(np.ones((2,3,4),dtype=np.int16))
print(np.empty((2,3)))
# Create Sequences
print(make_title("Create Sequences"))
print(np.arange(10,30,5))
print(np.arange(0,2,0.4))
from numpy import pi
print(np.linspace(0,2*pi,100))
Basic Operations
Arithmetic operations on arrays apply elementwise. A new array is created and filled with the result - even * (multiplication). The matrix product can be performed using the @ operator or the dot funhction or method. Operator such as += and *= act in place to modify an existing array rather than create a new one. When operating with arrays of different types, the type of the resulting array corresponds to the more general or precise one (a behavior known as upcasting).Many unary operations, such as computing the sum of all the elements in the array, are implemented as menthods of the ndarray class. By default, these operations apply to the array as though it were a list of numbers, regardless of its shape. However, by specifying the axis parameter, you can apply an operation along the specified axis of an array.
print(make_title("Element Wise Operators"))
a = np.array([20,30,40,50])
b = np.arange(0,4,1)
c = a - b
print("c =",c)
print("b**2 =",b**2)
print(10 * np.sin(a))
print(a < 35)
## Matrix Multiplication
print(make_title("Matrix Multiplication"))
A = np.array([[1,1],[0,1]])
B = np.array([[2,0],[0,4]])
print(A @ B) # Matrix Product
print(A.dot(B)) # Matrix Product
## In Place Operators
print(make_title("In Place Operators"))
rg = np.random.default_rng(1) # Create instance of default random bnumber
a = np.ones((2,3),dtype=int)
b = rg.random((2,3))
print(a)
a*=3
print(a)
b += a
print(b)
## Unary Operators
print(make_title("Unary Operators"))
a = rg.random((2,3))
print(a)
print("sum =",a.sum())
print("min =",a.min())
print("max =",a.max())
b = np.arange(12).reshape((3,4))
print("Sum Columns =",b.sum(axis=0)) # Sum of Each Column
print("CumSum Rows =",b.cumsum(axis=1)) # Cumulative Sum aalong each row
Universal Functions
NumPy provides familiar mathematical functions such as sin, cos, and exp. In NumPy, these are called "universal functions" (ufunc). Within NumPy, these functions operate elementwise on an array, producing an array as output
print(make_title("Universal Functions"))
B = np.arange(3)
print("B =",B)
print("e^B =",np.exp(B))
print("sqrt(B) =",np.sqrt(B))
C = np.array([2.,-1.,4.])
print(np.add(B,C))
Indexing, Slicing, and Iterating
One-dimensional arrays can be indexed, sliced, and iterated over, much like lists and other Python sequences. Multidimensional arrays can have one index per axis. These indices are given in a tuple separated by commas. When fewer indices are provided than the number of axes, the missing indices are considered complete slices :. The expression within brackets b[i] is treated as an i followed by as many instances of : as needed to represent the remaining axes. NumPy also allows you to write this using dots as b[i,...]. The dots (...) represent as many colons as needed to produce a complete indexing tuple. If x is an array with 5 axes, then:
- x[1,2,...] is equivalent to x[1,2,:,:,:]
- x[...,3] to x[:,:,:,:,3] and
- x[4, ..., 4, :] to x[4, :, :, 5, :]
Iterating over Multidimensional arrays is done with respect to its first axis. Hwoever, if you want to perform an operation on each element in the array, one can use the flat attribute which is an iterator over all the elements in the array.
print(make_title("Indexing, Slicing, and Iterating"))
a = np.arange(10)**3
print(a)
print("a[2] =",a[2])
print("a[2:5] =",a[2:5])
# equivalent to a[0:6:2] = 1000
# Set every 2nd element to 1000
a[:6:2] = 1000
print("a[:6:2]=1000 => a =",a)
print("reversed a (a[::-1]) =",a[::-1]) # Reversed A
for i in a:
print(i**(1/3.))
def f(x,y):
return 10 * x + y
print(make_title("Multidimensional Arrays"))
b = np.fromfunction(f, (5,4), dtype=int)
print("b =",b)
print("b[2,3] =",b[2,3])
print(b[0:5,1]) # Each row in the second column of b
print(b[:,1]) # Equivalent to the previous example
print(b[1:3,:]) # Every column in the second and third row of b
c = np.array([[[ 0, 1, 2], # a 3D array (two stacked 2D arrays)
[ 10, 12, 13]],
[[100, 101, 102],
[110, 112, 113]]])
c.shape
print(c[1, ...]) # same as c[1, :, :] or c[1]
print(c[..., 2]) # same as c[:, :, 2]
Shape Manipulation
Changing the Shape of An Array
An array has a shape given by the number of elements along each axis. The shape of an array can be changed with various three commands. The order of elements resulting frm ravel is normally "C-Style" - that is, the rightmost index "changes the fastest", so the element after a[0,0] is a[0,1]. f the array is reshaped to some other shape, again the array is treated as "C-style". The resize method modifies the array itself (whereas reshape returns its argument with a modified shape). If dimension is given as -1 in a reshaping operation, the other dimensions are automatically calculated.
print(make_title("Shape Manipualtion"))
a = np.floor(10 * rg.random((3,4)))
print(a)
print(a.shape)
print("The following commands all return a modified array, but do not change the original array:")
print("ravel() =",a.ravel()) # returns the array, flattened
print("Transpose =",a.T) # returns the array, transposed
print("reshape() =",a.reshape((4,3))) # returns the array with a modified shape
print(a)
a.resize((2,6))
print(a)
Stacking Together Different Arrays
Several arrays can be stacked together along different axes. The function column_stack stacks 1D arrats as columns into a 2D array. It is equivalent to hstack only for 2D arrays. In general, for arrays with more than two dimensions, hstack stacks along their second axis, vstack stacks along their first axes, and concatenate allows for an optional argument giving the number of the axis along which the concatenation should happen. In complex cases, r_ and c_ are useful for creating arrays by stacking numbers along one axis. They allow the use of range literals :. When used with arrays as arguments, r_ and c_ are similar to vstack and hstack in their default behavior, but allow for an optional argument giving the axis along which to concatenate.
a = np.floor(10 * rg.random((2,2)))
print(a)
b = np.floor(10 * rg.random((2,2)))
print(b)
print(np.vstack((a,b)))
print(np.hstack((a,b)))
from numpy import newaxis
print(np.column_stack((a,b))) # with 2D arrays
a = np.array([4.,2.])
b = np.array([3.,8.])
print(np.column_stack((a,b))) # returns a 2D array
print(np.hstack((a,b))) # the result is different
print(a[:,newaxis]) # view 'a' as a 2D column vector
print(np.r_[1:4,0,4])
Splitting One Array into Several Smaller Ones
Using hsplit, you can split an array along its horizontal axis, either by specifying the number of equally shaped arrays to return, or by specifying the columns after which the division should occur. vsplit splits along the vertical axis, and array_split allows one to specify along which axis to split
a = np.floor(10 * rg.random((2,12)))
print(a)
# Split `a` into 3
print(np.hsplit(a,3))
# Split `a` after the 3rd and 4th column
print(np.hsplit(a,(3,4)))
Copies and Views
When operating and manipulating arrays, their data is sometimes copied into a new array and sometimes not.
No copy at all
Simple assignments make no copy of objects or their data. Python passes mutable objects as references, so function calls make no copy.
print(make_title("No Copy at All"))
a = np.array([np.arange(4),np.arange(4,8),np.arange(8,12)])
b = a # No new obvject is created
print(b is a) # a and b are two names for the same ndarray object
def f(x):
print(id(x))
print(id(a)) # id is a unique identifier of an object
print(f(a))
View or shallow copy
Different array objects share the same data. The view method creates a new array object that looks at the same data. Slicing an array returns a view of it.
c = a.view()
print(c is a)
print(c.base is a) # c is a view of the data owned by a
print(c.flags.owndata)
c = c.reshape((2,6)) # a's shape doesn't change
print(a.shape)
c[0,4] = 1234 # a's data changes
print(a)
s = a[:,1:3]
s[:] = 10 # s[:] is a view of s. Note the difference between s = 10 and s[:] = 10
print(a)
Deep Copy
The copy method makes a complete copy of the array and its data. Sometimes copy should be called after slicing if the original array is not required anymore. For example, suppose a us a huge intermediate result and the final result b contains a small fraction of a, a deep copy should be made when constructing b with slicing. If b = a[:100] is used instead, a is referenced by b and will persist in memory even if del a is executed. `
a = np.arange(int(1e8))
b = a[:100].copy()
del a # The memory of ``a`` can be released
Less Basic
Broadcasting Rules
Broadcasting allows universal functions to deal in a meaningful way with inputs that do not have exactly the same shape.
- The first rule of broadcasting is that if all input arrays do not have the same number of dimensions, a "1" will be reatedly prepended to the shapes of all the smaller arrays until all the arrays have the same number of dimensions.
- The second rule of broadcasting ensures that arrays with a size of 1 along a particula dimension act as if theyt had the size of the array with the largest shape along that dimension. The value of the array element is assumed to be the same along that dimension of the "broadcast" array.
Advanced Indexing and Index Tricks
NumPy offers more indexing facilities than regular Python sequences. In addition to indexing by integers and slices, as we saw before, arrays can be indexed by arrays of integers and arrays of booleans.
Indexing with arrays of indices
When the indexed array a is multidimensional, a single array of indices refers to the first dimension of a. The example below # EXAMPLE shows this behavior by converting an image of labels into a color image using a pallette. We can also give indexes for more than one dimension. The arrays of indices for each dimension must have the same shape. In Python, arr[i,j] is the same as arr[(i,j)] - so we can put i and j in a tuple and then do the indexing with that. Another common use of indexing with arrays is th search of the maximum value of time-dependent series:
a = np.arange(12)**2 # the first 12 square numbers
i = np.array([1,1,3,8,5]) # an array of indices
print(a[i]) # the elements of `a` at the positions `i`
j = np.array([[3,4],[9,7]]) # a bidimensional array of indices
print(a[j]) # the same shape as `j`
# Example
palette = np.array([[0, 0, 0], # black
[255, 0, 0], # red
[0, 255, 0], # green
[0, 0, 255], # blue
[255, 255, 255]]) # white
image = np.array([[0, 1, 2, 0], # each value corresponds to a color in the palette
[0, 3, 4, 0]])
print(palette[image] ) # the (2, 4, 3) color image
a = np.arange(12).reshape(3,4)
print(a)
i = np.array([[0,1],[1,2]]) # indices for the first dim of `a`
j = np.array([[2,1],[3,3]]) # indices for the second dim
try:
print(a[i,j]) # i and just must have equal shape
except IndexError:
print("index 3 is out of bounds for axis 0 with size 3")
l = (i,j)
print(a[l])
time = np.linspace(20,145,5) # time scale
data = np.sin(np.arange(20)).reshape(5,4) # 4 time-dependent series
print(time)
print(data)
# index of the maxima for each series
ind = data.argmax(axis=0)
print(ind)
# times corresponding to the maxima
time_max = time[ind]
print(time_max)
data_max = data[ind,range(data.shape[1])] # => data[ind[0],0], data[ind[1], 1]...
print(data_max)
print(np.all(data_max == data.max(axis=0)))
# Use Indexing with arrays as a target to assign to
a = np.arange(5)
print(a)
a[[1,3,4]] = 0
print(a)
Indexing with Boolean Arrays
When we index with arrays of (integer) indices, we are proving the list of indices to pick. With boolean indices, the approach is different; we explicitly choose which items in the array we want and which ones we don't. The most natural way we can think of for boolean indexing is to use boolean arrays that have the same shape as the original array. The property can be very useful in assignments. The second way of indexing with booleans is more similar to integer indexing; for each dimension of the array we give a 1D boolean array selecting the slices we want. Npte that the length of the 1D boolean array must coincide with the length of the dimension (or axis) you want to slice.
a = np.arange(12).reshape((3,4))
b = a > 4
print(b) # `b` us a boolean with `a`'s shape
print(a[b]) # 1d array with the slected elements
a[b] = 0
print(a)
# Generating Mandelbrot Set with Boolean Indexing
import matplotlib.pyplot as plt
def mandlebrot(h,w,maxit=20,r=2):
"""
Returns an index of the Mandlebrot fractal of size (h,w)
"""
x = np.linspace(-2.5,1.5,4*h+1)
y = np.linspace(-1.5,1.5,3*w+1)
A, B = np.meshgrid(x,y)
C = A + B*1j
z = np.zeros_like(C)
divtime = maxit + np.zeros(z.shape, dtype=int)
for i in range(maxit):
z = z**2 + C
diverge = abs(z) > r # who is diverging
div_now = diverge & (divtime==maxit) # who is diverging now
divtime[div_now] = i # note when
z[diverge] = r # avoid diverging too much
return divtime
plt.clf()
plt.imshow(mandlebrot(400,400))
a = np.arange(12).reshape(3,4)
b1 = np.array([False, True, True]) # the first dim selection
b2 = np.array([True, False, True, False]) # the second dim selection
print(a[b1,:]) # Selecting rows
print(a[b1]) # Selecting rows, same as before
print(a[:,b2]) # Selecting columns
print(a[b1,b2]) # A weird thing to do
The ix_() function
The ix_ function can be used to combine different vectors so as to obtain the result for each n-uplet.
a = np.arange(2,6)
b = np.array([8,5,4])
c = np.array([5,4,6,8,3])
ax, bx, cx = np.ix_(a,b,c)
print(ax)
print(bx)
print(cx)
print((ax.shape, bx.shape, cx.shape))
result = ax + bx * cx
print(result)
print(result[3,2,4])
print(a[3] + b[2] * c[4])
Tricks and Tips
"Automatic" Reshaping
To change the dimension of an array, you can omit one of the sizes with -1 which will then be deduced automatically:
a = np.arange(30)
b = a.reshape((2,-1,3)) # -1 means "whatever is needed"
print(b.shape)
print(b)
Histograms
The NumPy histogram function applied to an array returns a pair of vectors: the histogram of the array and a vector of the bin edges. This is different than the matplotlib function.
# Build a vector of 10000 normal deviates with variance 0.5^2 and mean 2
mu, sigma = 2, 0.5
v = rg.normal(mu, sigma, 10000)
# Plot a normalized histogram with 50 bins
plt.hist(v, bins=50, density=True,label="Matplotlib Histogram") # matplotlib version (plot)
# Compute the histogram with numpy and then plot it
(n, bins) = np.histogram(v, bins=50, density=True) # NumPy version (no plot)
plt.plot(.5 * (bins[1:] + bins[:-1]), n,color="orange",label="Numpy Histogram Plot")
plt.legend()
plt.title("Numpy and Matplotlib Historgrams")
plt.show()
NumPy: the absolute basics for Beginners
NumPy (Numerical Python) is an open source Python library that's widely used in science and engineering. The NumPy library contains multidimensional array data structures, such as the homogeneous, N-dimensional ndarray, and a large library of functions that operate efficently on these data structures.
Import Numpy
import numpy as np
Wht use NumPy?
NumPy arrays can improve speed, reduce memory consumption, and offer a high-level syntax for performing a variety of common processing tasks. NumPy shines when ther are large quantities of "homogeneous" (same-type) data to be processed on the CPU.
What is an "array"?
In computer programming, an array is a structure for storing and retrieving data. In NumPy, this idea is generalized to an arbitrary number of dimensions, and so the fundamental array class is called ndarray: it represents an "N-dimensional array". Most NumPy arrays have some restrictions:
- all elements of the array must be of the same type
- once created, teh total size of the array can't change
- the shape must be "rectangular", not "jagged" - e.g., each row of a two-dimensional array must have the same number of columns
Array Fundamentals
One way to initialize an array is using a Python sequence, such as a list. Python slice notation can be used for indexing and the array is mutable. One major difference with Python lists is that slice indexing of a list copies elements into a new list, but slicing an array returns a view: an object that refers to the data in the original array. The original array can be mutated using the view. Two and higher dimensional arrays can be initialized from nested Python sequences. In NumPy, a dimension of an array is sometimes referred to as an "axis". Another difference between an array and a ist of lists is that an element of the array can be accessed by specifying the index along each axis within a single set of square brackets. Note: Scalar = 0-D array, Vector = 1-D array, Matrix = 2-D array, Tensor = N-D array where N ≥ 2
Array Attributes
- ndim - the number of dimensions of an array is contained in the ndim attribute
- shape - the shape of an array is a tuple of non-negative integers that specify the number of elements along each dimension
- size - the total number of elements in array is contained in the size attributes
- dtype - the data type of each element in an array
How to Create Basic Array
- np.zeros() - array of zeros
- np.ones() - array of ones
- np.empty() - random array
- np.arange(first number, last number, step size)
- np.linspace()
- The dtype attribute can be used to explicitly specify the type to use
Adding Removing and Sorting Elements
- np.sort() - how to sort array. You can specify the axis, kind, and order
- argsort an indirect sort along a specified axis
- lexsort an indirect stable sort on multiple keys
- searchsorted will find elements in a sorted array
- partition a partial sort
- np.concatenate() - concatenate ndarrays
Can you Reshape an Array?
Yes, with arr.reshape(). It will give you a new shaped array without changing teh data. The reshaped array needs to have the same number of elements as the original array. This method takes a few optional parameters:
- order
- C reshape in C-like index order
- F reshape in Fortran-like index order
How to Add a New Axis to an Array
You can use np.newaxis and np.expand_dims to increase the dimensions of your existing array. Using np.nexaxis will increase the dimension of your array by one dimension when used once. You con use np.newaxis to convert a column vector to a row vector and vice-versa. You can use np.expanddims to add an axis at index position 1 with np.expand_dims(a,axis=1).
Indexing and Slice
You can index and slice NumPy arrays in the same ways you can Python lists. If you want to select values from your array that fulfill certain conditions, it's straightforward with NumPy. You can use np.nonzero() to print the indices of elements that meet certain conditions.
print(make_title("Array Fundamentals"))
a = np.array([1, 2, 3, 4, 5, 6])
print(a[0] )# accessing element of array
a[0] = 10 # the array is mutable
print(a[:3]) # Python slice notation can be used for indexing
b = a[3:] # Creates view
print(b)
b[0] = 40 # affects a
print(a)
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
print(a[1,3]) # row 2, column 4
print(make_title("Array Attributes"))
print(a.ndim)
print(a.shape)
print(a.size)
print(a.dtype)
print(make_title("How To Make Array"))
print(np.zeros(2))
print(np.ones(2))
print(np.empty(2))
print(np.arange(4))
print(np.arange(2,9,2,dtype=np.float32))
print(make_title("Adding, Removing and Sorting Elements"))
c = np.hstack((np.arange(5,10,dtype=np.float32),np.arange(1,5,dtype=np.float32)))
print(c)
print(np.sort(c))
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])
print(np.concatenate((a,b)))
x = np.array([[1, 2], [3, 4]])
y = np.array([[5, 6]])
print(np.concatenate((x,y),axis=0))
print(make_title("How to Add New Axis to an Array"))
a = np.arange(1,7)
print(a.shape)
a2 = a[np.newaxis, :]
print(a2.shape)
row_vector = a[np.newaxis, :]
col_vector = a[:, np.newaxis]
print(make_title("Indexing and Slicing"))
a = np.array([[1 , 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
print(a[a < 5])
five_up = (a >= 5)
print(a[five_up])
divisible_by_2 = a[a%2 == 0]
print(divisible_by_2)
c = a[(a > 2) & (a < 11)] # Select elements that satisy two conditions using the `&` and `|` conditions
print(c)
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
b = np.nonzero(a < 5)
print(b) # The first array represnts the row indices where these values were found, and the second array represents the column indices where the values are found
Create an Array from Existing Data
You can use slicing, vstack, hstack, and hsplit to create arrays from existing data. You can use the view method to create a new array object that looks at the same data as the original array (a shallow copy). NumPy functions, as well as operations like indexing and slicing, will return views whenever possible. This saves memory and is faster (no copy of the data has to be made). However, modifying data in a view also modifies the original array.
Basic Array Operations
Basic operations are simple with numpy. If you want to find the sum of all elements in an array, use the sum() method. To add the rows or columns in a 2D array, you can specify the axis (axis is 0-indexed).
Broadcasting
There are times when you might want to carry out an operation between an array and a single number (also called an operation between a vector and a scalar) or between arrays of two different sizes. NumPy understands that the multiplication should happen with each cell. That concept is called broadcasting. Broadcasting is a mechanism that allows NumPy to perform operations on arrays of different shapes.
More Useful Array Operations
NumPy also performs aggregation functions. In addition to min, max, and sum, you can easily run mean to get the average, prod to get the result of multiplying elements together, std to get the standard deviation, and more. You can specify the axis for these aggregation functions.
Generating Random Numbers
The use of random number generation is an important part of the configuration and evaluation of many numerical and machine learning algorithms. Whether you need to randomly initialize weights in an artificial neural network, split data into random sets, or randomly shuffle your dataset, being able to generate random numbers (actually, repeatable pseudo-random numbers) is essential. With Generator.integers, you can generate random integers from low (remember that this is inclusive with NumPy) to high (exclusive). You can set endpoint=True to make the high number inclusive.
How to Get Unique Items and Counts
You can find unique elements in an array easily with np.unique. If you pass in n-dim arrays to this function, the array will be flattened, but you can specify an axis for this argument.
Transposing and Reshaping a Matrix
The T property (or transpose() method) allows you to transpose a NumPy array. Use reshape() to reshape a matrix.
Reverse an Array
NumPy's np.flip() function allows you to flip, or reverse, the contents of an array along an axis. If you don;t specify an axis, NumPy will reverse the contents along all the axes of your input array.
Reshaping and Flattening Multidimensional Arrays
There are two possible ways to flatten an array: .flatten() and .ravel(). The primary difference between the two is that the new array created using .ravel() is actually a reference to the parent array (a "view"). This means that any changes to the new array will affect the parent array as well.
How To Access the Doctring for More Information
Every object contains the reference to a string, which is known as the docstring. In most cases, this docstring contains a quick and concise summary of the object of how to use it. Python has a built-in help() function that can help you access this information.
How to Save and Load NumPy Objects
You will, at some point, want to save your arrays to disk and load them back without having to re-run the code. Fortunately, there are several ways to save and load objects with NumPy. The ndarray objects can be saved to and loaded from the disk files with loadtxt and savetxt functions that handle normal text files, load and save functions that handle NumPy binary files with a .npy file extension, and a savez function that handles NumPy files with a .npz file extension. The .npy and .npz files store data, shape, dtype, and other information required to reconstruct the ndarray in a way that allows the array to be correctly retrieved, even when the file is on another machine with different architecture.
print(make_title("Create an Array From Existing Data"))
a = np.arange(1,11)
arr1 = a[3:8] # Create a new array from a section of your array
a1 = np.array([[1, 1],
[2, 2]])
a2 = np.array([[3, 3],
[4, 4]])
print(np.vstack((a1,a2)))
print(np.hstack((a1,a2)))
x = np.arange(1, 25).reshape(2, 12)
print(np.hsplit(x,3))
print(make_title("Basic Array Operations"))
data = np.array([1,2])
ones = np.ones(2,dtype=int)
print(data+ones,data-ones,data*data,data/data,data.sum())
print(make_title("Generating Random Numbers"))
rng = np.random.default_rng() # the simplest way to generate random numbers
print(rng.random(3))
print(rng.integers(5, size=(2, 4)) )
c = np.concatenate((np.arange(0,10),np.arange(0,10)),axis=0)
print(c)
print(np.unique(c))
unqiue_values, unqiue_indices, unqiue_counts = np.unique(c,return_index=True,return_counts=True)
print((unqiue_values, unqiue_indices, unqiue_counts))
Array Creation
There are 6 general mechanisms for creating arrays:
- Conversion from other Python structures (tuples or lists)
- Instrinsic NumPy array creation functions
- Replication, joining, or mutating existing arrays
- Reading arrays from disk, either from standard or custom formats
- Creating arrays from raw bytes through the use of strings or buffers
- Use of special library functions
When you perform operations with two arrays with different dtype, NumPy will assign a new type that satisfies all of the array elements involved in the computation, here uint32 and int32 can both be represented as int64. The best practice for numpy.arange is to use integer start, end, and step values. The 2D array creation functions e.g. np.eye, np.diag, and np.vander define properties of special matrices represented as 2D arrays. np.eye( n, m ) defines a 2D identity matrix. np.diag can define either a square 2D array with given values along the diagonal or if given a 2D array returns a 1D array that is only the diagonal elements. The two array creation functions can be helpful when doing linear algebra. vander(x,n) defines a Vandermonde matrix as a 2D NumPy array. Each column of the Vandermonde matrix is a decreasing power of the input 1D array or list or tuple, x where the highest polynomial order is n-1. The ndarray creation functions np.ones, np.zeros, and random define arrays based upon the desired shape. np.zeros will create an array filled with 0 values with the specified shape. np.ones will create an array filled with one values. The random method of the result of default_rng will create an array filled with random values between 0 and 1. np.indices will cerate a set of arrays, one per dimension with each representing variation in that dimension. Once you have created arrays, you can replicate, join, or mutate those existing arrays to create new arrays. Ehn you assign an array or its elements to a new variable, you have to explicitly numpy.copy the array, otherwise, the variable is a view into the original array. There are a number of routines to join existing arrays - np.vstack, np.hstack and np.block. Reasing arrays from disk is the most common case for large array creation. The details depend on the format of the data on disk.
print(np.diag([1,2,3]))
print(np.diag([1,2,3],1))
print(np.vander(np.arange(1,5),2))
print(np.zeros((2,3,2)))
print(np.ones((2,3)))
from numpy.random import default_rng
print(default_rng(42).random((2,3))) # Here, the seed is set to 42 so you can reproduce these pseudorandom numbers
Indexing on ndarrays
ndarrays can be indexed using the standard Python x[obj] syntax, where x is the array and obj the selection. there are different kinds of indexing available depending on the obj: basic indexing, advanced indexing, and field access. Note that in Python, x[exp1, exp2, ..., expN] us equivalent to x[(exp1, exp2, ..., expN)]; the latter is just syntactic sugar for the former. Each index specified selects the array corresponding to the rest of the dimensions selected. The returned array is a view.NumPy uses C-order indexing. That means that the last index usually represents the most rapidly changing memory location.
Slicing and Striding
Basic slicing extends Python's basic concept of slicing to N dimensions. Basic slicing occurs when the obj is a slice object (constructed by start:stop:step notation outside of brackets), an integer, or a tuple of slice objects and integers. Ellipses and newaxis objects can be interspersed with these as well. Ellipses expands to the number of : bjects needed for the selection tuple to index all dimensions. In most cases, this means that the length of the expanded selection tuple is x.ndim. Each newaxis object in the selection tuple serves to expand the dimensions of the resulting selection by one unit-length dimension. The added dimension is the position of the newaxis object in the selection tuple.
Advanced Indexing
Advanced indexing is triggered when the selection object, obj, is a non-tuple sequence object, an ndarray (of data type integer or bool), or a tuple with at least one sequence object or ndarray (of data type integer or bool). There are two types of advanced indexing: integer and Boolean. Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).
If the index values are out of bounds then an IndexError is thrown. If the index arrays do not have the same shape, there is an attempt to broadcast them to the same shape. The broadcasting mechanism permits index arrays to be combined with scalars for other indexes.
Boolean Array Indexing
This advanced indexing occurs when obj is an array object of Boolean type, such as may be returned from comparison operators. A single Boolean index array is practically identical to x[obj.nonzero()] where, as described above, obj.nonzero() returns a tuple (of length obj.ndim) of integer index arrays showing the True elements of obj. A common use case for this is filtering for desired element values.
x = np.arange(10)
print(x[2])
print(x[-2])
print(x[1:7:2])
print(x[-2:10])
print(x[-3:3:-1])
print(x[5:])
x = np.array([[[1],[2],[3]],[[4],[5],[6]]])
print(x.shape)
print(x[1:2])
print(x[...,0])
## Advanced Indexing
x = np.arange(10,1,-1)
x[np.array([3,3,1,8])]
y = np.arange(35).reshape(5,7)
print(y[np.array([0, 2, 4]), np.array([0, 1, 2])])
print(y[np.array([0, 2, 4]), 1]) # Multindex broadcasting
## Boolean Indexing
x = np.array([[1., 2.], [np.nan, 3.], [np.nan, np.nan]])
print(x[~np.isnan(x)])
x = np.arange(35).reshape(5, 7)
b = x > 20
boolean_fil = b[:,-1]
print(x[boolean_fil]) # Boolean Filter Broadcasting
I/O with NumPy
NumPy provides several functions to create arrays from tabular data. We focus here on the genfromtxt function. In a nutshell, getfromtxt runs two main loops. the first loop converts each line of the file in a sequence of strings. The second loop converts each string to the appropriate data type. I have never seen data loaded using NumPy, so I am skipping this section for now.
Data Types
NumPy supports a much greater variety of numerical types than Python does. NumPuy numerical types are instances of numpy.dtypes objects, each having unique characteristics. To convert the type of an array, use the .astype() method. In addition to numerical types, NumPy also supports storing unicode strings, via the np.str_ dtype (U character code), null-terminated byte sequences via np.bytes_ (S character code), and arbitrary byte sequences, vis np.void (V character code). All of the above are fixed-width data types. They are parameterized by a width, in either bytes or unicode points, that a single data element in the array must fit inside. This means that storing an array of byte sequences or strings using this dtype requires knowing or calculating the sizes of the longest text or byte sequence in advance.
z = np.arange(10,dtype=np.float32)
y = z.copy().astype(np.int16)
print(z,z.dtype)
print(y,y.dtype)
print(type(np.float32))
y = np.array(["hello", "world!"])
print(y.dtype)
Broadcasting
The term broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is "broadcast" across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations os that looping occurs in C instead od Python. It does this without making needless copies of data that usually leads to efficient algorithm implementations.
When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing (rightmost) dimension and works its way less. Two dimensions are compatible when they are equal or one of them is 1. If these conditions are not met, then a ValueError is thrown. Input arrays do not need to have the same number of dimensions. The resulting array will have the same number of dimensions as the input array with the greatest number of dimensions, where the size of each dimension is the largest size of the corresponding dimension among the input arrays.
A Practical Example: Vector Quantization
Broadcasting comes up quite often in real world problems. A typical example occurs in the vector quantization (VQ) algorithm used in information theory, classification, and other related areas. The basic operation in VQ finds the closest point of a set of points, called codes in VQ jargon, to a given point, called the observation.
from numpy import array, argmin, sqrt, sum
observation = array([111.0, 188.0])
codes = array([[102.0, 203.0],
[132.0, 193.0],
[45.0, 155.0],
[57.0, 173.0]])
diff = codes - observation # the broadcast happens here
dist = sqrt(sum(diff**2,axis=-1))
argmin(dist)
Copies and Views
When operating on NumPy arrays, it is possible to access the internal data buffer directly using a view without copying data around. This ensures good performance but can also cause unwanted problems if the user is not aware of how this works. The NumPy array is a data structure consisting of two parts: the contiguous data buffer with the actual data elements and the metadata that contains information about the data buffer, The metadata includes the data type, strides, and other important information that helps manipulate the ndarray. It is possible to access the array differently by just changing certain metadata like stride and dtype without changing the data buffer. This creates a new way of looking at the data and these new arrays are called views. When a new array is created by duplicating the data buffer as well as the metadata, it is called a copy. The base attribute of the ndarray makes it easy to tell if an array is a view or a copy.
import numpy as np
x = np.arange(9)
x
y = x.reshape(3, 3)
y
y.base # .reshape() creates a view
z = y[[2, 1]]
z
z.base is None # advanced indexing creates a copy
Linear Algebra in n-dimensional Arrays
The shape of this image is (768, 1024, 3) - since this is a color image and we have used imread function to read it, the data is organized in three 2D arrays, representing color channels (in this case, red green and blue).It is possible to use methods from linear algebra to proximate an existing set of data. Here, we wil use the SVD (Singular Value Decomposition) to try to rebuild an image that uses less singular value information than the original one, while still retaining some of its features. In order to extract information from a given matrix, we can use the SVD to obtain 3 arrays which can be multiplied to obtain the original matrix. From the theory of linear algrebra, given a matrix A , the following product can be computed:
where U and VT are square and Σ is the same size as A . Σ is a diagonal matrix that contains the singular values of A , organized form largest to smallest. These values are always non-negative and can be used as an indicator of the "importance" of some features represented by matrix A . According to colorimetry, it is possible to obtain a reasonable grayscale version of the color image by applying the formula
The linalg module includes a norm function, which computes the norm of a vector or matrix represented in a NumPy array. See the line graph below. Although there are 768 singular values in s, most of those (after the 150th entry or so) are pretty small. So it makes sense to use only the information related to the first (say, 50) singular values to build a more economical approximation to out image. The idea is to consider all but the first k singular values in Sigma, keeping U and Vt intact, and computing the product of these matrices as the approximation.
NumPy's broadcasting allows us to do what we did for the greyscale image to the image that has RGB values. The linear algebra functions in NumPy expect to see an array of (n, M, N) where the first axis n represents the number of MxN matrices in the stack.
import numpy as np
from scipy.datasets import face
img = face()
print(type(img))
import matplotlib.pyplot as plt
%matplotlib inline
plt.imshow(img)
plt.show()
print(img.shape,img.ndim)
red_data = img[:,:,0] # Red pixel data
img_array = img / 255 # Since we are going to perform linear algebra operations on the data, it might be more interesting to have real numbers between 0 and 1 in each entry of the matrices
print(img_array.max(),img_array.min(),img_array.dtype)
red_array, green_array, blue_array = img_array[:,:,0], img_array[:,:,1], img_array[:,:,2]
from numpy import linalg
img_gray = img_array @ [0.2126, 0.7152, 0.0722]
print(img_gray.shape)
plt.imshow(img_gray,cmap="gray")
plt.show()
U, s, Vt = linalg.svd(img_gray)
print(U.shape,s.shape,Vt.shape)
print("Notice that s only has one dimension here - this is for economic purposes. You can rebuild the diagonal matrix using the function below")
def get_sigma_matrix(U,s,Vt):
Sigma = np.zeros((U.shape[1], Vt.shape[0]))
np.fill_diagonal(Sigma, s)
return Sigma
Sigma = get_sigma_matrix(U,s,Vt)
print("Should be small:",linalg.norm(img_gray - U @ Sigma @ Vt))
print("Is the reconstructed product close to the original matrix?: ",np.allclose(img_gray,U @ Sigma @ Vt))
plt.plot(s)
plt.show()
def get_approximation(U,Sigma,Vt,k):
return U @ Sigma[:, :k] @ Vt[:k, :]
plt.imshow(get_approximation(U,Sigma,Vt,10),cmap="gray")
plt.show()
img_array_transposed = np.transpose(img_array,axes=(2,0,1)) # Transpose array
U, s, Vt = linalg.svd(img_array_transposed)
print(U.shape,s.shape,Vt.shape)
Sigma = np.zeros((3, 768, 1024))
for j in range(3):
np.fill_diagonal(Sigma[j, :, :], s[j, :])
reconstructed = U @ Sigma @ Vt
reconstructed = np.clip(reconstructed, 0, 1)
plt.imshow(np.transpose(reconstructed, (1, 2, 0)))
plt.show()
k=50
approx_img = U @ Sigma[..., :k] @ Vt[..., :k, :]
plt.imshow(np.transpose(approx_img, (1, 2, 0)))
plt.show()
Saving and Sharing your NumPy Arrays
You can use savez to save zipped arrays to a file. You can use numPy's load to load them back into memory. You can use savetxt to save a csv file, and then you can use load_xy to reload them.
x = np.arange(10)
y = x ** 2
print(x)
print(y)
from pathlib import Path
import os
np.savez(os.path.join(Path.cwd(),"x_y-sqquared.npz"),x_axis=x,y_axis=y)
del x, y
load_xy = np.load(os.path.join(Path.cwd(),"x_y-sqquared.npz"))
x = load_xy["x_axis"]
y = load_xy["y_axis"]
print(x)
print(y)
array_out = np.block([x[:, np.newaxis], y[:, np.newaxis]])
np.savetxt(os.path.join(Path.cwd(),"x_y-squared.csv"), X=array_out, header="x, y", delimiter=",")
del x, y
load_xy = np.loadtxt("x_y-squared.csv", delimiter=",")
x = load_xy[:, 0]
y = load_xy[:, 1]
print(x)
print(y)
Masked Arrays
A masked array is the combination of a standard numpy.ndarray and a mask. A mask is either nomask, indicating that no value of the associated array is invalid, or an array of booleans that determines for each element of the associated array whether the value is valid or not. When an element of the mask is False, the corresponding element of the associated array is valid and is said to be unmasked. When an element of the mask is True, the corresponding element of the associated array is said to be masked (invalid).
You can think of a MaskedArray as a combination of:
- Data, as a regular numpy.ndarray of any shape or datatype
- A boolean value with the same shape as the data
- A fill_value, a value that may be used to replace the invalid entries in order to return a standard numpy.ndarray
When can They Be Useful?
- When you want to preserve the values you makes for later processing, without copying the array
- When you have to handle many arrays, each with their own mask. If the mask is part if the array, you avoid bugs and the code is possibly more compact
- When you have different flags for missing or invalid values, and wish to preserve these flags without replacing them in the original dataset, but exclude them from computations
- If you can't avoid or eliminate missing values, but don't want to deal with NaN values in your operations