Numpy, Pandas, and Matplotlib Notes
Read a textbook about numpy, pandas, and matplotlib to get a better understanding of the libraries. Here are the notes my that.
Python for Data Analysis, 3E, by Wes McKinney
Introduction
- What is meant by "data"?
- structured data - which encompasses tabular or spreadsheet like data in which each column may be a different typ (numeric, string, date, ...)
- Multidimensional arrays
- Multiple tables of data interrelated by key columns (what would be primary or foreign keys for a SQL user)
- Evenly or unevenly spaced time series
- Why Python?
- It has been found to be a suitable language not only for research and protyping but also for building the production systems
- Essential Python Libraries:
- Numpy
- short for numerical python, it provides the data structures, algorithms, and library glue needed for most scientic applications involving numeric data in Python.
- contains:
- fast and effcient multi dimensional array: ndarray
- fast array processing
- NumPy arrays are more efficient for storing and manipulating data than any other built-in Python data structure
- Pandas
- Provides high level data structures and function that mmakes working with structured or tabular data intuitive and flexible
- Primary Pandas object are DataFrame (a tabular, column oriented data structure with both rows and column labels) and Series ( a one dimensional labeled array object)
- Blends numpy array computing ideas with data manipulation capabilities found in relational databses
- Pandas name is derived from panel data and Python data analytics
- Matplotlib
- Mosty popular python library for producing plots and other two dimensional data structures
- Scipy
- Scipy is a collection of packages addressing a number of foundational problems in scientific computing
- Modules:
-scipy.integrate- Numerical integration routies and differential euqatuion solvers
-scipy.linalg - Linear algebra routines and matrix decompositions extending beyond those provided in numpy.linalg
-scipy.optimize - Function optimizers and root finding algorithms
-scipy.signal - Signal processing tools
-scipy.sparse - Sparse matrices and sparse linear algebra solvers
-scipy.special - Wrapper around SPECFUN, a FORTRAN library implementing many methematical functions, such as the gamma function
-scipy.stats - Standarad continuous and discrete probability distributions (density function, samplers, continuous distribution functions), various statistical test, and more statistical statistics
- Numerical integration routies and differential euqatuion solvers
- Scikit Learn
- Has become the premier general purpose machine learning toolkit for Python programmers
- Includes submodules for such models as:
- Classification: SVM, nearest neighbors, random forest, logistic regression, etc.
- Regression: Lasso, ridge regression, etc.
- Clustering: k-means, spectral clustering, etc.
- Dimensionality Reduction: PCA, feature selection, matrix factorization, etc.
- Model Selection: Grid Search, cross-validation, metrics
- Preprocessing: feature extraction and normalization
- statsmodels
- Statsmodels contains algorithms for classical (primarily frequentist) statistics and econometrics. This includes submodules as:
- Regression models: linear regression, generalized linear models, robust linear models, linear mixed effort models
- Analysis of variance
- Time Series analysis
- Nonparametic Methods
- Visualization of Statistical Results
- Statsmodels contains algorithms for classical (primarily frequentist) statistics and econometrics. This includes submodules as:
- Numpy
NumPy Basics: Arrays and Vectorized Computation
- Numpy = Numerical Python
- Numpy Features:
- ndarray, an efficient multidimensional array providing fast array-oriented arithemetic operations and flexibile broadcasting capabilities
- Mathematic functions for fast operations on entire arrays of data without having to write loops
- Tools for reaidng / writing data to disk and working with memory mapped files
- Linear algebra, random number generation, and Dourier transform capabilities
- Numpy is fast because it doesn't have to worry about types and because it uses contiguous arrays
- You can perform complex computations on enrire arrays without the need for Python for loops with NumPy
- A C API for connection NumPy with libraries written in C, C++, and FORTRAN
- NumnPy based algorithms are generally 10 to 100 times faster (or more) than their pure Python counterparts and use significantly less memory
ndarray (N-Dimenional Array)
Allows you to perform mathematical operations on whole blocks of data using similar syntax to mathematical operations between built in numbers
Becoming proficient in array-oriented programming and thinking is a ey step along the way to becoming a scientific guru
The ndarray provides a way to interpret a block of homogeneously typed data (either contiguous or strided) as a multidimensional array object. The data tupe, dtype, determines how the data is interpreted.
An ndarray contains:
- A pointer to data
- The data tupe or dtype
- A tuple containing the array's shape
- A tuple of strides - integers indicating the number of bytes to "step" in order to advance one element along a dimension
Creating Arrays
- Easing way to create an array is to use the np.array() function
- numpy.array tries to infer a good data type for the array that it creates. The type is stored in a special dtype metadata object
- You can also use np.zerosnp.ones(), np.empty() to create an array of a given shape
Data Types
- The data type or dtype is a special object containing the information (or metadata data about the data) the ndarray needs to interpret a chunk of memory as a particular type of data
- Data types are a source of NumPy's flexibility for interacting with data coming from other sustems. Numerical data types are named: a type name (like int or float), followed by a number representing the number of bits per element
arithmetic Operations
- Arrays enable you to express batch operations on data without writing any loops. NumPy users call this Vectorization. Any arithmetic operations between two equal size arrays apply the operation element-wise.
- Arithemetic operations with scalars propagate the scalar argument to each element in the array
- Comparsions between two elements of the same size yield boolean arrays
- Evaluating operations between differently sized arrays is called broadcasting - SEE APPENDIX A
Basic Indexing and Slicing
- 1-D numpy array indexing and slicing act like built in lists
- NOTE: numpy array slices are references to the original array - any modification to the sliced view will be reflected in the source array
- You can index N-D arrays with tuples
- You can basically index and slice numpy arrays with tuples
- 1-D numpy array indexing and slicing act like built in lists
Boolean Indexing
- You can produce a boolean array by comparing an array with some value, then you can use the returned boolean array to index a similar array
- ~: negate array, reverse the boolean values in a boolean array
- Boolean arrays can be created by combining multiple boolean conditions
mask = (names =="Bob") | (names=="Will")
- Slecting data from an array by Boolean indexing and assigning the result to a new variable always creates a copy of the data
- Fancy Indexing
- Fancy Indexing is a term adopted by NumPy to describe indexing using integer arrays
- You can use an array of values to index a matrix to select rows of data in a different order
- You can use two arrays of values to select values from a matrix and create a 1-D array
- The values of each array become tuples to select individual elements
- Transposing Arrays and Swapping Axes
- Transposing is a special form of reshaping that similarly returns a view on the underlying data without copying anything
- Arrays have the transpose method and the special T attribute
- When doing matrix operations (like np.dot), you may do this a lot
- Pseudorandom Number Generation
- numpy.random modules supplements the built-in Python random module with functions for efficiently generating whole arrays of sample values from many kinds of probability distributions.
- You can create your own explict random number generator by using the np.random.default_rng method
- Universal Functions: Fast Element Wise Array Functions
- A universal function, or ufunc is a function taht performs element-wise operations on data in ndarrays. You can think of them as fast vectorized wrappers for simple functions that take one or more scalar values and produce one or more scalar results
- Array Oriented programming with Arrays
- The practice of replacing explicit loops with array expressions is referred to by some people as vectorization
- Broadcasting is a powerful method for vectorizing computations
- The practice of replacing explicit loops with array expressions is referred to by some people as vectorization
- Expressing Conditional Logic as Array Operators
- The typical use of np.where(cond,TrueVal,FalseVal) in data analysis is to produce a new array of values vased on another array
- Mathematical and Statistical Methods
- You can use aggregations (also called reductions) like sum, mean, and std either by calling the array instance method or by using the top-level NumPy function
- Functions like mean and sum take an optional axis argument that computes the statistic ober the given axis. (axis=0 => rows, axis=1 => columns)
- You can use aggregations (also called reductions) like sum, mean, and std either by calling the array instance method or by using the top-level NumPy function
- Sorting
- NumPy arrays can be sorted in place using the sort() method
- This also takes the optional axis argument, except axis=0 here means column
- Unique and Other Set Logic
- np.unique(ndarray) returns an ndarry of unique values
- Numpy can save and load data to and from disk in some text and binary formats
- Linear Algebra
- numpy.lnalg has a standard set of matrix decompositions and things like inverse and determinate
- @ = matrix multiplication
- Numpy Advanced
- You can reshape arrays using ndarray.reshape(tuple)
- The oposite operation of reshape form one dimensional to higher dimenion is typically known as flattening or ravelling
- .ndarray.flatten() is like the quivalent js method and it makes a copy of the data
- np.concatenate() takes a sequence of arrays and joins them in order along the input axis (axis=0 => rows)
- np.vstack and np.hstack accomplish similar things as before
- np.c_[*ndarrays] and np.r_[*ndarrays] are more concise ways ro do column concatenation and row concatenation, respectively
- ndarray.repaeat(int|int[],axis=int) can be used to repeat elements a certain number of times along the specified axis
- np.tile(ndarray,axis=int) can be thought of as stacking copies of an array along an axis
- Broadcating governs how perations work between arrays of different shapes. It can be a powerful feature, but also one that causes confusion
- Broadcasting Rule: Two arrays are compatible for broadcasting if for each of the trailing dimension (i.e., starting from the end) the axis lengths match or if either of the lengths is 1. Broadcasting is then performed over the missing length 1 dimensions
Getting Started with Pandas
Introduction
- Series
- A Series is a one-dimensional array-like object containing a sequence of values (of similar types to numpy types) of the same type and an associated array of data labels, called its index.
- Values and indexes can be accesses with the Series.array and Series.index properties
- A series is like a fixed-length, ordered dictionary, as it is a mapping of index values to data values
- The pd.isna() and pd.notna should be used to detect missing data. Series also has these as instance methods (Series.isna())
- Both the Series itself and the index attribute have a name property
- DataFrame
A DataFrame represents a rectangular table of data and contains an ordred, named collection of columns, each of which can be a different value type (numeric, string, Boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dictionary of Series all sharing the same index
- Rows can be rerieved by position or name with the special DataFrame.iloc[] and DataFrame.loc attributes
- The del DataFrame["column"] expression can be used to remove a column
The column returned from indexing a DataFrame is a view on the underlying data, not a copy. Thus, any in-place modifications to the Series will be reflected in the DataFrame. The column can be explicitly copied with the Series's copy method
- DataFrame's index and columns attributes have their own name attributes set
- DataFrame's to_numpy() method returns the data contained in the DataFrame as a two-dimensional ndarray. If the columns have different types, the returned ndarray type will have a type that accomodates all columns
- Index Object
- Immutable
- Hold the axis labels and other metadata (like name)
- Is array like and set like (although it can contain duplicate values)
Essential Functionality
- Reindexing
- The reindex method rearranges the data according to a new index
- The method karg determines how to fill missing indices when eindexing
- Can be used to reindex rows, columns , or both
- Drop
- Use the drop() method to drop columns or rows, this returns a new DataFrame object
- Indexing, Selection, Filtering
- Sries indexing works like NumPy arrays except you can use the Series index values instead of only integers
- The preferred way to select index values is with the special loc operator
- You can use integrs to index, like in Numpy, by indexing using the iloc operator
- You can make a selection on a data frame using the loc (labels) and iloc (integers) operators
- Boolean arrays can be used with loc but not iloc
- Prefer indexing with loc and iloc to avoid ambiguity
- A good rule of thumb is to avoid chained indexing when doing assignments
- Arithmetic Operations
- Misalignment causes cells to be filled with NaN
- You can use DataFrame.add(DataFrame2,fill_value=Value) (and euivalent for other arithmetic operations) to specify another fill value that is not np.nan
- You can also use fill_value when reindexing
- Function Mapping
- NumPy ufuncs also work with pandas objects
- DataFrame's apply() method can be used to apply a function to one-dimensional arrays to each column or row
- You can specify the axis karg to make it work a certain way
- You can use applymap() to apply an operation to each cell
- Sorting
- You can use sort_index, sort_values, and rank to sort stuff
- Watch out for duplicate indices
Statistics
- Pandas is quipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a Series, Series of values from the rows or columns of a DataFrame
- Ex: sum, mean
- They have kargs like axis and skipna
- describe produces multiple summary statistics in one shot
- For nonumerical data, describe produces alternative summary statistics
- Use the corr and cov methods to compute correlation and covariance between series
- You can also use these methods on a DataFrane to compute the correlation/covariance between one column and all other columns
Data Loading Storage and File Formats
Reading data and making it accessible, called data loading is neccessary first step for using most of the tools in this book. The term parsing is also sometime used to descibe loading text data and interpreting it as tables and different data types.
- Reading data arguments types:
- Indexing
Treat one or more of the columns as the returned DataFrame, and whether to get column names from the file, arguments you provide, or not at all - Type Inference and Data Conversion
- Includes the user-defined alue conversions and custom list of missing value markers
- Date and time parsing
- Includes a combining capability, including combining date and time infomation spread over multiple columns into a single column in the result
- Iterating
0 Support for iterating over chunk of very large files - Unclean data issues
- Includes skipping rows or a dooter, comments, or other minor things like numeric data with thousads separated by comma
- Indexing
- Options when reading data:
- Give names to columns
- Specify no header
- Specify data seperator
- Skiprows in data
- Specify index columns
- There are a lot of options hen hadnling csv, jsonm and html data - best to look back over docs when the need to handle special cases comes
- Pickle is good for storing python data short term (packages may change)
- HDF stamds for Hierarchal Data Format and is a good format for storing large quantities of scientiic data
- It is good practice to call raise_for_status after using requests.get to check for HTTP errors
Data Cleaning and Preperation
During the course of doing daata analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up to 80% or more of an analysts time.
- Pandas helps make this preparation easier
- We refer to missing data as na - not available. In ststistics application, NA data may either be data that does not exist or that exists but was not observed (through problems with data collection, for example)
- Saving a regex using re.compile is recommended if you intend to applythe expression to many strings - doping so will save you CPU cycles
- Pandas has extension types that provide for specialized treatment of strings, integers, and Boolean data which until recently have has some rough edges when working with missing data
Many data systems (for data warehousing, statistical computing, or other uses) have
developed specialized approaches for representing data with repeated values for more
efficient storage and computation. In data warehousing, a best practice is to use
so-called dimension tables containing the distinct values and storing the primary
observations as integer keys referencing the dimension table:
This representation as integers is called the categorical or dictionary-encoded repre‐
sentation. The array of distinct values can be called the categories, dictionary, or levels
of the data. In this book we will use the terms categorical and categories. The integer
values that reference the categories are called the category codes or simply codes.
Data Wrangling: Join, Combine, and Reshape
- Hierarchal Index is an important feature of pandas that enables you to have multiple (two or mode) index levels on an axis. Grouby, multiIndex, ...
- Data in pandas can be combined in a number of ways:
- pandas.merge
- Connect rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database join operations.
- You can specify inner/outer/left/right join, which keys to merge on
- Connect rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database join operations.
- DataFrame.join
- Join together many dataframes that have the same index but nonoverlapping columns
- pandas.concat
- Concatenate or "stack" objects together along an axis
- combine_first
- Splice rogether overlapping data to fill in missing values in one object with values from another
- pandas.merge
Plotting and Visualization
- Plots in matplotlib reside within a Figure object. You can create a new figure with plt.figure where you can also customize the size of the plot, add subplots,
- You can use the rc method on plt to configure global patterns governing figure size, subplot spacing, colors, font sizes, grid styles, and so on.
Data Aggregation and Group Operations
Hadley Wickham, an author of many popular packages for the R programming
language, coined the term split-apply-combine for describing group operations. In the
first stage of the process, data contained in a pandas object, whether a Series, Data‐
Frame, or otherwise, is split into groups based on one or more keys that you provide.
The splitting is performed on a particular axis of an object. For example, a DataFrame
can be grouped on its rows (axis="index") or its columns (axis="columns"). Once
this is done, a function is applied to each group, producing a new value. Finally,
the results of all those function applications are combined into a result object. The
form of the resulting object will usually depend on what’s being done to the data. See
Figure 10-1 for a mockup of a simple group aggregation.
- Use the groupby and apply functions to group data and apply functions to the grouped data