Pandas Review
I wanted to review Pandas before continuing learning about Machine Learning. There were some code snippets in Hands On Machine Learning that I didn't understand well do to my lack of knowledge about pandas - mainly indexing.
References
Intro to Pandas
When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.
What Kind of Data Does Pandas Handle?
- A DataFrame is a 2-dimensional data structure that can store data of different types in columns.
- Each column in a DataFrame is a Series.
- A pandas Series has no column labels, as it is just a single column of a DataFrame. A series does have row labels.
import pandas as pd
import os
from dotenv import load_dotenv, dotenv_values
from pathlib import Path
df = pd.DataFrame(
{
"Name": ["Alice","Bart","Carl"],
"Age": [22,35,58],
"Sex": ["male","male","female"]
}
)
df["Name"]
# OUT:
# 0 Alice
# 1 Bart
# 2 Carl
ser = pd.Series([22,35,58],name="Age")
titanic = pd.read_csv(os.path.join(Path.cwd(),"..","titanic.csv"))
How Do I read and Write Tabular Data?
- pandas supports many different file formats or data sources out of the box (csv, excel, sql, ...), each of them with the prefix read_*.
- You can use the head() and tail() methods to ensure that the data was loaded properly.
- You can check to see how pandas interpreted each of the column data types by requesting the df.dtypes attribute.
- dtypes are an attribute of a DataFrame and Series.
- The method info provides technical information about a DataFrame:
titanic.info()
- The type is DataFrame
- There are 891 entries (891 rows)
- The table has 12 columns, some with a numeric data type, others with a string object data type. Some columns contain Null - those columns with less than 891 Non-Null values
- The approximate amount of RAM used to hold the data frame is provided as well
How Do I Select a Subset of a DataFrame?
- To select a single column, use square brackets [] with the column name of the column of interest
- Each column in a DataFrame is a Series. As a single column is selected, the return object is a pandas Series.
- The DataFrame.shape is an attribute of a pandas DataFrame and Series containing the number of rows and columns: (nrows, ncolumns). A pandas Series is 1-dimensional and only the number of rows is returned.
- To select multiple columns, use a list of column names within the selected brackets:
titanic[["Age", "Sex"]]
Selecting Specific Rows of DataFrame
- To select rows based on a conditional expression, use a condition inside the selection brackets:
titanic[titanic["Age"] > 35]
- The output of the conditional expression df["Age"] > 35 returns a pandas Series of Boolean values with the same number of rows as the original DataFrame. Such a Series of Boolean values can be used to filter the DataFrame by putting it in between the selection brackets []. Only rows for which the value is True will be selected.
Selecting Specific Rows and Columns
- The loc/iloc operators are required in front of the selection brackets []. When using loc/iloc, the part before the comma is the rows you want, and the part after the comma is the columns you want to select.
- When using the column names, row labels or a conditional expression, use the loc operator in front of the selection brackets []. For both the part before and after the comma, you can use a single label, a list of - labels, a slice of labels, a conditional expression or a colon. Using a colon specifies you want to select all rows or columns.
- When specifically interested in certain rows and/or columns based on their position in the table, use the iloc operator in front of the selection brackets [].
- When selecting specific rows and/or columns with loc or iloc, new values can be assigned to the selected data:
adult_names = titanic.loc[titanic["Age"] > 35, "Name"]
row_9_to_25_col_2_to_5 = titanic.iloc[9:25,2:25]
titanic.iloc[0:3,3]="anonymous"
How Do I Create Plots in Pandas?
With a DataFrame, pandas creates by default one line plot for each of the columns with numeric data. The plot method works on Series and DataFrames. You can see a series plot below. Apart from the default line plot when using the plot function, a number of alternatives are available to plot data (see box plot below).
import pandas as pd
import matplotlib.pyplot as plt
air_quality = pd.read_csv(os.path.join(Path.cwd(),"..","air_quality_no2.csv"),index_col=0,parse_dates=True)
air_quality.plot()
plt.show()
air_quality["station_paris"].plot()
plt.title("Air Quality, Station Paris")
plt.show()
print([method_name for method_name in dir(df.plot) if not method_name.startswith("_")])
air_quality.plot.box()
How to create New Columns Derived From Existing Columns?
- To create a new column, use the [] brackets with the new column name at the left side of the assignment.
- The calculation is done element-wise. This means all values in a given column are calculated at once and you do not need to iterate each one of the rows.
air_quality["londom_mg_per_cubic"] = air_quality["station_london"]
How to Calculate Summary Statistics
Different statistics are available and can be applied to columns with numerical data. Operations in general exclude missing data and operate across rows by default.
titanic["Age"].mean()
# The statistic applied to multiple columns of a DataFrame is calculates for each numeric column
titanic[["Age","Fare"]].median()
titanic.describe()
titanic.agg(
{
"Age": ["min","max","median","skew"],
"Fare": ["min","max","median","mean"]
}
)
Calculating a given statistic (e.g. mean age) for each category in a column (e.g. male/female in the Sex column) is a common pattern. The groupby method is used to support this type of operations. This fits in the more general split-apply-combine pattern:
titanic[["Sex","Age"]].groupby("Sex").mean()
- Split the data into groups
- Apply a function to each group independently
- Combine the results into a data structure
- The [df|series].value_counts() method counts the number of records for each category in a column.
How To Reshape the Layout of Tables
The df.sort_values(by=[column_name|list(column_names),ascending=Boolean]) method allows you to sort rows in a table according to the defined columns. The index will follow the row order.
The pivot() function is purely reshaping of the data: a single value for each index/column combination is required.
air_quality.describe()
no2 = air_quality[air_quality["parameter"] == "no2"]
no2_subset = no2.sort_index().groupby(["location"]).head(2)
no2_subset.pivot(columns="location", values="value")
df.pivot_table(
values="value",
index="location",
columns="parameter",
aggfunc="mean",
margins=True,
)
In the case of pivot(), the data is only rearranged. When multiple values need to be aggregated, pivot_table() can be used, proving an aggregation function on how to combine these values.
How to combine data from multiple tables
The pd.concat([...DataFrames]) function performs concatenation operations of multiple tables along one of the axes (row-wise or column-wise). By default, concatenation is along axis 0, so the resulting table combines the rows of the input tables.
The axis argument will return a number of pandas methods that can be applied along an axis. A DataFrame has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1). Most operations like concatenation or summary statistics are by default across rows (axis 0), but can be applied across columns as well.
How to handle time series data with ease
df["datetime"] = pd.to_datetime(df["datetime"])
By applying the to_datetime function, pandas interprets the strings and convert these to datetime objects. In pandas we call these objects similar to datetime.datetime from the standard library pandas.Timestamp.
Using pandas.Timestamp for datetimes enables us to calculate with date information and make them comparable. Hence, we can use this to get the length of our time series.
How to manipulate textual data
Similar to datetime objects having a dt accessor, a number of specialized string methods are available when using the str accessor: df["Names"].str.lower()
10 minutes to pandas
Pandas provides two types of classes for handling data:
- Series: a one-dimensional labeled array holding data of any type such as integers, strings, Python objects, etc.
- DataFrame: a two-dimensional data structure that holds data like a two-dimensional array r a table with rows and columns
When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. For NumPy data types, np.nan represents missing data, It is by default not included in computations. Reindexing allows you to change / add / delete the index on the specified axis.This returns a copy of the data. DataFrame.dropna()drops any rows that have missing data. DataFrame.fillna(value=) fills missing data. DataFrame.isna() gets the boolean mask weher values are nan. DataFrame.agg() abd DataFrame.transform() applies a user defined function that reduces or broadcasts its result respectively.
Merge
pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra in the case of join / merge-type operations.
- pd.concat(): join pandas objects together row-wise
- pd.merge() enables SQL style joins along specific columns
Grouping
By "group by" we are referring to a process involving one or more of the following steps:
splitting the data into groups based on some criteria
applying a function to each group independently
combining the results into a data structure
Time Series
pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications.
Categoricals
pandas can include categorical data in a DataFrame
Into to Data Structures
Fundamentally, data alignment is intrinsic. The link between labels and data will not be broken unless done so explicitly by you.
Series
Series is a one-dimensional labeled array capable of holding any data type. The axis labels are collectively referred to as teh index. It acts very similar to ndarray, but operations such as slicing will also slice the index. A pandas Series also has a single dtype. The array attribute gives you the underlying data. A key difference between Series and ndarray is that operations between Series automatically align the data based on label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels. The Series also has a name attribute.
DataFrame is a 2-dimensional data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally most commonly used pandas objects. It is generally the most commonly used pandas object. Along with the data, you can optionally pass index (row labels) and columns arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. It axis labels are not passed, they will be constructed from the input data based on common sense rules.
Essential Basic Functionality
Use .array or .to_numpy() to get the values in a Series or DataFrame, not .values. Use the ixxmin() and idxmax() methods on series and data frames to compute the index labels with the minimum and maximum corresponding values. The value_counts() Series method computes a histogram of a 1D array of values. Arbitrary functions can be applied along the axes of a DataFrame using the apply(lambda,axis?) method, which takes an optional axis argument. The aggregation API allows you to express possibly myltiple aggregation operations in a single concise way. DataFrame.agg(lambda,axis?). The transform() method returns an object that is indexed the same (same size) as the original. This API allows you to provide multiple operations at the same time rather than one-by-one. reindex() is the fundamental data alignment method in pandas. It is used to implement nearly all other features relying on label-alignment functionality. To reindex means to conform the data to match a given set of labels along a particular axis. The align() method is the fastest way to simultaneously align two objects.
IO Tools
The pandas I/O API is a set of top level reader functions accessed like pandas.read_csv() that generally return a pandas object. The top-level read_html() function can accept an HTML string/file/URL and will parse HTML tables into list of pandas DataFrames. to_latex() to output DataFrames to LATEX .
Indexing and Selecting Data
- .loc is primarily label based, but may also be used with a boolean array.
- .iloc is primary integer position based, but may also be used with a boolean array
pandas provides a suite of methods in order to have purely label based indexing. This is a strict includion based protocol. When slicing, both the start bound AND the stop bound are included, if present in the index. Integers are valid labels, nut they refer to the label and not the position. pandas provides a suite of methods in order to get purely integer based indexing. The semantics follow closely Python and NumPy slicing. These are 0-based indexing. When slicing, the start bound is included, while the upper bound is excluded. You can use the sample() method to return a random row / number of rows.
Merge, join, concatenate and compare
pandas provides various methods for combining and comparing Series or DataFrame:
- concat() Merge multiple Series or DataFrame objects along a shared index or column
- DataFrame.join() Merge multiple DataFrame objects along the columns
- DataFrame.combine_first() Update missing values with non-missing values in the same location
- merge() Combine two Series or DataFrame with SQL-style joining
- merge_ordered() Combine two Series or DataFrame objects along an ordered axis
- merge_asof() Combine two Series or DataFrame objects by near instead of exact matching keys
- Series.compare() and DataFrame.compare(): Show differences in values between two Series or DataFrame objects