Pandas Review

I wanted to review Pandas before continuing learning about Machine Learning. There were some code snippets in Hands On Machine Learning that I didn't understand well do to my lack of knowledge about pandas - mainly indexing.

References

Intro to Pandas

When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.

What Kind of Data Does Pandas Handle?

  • A DataFrame is a 2-dimensional data structure that can store data of different types in columns.
  • Each column in a DataFrame is a Series.
  • A pandas Series has no column labels, as it is just a single column of a DataFrame. A series does have row labels.

DataFrame

Series

import pandas as pd
import os
from dotenv import load_dotenv, dotenv_values
from pathlib import Path 

df = pd.DataFrame(
{
	"Name": ["Alice","Bart","Carl"],
	"Age": [22,35,58],
	"Sex": ["male","male","female"]
}
)
df["Name"]
# OUT:
# 0 Alice
# 1 Bart
# 2 Carl
ser = pd.Series([22,35,58],name="Age")
titanic = pd.read_csv(os.path.join(Path.cwd(),"..","titanic.csv"))
out[2]

How Do I read and Write Tabular Data?

How Do I read and Write Tabular Data

  • pandas supports many different file formats or data sources out of the box (csv, excel, sql, ...), each of them with the prefix read_*.
  • You can use the head() and tail() methods to ensure that the data was loaded properly.
  • You can check to see how pandas interpreted each of the column data types by requesting the df.dtypes attribute.
    • dtypes are an attribute of a DataFrame and Series.
  • The method info provides technical information about a DataFrame:
titanic.info()
out[4]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

  • The type is DataFrame
  • There are 891 entries (891 rows)
  • The table has 12 columns, some with a numeric data type, others with a string object data type. Some columns contain Null - those columns with less than 891 Non-Null values
  • The approximate amount of RAM used to hold the data frame is provided as well

How Do I Select a Subset of a DataFrame?

Selecting Specific Columns of DataFrame

  • To select a single column, use square brackets [] with the column name of the column of interest
  • Each column in a DataFrame is a Series. As a single column is selected, the return object is a pandas Series.
  • The DataFrame.shape is an attribute of a pandas DataFrame and Series containing the number of rows and columns: (nrows, ncolumns). A pandas Series is 1-dimensional and only the number of rows is returned.
  • To select multiple columns, use a list of column names within the selected brackets:
titanic[["Age", "Sex"]]
out[6]

Age Sex

0 22.0 male

1 38.0 female

2 26.0 female

3 35.0 female

4 35.0 male

.. ... ...

886 27.0 male

887 19.0 female

888 NaN female

889 26.0 male

890 32.0 male



[891 rows x 2 columns]

Selecting Specific Rows of DataFrame

Selecting Specific Rows of DataFrame

  • To select rows based on a conditional expression, use a condition inside the selection brackets:
titanic[titanic["Age"] > 35]
out[8]

PassengerId Survived Pclass \

1 2 1 1

6 7 0 1

11 12 1 1

13 14 0 3

15 16 1 2

.. ... ... ...

865 866 1 2

871 872 1 1

873 874 0 3

879 880 1 1

885 886 0 3



Name Sex Age SibSp \

1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1

6 McCarthy, Mr. Timothy J male 54.0 0

11 Bonnell, Miss. Elizabeth female 58.0 0

13 Andersson, Mr. Anders Johan male 39.0 1

15 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0

.. ... ... ... ...

865 Bystrom, Mrs. (Karolina) female 42.0 0

871 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1

873 Vander Cruyssen, Mr. Victor male 47.0 0

879 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0

885 Rice, Mrs. William (Margaret Norton) female 39.0 0



Parch Ticket Fare Cabin Embarked

1 0 PC 17599 71.2833 C85 C

6 0 17463 51.8625 E46 S

11 0 113783 26.5500 C103 S

13 5 347082 31.2750 NaN S

15 0 248706 16.0000 NaN S

.. ... ... ... ... ...

865 0 236852 13.0000 NaN S

871 1 11751 52.5542 D35 S

873 0 345765 9.0000 NaN S

879 1 11767 83.1583 C50 C

885 5 382652 29.1250 NaN Q



[217 rows x 12 columns]

  • The output of the conditional expression df["Age"] > 35 returns a pandas Series of Boolean values with the same number of rows as the original DataFrame. Such a Series of Boolean values can be used to filter the DataFrame by putting it in between the selection brackets []. Only rows for which the value is True will be selected.

Selecting Specific Rows and Columns

Selecting Specific Rows and Columns

  • The loc/iloc operators are required in front of the selection brackets []. When using loc/iloc, the part before the comma is the rows you want, and the part after the comma is the columns you want to select.
  • When using the column names, row labels or a conditional expression, use the loc operator in front of the selection brackets []. For both the part before and after the comma, you can use a single label, a list of - labels, a slice of labels, a conditional expression or a colon. Using a colon specifies you want to select all rows or columns.
  • When specifically interested in certain rows and/or columns based on their position in the table, use the iloc operator in front of the selection brackets [].
  • When selecting specific rows and/or columns with loc or iloc, new values can be assigned to the selected data:
adult_names = titanic.loc[titanic["Age"] > 35, "Name"]
row_9_to_25_col_2_to_5 = titanic.iloc[9:25,2:25]
titanic.iloc[0:3,3]="anonymous"
out[10]

How Do I Create Plots in Pandas?

With a DataFrame, pandas creates by default one line plot for each of the columns with numeric data. The plot method works on Series and DataFrames. You can see a series plot below. Apart from the default line plot when using the plot function, a number of alternatives are available to plot data (see box plot below).

import pandas as pd
import matplotlib.pyplot as plt 
air_quality = pd.read_csv(os.path.join(Path.cwd(),"..","air_quality_no2.csv"),index_col=0,parse_dates=True)
air_quality.plot()
plt.show()
air_quality["station_paris"].plot()
plt.title("Air Quality, Station Paris")
plt.show()
print([method_name for method_name in dir(df.plot) if not method_name.startswith("_")])
air_quality.plot.box()
out[12]
Jupyter Notebook Image

<Figure size 640x480 with 1 Axes>

Jupyter Notebook Image

<Figure size 640x480 with 1 Axes>

['area', 'bar', 'barh', 'box', 'density', 'hexbin', 'hist', 'kde', 'line', 'pie', 'scatter']

<Axes: >

Jupyter Notebook Image

<Figure size 640x480 with 1 Axes>

How to create New Columns Derived From Existing Columns?

How to create New Columns Derived From Existing Columns?

  • To create a new column, use the [] brackets with the new column name at the left side of the assignment.
    • The calculation is done element-wise. This means all values in a given column are calculated at once and you do not need to iterate each one of the rows.
air_quality["londom_mg_per_cubic"] = air_quality["station_london"]

How to Calculate Summary Statistics

Calculating Summary Statistics

Different statistics are available and can be applied to columns with numerical data. Operations in general exclude missing data and operate across rows by default.

titanic["Age"].mean()
# The statistic applied to multiple columns of a DataFrame is calculates for each numeric column
titanic[["Age","Fare"]].median() 
titanic.describe()
titanic.agg(
	{
		"Age": ["min","max","median","skew"],
		"Fare": ["min","max","median","mean"]
	}
)
out[14]

Age Fare

min 0.420000 0.000000

max 80.000000 512.329200

median 28.000000 14.454200

skew 0.389108 NaN

mean NaN 32.204208

Calculating a given statistic (e.g. mean age) for each category in a column (e.g. male/female in the Sex column) is a common pattern. The groupby method is used to support this type of operations. This fits in the more general split-apply-combine pattern:

titanic[["Sex","Age"]].groupby("Sex").mean()
out[16]
  • Split the data into groups
  • Apply a function to each group independently
  • Combine the results into a data structure
  • The [df|series].value_counts() method counts the number of records for each category in a column.

How To Reshape the Layout of Tables

The df.sort_values(by=[column_name|list(column_names),ascending=Boolean]) method allows you to sort rows in a table according to the defined columns. The index will follow the row order.

How To Reshape the Layout of Tables

The pivot() function is purely reshaping of the data: a single value for each index/column combination is required.

air_quality.describe()
no2 = air_quality[air_quality["parameter"] == "no2"]
no2_subset = no2.sort_index().groupby(["location"]).head(2)
no2_subset.pivot(columns="location", values="value")
df.pivot_table(
   values="value",
   index="location",
   columns="parameter",
   aggfunc="mean",
   margins=True,
)

Pivot Table

In the case of pivot(), the data is only rearranged. When multiple values need to be aggregated, pivot_table() can be used, proving an aggregation function on how to combine these values.

How to combine data from multiple tables

Concatenating Objects

The pd.concat([...DataFrames]) function performs concatenation operations of multiple tables along one of the axes (row-wise or column-wise). By default, concatenation is along axis 0, so the resulting table combines the rows of the input tables.

The axis argument will return a number of pandas methods that can be applied along an axis. A DataFrame has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1). Most operations like concatenation or summary statistics are by default across rows (axis 0), but can be applied across columns as well.

Join Tables Using a Common Identifier

How to handle time series data with ease

df["datetime"] = pd.to_datetime(df["datetime"])

By applying the to_datetime function, pandas interprets the strings and convert these to datetime objects. In pandas we call these objects similar to datetime.datetime from the standard library pandas.Timestamp.
Using pandas.Timestamp for datetimes enables us to calculate with date information and make them comparable. Hence, we can use this to get the length of our time series.

How to manipulate textual data

Similar to datetime objects having a dt accessor, a number of specialized string methods are available when using the str accessor: df["Names"].str.lower()

10 minutes to pandas

Pandas provides two types of classes for handling data:

  1. Series: a one-dimensional labeled array holding data of any type such as integers, strings, Python objects, etc.
  2. DataFrame: a two-dimensional data structure that holds data like a two-dimensional array r a table with rows and columns

When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. For NumPy data types, np.nan represents missing data, It is by default not included in computations. Reindexing allows you to change / add / delete the index on the specified axis.This returns a copy of the data. DataFrame.dropna()drops any rows that have missing data. DataFrame.fillna(value=) fills missing data. DataFrame.isna() gets the boolean mask weher values are nan. DataFrame.agg() abd DataFrame.transform() applies a user defined function that reduces or broadcasts its result respectively.

Merge

pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra in the case of join / merge-type operations.

  • pd.concat(): join pandas objects together row-wise
  • pd.merge() enables SQL style joins along specific columns

Grouping

  • By "group by" we are referring to a process involving one or more of the following steps:

  • splitting the data into groups based on some criteria

  • applying a function to each group independently

  • combining the results into a data structure

Time Series

pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications.

Categoricals

pandas can include categorical data in a DataFrame

Into to Data Structures

Fundamentally, data alignment is intrinsic. The link between labels and data will not be broken unless done so explicitly by you.

Series

Series is a one-dimensional labeled array capable of holding any data type. The axis labels are collectively referred to as teh index. It acts very similar to ndarray, but operations such as slicing will also slice the index. A pandas Series also has a single dtype. The array attribute gives you the underlying data. A key difference between Series and ndarray is that operations between Series automatically align the data based on label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels. The Series also has a name attribute.

DataFrame is a 2-dimensional data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally most commonly used pandas objects. It is generally the most commonly used pandas object. Along with the data, you can optionally pass index (row labels) and columns arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. It axis labels are not passed, they will be constructed from the input data based on common sense rules.

Basics of Indexing / Slicing

Essential Basic Functionality

Use .array or .to_numpy() to get the values in a Series or DataFrame, not .values. Use the ixxmin() and idxmax() methods on series and data frames to compute the index labels with the minimum and maximum corresponding values. The value_counts() Series method computes a histogram of a 1D array of values. Arbitrary functions can be applied along the axes of a DataFrame using the apply(lambda,axis?) method, which takes an optional axis argument. The aggregation API allows you to express possibly myltiple aggregation operations in a single concise way. DataFrame.agg(lambda,axis?). The transform() method returns an object that is indexed the same (same size) as the original. This API allows you to provide multiple operations at the same time rather than one-by-one. reindex() is the fundamental data alignment method in pandas. It is used to implement nearly all other features relying on label-alignment functionality. To reindex means to conform the data to match a given set of labels along a particular axis. The align() method is the fastest way to simultaneously align two objects.

IO Tools

The pandas I/O API is a set of top level reader functions accessed like pandas.read_csv() that generally return a pandas object. The top-level read_html() function can accept an HTML string/file/URL and will parse HTML tables into list of pandas DataFrames. to_latex() to output DataFrames to LaTeX\LaTeXLATEX .

Indexing and Selecting Data

  • .loc is primarily label based, but may also be used with a boolean array.
  • .iloc is primary integer position based, but may also be used with a boolean array

pandas provides a suite of methods in order to have purely label based indexing. This is a strict includion based protocol. When slicing, both the start bound AND the stop bound are included, if present in the index. Integers are valid labels, nut they refer to the label and not the position. pandas provides a suite of methods in order to get purely integer based indexing. The semantics follow closely Python and NumPy slicing. These are 0-based indexing. When slicing, the start bound is included, while the upper bound is excluded. You can use the sample() method to return a random row / number of rows.

Merge, join, concatenate and compare

pandas provides various methods for combining and comparing Series or DataFrame:

  • concat() Merge multiple Series or DataFrame objects along a shared index or column
  • DataFrame.join() Merge multiple DataFrame objects along the columns
  • DataFrame.combine_first() Update missing values with non-missing values in the same location
  • merge() Combine two Series or DataFrame with SQL-style joining
  • merge_ordered() Combine two Series or DataFrame objects along an ordered axis
  • merge_asof() Combine two Series or DataFrame objects by near instead of exact matching keys
  • Series.compare() and DataFrame.compare(): Show differences in values between two Series or DataFrame objects