End-to-End Machine Learning Project Exercises Answers
I don't think these answers are the best - I had a hard time reviewing feature engineering, but I think I learned it pretty well now. I trained a lot of different models on the California Housing Dataset because I was getting frustrated by how by the SVM model originally was.
End-to-End Machine Learning Project Exercises Answers
Using this chapter's dataset
from sklearn.datasets import fetch_openml
import pandas as pd
import os
housing = pd.read_csv(os.path.join(os.getcwd(),'..','housing.csv'))
housing.head()
housing.describe()
housing.info()
housing["ocean_proximity"].value_counts()
corr = housing.iloc[:,:-1].corr()
corr.style.background_gradient(cmap="coolwarm")
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, Normalizer, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
from sklearn.cluster import KMeans
from sklearn.impute import SimpleImputer, KNNImputer
np.random.seed(42)
randint = np.random.randint(100)
target = housing["median_house_value"].copy()
data = housing.drop(columns="median_house_value").copy()
X_train, X_test, y_train, y_test = train_test_split(data,target,test_size=0.2,random_state=randint)
def transform_to_numpy(X):
"""
I prefer to work with NumPy arrays rather than DataFrames. Working with DataFrames can cause issues
"""
if not isinstance(X,np.ndarray):
X = X.to_numpy()
return X
# Custom Class for Imputing
class ImputeCategoricalMode(BaseEstimator,TransformerMixin):
"""
This CustomTransformer should be used to impute categorical variables.
You can pass in a mode value to tell the class which categorical variable should be imputed
This
"""
mode = np.nan
def __init__(self,mode=np.nan):
if mode != np.nan:
self.mode = mode
def fit(self, X, y=None):
modes = []
for col in X.select_dtypes(include="object"):
mode = X[col].value_counts().idxmax()
modes.append(mode)
if len(modes)==1:
self.mode = modes[0]
else:
self.mode = modes
return self
def transform(self, X, y=None):
iter = 0
for col in X.select_dtypes(include="object"):
if isinstance(self.mode,list):
X[col].fillna(value=self.mode[iter])
else:
X[col].fillna(value=self.mode)
iter += 1
return X
class HandleMultiCollinearity(BaseEstimator,TransformerMixin):
"""
total_rooms: 1
total_bedrooms: 2
population: 3
households: 4
median_income: 5
"""
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
X[:,1] = X[:,1]+X[:,2] / X[:,4] # Rooms Per Household
X[:, 2] = X[:, 3]
X[:, 3] = X[:, 3] / X[:, 4] # Population per household
X = np.delete(X,4,1) # Delete the number of households - this doesn't tell you much without data I think
return X
lat_lng_cluster = KMeans(n_clusters=12)
lat_lng_cluster.fit(X_train.loc[:,["longitude","latitude"]])
predictions = lat_lng_cluster.predict(X_train.loc[:,["longitude","latitude"]])
lat_lng_ohe = OneHotEncoder(sparse_output=False)
lat_lng_ohe.fit(predictions.reshape(-1,1))
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1,layout="constrained")
s = ax.scatter(X_train["longitude"],X_train["latitude"],alpha=0.1,cmap=plt.cm.hsv,c=predictions)
fig.colorbar(s,cmap=plt.cm.hsv)
ax.set_title("Clustering Based on Latitude/Longitude")
plt.show()
class LatLngPreprocessor(BaseEstimator,TransformerMixin):
"""
Expect to Only see longitide, latitude in X array
"""
def fit(self,X,y=None):
return self
def transform(self,X,y=None):
predictions = lat_lng_cluster.predict(X)
X = lat_lng_ohe.transform(predictions.reshape(-1,1))
return X
cat_pipeline = Pipeline(
steps=[
("impute",ImputeCategoricalMode(mode="<1H OCEAN")),
("ohe",OneHotEncoder(sparse_output=False))
]
)
num_pipeline = Pipeline(
steps=[
("impute",KNNImputer(n_neighbors=5,weights="distance")),
("colin",HandleMultiCollinearity()),
("scale",StandardScaler())
]
)
lat_lng_pipeline = Pipeline(
steps=[
("impute",KNNImputer(n_neighbors=5,weights="distance")),
("compute",LatLngPreprocessor())
]
)
regressor = ColumnTransformer(
transformers=[
("lat_lng",lat_lng_pipeline,["longitude","latitude"]),
("cat",cat_pipeline,["ocean_proximity"]),
("scale",num_pipeline,make_column_selector(dtype_include=np.float64))
],
remainder="passthrough"
)
out = regressor.fit_transform(X_train)
dfOut = pd.DataFrame(out)
dfOut.std()
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet, SGDRegressor, LassoLars
from sklearn.svm import SVR, LinearSVR
from sklearn.neighbors import KNeighborsRegressor, RadiusNeighborsRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.cross_decomposition import PLSRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
verbose=2
n_jobs=-1
cv=5
refit=True
max_iter_linear = 100000
lin_reg = LinearRegression(n_jobs=-1)
lin_reg.fit(out,y_train)
"""
- Lasso = Linear model yrained with L1 prior as regularizer. Techinically, the Lasso mosel is optimizing the same objective function as the Elastic Net with l1_ratio -10.
- Alpha is the constant that multiples the regularization term. Alpha must be non-negative float. alpha=0 === LinearRegression
- selection='random' often leads to significantly faster convergence
"""
lasso_param_grid = [
{"alpha": [1, 10, 100]}
]
lasso = GridSearchCV(Lasso(max_iter=max_iter_linear,selection='random'),param_grid=lasso_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)
lasso.fit(out,y_train)
"""
- Ridge is linear least squares with l2 regularization
- Alphas is the constant that multiplies the l2 regulkarization term
"""
ridge_param_grid = [
{"alpha": [1, 10, 100]}
]
ridge = GridSearchCV(Ridge(max_iter=max_iter_linear),param_grid=ridge_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)
ridge.fit(out,y_train)
"""
- ElasticNet is linear regression with combined l1 and l2 priors as regulaizers
- Alpha is a constant that multiplues the penalty terms
- the l1 ratio should be between 0 and 1, 1=L1 penalty, 0 = l2 penalty
- selection='random' often leads to significantly faster convergence
"""
elastic_param_grid = [
{"alpha": [0.5, 1, 10, 100], "l1_ratio": [0.1, 0.25, 0.5, 0.75, 0.9]}
]
elastic = GridSearchCV(ElasticNet(selection="random"),param_grid=elastic_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)
elastic.fit(out,y_train)
"""
- Epsilon-support Vector Regression. The free parameters in the model are C and epsilon
- C is the regulariztion parameter. The strength of regularization is inversely proportional to C
- Epsilon is the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated win the training loss fiction
"""
svm_param_grid = [
{ 'C': [ 1000000 ], 'epsilon': [10, 1, 0.1]},
]
svm = GridSearchCV(SVR(max_iter=-1,kernel="linear"),param_grid=svm_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)
svm.fit(out,y_train)
"""
- Lasso Model implemented using Least Angle Regression (LARS)
- Linear model trained with L1 prior as regularizer
"""
lars_lass_param_grid = [
{ "alpha": [1, 10, 100], "eps": np.linspace(np.finfo(float).eps,np.finfo(float).eps*10,3) }
]
lars_lass_reg = GridSearchCV(LassoLars(max_iter=max_iter_linear),param_grid=lars_lass_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)
lars_lass_reg.fit(out,y_train)
"""
- Linear model fitted by minimizing a regularized empirical loss with SGD
- SGD stands for Stochastic Gradient Descent: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule
- The regularizer is a penalty added to the loss function that shrinks model parameters towards the zero vector using either the squared euclidean norm L2 or the absolute norm L1 or a combination of both
- This implementation works with data represented as dense numpy arrays of floating point values for the features.
"""
sgd_reg_param_grid = [
{"loss": ["sqaured_error", "huber"], "penalty": ["l2", "elasticnet"], "alpha": [0.000001, 0.00001, 0.001], "early_stopping": [True, False]}
]
sgd_reg = GridSearchCV(SGDRegressor(),param_grid=sgd_reg_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)
sgd_reg.fit(out,y_train)
"""
- Regression based on k-nearest neighbors.
- The target is predicted by local interpolatio of the targets associated of the nearest neighbors in the training set
"""
k_neigh_param_grid = [
{"n_neighbors": [3, 5, 10, 50], "weights": ["uniform","distance"], "algorithm": ["auto", "ball_tree", "kd_tree"]}
]
k_neigh = GridSearchCV(KNeighborsRegressor(n_jobs=-1),param_grid=k_neigh_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)
k_neigh.fit(out,y_train)
"""
- Regression based on neighbors within a fixed radius.
- The target is predicted by local interpolation of the targets associated of the nearest neighbors in the training set.
"""
r_neigh_param_grid = [
{"radius": [0.1, 1, 10, 100], "weights": ["uniform","distance"], "algorithm": ["auto", "ball_tree", "kd_tree"]}
]
r_neigh = GridSearchCV(RadiusNeighborsRegressor(n_jobs=1),param_grid=r_neigh_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)
"""
- Gaussian process regression (GPR).
"""
gp_rep = GaussianProcessRegressor(normalize_y=True)
r_neigh.fit(out,y_train)
gp_rep.fit(out,y_train)
"""
- PLS Regression. PLS regresison is also known as PLS2 or PLS1, depending on the number of targets.
- n_components = Number of components to keep.
- scale = Whether to scale X and Y
"""
pls_reg_param_grid = [
{"n_components": [2, 10, 15], "scale": [True, False]}
]
pls_reg = GridSearchCV(PLSRegression(max_iter=max_iter_linear),param_grid=pls_reg_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)
pls_reg.fit(out,y_train)
"""
- Decision Tree Regressor
- Deciion Trees are non-paramteric supervised learning models used for classification adn regressio. The goal is create a model that predicts the value of a target variable by learning simple decision rules inferrred from the data features. A tree can be seen as a piewsie constant approximation
- criterion is a function to measure the quality of the split
- min_samples_leaf is the minimum samples required tp be a leaf node
- This may have the effect of smoothing the model, especially in regression
- max_depth is the maximum depth of the tree. If None, then nodes are expanded until all leaves are pure and until all leaves contain less than min_samples_split samples
"""
dt_reg_param_grid = [
{"max_depth": [5, 10, 50], "min_samples_leaf": [1, 10, 100], "criterion": ["squared_error", "absolute_error", "poisson"], "splitter": ["best", "random"]}
]
dt_reg = GridSearchCV(DecisionTreeRegressor(),param_grid=dt_reg_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)
dt_reg.fit(out,y_train)
"""
- A random Forest Regressor
- A random forest is a meta estimator that fits a number of decision tree regressors on various subsamples of the dataset and uses averaging to improve the prredictive accuracy and control over-fitting
- Trees in the forest use best-split strategy by passing splitter="best" to the underlying Decsion TreeeRegressor
"""
rand_forest_param_grid = [
{ "min_samples_leaf": [1,5] }
]
rand_forest = GridSearchCV(RandomForestRegressor(verbose=2,n_estimators=100,criterion="squared_error",max_depth=100),param_grid=rand_forest_param_grid,verbose=3,n_jobs=n_jobs,cv=cv,refit=refit)
rand_forest.fit(out,y_train)
"""
- Gradient Boosting for Regression
- This estimator builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage, a regression tree is fit on the negative fradient of the given loss function
- Loss function is the loss function to be optimized
- "subsample": The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting. subsample interacts with the parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias.
"""
gb_reg_param_grid = [
{ "max_depth": [10,100]}
]
gb_reg = GridSearchCV(GradientBoostingRegressor(loss="squared_error",verbose=3,subsample=1),param_grid=gb_reg_param_grid,verbose=3,n_jobs=n_jobs,cv=cv,refit=refit)
gb_reg.fit(out,y_train)
predictors = [
{ "pred": lin_reg, "name": "Linear Regression", "color": 'r'},
{ "pred": lasso, "name": "Lasso Regression", "color": 'g'},
{ "pred": ridge, "name": "Ridge Regression", "color": 'b'},
{ "pred": elastic, "name": "ElasticNet", "color": 'y'},
{ "pred": svm, "name": "SVM", "color": 'pink'},
{ "pred": lars_lass_reg, "name": "LARS Lasso", "color": 'brown'},
{ "pred": sgd_reg, "name": "SGD Regressor", "color": 'magenta'},
{ "pred": k_neigh, "name": "KNeighbors Regressor", "color": '#6495ED'},
{ "pred": r_neigh, "name": "RadiusNeighbors Regressor", "color": '#556B2F'},
{ "pred": gp_rep, "name": "GradientProcess Regressor", "color": '#5F9EA0'},
{ "pred": pls_reg, "name": "PLSRegression", "color": '#2F4F4F'},
{ "pred": dt_reg, "name": "Decison Tree Regressor", "color": '#66CDAA'},
{ "pred": rand_forest, "name": "Random Forest Regressor", "color": '#DAA520'},
{ "pred": gb_reg, "name": "Gradient Boosting Regressor", "color": '#A0522D'},
]
import matplotlib.pyplot as plt
fig, ax = plt.subplots(4,4,layout="constrained",figsize=(16,16))
ax[0,0].plot(y_test,y_test,c="k")
ax[0,0].set_title("Real Data")
iter = 1
for pred in predictors:
if iter>=12:
Ax = ax[3,iter-12]
elif iter >= 8:
Ax = ax[2,iter-8]
elif iter >=4:
Ax = ax[1,iter-4]
else:
Ax = ax[0,iter]
y_pred = pred["pred"].predict(regressor.fit_transform(X_test))
Ax.plot(y_test,y_test,c="k",label="Real Data")
Ax.scatter(y_test,y_pred,c=pred["color"],marker="+",label=pred["name"])
Ax.legend()
Ax.set_title(pred["name"])
iter+=1
ax[3,3].remove()
plt.show()
svm.best_estimator_
Notes on the Data / Problems Below
- Total bedrooms is missing some values that need to be imputed - probably best to impute them using Iterative or KNN
- As can be seen from the correlation matrix, the population, total rooms, and total bedrooms are heavily correlated. Multicollinearity is particularly undesirable because it impacts the Multicollinearity of linear regression models. If using a linear regression model, you should try to get rid of this Multicollinearity.
- I am not going to use ordinal encoder for the ocean_proximity attribute since I am sure that one hot encoding will work better and there are not many classes for that feature
- For the latitude and longitude, I am looking into ways to properly handle that data
- Some Ways to Handle Latitude and Longitude
- Choose a model that does not require normalization (not required since I am required to use SVM here)
- Perform reverse geocoding
- Converting Geocoding into zones using clustering - I am going to do this probably.
- Some Ways to Handle Latitude and Longitude
Using DBSCAN For Latitude and Longitude Clustering
- DBSCAN Reference
- I decided to cluster the data using different eps and min_samples values until I got something that looked right - see clustering chart
- I will then perform One Hot Encoding (Since I only have 19 unique clusters)
Question 1
Try a Support Vector Machine regressor (sklearn.svm.SVR), with various hyperparameters such as kernel="linear" (with various values for the C hyperparameter) or kernel="rbf" (with various values for the C and gamma hyperparameters). Don’t worry about what these hyperparameters mean for now. How does the best SVR predictor perform?
Support Vector Machines are a set of supervised learning methods for classification, regression, and outlier detection. Support Vector Machine algorithms are not scale invariant, so it is highly recommended to scale your data.
Question 2
Try replacing GridSearchCV with RandomizedSearchCV.
Question 3
Try adding a transformer in the preparation pipeline to select only the most important attributes.
Question 4
Try creating a single pipeline that does the full data preparation plus the final prediction.
Question 5
Automatically explore some preparation options using GridSearchCV.