End-to-End Machine Learning Project Exercises Answers

I don't think these answers are the best - I had a hard time reviewing feature engineering, but I think I learned it pretty well now. I trained a lot of different models on the California Housing Dataset because I was getting frustrated by how by the SVM model originally was.

2 460

Using this chapter's dataset

from sklearn.datasets import fetch_openml
import pandas as pd 
import os 
housing = pd.read_csv(os.path.join(os.getcwd(),'..','housing.csv'))

longitude latitude housing_median_age total_rooms total_bedrooms \

0 -122.23 37.88 41.0 880.0 129.0

1 -122.22 37.86 21.0 7099.0 1106.0

2 -122.24 37.85 52.0 1467.0 190.0

3 -122.25 37.85 52.0 1274.0 235.0

4 -122.25 37.85 52.0 1627.0 280.0

population households median_income median_house_value ocean_proximity

0 322.0 126.0 8.3252 452600.0 NEAR BAY

1 2401.0 1138.0 8.3014 358500.0 NEAR BAY

2 496.0 177.0 7.2574 352100.0 NEAR BAY

3 558.0 219.0 5.6431 341300.0 NEAR BAY

4 565.0 259.0 3.8462 342200.0 NEAR BAY


longitude latitude housing_median_age total_rooms \

count 20640.000000 20640.000000 20640.000000 20640.000000

mean -119.569704 35.631861 28.639486 2635.763081

std 2.003532 2.135952 12.585558 2181.615252

min -124.350000 32.540000 1.000000 2.000000

25% -121.800000 33.930000 18.000000 1447.750000

50% -118.490000 34.260000 29.000000 2127.000000

75% -118.010000 37.710000 37.000000 3148.000000

max -114.310000 41.950000 52.000000 39320.000000

total_bedrooms population households median_income \

count 20433.000000 20640.000000 20640.000000 20640.000000

mean 537.870553 1425.476744 499.539680 3.870671

std 421.385070 1132.462122 382.329753 1.899822

min 1.000000 3.000000 1.000000 0.499900

25% 296.000000 787.000000 280.000000 2.563400

50% 435.000000 1166.000000 409.000000 3.534800

75% 647.000000 1725.000000 605.000000 4.743250

max 6445.000000 35682.000000 6082.000000 15.000100


count 20640.000000

mean 206855.816909

std 115395.615874

min 14999.000000

25% 119600.000000

50% 179700.000000

75% 264725.000000

max 500001.000000


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB



<1H OCEAN 9136





Name: count, dtype: int64

corr = housing.iloc[:,:-1].corr()

<pandas.io.formats.style.Styler at 0x21208a5cd10>

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, Normalizer, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
from sklearn.cluster import KMeans
from sklearn.impute import SimpleImputer, KNNImputer

randint = np.random.randint(100)

target = housing["median_house_value"].copy()
data = housing.drop(columns="median_house_value").copy()

X_train, X_test, y_train, y_test = train_test_split(data,target,test_size=0.2,random_state=randint) 

def transform_to_numpy(X):
    I prefer to work with NumPy arrays rather than DataFrames. Working with DataFrames can cause issues
    if not isinstance(X,np.ndarray):
        X = X.to_numpy()
    return X

# Custom Class for Imputing 
class ImputeCategoricalMode(BaseEstimator,TransformerMixin):
    This CustomTransformer should be used to impute categorical variables.
    You can pass in a mode value to tell the class which categorical variable should be imputed 
    mode = np.nan
    def __init__(self,mode=np.nan):
        if mode != np.nan:
            self.mode = mode
    def fit(self, X, y=None):
        modes = []
        for col in X.select_dtypes(include="object"):
            mode = X[col].value_counts().idxmax()
        if len(modes)==1:
            self.mode = modes[0]
            self.mode = modes
        return self
    def transform(self, X, y=None):
        iter = 0
        for col in X.select_dtypes(include="object"):
            if isinstance(self.mode,list):
            iter += 1
        return X

class HandleMultiCollinearity(BaseEstimator,TransformerMixin):
    total_rooms: 1	
    total_bedrooms: 2	
    population: 3	
    households: 4	
    median_income: 5
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        X[:,1] = X[:,1]+X[:,2] / X[:,4] # Rooms Per Household
        X[:, 2] = X[:, 3]
        X[:, 3] = X[:, 3] / X[:, 4]  # Population per household 
        X = np.delete(X,4,1) # Delete the number of households - this doesn't tell you much without data I think
        return X

lat_lng_cluster = KMeans(n_clusters=12)
predictions = lat_lng_cluster.predict(X_train.loc[:,["longitude","latitude"]])
lat_lng_ohe = OneHotEncoder(sparse_output=False)
import matplotlib.pyplot as plt 
fig, ax = plt.subplots(1,1,layout="constrained")
s = ax.scatter(X_train["longitude"],X_train["latitude"],alpha=0.1,cmap=plt.cm.hsv,c=predictions)
ax.set_title("Clustering Based on Latitude/Longitude")

class LatLngPreprocessor(BaseEstimator,TransformerMixin):
    Expect to Only see longitide, latitude in X array 
    def fit(self,X,y=None):
        return self
    def transform(self,X,y=None):
        predictions = lat_lng_cluster.predict(X)
        X = lat_lng_ohe.transform(predictions.reshape(-1,1))
        return X

cat_pipeline = Pipeline(
        ("impute",ImputeCategoricalMode(mode="<1H OCEAN")),

num_pipeline = Pipeline(

lat_lng_pipeline = Pipeline(

regressor = ColumnTransformer(

out = regressor.fit_transform(X_train)
dfOut = pd.DataFrame(out)
Jupyter Notebook Image

<Figure size 640x480 with 2 Axes>

c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but KMeans was fitted with feature names


0 0.451402

1 0.389156

2 0.248206

3 0.122115

4 0.216116

5 0.207172

6 0.099466

7 0.354101

8 0.220731

9 0.215103

10 0.268471

11 0.135516

12 0.496451

13 0.466844

14 0.007782

15 0.313854

16 0.333929

17 1.000030

18 1.000030

19 1.000030

20 1.000030

21 1.000030

22 1.000030

23 1.000030

dtype: float64

from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet, SGDRegressor, LassoLars
from sklearn.svm import SVR, LinearSVR
from sklearn.neighbors import KNeighborsRegressor, RadiusNeighborsRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.cross_decomposition import PLSRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.neural_network import MLPRegressor 
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

max_iter_linear = 100000

lin_reg = LinearRegression(n_jobs=-1)


- Lasso = Linear model yrained with L1 prior as regularizer. Techinically, the Lasso mosel is optimizing the same objective function as the Elastic Net with l1_ratio -10.
- Alpha is the constant that multiples the regularization term. Alpha must be non-negative float. alpha=0 === LinearRegression
- selection='random' often leads to significantly faster convergence
lasso_param_grid = [
    {"alpha": [1, 10, 100]}
lasso = GridSearchCV(Lasso(max_iter=max_iter_linear,selection='random'),param_grid=lasso_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)

Fitting 5 folds for each of 3 candidates, totalling 15 fits

GridSearchCV(cv=5, estimator=Lasso(max_iter=100000, selection='random'),

n_jobs=-1, param_grid=[{'alpha': [1, 10, 100]}], verbose=2)

- Ridge is linear least squares with l2 regularization 
- Alphas is the constant that multiplies the l2 regulkarization term 
ridge_param_grid = [
    {"alpha": [1, 10, 100]}
ridge = GridSearchCV(Ridge(max_iter=max_iter_linear),param_grid=ridge_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)

Fitting 5 folds for each of 3 candidates, totalling 15 fits

GridSearchCV(cv=5, estimator=Ridge(max_iter=100000), n_jobs=-1,

param_grid=[{'alpha': [1, 10, 100]}], verbose=2)

- ElasticNet is linear regression with combined l1 and l2 priors as regulaizers
- Alpha is a constant that multiplues the penalty terms 
- the l1 ratio should be between 0 and 1, 1=L1 penalty, 0 = l2 penalty
- selection='random' often leads to significantly faster convergence
elastic_param_grid = [
    {"alpha": [0.5, 1, 10, 100], "l1_ratio": [0.1, 0.25, 0.5, 0.75, 0.9]}
elastic = GridSearchCV(ElasticNet(selection="random"),param_grid=elastic_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)

Fitting 5 folds for each of 20 candidates, totalling 100 fits

GridSearchCV(cv=5, estimator=ElasticNet(selection='random'), n_jobs=-1,

param_grid=[{'alpha': [0.5, 1, 10, 100],

'l1_ratio': [0.1, 0.25, 0.5, 0.75, 0.9]}],


- Epsilon-support Vector Regression. The free parameters in the model are C and epsilon
- C is the regulariztion parameter. The strength of regularization is inversely proportional to C 
- Epsilon is the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated win the training loss fiction
svm_param_grid = [
    { 'C': [ 1000000 ], 'epsilon': [10, 1, 0.1]},
svm = GridSearchCV(SVR(max_iter=-1,kernel="linear"),param_grid=svm_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)

Fitting 5 folds for each of 3 candidates, totalling 15 fits

GridSearchCV(cv=5, estimator=SVR(kernel='linear'), n_jobs=-1,

param_grid=[{'C': [1000000], 'epsilon': [10, 1, 0.1]}], verbose=2)

- Lasso Model implemented using Least Angle Regression (LARS)
- Linear model trained with L1 prior as regularizer
lars_lass_param_grid = [
    { "alpha": [1, 10, 100], "eps": np.linspace(np.finfo(float).eps,np.finfo(float).eps*10,3) }
lars_lass_reg = GridSearchCV(LassoLars(max_iter=max_iter_linear),param_grid=lars_lass_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)

Fitting 5 folds for each of 9 candidates, totalling 45 fits

GridSearchCV(cv=5, estimator=LassoLars(max_iter=100000), n_jobs=-1,

param_grid=[{'alpha': [1, 10, 100],

'eps': array([2.22044605e-16, 1.22124533e-15, 2.22044605e-15])}],


- Linear model fitted by minimizing a regularized empirical loss with SGD
- SGD stands for Stochastic Gradient Descent: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule
- The regularizer is a penalty added to the loss function that shrinks model parameters towards the zero vector using either the squared euclidean norm L2 or the absolute norm L1 or a combination of both
- This implementation works with data represented as dense numpy arrays of floating point values for the features.
sgd_reg_param_grid = [
    {"loss": ["sqaured_error", "huber"], "penalty": ["l2", "elasticnet"], "alpha": [0.000001, 0.00001, 0.001], "early_stopping": [True, False]}
sgd_reg = GridSearchCV(SGDRegressor(),param_grid=sgd_reg_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)

Fitting 5 folds for each of 24 candidates, totalling 120 fits

GridSearchCV(cv=5, estimator=SGDRegressor(), n_jobs=-1,

param_grid=[{'alpha': [1e-06, 1e-05, 0.001],

'early_stopping': [True, False],

'loss': ['sqaured_error', 'huber'],

'penalty': ['l2', 'elasticnet']}],


- Regression based on k-nearest neighbors.
- The target is predicted by local interpolatio of the targets associated of the nearest neighbors in the training set 
k_neigh_param_grid = [
    {"n_neighbors": [3, 5, 10, 50], "weights": ["uniform","distance"], "algorithm": ["auto", "ball_tree", "kd_tree"]}
k_neigh = GridSearchCV(KNeighborsRegressor(n_jobs=-1),param_grid=k_neigh_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)

Fitting 5 folds for each of 24 candidates, totalling 120 fits

GridSearchCV(cv=5, estimator=KNeighborsRegressor(n_jobs=-1), n_jobs=-1,

param_grid=[{'algorithm': ['auto', 'ball_tree', 'kd_tree'],

'n_neighbors': [3, 5, 10, 50],

'weights': ['uniform', 'distance']}],


- Regression based on neighbors within a fixed radius.
- The target is predicted by local interpolation of the targets associated of the nearest neighbors in the training set.
r_neigh_param_grid = [
    {"radius": [0.1, 1, 10, 100], "weights": ["uniform","distance"], "algorithm": ["auto", "ball_tree", "kd_tree"]}
r_neigh = GridSearchCV(RadiusNeighborsRegressor(n_jobs=1),param_grid=r_neigh_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)
- Gaussian process regression (GPR).
gp_rep = GaussianProcessRegressor(normalize_y=True)

Fitting 5 folds for each of 24 candidates, totalling 120 fits

c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_search.py:1051: UserWarning: One or more of the test scores are non-finite: [ nan nan nan nan nan nan
-0.00045045 0.12339411 nan nan nan nan
nan nan -0.00045045 0.12339411 nan nan
nan nan nan nan -0.00045045 0.12339411]


- PLS Regression. PLS regresison is also known as PLS2 or PLS1, depending on the number of targets.
- n_components = Number of components to keep.
- scale = Whether to scale X and Y
pls_reg_param_grid = [
    {"n_components": [2, 10, 15], "scale": [True, False]}
pls_reg = GridSearchCV(PLSRegression(max_iter=max_iter_linear),param_grid=pls_reg_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)

Fitting 5 folds for each of 6 candidates, totalling 30 fits

GridSearchCV(cv=5, estimator=PLSRegression(max_iter=100000), n_jobs=-1,

param_grid=[{'n_components': [2, 10, 15], 'scale': [True, False]}],


- Decision Tree Regressor
- Deciion Trees are non-paramteric supervised learning models used for classification adn regressio. The goal is create a model that predicts the value of a target variable by learning simple decision rules inferrred from the data features. A tree can be seen as a piewsie constant approximation
- criterion  is a function to measure the quality of the split
- min_samples_leaf is the minimum samples required tp be a leaf node 
- This may have the effect of smoothing the model, especially in regression
- max_depth is the maximum depth of the tree. If None, then nodes are expanded until all leaves are pure and until all leaves contain less than min_samples_split samples
dt_reg_param_grid = [
    {"max_depth": [5, 10, 50], "min_samples_leaf": [1, 10, 100], "criterion": ["squared_error", "absolute_error", "poisson"], "splitter": ["best", "random"]}
dt_reg = GridSearchCV(DecisionTreeRegressor(),param_grid=dt_reg_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)

Fitting 5 folds for each of 54 candidates, totalling 270 fits

GridSearchCV(cv=5, estimator=DecisionTreeRegressor(), n_jobs=-1,

param_grid=[{'criterion': ['squared_error', 'absolute_error',


'max_depth': [5, 10, 50],

'min_samples_leaf': [1, 10, 100],

'splitter': ['best', 'random']}],


- A random Forest Regressor  
- A random forest is a meta estimator that fits a number of decision tree regressors on various subsamples of the dataset and uses averaging to improve the prredictive accuracy and control over-fitting 
- Trees in the forest use best-split strategy by passing splitter="best" to the underlying Decsion TreeeRegressor 
rand_forest_param_grid = [
    { "min_samples_leaf": [1,5] }
rand_forest = GridSearchCV(RandomForestRegressor(verbose=2,n_estimators=100,criterion="squared_error",max_depth=100),param_grid=rand_forest_param_grid,verbose=3,n_jobs=n_jobs,cv=cv,refit=refit)

GridSearchCV(cv=5, estimator=GradientBoostingRegressor(subsample=1, verbose=3),

n_jobs=-1, param_grid=[{'max_depth': [10, 100]}], verbose=3)

predictors = [
    { "pred": lin_reg, "name": "Linear Regression", "color": 'r'},
    { "pred": lasso, "name": "Lasso Regression", "color": 'g'},
    { "pred": ridge, "name": "Ridge Regression", "color": 'b'},
    { "pred": elastic, "name": "ElasticNet", "color": 'y'},
    { "pred": svm, "name": "SVM", "color": 'pink'},
    { "pred": lars_lass_reg, "name": "LARS Lasso", "color": 'brown'},
    { "pred": sgd_reg, "name": "SGD Regressor", "color": 'magenta'},
    { "pred": k_neigh, "name": "KNeighbors Regressor", "color": '#6495ED'},
    { "pred": r_neigh, "name": "RadiusNeighbors Regressor", "color": '#556B2F'},
    { "pred": gp_rep, "name": "GradientProcess Regressor", "color": '#5F9EA0'},
    { "pred": pls_reg, "name": "PLSRegression", "color": '#2F4F4F'},
    { "pred": dt_reg, "name": "Decison Tree Regressor", "color": '#66CDAA'},
    { "pred": rand_forest, "name": "Random Forest Regressor", "color": '#DAA520'},
    { "pred": gb_reg, "name": "Gradient Boosting Regressor", "color": '#A0522D'},
import matplotlib.pyplot as plt
fig, ax = plt.subplots(4,4,layout="constrained",figsize=(16,16))
ax[0,0].set_title("Real Data")
iter = 1
for pred in predictors:
    if iter>=12:
        Ax = ax[3,iter-12]
    elif iter >= 8:
        Ax = ax[2,iter-8]
    elif iter >=4:
        Ax = ax[1,iter-4]
        Ax = ax[0,iter]
    y_pred = pred["pred"].predict(regressor.fit_transform(X_test))
    Ax.plot(y_test,y_test,c="k",label="Real Data")

Jupyter Notebook Image

<Figure size 1600x1600 with 15 Axes>


SVR(C=1000000, gamma=0.1, kernel='linear')

Notes on the Data / Problems Below

  • Total bedrooms is missing some values that need to be imputed - probably best to impute them using Iterative or KNN
  • As can be seen from the correlation matrix, the population, total rooms, and total bedrooms are heavily correlated. Multicollinearity is particularly undesirable because it impacts the Multicollinearity of linear regression models. If using a linear regression model, you should try to get rid of this Multicollinearity.
  • I am not going to use ordinal encoder for the ocean_proximity attribute since I am sure that one hot encoding will work better and there are not many classes for that feature
  • For the latitude and longitude, I am looking into ways to properly handle that data
    • Some Ways to Handle Latitude and Longitude
      • Choose a model that does not require normalization (not required since I am required to use SVM here)
      • Perform reverse geocoding
      • Converting Geocoding into zones using clustering - I am going to do this probably.

Using DBSCAN For Latitude and Longitude Clustering

  • DBSCAN Reference
  • I decided to cluster the data using different eps and min_samples values until I got something that looked right - see clustering chart
  • I will then perform One Hot Encoding (Since I only have 19 unique clusters)

Question 1

Try a Support Vector Machine regressor (sklearn.svm.SVR), with various hyperparameters such as kernel="linear" (with various values for the C hyperparameter) or kernel="rbf" (with various values for the C and gamma hyperparameters). Don’t worry about what these hyperparameters mean for now. How does the best SVR predictor perform?

Support Vector Machines are a set of supervised learning methods for classification, regression, and outlier detection. Support Vector Machine algorithms are not scale invariant, so it is highly recommended to scale your data.

Question 2

Try replacing GridSearchCV with RandomizedSearchCV.

Question 3

Try adding a transformer in the preparation pipeline to select only the most important attributes.

Question 4

Try creating a single pipeline that does the full data preparation plus the final prediction.

Question 5

Automatically explore some preparation options using GridSearchCV.

