End-to-End Machine Learning Project Exercises Answers

I don't think these answers are the best - I had a hard time reviewing feature engineering, but I think I learned it pretty well now. I trained a lot of different models on the California Housing Dataset because I was getting frustrated by how by the SVM model originally was.

End-to-End Machine Learning Project Exercises Answers

Using this chapter's dataset

from sklearn.datasets import fetch_openml
import pandas as pd 
import os 
housing = pd.read_csv(os.path.join(os.getcwd(),'..','housing.csv'))
housing.head()
out[2]

longitude latitude housing_median_age total_rooms total_bedrooms \

0 -122.23 37.88 41.0 880.0 129.0

1 -122.22 37.86 21.0 7099.0 1106.0

2 -122.24 37.85 52.0 1467.0 190.0

3 -122.25 37.85 52.0 1274.0 235.0

4 -122.25 37.85 52.0 1627.0 280.0



population households median_income median_house_value ocean_proximity

0 322.0 126.0 8.3252 452600.0 NEAR BAY

1 2401.0 1138.0 8.3014 358500.0 NEAR BAY

2 496.0 177.0 7.2574 352100.0 NEAR BAY

3 558.0 219.0 5.6431 341300.0 NEAR BAY

4 565.0 259.0 3.8462 342200.0 NEAR BAY

housing.describe()
out[3]

longitude latitude housing_median_age total_rooms \

count 20640.000000 20640.000000 20640.000000 20640.000000

mean -119.569704 35.631861 28.639486 2635.763081

std 2.003532 2.135952 12.585558 2181.615252

min -124.350000 32.540000 1.000000 2.000000

25% -121.800000 33.930000 18.000000 1447.750000

50% -118.490000 34.260000 29.000000 2127.000000

75% -118.010000 37.710000 37.000000 3148.000000

max -114.310000 41.950000 52.000000 39320.000000



total_bedrooms population households median_income \

count 20433.000000 20640.000000 20640.000000 20640.000000

mean 537.870553 1425.476744 499.539680 3.870671

std 421.385070 1132.462122 382.329753 1.899822

min 1.000000 3.000000 1.000000 0.499900

25% 296.000000 787.000000 280.000000 2.563400

50% 435.000000 1166.000000 409.000000 3.534800

75% 647.000000 1725.000000 605.000000 4.743250

max 6445.000000 35682.000000 6082.000000 15.000100



median_house_value

count 20640.000000

mean 206855.816909

std 115395.615874

min 14999.000000

25% 119600.000000

50% 179700.000000

75% 264725.000000

max 500001.000000

housing.info()
out[4]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

housing["ocean_proximity"].value_counts()
out[5]

ocean_proximity

<1H OCEAN 9136

INLAND 6551

NEAR OCEAN 2658

NEAR BAY 2290

ISLAND 5

Name: count, dtype: int64

corr = housing.iloc[:,:-1].corr()
corr.style.background_gradient(cmap="coolwarm")
out[6]

<pandas.io.formats.style.Styler at 0x21208a5cd10>

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, Normalizer, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
from sklearn.cluster import KMeans
from sklearn.impute import SimpleImputer, KNNImputer

np.random.seed(42)
randint = np.random.randint(100)

target = housing["median_house_value"].copy()
data = housing.drop(columns="median_house_value").copy()

X_train, X_test, y_train, y_test = train_test_split(data,target,test_size=0.2,random_state=randint) 

def transform_to_numpy(X):
    """
    I prefer to work with NumPy arrays rather than DataFrames. Working with DataFrames can cause issues
    """
    if not isinstance(X,np.ndarray):
        X = X.to_numpy()
    return X


# Custom Class for Imputing 
class ImputeCategoricalMode(BaseEstimator,TransformerMixin):
    """
    This CustomTransformer should be used to impute categorical variables.
    You can pass in a mode value to tell the class which categorical variable should be imputed 
    This 
    """
    mode = np.nan
    def __init__(self,mode=np.nan):
        if mode != np.nan:
            self.mode = mode
    def fit(self, X, y=None):
        modes = []
        for col in X.select_dtypes(include="object"):
            mode = X[col].value_counts().idxmax()
            modes.append(mode)
        if len(modes)==1:
            self.mode = modes[0]
        else:
            self.mode = modes
        return self
    def transform(self, X, y=None):
        iter = 0
        for col in X.select_dtypes(include="object"):
            if isinstance(self.mode,list):
                X[col].fillna(value=self.mode[iter])
            else:
                X[col].fillna(value=self.mode)
            iter += 1
        return X

class HandleMultiCollinearity(BaseEstimator,TransformerMixin):
    """
    total_rooms: 1	
    total_bedrooms: 2	
    population: 3	
    households: 4	
    median_income: 5
    """
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        X[:,1] = X[:,1]+X[:,2] / X[:,4] # Rooms Per Household
        X[:, 2] = X[:, 3]
        X[:, 3] = X[:, 3] / X[:, 4]  # Population per household 
        X = np.delete(X,4,1) # Delete the number of households - this doesn't tell you much without data I think
        return X

lat_lng_cluster = KMeans(n_clusters=12)
lat_lng_cluster.fit(X_train.loc[:,["longitude","latitude"]])
predictions = lat_lng_cluster.predict(X_train.loc[:,["longitude","latitude"]])
lat_lng_ohe = OneHotEncoder(sparse_output=False)
lat_lng_ohe.fit(predictions.reshape(-1,1))
import matplotlib.pyplot as plt 
fig, ax = plt.subplots(1,1,layout="constrained")
s = ax.scatter(X_train["longitude"],X_train["latitude"],alpha=0.1,cmap=plt.cm.hsv,c=predictions)
fig.colorbar(s,cmap=plt.cm.hsv)
ax.set_title("Clustering Based on Latitude/Longitude")
plt.show()

class LatLngPreprocessor(BaseEstimator,TransformerMixin):
    """
    Expect to Only see longitide, latitude in X array 
    """
    def fit(self,X,y=None):
        return self
    def transform(self,X,y=None):
        predictions = lat_lng_cluster.predict(X)
        X = lat_lng_ohe.transform(predictions.reshape(-1,1))
        return X

cat_pipeline = Pipeline(
    steps=[
        ("impute",ImputeCategoricalMode(mode="<1H OCEAN")),
        ("ohe",OneHotEncoder(sparse_output=False))
    ]
)

num_pipeline = Pipeline(
    steps=[
        ("impute",KNNImputer(n_neighbors=5,weights="distance")),
        ("colin",HandleMultiCollinearity()),
        ("scale",StandardScaler())
    ]
)

lat_lng_pipeline = Pipeline(
    steps=[
        ("impute",KNNImputer(n_neighbors=5,weights="distance")),
        ("compute",LatLngPreprocessor())
    ]
)

regressor = ColumnTransformer(
    transformers=[
        ("lat_lng",lat_lng_pipeline,["longitude","latitude"]),
        ("cat",cat_pipeline,["ocean_proximity"]),
        ("scale",num_pipeline,make_column_selector(dtype_include=np.float64))
    ],
    remainder="passthrough"
)

out = regressor.fit_transform(X_train)
dfOut = pd.DataFrame(out)
out[7]
Jupyter Notebook Image

<Figure size 640x480 with 2 Axes>

c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but KMeans was fitted with feature names
warnings.warn(

dfOut.std()
out[8]

0 0.451402

1 0.389156

2 0.248206

3 0.122115

4 0.216116

5 0.207172

6 0.099466

7 0.354101

8 0.220731

9 0.215103

10 0.268471

11 0.135516

12 0.496451

13 0.466844

14 0.007782

15 0.313854

16 0.333929

17 1.000030

18 1.000030

19 1.000030

20 1.000030

21 1.000030

22 1.000030

23 1.000030

dtype: float64

from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet, SGDRegressor, LassoLars
from sklearn.svm import SVR, LinearSVR
from sklearn.neighbors import KNeighborsRegressor, RadiusNeighborsRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.cross_decomposition import PLSRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.neural_network import MLPRegressor 
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV


verbose=2
n_jobs=-1
cv=5
refit=True
max_iter_linear = 100000

lin_reg = LinearRegression(n_jobs=-1)
lin_reg.fit(out,y_train)
out[9]

LinearRegression(n_jobs=-1)


"""
- Lasso = Linear model yrained with L1 prior as regularizer. Techinically, the Lasso mosel is optimizing the same objective function as the Elastic Net with l1_ratio -10.
- Alpha is the constant that multiples the regularization term. Alpha must be non-negative float. alpha=0 === LinearRegression
- selection='random' often leads to significantly faster convergence
"""
lasso_param_grid = [
    {"alpha": [1, 10, 100]}
]
lasso = GridSearchCV(Lasso(max_iter=max_iter_linear,selection='random'),param_grid=lasso_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)
lasso.fit(out,y_train)
out[10]

Fitting 5 folds for each of 3 candidates, totalling 15 fits

GridSearchCV(cv=5, estimator=Lasso(max_iter=100000, selection='random'),

n_jobs=-1, param_grid=[{'alpha': [1, 10, 100]}], verbose=2)


"""
- Ridge is linear least squares with l2 regularization 
- Alphas is the constant that multiplies the l2 regulkarization term 
"""
ridge_param_grid = [
    {"alpha": [1, 10, 100]}
]
ridge = GridSearchCV(Ridge(max_iter=max_iter_linear),param_grid=ridge_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)
ridge.fit(out,y_train)
out[11]

Fitting 5 folds for each of 3 candidates, totalling 15 fits

GridSearchCV(cv=5, estimator=Ridge(max_iter=100000), n_jobs=-1,

param_grid=[{'alpha': [1, 10, 100]}], verbose=2)


"""
- ElasticNet is linear regression with combined l1 and l2 priors as regulaizers
- Alpha is a constant that multiplues the penalty terms 
- the l1 ratio should be between 0 and 1, 1=L1 penalty, 0 = l2 penalty
- selection='random' often leads to significantly faster convergence
"""
elastic_param_grid = [
    {"alpha": [0.5, 1, 10, 100], "l1_ratio": [0.1, 0.25, 0.5, 0.75, 0.9]}
]
elastic = GridSearchCV(ElasticNet(selection="random"),param_grid=elastic_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)
elastic.fit(out,y_train)
out[12]

Fitting 5 folds for each of 20 candidates, totalling 100 fits

GridSearchCV(cv=5, estimator=ElasticNet(selection='random'), n_jobs=-1,

param_grid=[{'alpha': [0.5, 1, 10, 100],

'l1_ratio': [0.1, 0.25, 0.5, 0.75, 0.9]}],

verbose=2)


"""
- Epsilon-support Vector Regression. The free parameters in the model are C and epsilon
- C is the regulariztion parameter. The strength of regularization is inversely proportional to C 
- Epsilon is the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated win the training loss fiction
"""
svm_param_grid = [
    { 'C': [ 1000000 ], 'epsilon': [10, 1, 0.1]},
]
svm = GridSearchCV(SVR(max_iter=-1,kernel="linear"),param_grid=svm_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)
svm.fit(out,y_train)
out[13]

Fitting 5 folds for each of 3 candidates, totalling 15 fits

GridSearchCV(cv=5, estimator=SVR(kernel='linear'), n_jobs=-1,

param_grid=[{'C': [1000000], 'epsilon': [10, 1, 0.1]}], verbose=2)


"""
- Lasso Model implemented using Least Angle Regression (LARS)
- Linear model trained with L1 prior as regularizer
"""
lars_lass_param_grid = [
    { "alpha": [1, 10, 100], "eps": np.linspace(np.finfo(float).eps,np.finfo(float).eps*10,3) }
]
lars_lass_reg = GridSearchCV(LassoLars(max_iter=max_iter_linear),param_grid=lars_lass_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)
lars_lass_reg.fit(out,y_train)
out[14]

Fitting 5 folds for each of 9 candidates, totalling 45 fits

GridSearchCV(cv=5, estimator=LassoLars(max_iter=100000), n_jobs=-1,

param_grid=[{'alpha': [1, 10, 100],

'eps': array([2.22044605e-16, 1.22124533e-15, 2.22044605e-15])}],

verbose=2)


"""
- Linear model fitted by minimizing a regularized empirical loss with SGD
- SGD stands for Stochastic Gradient Descent: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule
- The regularizer is a penalty added to the loss function that shrinks model parameters towards the zero vector using either the squared euclidean norm L2 or the absolute norm L1 or a combination of both
- This implementation works with data represented as dense numpy arrays of floating point values for the features.
"""
sgd_reg_param_grid = [
    {"loss": ["sqaured_error", "huber"], "penalty": ["l2", "elasticnet"], "alpha": [0.000001, 0.00001, 0.001], "early_stopping": [True, False]}
]
sgd_reg = GridSearchCV(SGDRegressor(),param_grid=sgd_reg_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)
sgd_reg.fit(out,y_train)
out[15]

Fitting 5 folds for each of 24 candidates, totalling 120 fits

c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py:547: FitFailedWarning:
60 fits failed out of a total of 120.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
13 fits failed with the following error:
Traceback (most recent call last):
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 1467, in wrapper
estimator._validate_params()
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
validate_parameter_constraints(
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'loss' parameter of SGDRegressor must be a str among {'epsilon_insensitive', 'squared_error', 'squared_epsilon_insensitive', 'huber'}. Got 'sqaured_error' instead.

--------------------------------------------------------------------------------
16 fits failed with the following error:
Traceback (most recent call last):
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 1467, in wrapper
estimator._validate_params()
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
validate_parameter_constraints(
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'loss' parameter of SGDRegressor must be a str among {'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive', 'squared_error'}. Got 'sqaured_error' instead.

--------------------------------------------------------------------------------
1 fits failed with the following error:
Traceback (most recent call last):
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 1467, in wrapper
estimator._validate_params()
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
validate_parameter_constraints(
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'loss' parameter of SGDRegressor must be a str among {'squared_error', 'huber', 'squared_epsilon_insensitive', 'epsilon_insensitive'}. Got 'sqaured_error' instead.

--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 1467, in wrapper
estimator._validate_params()
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
validate_parameter_constraints(
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'loss' parameter of SGDRegressor must be a str among {'squared_epsilon_insensitive', 'huber', 'squared_error', 'epsilon_insensitive'}. Got 'sqaured_error' instead.

--------------------------------------------------------------------------------
1 fits failed with the following error:
Traceback (most recent call last):
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 1467, in wrapper
estimator._validate_params()
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
validate_parameter_constraints(
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'loss' parameter of SGDRegressor must be a str among {'huber', 'squared_error', 'squared_epsilon_insensitive', 'epsilon_insensitive'}. Got 'sqaured_error' instead.

--------------------------------------------------------------------------------
1 fits failed with the following error:
Traceback (most recent call last):
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 1467, in wrapper
estimator._validate_params()
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
validate_parameter_constraints(
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'loss' parameter of SGDRegressor must be a str among {'epsilon_insensitive', 'squared_error', 'huber', 'squared_epsilon_insensitive'}. Got 'sqaured_error' instead.

--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 1467, in wrapper
estimator._validate_params()
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
validate_parameter_constraints(
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'loss' parameter of SGDRegressor must be a str among {'squared_epsilon_insensitive', 'huber', 'epsilon_insensitive', 'squared_error'}. Got 'sqaured_error' instead.

--------------------------------------------------------------------------------
3 fits failed with the following error:
Traceback (most recent call last):
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 1467, in wrapper
estimator._validate_params()
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
validate_parameter_constraints(
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'loss' parameter of SGDRegressor must be a str among {'squared_epsilon_insensitive', 'epsilon_insensitive', 'squared_error', 'huber'}. Got 'sqaured_error' instead.

--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 1467, in wrapper
estimator._validate_params()
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
validate_parameter_constraints(
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'loss' parameter of SGDRegressor must be a str among {'squared_error', 'squared_epsilon_insensitive', 'huber', 'epsilon_insensitive'}. Got 'sqaured_error' instead.

--------------------------------------------------------------------------------
13 fits failed with the following error:
Traceback (most recent call last):
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 1467, in wrapper
estimator._validate_params()
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
validate_parameter_constraints(
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'loss' parameter of SGDRegressor must be a str among {'squared_epsilon_insensitive', 'squared_error', 'huber', 'epsilon_insensitive'}. Got 'sqaured_error' instead.

--------------------------------------------------------------------------------
3 fits failed with the following error:
Traceback (most recent call last):
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 1467, in wrapper
estimator._validate_params()
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
validate_parameter_constraints(
File "c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'loss' parameter of SGDRegressor must be a str among {'epsilon_insensitive', 'huber', 'squared_epsilon_insensitive', 'squared_error'}. Got 'sqaured_error' instead.

warnings.warn(some_fits_failed_message, FitFailedWarning)
c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_search.py:1051: UserWarning: One or more of the test scores are non-finite: [ nan nan -3.21805984 -3.21805962 nan nan
-3.20469439 -3.20469347 nan nan -3.21805989 -3.2180601
nan nan -3.20475344 -3.20474389 nan nan
-3.21806264 -3.21806231 nan nan -3.20775984 -3.20756375]
warnings.warn(
c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\linear_model\_stochastic_gradient.py:1575: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
warnings.warn(

GridSearchCV(cv=5, estimator=SGDRegressor(), n_jobs=-1,

param_grid=[{'alpha': [1e-06, 1e-05, 0.001],

'early_stopping': [True, False],

'loss': ['sqaured_error', 'huber'],

'penalty': ['l2', 'elasticnet']}],

verbose=2)


"""
- Regression based on k-nearest neighbors.
- The target is predicted by local interpolatio of the targets associated of the nearest neighbors in the training set 
"""
k_neigh_param_grid = [
    {"n_neighbors": [3, 5, 10, 50], "weights": ["uniform","distance"], "algorithm": ["auto", "ball_tree", "kd_tree"]}
]
k_neigh = GridSearchCV(KNeighborsRegressor(n_jobs=-1),param_grid=k_neigh_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)
k_neigh.fit(out,y_train)
out[16]

Fitting 5 folds for each of 24 candidates, totalling 120 fits

GridSearchCV(cv=5, estimator=KNeighborsRegressor(n_jobs=-1), n_jobs=-1,

param_grid=[{'algorithm': ['auto', 'ball_tree', 'kd_tree'],

'n_neighbors': [3, 5, 10, 50],

'weights': ['uniform', 'distance']}],

verbose=2)


"""
- Regression based on neighbors within a fixed radius.
- The target is predicted by local interpolation of the targets associated of the nearest neighbors in the training set.
"""
r_neigh_param_grid = [
    {"radius": [0.1, 1, 10, 100], "weights": ["uniform","distance"], "algorithm": ["auto", "ball_tree", "kd_tree"]}
]
r_neigh = GridSearchCV(RadiusNeighborsRegressor(n_jobs=1),param_grid=r_neigh_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)
"""
- Gaussian process regression (GPR).
"""
gp_rep = GaussianProcessRegressor(normalize_y=True)
r_neigh.fit(out,y_train)
gp_rep.fit(out,y_train)
out[17]

Fitting 5 folds for each of 24 candidates, totalling 120 fits

c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_search.py:1051: UserWarning: One or more of the test scores are non-finite: [ nan nan nan nan nan nan
-0.00045045 0.12339411 nan nan nan nan
nan nan -0.00045045 0.12339411 nan nan
nan nan nan nan -0.00045045 0.12339411]
warnings.warn(

GaussianProcessRegressor(normalize_y=True)

"""
- PLS Regression. PLS regresison is also known as PLS2 or PLS1, depending on the number of targets.
- n_components = Number of components to keep.
- scale = Whether to scale X and Y
"""
pls_reg_param_grid = [
    {"n_components": [2, 10, 15], "scale": [True, False]}
]
pls_reg = GridSearchCV(PLSRegression(max_iter=max_iter_linear),param_grid=pls_reg_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)
pls_reg.fit(out,y_train)
out[18]

Fitting 5 folds for each of 6 candidates, totalling 30 fits

GridSearchCV(cv=5, estimator=PLSRegression(max_iter=100000), n_jobs=-1,

param_grid=[{'n_components': [2, 10, 15], 'scale': [True, False]}],

verbose=2)

"""
- Decision Tree Regressor
- Deciion Trees are non-paramteric supervised learning models used for classification adn regressio. The goal is create a model that predicts the value of a target variable by learning simple decision rules inferrred from the data features. A tree can be seen as a piewsie constant approximation
- criterion  is a function to measure the quality of the split
- min_samples_leaf is the minimum samples required tp be a leaf node 
- This may have the effect of smoothing the model, especially in regression
- max_depth is the maximum depth of the tree. If None, then nodes are expanded until all leaves are pure and until all leaves contain less than min_samples_split samples
"""
dt_reg_param_grid = [
    {"max_depth": [5, 10, 50], "min_samples_leaf": [1, 10, 100], "criterion": ["squared_error", "absolute_error", "poisson"], "splitter": ["best", "random"]}
]
dt_reg = GridSearchCV(DecisionTreeRegressor(),param_grid=dt_reg_param_grid,verbose=verbose,n_jobs=n_jobs,cv=cv,refit=refit)
dt_reg.fit(out,y_train)
out[19]

Fitting 5 folds for each of 54 candidates, totalling 270 fits

GridSearchCV(cv=5, estimator=DecisionTreeRegressor(), n_jobs=-1,

param_grid=[{'criterion': ['squared_error', 'absolute_error',

'poisson'],

'max_depth': [5, 10, 50],

'min_samples_leaf': [1, 10, 100],

'splitter': ['best', 'random']}],

verbose=2)


"""
- A random Forest Regressor  
- A random forest is a meta estimator that fits a number of decision tree regressors on various subsamples of the dataset and uses averaging to improve the prredictive accuracy and control over-fitting 
- Trees in the forest use best-split strategy by passing splitter="best" to the underlying Decsion TreeeRegressor 
"""
rand_forest_param_grid = [
    { "min_samples_leaf": [1,5] }
]
rand_forest = GridSearchCV(RandomForestRegressor(verbose=2,n_estimators=100,criterion="squared_error",max_depth=100),param_grid=rand_forest_param_grid,verbose=3,n_jobs=n_jobs,cv=cv,refit=refit)
rand_forest.fit(out,y_train)
out[20]

Fitting 5 folds for each of 2 candidates, totalling 10 fits
building tree 1 of 100
building tree 2 of 100
building tree 3 of 100
building tree 4 of 100
building tree 5 of 100
building tree 6 of 100
building tree 7 of 100
building tree 8 of 100
building tree 9 of 100
building tree 10 of 100
building tree 11 of 100
building tree 12 of 100
building tree 13 of 100
building tree 14 of 100
building tree 15 of 100
building tree 16 of 100
building tree 17 of 100
building tree 18 of 100
building tree 19 of 100
building tree 20 of 100
building tree 21 of 100
building tree 22 of 100
building tree 23 of 100
building tree 24 of 100
building tree 25 of 100
building tree 26 of 100
building tree 27 of 100
building tree 28 of 100
building tree 29 of 100
building tree 30 of 100
building tree 31 of 100
building tree 32 of 100
building tree 33 of 100
building tree 34 of 100
building tree 35 of 100
building tree 36 of 100
building tree 37 of 100
building tree 38 of 100
building tree 39 of 100
building tree 40 of 100

[Parallel(n_jobs=1)]: Done 40 tasks | elapsed: 4.5s

building tree 41 of 100
building tree 42 of 100
building tree 43 of 100
building tree 44 of 100
building tree 45 of 100
building tree 46 of 100
building tree 47 of 100
building tree 48 of 100
building tree 49 of 100
building tree 50 of 100
building tree 51 of 100
building tree 52 of 100
building tree 53 of 100
building tree 54 of 100
building tree 55 of 100
building tree 56 of 100
building tree 57 of 100
building tree 58 of 100
building tree 59 of 100
building tree 60 of 100
building tree 61 of 100
building tree 62 of 100
building tree 63 of 100
building tree 64 of 100
building tree 65 of 100
building tree 66 of 100
building tree 67 of 100
building tree 68 of 100
building tree 69 of 100
building tree 70 of 100
building tree 71 of 100
building tree 72 of 100
building tree 73 of 100
building tree 74 of 100
building tree 75 of 100
building tree 76 of 100
building tree 77 of 100
building tree 78 of 100
building tree 79 of 100
building tree 80 of 100
building tree 81 of 100
building tree 82 of 100
building tree 83 of 100
building tree 84 of 100
building tree 85 of 100
building tree 86 of 100
building tree 87 of 100
building tree 88 of 100
building tree 89 of 100
building tree 90 of 100
building tree 91 of 100
building tree 92 of 100
building tree 93 of 100
building tree 94 of 100
building tree 95 of 100
building tree 96 of 100
building tree 97 of 100
building tree 98 of 100
building tree 99 of 100
building tree 100 of 100

GridSearchCV(cv=5, estimator=RandomForestRegressor(max_depth=100, verbose=2),

n_jobs=-1, param_grid=[{'min_samples_leaf': [1, 5]}], verbose=3)


"""
- Gradient Boosting for Regression 
- This estimator builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage, a regression tree is fit on the negative fradient of the given loss function
- Loss function is the loss function to be optimized
- "subsample": The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting. subsample interacts with the parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias.
"""
gb_reg_param_grid = [
    { "max_depth": [10,100]}
]
gb_reg = GridSearchCV(GradientBoostingRegressor(loss="squared_error",verbose=3,subsample=1),param_grid=gb_reg_param_grid,verbose=3,n_jobs=n_jobs,cv=cv,refit=refit)
gb_reg.fit(out,y_train)
out[21]

Fitting 5 folds for each of 2 candidates, totalling 10 fits
Iter Train Loss Remaining Time
1 11220110853.0884 11.86s
2 9564105519.0168 11.20s
3 8199574792.3146 10.55s
4 7073156166.8643 10.34s
5 6132745462.1036 10.11s
6 5357768454.8766 10.17s
7 4713659617.5997 10.07s
8 4170247249.7666 9.96s
9 3708695017.9085 9.76s
10 3321318807.6516 9.64s
11 2989673293.1276 9.44s
12 2707433919.7042 9.30s
13 2471866622.5583 9.36s
14 2269084523.7180 9.23s
15 2089468622.9853 9.06s
16 1940328152.6817 8.95s
17 1805413289.9092 8.84s
18 1689093463.6561 8.69s
19 1582987080.2200 8.58s
20 1487591256.7369 8.48s
21 1408086678.8820 8.36s
22 1327832240.4243 8.24s
23 1264581182.2221 8.12s
24 1199181262.6256 8.00s
25 1142770441.5154 7.88s
26 1100300674.9815 7.75s
27 1059332754.1210 7.64s
28 1020784963.1940 7.52s
29 992330061.7562 7.40s
30 960049776.7975 7.29s
31 934351039.1968 7.18s
32 900451113.1709 7.07s
33 872463598.3874 6.96s
34 846366703.5046 6.85s
35 821698074.6991 6.74s
36 791804029.3296 6.63s
37 776196966.9164 6.53s
38 751955413.5268 6.42s
39 732654907.0389 6.39s
40 714878029.9239 6.31s
41 697106891.2151 6.19s
42 676982955.7805 6.08s
43 655097361.7723 5.99s
44 641841248.0364 5.88s
45 622509120.8860 5.78s
46 607868086.5195 5.68s
47 589614850.5356 5.58s
48 577327798.1944 5.47s
49 562586798.4259 5.37s
50 550157858.2202 5.26s
51 544030745.7252 5.17s
52 527097942.1783 5.07s
53 518458033.6180 4.96s
54 508593309.3073 4.85s
55 498326204.8029 4.74s
56 492380001.4917 4.64s
57 482697892.5651 4.53s
58 473116304.4393 4.42s
59 467781853.5428 4.33s
60 460075521.0527 4.22s
61 453683660.3350 4.12s
62 448806064.1439 4.03s
63 442870674.7941 3.92s
64 432607941.9770 3.81s
65 425852034.1602 3.71s
66 418556584.4979 3.61s
67 414172183.6743 3.50s
68 405266689.0808 3.39s
69 397840390.9411 3.28s
70 395192951.1269 3.18s
71 385716508.9662 3.07s
72 379923452.7020 2.96s
73 378176884.1860 2.86s
74 374498410.4457 2.76s
75 367715424.1565 2.65s
76 359046780.7486 2.54s
77 349979969.3326 2.44s
78 348399953.7211 2.33s
79 344943980.5251 2.22s
80 341835869.9803 2.12s
81 337988681.5873 2.01s
82 334343421.3969 1.91s
83 326653869.3577 1.80s
84 320961032.9182 1.70s
85 316682894.8566 1.59s
86 313599042.4286 1.48s
87 311138244.2568 1.38s
88 306557711.8857 1.27s
89 303159280.2133 1.17s
90 301156471.8212 1.06s
91 296673710.5716 0.95s
92 292399539.4118 0.85s
93 290272908.6668 0.75s
94 287439195.7707 0.64s
95 284346258.5994 0.53s
96 282697590.4122 0.43s
97 277706681.1520 0.32s
98 275513677.8863 0.21s
99 274151875.5296 0.11s
100 272132525.0063 0.00s

GridSearchCV(cv=5, estimator=GradientBoostingRegressor(subsample=1, verbose=3),

n_jobs=-1, param_grid=[{'max_depth': [10, 100]}], verbose=3)

predictors = [
    { "pred": lin_reg, "name": "Linear Regression", "color": 'r'},
    { "pred": lasso, "name": "Lasso Regression", "color": 'g'},
    { "pred": ridge, "name": "Ridge Regression", "color": 'b'},
    { "pred": elastic, "name": "ElasticNet", "color": 'y'},
    { "pred": svm, "name": "SVM", "color": 'pink'},
    { "pred": lars_lass_reg, "name": "LARS Lasso", "color": 'brown'},
    { "pred": sgd_reg, "name": "SGD Regressor", "color": 'magenta'},
    { "pred": k_neigh, "name": "KNeighbors Regressor", "color": '#6495ED'},
    { "pred": r_neigh, "name": "RadiusNeighbors Regressor", "color": '#556B2F'},
    { "pred": gp_rep, "name": "GradientProcess Regressor", "color": '#5F9EA0'},
    { "pred": pls_reg, "name": "PLSRegression", "color": '#2F4F4F'},
    { "pred": dt_reg, "name": "Decison Tree Regressor", "color": '#66CDAA'},
    { "pred": rand_forest, "name": "Random Forest Regressor", "color": '#DAA520'},
    { "pred": gb_reg, "name": "Gradient Boosting Regressor", "color": '#A0522D'},
]
out[22]
import matplotlib.pyplot as plt
fig, ax = plt.subplots(4,4,layout="constrained",figsize=(16,16))
ax[0,0].plot(y_test,y_test,c="k")
ax[0,0].set_title("Real Data")
iter = 1
for pred in predictors:
    if iter>=12:
        Ax = ax[3,iter-12]
    elif iter >= 8:
        Ax = ax[2,iter-8]
    elif iter >=4:
        Ax = ax[1,iter-4]
    else:
        Ax = ax[0,iter]
    y_pred = pred["pred"].predict(regressor.fit_transform(X_test))
    Ax.plot(y_test,y_test,c="k",label="Real Data")
    Ax.scatter(y_test,y_pred,c=pred["color"],marker="+",label=pred["name"])
    Ax.legend()
    Ax.set_title(pred["name"])
    iter+=1
ax[3,3].remove()
plt.show()
out[23]

c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but KMeans was fitted with feature names
warnings.warn(
c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but KMeans was fitted with feature names
warnings.warn(
c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but KMeans was fitted with feature names
warnings.warn(
c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but KMeans was fitted with feature names
warnings.warn(
c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but KMeans was fitted with feature names
warnings.warn(
c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but KMeans was fitted with feature names
warnings.warn(
c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but KMeans was fitted with feature names
warnings.warn(
c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but KMeans was fitted with feature names
warnings.warn(
c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but KMeans was fitted with feature names
warnings.warn(
c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but KMeans was fitted with feature names
warnings.warn(
c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but KMeans was fitted with feature names
warnings.warn(
c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but KMeans was fitted with feature names
warnings.warn(
c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but KMeans was fitted with feature names
warnings.warn(
[Parallel(n_jobs=1)]: Done 40 tasks | elapsed: 0.0s
c:\Users\fmb20\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but KMeans was fitted with feature names
warnings.warn(

Jupyter Notebook Image

<Figure size 1600x1600 with 15 Axes>

svm.best_estimator_
out[24]

SVR(C=1000000, gamma=0.1, kernel='linear')

Notes on the Data / Problems Below

  • Total bedrooms is missing some values that need to be imputed - probably best to impute them using Iterative or KNN
  • As can be seen from the correlation matrix, the population, total rooms, and total bedrooms are heavily correlated. Multicollinearity is particularly undesirable because it impacts the Multicollinearity of linear regression models. If using a linear regression model, you should try to get rid of this Multicollinearity.
  • I am not going to use ordinal encoder for the ocean_proximity attribute since I am sure that one hot encoding will work better and there are not many classes for that feature
  • For the latitude and longitude, I am looking into ways to properly handle that data
    • Some Ways to Handle Latitude and Longitude
      • Choose a model that does not require normalization (not required since I am required to use SVM here)
      • Perform reverse geocoding
      • Converting Geocoding into zones using clustering - I am going to do this probably.

Using DBSCAN For Latitude and Longitude Clustering

  • DBSCAN Reference
  • I decided to cluster the data using different eps and min_samples values until I got something that looked right - see clustering chart
  • I will then perform One Hot Encoding (Since I only have 19 unique clusters)

Question 1

Try a Support Vector Machine regressor (sklearn.svm.SVR), with various hyperparameters such as kernel="linear" (with various values for the C hyperparameter) or kernel="rbf" (with various values for the C and gamma hyperparameters). Don’t worry about what these hyperparameters mean for now. How does the best SVR predictor perform?

Support Vector Machines are a set of supervised learning methods for classification, regression, and outlier detection. Support Vector Machine algorithms are not scale invariant, so it is highly recommended to scale your data.

Question 2

Try replacing GridSearchCV with RandomizedSearchCV.

Question 3

Try adding a transformer in the preparation pipeline to select only the most important attributes.

Question 4

Try creating a single pipeline that does the full data preparation plus the final prediction.

Question 5

Automatically explore some preparation options using GridSearchCV.