Hands On Machine Learning Chapter 2 - End To End Machine Learning Project

I am going to re-read Hands-On Machine Learning with Scikit-learn Keras & TensorFlow because I don't feel that I got a good grasp of machine learning the first time I read it, and I skipped neural networks the first time I read the book. Since the first time reading this textbook.

DOWNLOAD NOTEBOOK

2 530

End-to-End Machine Learning Project

In this chapter, you will be going through an example project end to end, pretending to be a recently hired data scientist in a real estate company.

When you are learning Machine learning, it is best to actually experiment with real-world data, not just artificial datasets. Here are a few places you can look to get data:
- Popular Open Data Repositories
- Meta Portals (List Open Repositories)
- Other Pages Listing Many Popular Open Data Repositories

Looking at the Big Picture

The first task you are asked to perform is to build a model of housing prices in California using the California census data. This data has metrics such as the population, median income, median housing price, and so on for each block group in California. Your model should learn from this data and be able to predict the median housing price in any district given all the other metrics.

A sequence of data processing components is called a data pipeline. Pipelines are very common in Machine Learning Systems, since there is a lot of data to manipulate and many data transformations to apply. Components typically run asynchronously. Each component pulls in a large amount of data, processes it, and spits out the result in anther data store, and then some time later the next component in the pipeline pulls this data and spits out its own output, and so on. Each component is fairly self-contained: the interface between components is simply the data store.

Framing Problem

First step is framing the problem.

This is a supervised learning task since you are given labeled data.
It is a regression task, since you are asked to predict a value
This is a multiple regression problem since the system will use multiple features to make a prediction
It is a univariate regression problem since we are only trying to predict a single value for each district.
- If we were trying to predict multiple values per district, it would be a multivariate regression problem.
Batch Learning problem - no continuous data flow

Select Performance Measure

Next step is to select a performance measure. A tyoical performance measure for regression problems is the Root Mean Square Error (RMSE). It gives an idea of how much error the system typically makes in its predictions, with a high weight for large errors:

\text{RMSE}(\textbf{X},h)=\sqrt{\frac{1}{m}\sum_{i=1}^m\left(h(\textbf{x}^i)-y^i\right)}

Although the RMSE is generally the preferred performance measure for regression tasks, in some contexts you may prefer to use another function. If there are outlier districts, you may consider using the Mean Absolute Error (also known as the Average Absolute Deviation):

\text{MAE}(\textbf{X},h)=\frac{1}{m} \sum_{i=1}^m\lvert h\left( \textbf{x}^i - y^i \right) \rvert

Both RMSE and MAE are ways to measure the distance between two vectors: the vector of predictions and the vector of target values. various distance measures, or norms are possible:

Computing the root os a sum of sqares (RMSE) corresponds to the Euclidean norm. It is also called the $\ell _2$ norm, note $\lVert \cdot \rVert _2$ or just $\lVert \cdot \rVert$
Computing the sum of absolutes (MAE) corresponds to the $\ell _1$ norm, noted $\lVert \cdot \rVert _1$ . It is sometimes called the Manhatten norm because it measures the distance between two points in a city if you can only travel along orthogonal city blocks.
More generally, the $\ell _k$ norm of a vector $\textbf{v}$ containing $n$ elements is defined as $\lVert \textbf{v} \rVert _k = \left( \lvert v_0 \rvert ^k + \cdots + \lvert v_n \rvert ^k \right)^{\cfrac{1}{k}} \space \cdot \space \ell _0$ just gives the number of non-zero elements in the vector, and $\ell _{\infty}$ gives the maximum absolute value in the vector.
The higher the norm index, the more it focuses on large values and neglects small ones. This is why the RMSE is more sensitive to outliers than the MAE. But when outliers are exponentially rare, the RMSE performs very well and is generally preferred.

The Math Notation Used in Textbook:

$m$ is the number of instances in the datasets
$\textbf{x}^i$ is a vector of all feature values (excluding the label) of the $i^{th}$ instance in the dataset, and $y^{i}$ is its label (the desired output value for that instance)
$\textbf{X}$ is a matrix containing all feature values (excluding labels) of all instances in the dataset. There is one row per instance and the $i^{th}$ row is equal to the transpose of $\textbf{x}^i$ noted $(\textbf{x}^i)^T$ .
$h$ is your system's prediction function, also called a hypothesis. When your system is given an instance's feature vector $\textbf{x}^i$ it outputs a predicted value $\hat{y}^y=h(\textbf{x}^i)$ for that instance
$\text{RMSE}(\textbf{X},h)$ is the cost function measured on the set of examples using your hypothesis $h$ .
We use lowercase italic font for scalar values and function names, lowercase bold font for vectors, and uppercase bold font for matrices.

Get the Data

import pandas as pd
def load_housing_data():
    return pd.read_csv("housing.csv")
housing = load_housing_data()
housing.head()

out[2]

longitude latitude housing_median_age total_rooms total_bedrooms \

0 -122.23 37.88 41.0 880.0 129.0

1 -122.22 37.86 21.0 7099.0 1106.0

2 -122.24 37.85 52.0 1467.0 190.0

3 -122.25 37.85 52.0 1274.0 235.0

4 -122.25 37.85 52.0 1627.0 280.0

population households median_income median_house_value ocean_proximity

0 322.0 126.0 8.3252 452600.0 NEAR BAY

1 2401.0 1138.0 8.3014 358500.0 NEAR BAY

2 496.0 177.0 7.2574 352100.0 NEAR BAY

3 558.0 219.0 5.6431 341300.0 NEAR BAY

4 565.0 259.0 3.8462 342200.0 NEAR BAY

housing.info()

out[3]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

Using the dataFrame.head() and dataFrame.info() methods above, you can see the first few rows and get a quick description of the data (number of rows, each attribute's type and number of non-null values) respectively. You can see that the total_bed_rooms columns contains some null values above. The ocean_proximity column has a type of object, which usually means that the column has a text type when you load it from a CSV file. Every other column has a numeric type. In the head method, you see that ocean_proximity values are repetitive, which means that is probably a categorical attribute. You can use the value_counts() method to get an idea of what is contained in the categorical column. You can use the dataFrame.describe() method to get a summary of the numerical attributes. Note that the null values are ignored in the describe method.

The %matplotlib inline command tells Jupyter to set up Matplotlib so it uses Jupyter's own backend. Plots are then rendered within the notebook itself.

housing["ocean_proximity"].value_counts()

out[5]

ocean_proximity

<1H OCEAN 9136

INLAND 6551

NEAR OCEAN 2658

NEAR BAY 2290

ISLAND 5

Name: count, dtype: int64

housing.describe()

out[6]

longitude latitude housing_median_age total_rooms \

count 20640.000000 20640.000000 20640.000000 20640.000000

mean -119.569704 35.631861 28.639486 2635.763081

std 2.003532 2.135952 12.585558 2181.615252

min -124.350000 32.540000 1.000000 2.000000

25% -121.800000 33.930000 18.000000 1447.750000

50% -118.490000 34.260000 29.000000 2127.000000

75% -118.010000 37.710000 37.000000 3148.000000

max -114.310000 41.950000 52.000000 39320.000000

total_bedrooms population households median_income \

count 20433.000000 20640.000000 20640.000000 20640.000000

mean 537.870553 1425.476744 499.539680 3.870671

std 421.385070 1132.462122 382.329753 1.899822

min 1.000000 3.000000 1.000000 0.499900

25% 296.000000 787.000000 280.000000 2.563400

50% 435.000000 1166.000000 409.000000 3.534800

75% 647.000000 1725.000000 605.000000 4.743250

max 6445.000000 35682.000000 6082.000000 15.000100

median_house_value

count 20640.000000

mean 206855.816909

std 115395.615874

min 14999.000000

25% 119600.000000

50% 179700.000000

75% 264725.000000

max 500001.000000

Another quick way to get a feel of data you are dealing with is to plot a histogram for each numerical attribute. A histogram shows the number of instances (on the vertical axis) that have a given value range (on the horizontal axis). You can use the hist() method on the whole dataset, and it will plot a histogram for each numerical attribute.

%matplotlib inline 
import matplotlib.pyplot as plt
housing.hist(bins=50,figsize=(20,15))
plt.show()

out[8]

Things to notice from histograms:

The media income attribute does not look to be expressed in US dollars. The numbers actually represent ten thousand dollars and were capped at 15.
The housing median age and the median house value were also capped.
The attributes have very different scales.
Many of the histograms are tail heavy: they extend much farther to the right of the median than to the left.

Generate a Test Set

Your brain is an amazing pattern detection system, which means that it is highly prone to overfitting: if you look at the test set, you may stumble upon some seemingly interesting pattern in the test data that leads you to select a particular kind of machine learning model. When you estimate the generalization error using the test set, your estimate will be too optimistic and you will launch a system that will not perform as well as expected. This is called data snooping bias.
You want to avoid generating different test/train splits every time you run your code - that way the model does not see all the dataset.You can use sklearn's train_test_split function to properly split the data. There is a random_state parameter that allows you to set the random number seed.

from sklearn.model_selection import train_test_split
train_set,test_set = train_test_split(housing,test_size=0.2,random_state=42)

out[10]

You want to make sure that your train set is representative of the population: Stratified Sampling - the population is divided into homogeneous subgroups called strata, and the right number of instances is sampled from each stratum to guarantee that the test set is representative of the overall population. It is important to have a significant number of instances in your dataset for each stratum, or else the estimate of the stratum's importance may be biased. This means that you should not have too much strata, and each stratum should be large enough. The code below uses the pd.cut function to create an income category attribute with 5 categories. After creating the categories, you can do stratified sampling based on income category. For thus use can use Scikit-Learn's StratifiedShuffleSplit class.
Test set generation is often neglected but critical part of a machine learning project.

import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
housing["income_cat"] = pd.cut(housing["median_income"],bins=[0.,1.5,3.0,4.5,6.,np.inf],labels=[1,2,3,4,5])
housing["income_cat"].hist()
%matplotlib inline 
plt.show()
split = StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)
for train_index, test_index in split.split(housing,housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set=housing.loc[test_index]
print(strat_test_set["income_cat"].value_counts()/len(strat_test_set))
for set_ in (strat_train_set,strat_test_set):
    set_.drop("income_cat",axis=1,inplace=True)

out[12]

income_cat
3 0.350533
2 0.318798
4 0.176357
5 0.114341
1 0.039971
Name: count, dtype: float64

Discover and Visualize the Data to Gain Insights

Since there is geographical information (latitude and longitude), it a good idea to create a scatter plot of all districts to viualize the data.
- You can see that the data resembled California, and by using alpha, you can see that high density area
- Customizing the plot:
  - s is the radius of the circle and represents the district's population
  - c is the color and represents the price
  - We use a predefined color map jet, which ranges from blue (low values) to red (high values)
- The image tells you that housing prices are very much related to location and population density.

Looking for Correlations

Since the dataset is not too large, you can easily compute the standard correlation coefficient (also called Pearson's r) between every pair using the corr method.
The correlation coefficient ranges from -1 to 1. When it is close to 1, it means that there is a strong positive correlation. When the coefficient is close to -1, it means that there is a strong negative correlation. Finally, coefficients close to 0 mean that there is no linear correlation. The correlation coefficient only measures linear coefficients. It may completely miss nonlinear relationships. The correlation coefficient has nothing to do with slope.

Standard Correlation Coefficients Various Datasets

Another way to check for correlation between attributes is to use Pandas' scatter_matrix function, which plots every numerical attribute against every other numerical attribute.
As seen from the plot, the most promising attribute to predict the media house value is the median income.

from pandas.plotting import scatter_matrix
housing = strat_train_set.copy()
housing.plot(kind="scatter",x="longitude",y="latitude",alpha=0.4,s=housing["population"]/100,label="population",figsize=(10,7),c="median_house_value",cmap=plt.get_cmap("jet"),colorbar=True)
plt.legend()
corr_matrix = housing.corr(numeric_only=True)
print(corr_matrix["median_house_value"].sort_values(ascending=False))
attributes = ["median_house_value","median_income","total_rooms","housing_median_age"]
a_throwaway = scatter_matrix(housing[attributes],figsize=(12,8)) # Assigning scatter_matrix to varaible to supress output

out[14]

median_house_value 1.000000
median_income 0.688380
total_rooms 0.137455
housing_median_age 0.102175
households 0.071426
total_bedrooms 0.054635
population -0.020153
longitude -0.050859
latitude -0.139584
Name: median_house_value, dtype: float64

Experimenting with Attribute Combinations

One last thing you may want to do before actually preparing the data for Machine Learning algorithms is to try out attribute combinations.
As seen from the attribute combinations below, the bedrooms_per_room and rooms_per_household are more informative (more correlated with the housing price) than the total_rooms per district and the total_bedrooms per district.

housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"] = housing["population"]/housing["households"]
corr_matrix = housing.corr(numeric_only=True)
print(corr_matrix["median_house_value"].sort_values(ascending=False))

out[16]

median_house_value 1.000000
median_income 0.688380
rooms_per_household 0.143663
total_rooms 0.137455
housing_median_age 0.102175
households 0.071426
total_bedrooms 0.054635
population -0.020153
population_per_household -0.038224
longitude -0.050859
latitude -0.139584
bedrooms_per_room -0.256397
Name: median_house_value, dtype: float64

Prepare the Machine Learning Algorithms

You should write functions to prepare the data for your Machine Learning algorithms for good reason:

This will allow you to reproduce these transformations easily on any dataset (e.g., the next time you get a fresh dataset)
You will gradually build a library of transformation functions that you can reuse in future projects
You can use these functions in your live system to transform the new data before feeding it to your algorithms
This will make it possible for you try various transformations and see which combination of transformations works best.

Data Cleaning

Most machine learning algorithms cannot work with missing features. You have three options to fix:

Get rid of rows that contain missing values dropna()
Get rid of the whole column that contains missing values drop()
Set the values to some value (zero, the mean, the media, etc.) fillna()

If you use this option, you need to save the mean/median/... so that you can replace missing values in the test set when you want to evaluate the system and so that you can replace missing values in the live system
Scikit-Learn provides a handy class to take care of missing values: SimpleImputer
- Only works on numerical data (need to drop non-numerical data)
- It's a good idea to apply the imputer to all numerical attributes in case you can't be sure that there won;t be missing data when you go live
- The imputer transform() method returns a NumPy array, but ou can easily put it back into DataFrame form.

## Revert to Clean Training Set and Separate Labels 
housing = strat_train_set.drop("median_house_value",axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

# Impute or drop columns or rows
from sklearn.impute import SimpleImputer
housing.dropna(subset=["total_bedrooms"]) # option 1
housing.drop("total_bedrooms",axis=1) # option 2
median = housing["total_bedrooms"].median() # option 3
housing["total_bedrooms"].fillna(median)

# Imputing
imputer = SimpleImputer(strategy="median")
# Since the median can only be computed on numerical attributed, you need to create copy of data without categorical data
housing_num = housing.drop("ocean_proximity",axis=1)
imputer.fit(housing_num)
print(imputer.statistics_)
print(housing_num.median().values)
X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X,columns=housing_num.columns)

out[18]

[-118.51 34.26 29. 2125. 434. 1167. 408.
3.5385]
[-118.51 34.26 29. 2125. 434. 1167. 408.
3.5385]

Scikit-Learn Design

The main design principles of Scikit-Learn are:

Consistency: All objects share a consistent and simple interface
- Estimators: Any object that can estimate some parameters based on a dataset is called an estimator. The estimation is performed by the fit() method, and it takes only a dataset as a parameter. Any other parameter needed to guide the estimation process is considered a hyperparameter, and it must be set as an instance variable
- Transformers: Some estimators can also transform a dataset; these are called transformers. Once again, the API is quit simple: the transformation is performed by the transform() method with the dataset to transform as a parameter. It returns the transformed dataset. All transformers also have a convenience method called fit_transform() that is equivalent to calling fit() and then transform()
- Predictors: Some estimators are capable of making predictions given a dataset; they are called predictors. A predictor has a predict method that takes a dataset of new instances and returns a dataset of corresponding predictions. It also has a score() method that measures the quality of the predictions given a test set and the corresponding labels in case of supervised learning.
Inspection: All the estimator's hyperparameters are accessible directly via public instance variables, and all the estimators learned parameters are also accessible via public instance variables with a trailing underscore
Nonproliferation of Classes: Datasets are represented as NumPy arrays or SciPy sparse matrices, instead of homemade classes. Hyperparameters are just regular python strings or numbers.
Composition: Existing building blocks are reused as much as possible.
Sensible Defaults: Scikit-learn provides reasonable default values for most parameters, making it easy to create a baseline working system quickly.

Handling text And Categorical Attributes

Most machine learning algorithms prefer to work with numbers. You can convert categories from text to numbers using sklearn OrdinalEncoder class. One issue that may arise is that ML algorithms will assume that two nearby values are more similar than two distinct values. To fix this issue, a common solution is to create one binary attribute per category. This is called one hot encoding, because only one attribute will be equal to 1 while the others will be 0. The new attributes are sometimes called dummy attributes. You can use Scikit-Learn's OneHotEncoder for this.

The output of OneHotEncoder's fit_transform is oftentimes a scipy sparse matrix which improves the memory storage of the matrix.

from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
ordinal_encoder = OrdinalEncoder()
cat_encoder = OneHotEncoder()
housing_cat = housing[["ocean_proximity"]]
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

out[20]

Custom Transformers

You will sometimes need to write your own transformers for tasks like custom cleanup operations or combining specific attributes. You will want your transformer to work seamlessly with Scikit-Learn functionalities. You will need to create a class that implements three methods:

fit() returning self
transform()
fit_transform()
You can get the last one for free by simply adding TransformerMixin as a base class. Also, if you add BaseEstimator as a base class (and avoid *args and **kargs in your constructor) you will get two extra methods (get_params() and set_params()) that will be used for automatic hyperparameter tuning.

In this example the transformer has one hyperparameter, add_bedrooms_per_room set to True by default. This hyperparameter will allow you to easily find out whether adding this attribute helps the Machine Learning algorithms or not. [...] The more you automate data preparation steps, the more combinations you can automatically try out, making it much more likely you will find a great combination.

from sklearn.base import BaseEstimator, TransformerMixin
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room=True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self,X,y=None):
        return self # nothing else to do
    def transform(self,X,y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, population_ix] / X[:, households_ix]
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

out[22]

Feature Scaling

One of the most important transformation you need to apply to your data is feature scaling. With few exceptions, Machine Learning algorithms don't perform well when the input numerical attributes have very different scales. There are two common ways to get all attributes to have the same scale:

min-max scaling (also called normalization)

Values are shifted and scaled so that they end up ranging from 0 to 1. $\text{new value}= \cfrac{ \text{value} - \text{min value} }{ \text{max value} }$
Scikit-Learn provides the MinMaxScaler for this

standardization

Results in a distribution with a unit variance. $\text{new value}=\cfrac{\text{value}-\={x}}{\sigma}$ where $\sigma=\text{standard deviation}$ and $\={x}=\text{mean}$
Does not bound values to certain range, which might be a problem for some algorithms, like neural networks.
Standardization is much less affected by outliers
Scikit-Learn provides StandardScaler for standardization

Transformation Pipelines

There are many transformation steps that need to be executed in teh right order. Scikit-Learn provides the Pipeline class tro help with sequences of transformations. The Pipeline consructor takes a list of name/estimator pairs defining a sequence of steps. All but the last estimator must be transformers.

When you call the pipeline's fit() method, it calls fit_transform() sequentially on all transformers, passing the output of each call as the parameter to the next call, until it reaches the final estimator, for which it just calls the fit() method.

It would be more convenient to have a single transformer able to handle all columns, applying the appropriate transformations to each column. (This way, we don't have to handle text and numeric columns separately). Scikit-Lean has ColumnTransformer for this purpose.

ColumnTransformer requires a list of tuples, where each tuple contains a name, a transformer and a list of names (or indices) of columns that the transformer should be applied to.
ColumnTransformer applies each transformer to the appropriate columns and concatenates the outputs along the second axis (the transformers must return the same number of rows)
ColumnTransformer may return a sparse or dense matrix

Instead of a transformer, you can specify the string "drop" or "pass through" if you want the columns to be dropped or remain untouched, respectively.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
num_pipeline = Pipeline([
    ('imputer',SimpleImputer(strategy="median")),
    ('attribs_adder',CombinedAttributesAdder()),
    ('std_scaler',StandardScaler())
])
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
full_pipeline = ColumnTransformer([
    ("num",num_pipeline,num_attribs),
    ("cat",OneHotEncoder(),cat_attribs)
])
housing_prepared = full_pipeline.fit_transform(housing)

out[24]

Select and Train a Model

Our predictions with the linear regression model show to be underfitting the data. This means that the features do not provide enough information to make good predictions or that the model is not powerful enough.
Our predictions using the Decision Tree Regressor are badly overfitting the data.

We can use Scikit-Learn's K-fold cross-validation feature to randomly split the training set into 10 distinct subsets called folds, and then train and evaluate the Decision tree 10 times, picking a different fold for evaluation each time and training on the 9 folds. The result is an array containing 10 evaluation scores. Scikit-Learn's cross-validation features expect a utility function rather than a cost function.

Cross validation allows you to get the estimate of the performance of the model and also a measure of how precise the performance is. Cross validation comes at the cost of training the model several times.

RandomForestRegressor models work by training many decision trees in random subsets of features, then averaging out their predictions. Building a model on top of many other models is called Ensemble Learning, and it is often a great way to push ML algorithms even further. If the score on the training set is much lower than on the validation sets, it means the model is overfitting the training set.

The goal of this step is to get a few promising models without tuning the hyperparameters yet.

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
housing_predictions_lin = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions_lin)
lin_rmse = np.sqrt(lin_mse)
print("The typical prediction error with Linear Regression is: $",lin_rmse)
print("The Linear Regression Model is underfitting the data - and is probaly not powerful enough.")
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)
housing_predictions_tree = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions_tree)
tree_rmse = np.sqrt(tree_mse)
print("The typical prediction error with a Decision Tree Regression is: $",tree_rmse)
print("The Decision Tree Regressor is overfitting the data.")
scores = cross_val_score(tree_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)
def display_scores(scores):
    print("Scores: ",scores)
    print("Mean: ",scores.mean())
    print("Standard Deviation: ",scores.std())
print("\nDecision Tree Regressor Scores:\n---------------------------------------------")
display_scores(tree_rmse_scores)
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
print("\nLinear Regression Scores:\n---------------------------------------------")
display_scores(lin_rmse_scores)
print("\nThe Decison Tree Regressor model is overfitting so bad that it performs worse than the Linear Regressor.\n")
forest_reg = RandomForestRegressor(n_estimators=100, random_state=42,max_depth=6)
forest_reg.fit(housing_prepared,housing_labels)
housing_predictions_for = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions_for)
forest_rmse = np.sqrt(forest_mse)
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
print("\nRandom Forest Scores:\n---------------------------------------------")
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

out[26]

The typical prediction error with Linear Regression is: $ 68076.77891613315
The Linear Regression Model is underfitting the data - and is probaly not powerful enough.
The typical prediction error with a Decision Tree Regression is: $ 0.0
The Decision Tree Regressor is overfitting the data.

Decision Tree Regressor Scores:
---------------------------------------------
Scores: [69065.25916851 68494.43167691 70303.18632228 72981.01296128
69283.68189749 70371.94033815 71182.99764107 70133.40396406
72734.86910305 69530.31211822]
Mean: 70408.10951910166
Standard Deviation: 1422.046054660025

Linear Regression Scores:
---------------------------------------------
Scores: [69870.27769409 67158.73153705 67306.28413204 68467.8035438
67251.83730682 68508.22616777 67525.45824395 69932.10453943
67650.10198615 68264.15862964]
Mean: 68193.49837807339
Standard Deviation: 974.6694004604976

The Decison Tree Regressor model is overfitting so bad that it performs worse than the Linear Regressor.

Random Forest Scores:
---------------------------------------------
Scores: [60024.72797636 60098.98778019 56641.05214424 59091.76742651
59325.22554697 60814.55248432 60263.04583588 61057.24712315
60608.02951471 60521.45531579]
Mean: 59844.60911481157
Standard Deviation: 1217.5286578636737

You should save every model you experiment with, so you can come back easily to any model you want. Make sure you save both the hyperparameters and the trained parameters, as well as the cross-validation scores and perhaps the actual predictions as well. This will allow you to easily compare scores across model types, and compare the types of errors they make. You can easily save Scikit-Learn models by using Python’s pickle module, or using sklearn.externals.joblib

Fine Tune Your Model

Let's assume you have a shortlist of promising models, you now need to fine tune them.

Grid Search

You can use Scikit-learn's GridSearchCV to search for you. All you need to do is tell it which hyperparameters you want it to experiment with, and what values to try out, and it will evaluate all the possible combinations of hyperparameter values, using cross validation. It may take a long time, but when it is done, you can get the best combination of parameters by accessing the best_params_ property of the grid search object

Randomized Search

The grid search approach is fine when you are exploring relatively few combinations, like in the previous example, but when the hyperparameter search space is large, it is often preferable to use RandomizedSearchCV instead. Instead of trying out all possible combinations, it evaluates a given number of random combinations by selecting a random value for each hyperparameter at every iteration. This has two main benefits:

You get to test more different values for the hyperparameter depending on how long the search runs.
You have more control over the computing budget by setting the number of iterations.

Ensemble Methods

Another way to fine-tune the system is to try to combine the models that perform the best. The group (or "ensemble") will often perform better than the best individual models.

Analyze the Best Models and Their Errors

You will often find good insights on the problem by inspecting the best models. Using this information, you might be able to see that you should drop some features (for example). You should also look at the specific errors that your system makes, then try to understand why it makes them and what could fix the problem.

Evaluate on Test Set

There is nothing special about this process [evaluating the model on the test set]: just get the predictors and the labels from your test set, run your full_pipeline to transform the data (call transform(), not fit_transform(), you do not want to fit the test set!), and evaluate the final model on the test set.
You might want to have an idea of how precise this estimate is, For this,m you can compute a 95% confidence interval for the generalization error using scipy.stats.t.interval()
You must resist the temptation to tweak the hyperparameters to make the numbers look good on the test set; the improvements would be unlikely to generalize to new data.

from sklearn.model_selection import GridSearchCV

param_grid = [
    { 
        'n_estimators': [3, 10, 30],
        'max_features': [2, 4, 6, 8] 
    },
    {
        'bootstrap': [False],
        'n_estimators': [3, 10],
        'max_features': [2, 3, 4] 
    }
]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)

out[28]

GridSearchCV(cv=5, estimator=RandomForestRegressor(),

param_grid=[{'max_features': [2, 4, 6, 8],

'n_estimators': [3, 10, 30]},

{'bootstrap': [False], 'max_features': [2, 3, 4],

'n_estimators': [3, 10]}],

return_train_score=True, scoring='neg_mean_squared_error')

print(grid_search.best_params_)
grid_search.best_estimator_
feature_importances = grid_search.best_estimator_.feature_importances_
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_encoder = full_pipeline.named_transformers_["cat"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
print("Retrieving Information From Final Model\n----------------------------------")
print(sorted(zip(feature_importances, attributes), reverse=True))
final_model = grid_search.best_estimator_
X_test = strat_test_set.drop("median_house_value",axis=1)
y_test = strat_test_set["median_house_value"].copy()
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
print("Final Root Mean Square Error",final_rmse)
from scipy import stats
confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,loc=squared_errors.mean(),scale=stats.sem(squared_errors)))

out[29]

{'max_features': 4, 'n_estimators': 30}
Retrieving Information From Final Model
----------------------------------
[(0.3223592339189118, 'median_income'), (0.1312478841804661, 'INLAND'), (0.08666849008356783, 'rooms_per_hhold'), (0.08570112495633296, 'latitude'), (0.08069646681289332, 'longitude'), (0.07078118436534746, 'pop_per_hhold'), (0.06465798339440287, 'bedrooms_per_room'), (0.04197259706944822, 'housing_median_age'), (0.021972881451931346, 'total_rooms'), (0.02047848809538031, 'population'), (0.020279985665433042, 'total_bedrooms'), (0.019735535618530326, 'households'), (0.019091733263987515, '<1H OCEAN'), (0.007115826634862638, 'NEAR OCEAN'), (0.0070979167344117945, 'NEAR BAY'), (0.0001426677540925575, 'ISLAND')]
Final Root Mean Square Error 50373.60561507538

array([48104.56497941, 52544.75343427])

Launch, Monitor and Maintain System

You need to write monitoring code to check your system's live performance at regular intervals and trigger alerts when it drops. This is important to monitor model degradation.
You should evaluate the system's input data quality.
You generally want to train your models on a regular basis using fresh data. You should automate this process as much as possible.

User Comments

There are currently no comments for this article.